Submission information
Submission Number: 136
Submission ID: 240
Submission UUID: d4d4c6e4-7142-4b12-8e0a-937a5787f8d6
Submission URI: /form/project
Created: Sat, 01/22/2022 - 10:59
Completed: Sat, 01/22/2022 - 10:59
Changed: Wed, 05/31/2023 - 15:16
Remote IP address: 173.59.10.161
Submitted by: Vinayak Mathur
Language: English
Is draft: No
Webform: Project
Project Title: High throughput Python pipeline to identify Horizontal Gene Transfer Program: CAREERS (323) Project Image: {Empty} Tags: bioinformatics (277), biology (515), data-wrangling (6), genomics (537), github (490), python (69), workflow (365) Status: Halted Project Leader -------------- Project Leader: Vinayak Mathur Email: vm7027@cabrini.edu Mobile Phone: 7324214925 Work Phone: {Empty} Project Personnel ----------------- Mentor(s): Simon Delattre (1801) Student-facilitator(s): Arun Dash (1602) Mentee(s): {Empty} Project Information ------------------- Project Description: Project Description: This project seeks to further investigate the genetic phenomenon of horizontal gene transfer (HGT), specifically when involving interactions between bacteriophages and their host bacteria. From a biological perspective, this type of horizontal gene transfer occurs when bacteriophages attach themselves to a bacterial cell and inject it with a vector such as a plasmid that integrates into the host genome and takes control of the bacterium to make copies of itself. The main aim of the project is to develop an analysis pipeline written in Python that automatically generates a large output list of bacterial accession numbers given an input list of phage accession numbers. The current program employs BLAST to create this list of accession numbers. In the analysis pipeline, the input list is iterated through, and each phage accession number is submitted as a BLAST query to be aligned with the NCBI database of bacterial genes. The top bacterial result for each phage query ID is stored and aligned with the database of bacteriophage genes in turn. A match between the original phage query ID and the phage result of the BLAST search where the bacterial accession number is the query ID indicates the presence of horizontal gene transfer. Conducting this analysis in an HPC environment using SSH could significantly speed up the process of data collection compared to the functioning of the current pipeline or performing manual searches on the NCBI website where BLAST has been made available. Current version of the pipeline is available here: https://github.com/genomesolver/CSPpipeline Research goals: This research project has three major goals: 1) Identify instances of HGT in a large dataset of bacteriophage proteins: The data list produced by the program facilitates more in-depth analysis of bacteriophage-mediated horizontal gene transfer. 2) Predict likelihood of HGT: By developing a probabilistic classifier, we can attempt to predict the likelihood that a certain clade of bacteria is affected by horizontal gene transfer given the HGT status of the other members of the clade. This model could assist in establishing the statistical significance of the occurrences of HGT in bacterial relatives and help identify cellular features specific to those groups of bacteria that could potentially explain their vulnerability to infection by phages. 3) Functional analysis: A Gene Ontology (GO) enrichment analysis is another research aim to extract meaningful conclusions from this data. Since the current version of the pipeline generates a list of bacterial accession numbers that correspond to phage query IDs, that list can be processed in order to find GO terms in groups of genes regulated by the integration of the nucleic acids of the bacteriophage. This type of data analysis would be very useful to visualize and increase the understanding of how the phage infections disrupt the genetic network of the bacteria. Project Information Subsection ------------------------------ Project Deliverables: The goals of the project are: 1) To fine tune the already developed Python pipeline to be able to analyze larger datasets 2) Be able to use a offline version of NCBI database to run the analysis 3) Develop a model to be able to predict likelihood of HGT Project Deliverables: {Empty} Student Research Computing Facilitator Profile: {Empty} Mentee Research Computing Profile: {Empty} Student Facilitator Programming Skill Level: Some hands-on experience Mentee Programming Skill Level: {Empty} Project Institution: Cabrini University Project Address: 610 King of Prussia Road IAD 224 Radnor, Pennsylvania. 19087 Anchor Institution: CR-Penn State Preferred Start Date: {Empty} Start as soon as possible.: Yes Project Urgency: Already behind4Start date is flexible Expected Project Duration (in months): 4 Launch Presentation: {Empty} Launch Presentation Date: {Empty} Wrap Presentation: https://support.access-ci.org/system/files/webform/project/240/Arun_Dash_Wrap_Presentation.pdf Wrap Presentation Date: 05/10/2023 Project Milestones: - Milestone Title: Improvement of Current Pipeline Milestone Description: Attempt to improve run time of the current Python pipeline, with the possibility of downloading the NCBI database needed for comparison to a local server space Completion Date Goal: 2022-07-30 - Milestone Title: Develop Classifier Milestone Description: Develop the probabilistic classifier for the pipeline, to predict likelihood of HGT in different bacterial clades Completion Date Goal: 2022-09-16 - Milestone Title: Functional Analysis Function Milestone Description: Create a functional analysis function to compare to existing Gene Ontology databases. Work on writing a manuscript for publishing the results. Completion Date Goal: 2022-10-07 Github Contributions: (https://github.com/genomesolver/CSPpipeline) Planned Portal Contributions (if any): {Empty} Planned Publications (if any): Plan to publish the manuscript in the journal: https://iubmb.onlinelibrary.wiley.com/journal/15393429 What will the student learn?: {Empty} What will the mentee learn?: {Empty} What will the Cyberteam program learn from this project?: {Empty} HPC resources needed to complete this project?: {Empty} Notes: {Empty} Final Report ------------ What is the impact on the development of the principal discipline(s) of the project?: {Empty} What is the impact on other disciplines?: {Empty} Is there an impact physical resources that form infrastructure?: {Empty} Is there an impact on the development of human resources for research computing?: {Empty} Is there an impact on institutional resources that form infrastructure?: {Empty} Is there an impact on information resources that form infrastructure?: {Empty} Is there an impact on technology transfer?: {Empty} Is there an impact on society beyond science and technology?: {Empty} Lessons Learned: {Empty} Overall results: {Empty}