Submission information
Submission Number: 87
Submission ID: 123
Submission UUID: 07e37fba-7cd1-46eb-8bb6-da0d02be33ff
Submission URI: /form/project
Created: Sat, 02/06/2021 - 06:01
Completed: Sat, 02/06/2021 - 06:05
Changed: Tue, 08/02/2022 - 15:06
Remote IP address: 24.34.184.238
Submitted by: Gaurav Khanna
Language: English
Is draft: No
Webform: Project
Project Title: Assembly and Taxonomic Profiling of Metagenomic Sequences using Deep Learning Program: CAREERS (323) Project Image: https://support.access-ci.org/system/files/webform/project/123/13%20June%202014.jpg Tags: ai (271), bioinformatics (277), biology (515), deep-learning (303), gpu (80), hardware (74), machine-learning (272), neural-networks (435), python (69) Status: Complete Project Leader -------------- Project Leader: Ying Zhang Email: yingzhang@uri.edu Mobile Phone: {Empty} Work Phone: {Empty} Project Personnel ----------------- Mentor(s): Cecile Cres (1346), Ying Zhang (514) Student-facilitator(s): Eric Rangel (635) Mentee(s): {Empty} Project Information ------------------- Project Description: Microorganisms play important roles in nutrient cycling, energy production, and ecosystem health. The collection of microorganisms, also known as the “microbiomes”, are central participants of plant-soil interactions, bioproduct synthesis, environmental sustainability, and human health. The study of microbiomes has long been inhibited by lack of laboratory cultivated microbial isolates. This problem has been alleviated over the past decade through the rapid advancement and adaptation of metagenomics, a sequencing technology that determines the genomic composition of mixed microbial populations encompassing hundreds to thousands of genomes as a whole. The reconstruction of species diversity and function from metagenomic data is computationally highly expensive due to the challenges in assembling the short sequencing reads and the complexity in assigning sequencing data to specific taxonomic lineages, causing significant lags between data generation and data interpretation. The goal of this project is to facilitate the development of deep learning models to enhance metagenomic data analysis. This effort will lead to an improvement in existing models developed in our prior studies by examining the influences of training data, model parameterization, and model architectures on the speed and accuracy of model development. We will also aim to improve the computational pipeline of model training and testing to improve speed and develop a better control of the data flow. Project Information Subsection ------------------------------ Project Deliverables: The main deliverable of this project is a computational workflow on the appropriate computational resource that will allow for the development of a well-trained DL model that can be broadly applied for the analysis of metagenomic data from diverse environments. Project Deliverables: {Empty} Student Research Computing Facilitator Profile: Graduate or undergraduate Experiences with machine learning Python programming Prior experiences with GPU computing Mentee Research Computing Profile: {Empty} Student Facilitator Programming Skill Level: Some hands-on experience Mentee Programming Skill Level: {Empty} Project Institution: University of Rhode Island Project Address: {Empty} Anchor Institution: CR-University of Rhode Island Preferred Start Date: 06/15/2021 Start as soon as possible.: No Project Urgency: Already behind5Start date is flexible Expected Project Duration (in months): 6 Launch Presentation: {Empty} Launch Presentation Date: 07/14/2021 Wrap Presentation: https://support.access-ci.org/system/files/webform/project/123/Final%20Presentation%20CyberTeams.pptx.pdf Wrap Presentation Date: 01/12/2022 Project Milestones: - Milestone Title: Milestone #1 Milestone Description: Launch presentation; Access to appropriate computational resources; set up the project with version control etc. via GitHub. Completion Date Goal: 2021-07-15 Actual Completion Date: 2021-07-15 - Milestone Title: Milestone #2 Milestone Description: Compare a newly developed in-house reads simulator with existing simulators to identify differences in algorithm design and data simulation; Adapt the new simulator into simulating reads based on diverse sequencing platforms. Completion Date Goal: 2021-08-15 Actual Completion Date: 2021-08-15 - Milestone Title: Milestone #3 Milestone Description: Train deep learning model with dataset generated with the new simulator and compare the training outcomes. Completion Date Goal: 2021-09-15 Actual Completion Date: 2021-09-15 - Milestone Title: Milestone #4 Milestone Description: Incorporate and investigate how the oversampling approach affects the model’s performance. Completion Date Goal: 2021-10-15 Actual Completion Date: 2021-10-15 - Milestone Title: Milestone #5 Milestone Description: Adapt approach to a dataset of reference genomes. Completion Date Goal: 2021-11-15 Actual Completion Date: 2022-12-15 - Milestone Title: Milestone #6 Milestone Description: Evaluate model performance; update GitHub with code, documentation; wrap-up presentation. Completion Date Goal: 2021-12-15 Actual Completion Date: 2022-01-12 Github Contributions: {Empty} Planned Portal Contributions (if any): {Empty} Planned Publications (if any): {Empty} What will the student learn?: {Empty} What will the mentee learn?: {Empty} What will the Cyberteam program learn from this project?: {Empty} HPC resources needed to complete this project?: GPU cluster; potentially AiMOS? Notes: {Empty} Final Report ------------ What is the impact on the development of the principal discipline(s) of the project?: Eric worked on a project related to taxonomic classification, a task frequently conducted by biologists to identify microbial organisms in environmental samples. His project consisted in designing a read simulator capable of simulating short DNA sequences using bacterial genomes as templates. By showing that a deep learning model trained with such data is able to efficiently classify regular short DNA sequences, Eric’s project provides promising results to improve the classification of short DNA sequences using deep learning algorithms and therefore contributes to the field of metagenomics and bioinformatics. What is the impact on other disciplines?: The approach taken towards parallelism and the lessons learned are applicable to a variety of other areas in science and engineering. Those outcomes may benefit other disciplines in a similar context -- increased throughput on relevant computations. Eric was able to transfer the knowledge he acquired through this project into his academic training, especially his computer science and Python coding classes. Is there an impact physical resources that form infrastructure?: None. Is there an impact on the development of human resources for research computing?: The supported RCF: Eric Rangel is very interested now in a career in the area of research computing owing to his experience in CyberTeams. The supported student was effectively retained i.e. will continue on to a STEM career as a direct outcome of this grant. Funded research opportunities appear to have a strong impact on students in STEM disciplines. In addition, these student was mentored and trained on how research in computational biology (and other sciences) is conducted, the mathematical and technological tools involved, and how to overcome roadblocks and challenges when working with unknowns. Is there an impact on institutional resources that form infrastructure?: None. Is there an impact on information resources that form infrastructure?: None. Is there an impact on technology transfer?: None. Is there an impact on society beyond science and technology?: By improving the identification of microorganisms in samples collected from various environments such as oceans, biologists can provide more accurate results on the impact of climate change on microbial communities and the subsequent effects on oceans health and marine animals. Advancement of scientific efforts and the development of a STEM trained workforce has many established positive impacts on society beyond those particular areas. While difficult to quantify, we envision that in the long run there will be a tangible positive impact of such projects well beyond their domains. Lessons Learned: Beyond the scientific benefits that can be obtained from advanced instrumentation such as HPC systems, it became clear that a student with the right combination of interests and background can positively impact the throughput of a research lab over a short-term engagement as long as there is a supporting team available to the student as a resource. Additionally, such positive short-term engagements seemed to be sufficient to get the student enthusiastic about a career in research computing support. Overall results: The project was successful in terms of training a student-facilitator to gain computer skills and knowledge on machine learning as well as making progress in developing new techniques to enhance the performance of taxonomic classification using deep learning algorithms; Positive impact from short-term student engagement on research lab throughput and student retention in a STEM discipline; Workforce development and training in a STEM area (research computing).