Submission Number: 87
Submission ID: 123
Submission UUID: 07e37fba-7cd1-46eb-8bb6-da0d02be33ff
Submission URI: /form/project

Created: Sat, 02/06/2021 - 06:01
Completed: Sat, 02/06/2021 - 06:05
Changed: Tue, 08/02/2022 - 15:06

Remote IP address: 24.34.184.238
Submitted by: Gaurav Khanna
Language: English

Is draft: No
Webform: Project
Project Title Assembly and Taxonomic Profiling of Metagenomic Sequences using Deep Learning
Program CAREERS
Project Image 13 June 2014.jpg
Tags ai (271), bioinformatics (277), biology (515), deep-learning (303), gpu (80), hardware (74), machine-learning (272), neural-networks (435), python (69)
Status Complete
Project Leader Ying Zhang
Email yingzhang@uri.edu
Mobile Phone
Work Phone
Mentor(s) Cecile Cres, Ying Zhang
Student-facilitator(s) Eric Rangel
Mentee(s)
Project Description Microorganisms play important roles in nutrient cycling, energy production, and ecosystem health. The collection of microorganisms, also known as the “microbiomes”, are central participants of plant-soil interactions, bioproduct synthesis, environmental sustainability, and human health. The study of microbiomes has long been inhibited by lack of laboratory cultivated microbial isolates. This problem has been alleviated over the past decade through the rapid advancement and adaptation of metagenomics, a sequencing technology that determines the genomic composition of mixed microbial populations encompassing hundreds to thousands of genomes as a whole. The reconstruction of species diversity and function from metagenomic data is computationally highly expensive due to the challenges in assembling the short sequencing reads and the complexity in assigning sequencing data to specific taxonomic lineages, causing significant lags between data generation and data interpretation.

The goal of this project is to facilitate the development of deep learning models to enhance metagenomic data analysis. This effort will lead to an improvement in existing models developed in our prior studies by examining the influences of training data, model parameterization, and model architectures on the speed and accuracy of model development. We will also aim to improve the computational pipeline of model training and testing to improve speed and develop a better control of the data flow.
Project Deliverables The main deliverable of this project is a computational workflow on the appropriate computational resource that will allow for the development of a well-trained DL model that can be broadly applied for the analysis of metagenomic data from diverse environments.
Project Deliverables
Student Research Computing Facilitator Profile Graduate or undergraduate
Experiences with machine learning
Python programming
Prior experiences with GPU computing
Mentee Research Computing Profile
Student Facilitator Programming Skill Level Some hands-on experience
Mentee Programming Skill Level
Project Institution University of Rhode Island
Project Address
Anchor Institution CR-University of Rhode Island
Preferred Start Date 06/15/2021
Start as soon as possible. No
Project Urgency Already behind5Start date is flexible
Expected Project Duration (in months) 6
Launch Presentation
Launch Presentation Date 07/14/2021
Wrap Presentation
Wrap Presentation Date 01/12/2022
Project Milestones
  • Milestone Title: Milestone #1
    Milestone Description: Launch presentation; Access to appropriate computational resources; set up the project with version control etc. via GitHub.
    Completion Date Goal: 2021-07-15
    Actual Completion Date: 2021-07-15
  • Milestone Title: Milestone #2
    Milestone Description: Compare a newly developed in-house reads simulator with existing simulators to identify differences in algorithm design and data simulation; Adapt the new simulator into simulating reads based on diverse sequencing platforms.
    Completion Date Goal: 2021-08-15
    Actual Completion Date: 2021-08-15
  • Milestone Title: Milestone #3
    Milestone Description: Train deep learning model with dataset generated with the new simulator and compare the training outcomes.
    Completion Date Goal: 2021-09-15
    Actual Completion Date: 2021-09-15
  • Milestone Title: Milestone #4
    Milestone Description: Incorporate and investigate how the oversampling approach affects the model’s performance.
    Completion Date Goal: 2021-10-15
    Actual Completion Date: 2021-10-15
  • Milestone Title: Milestone #5
    Milestone Description: Adapt approach to a dataset of reference genomes.
    Completion Date Goal: 2021-11-15
    Actual Completion Date: 2022-12-15
  • Milestone Title: Milestone #6
    Milestone Description: Evaluate model performance; update GitHub with code, documentation; wrap-up presentation.
    Completion Date Goal: 2021-12-15
    Actual Completion Date: 2022-01-12
Github Contributions
Planned Portal Contributions (if any)
Planned Publications (if any)
What will the student learn?
What will the mentee learn?
What will the Cyberteam program learn from this project?
HPC resources needed to complete this project? GPU cluster; potentially AiMOS?
Notes
What is the impact on the development of the principal discipline(s) of the project? Eric worked on a project related to taxonomic classification, a task frequently conducted by biologists to identify microbial organisms in environmental samples. His project consisted in designing a read simulator capable of simulating short DNA sequences using bacterial genomes as templates. By showing that a deep learning model trained with such data is able to efficiently classify regular short DNA sequences, Eric’s project provides promising results to improve the classification of short DNA sequences using deep learning algorithms and therefore contributes to the field of metagenomics and bioinformatics.
What is the impact on other disciplines? The approach taken towards parallelism and the lessons learned are applicable to a variety of other areas in science and engineering. Those outcomes may benefit other disciplines in a similar context -- increased throughput on relevant computations.

Eric was able to transfer the knowledge he acquired through this project into his academic training, especially his computer science and Python coding classes.
Is there an impact physical resources that form infrastructure? None.
Is there an impact on the development of human resources for research computing? The supported RCF: Eric Rangel is very interested now in a career in the area of research computing owing to his experience in CyberTeams. The supported student was effectively retained i.e. will continue on to a STEM career as a direct outcome of this grant. Funded research opportunities appear to have a strong impact on students in STEM disciplines.

In addition, these student was mentored and trained on how research in computational biology (and other sciences) is conducted, the mathematical and technological tools involved, and how to overcome roadblocks and challenges when working with unknowns.
Is there an impact on institutional resources that form infrastructure? None.
Is there an impact on information resources that form infrastructure? None.
Is there an impact on technology transfer? None.
Is there an impact on society beyond science and technology? By improving the identification of microorganisms in samples collected from various environments such as oceans, biologists can provide more accurate results on the impact of climate change on microbial communities and the subsequent effects on oceans health and marine animals.

Advancement of scientific efforts and the development of a STEM trained workforce has many established positive impacts on society beyond those particular areas. While difficult to quantify, we envision that in the long run there will be a tangible positive impact of such projects well beyond their domains.
Lessons Learned Beyond the scientific benefits that can be obtained from advanced instrumentation such as HPC systems, it became clear that a student with the right combination of interests and background can positively impact the throughput of a research lab over a short-term engagement as long as there is a supporting team available to the student as a resource.

Additionally, such positive short-term engagements seemed to be sufficient to get the student enthusiastic about a career in research computing support.
Overall results The project was successful in terms of training a student-facilitator to gain computer skills and knowledge on machine learning as well as making progress in developing new techniques to enhance the performance of taxonomic classification using deep learning algorithms;
Positive impact from short-term student engagement on research lab throughput and student retention in a STEM discipline;
Workforce development and training in a STEM area (research computing).