Submission Number: 87
Submission ID: 123
Submission UUID: 07e37fba-7cd1-46eb-8bb6-da0d02be33ff
Submission URI: /form/project

Created: Sat, 02/06/2021 - 06:01
Completed: Sat, 02/06/2021 - 06:05
Changed: Tue, 08/02/2022 - 15:06

Remote IP address: 24.34.184.238
Submitted by: Gaurav Khanna
Language: English

Is draft: No
Webform: Project
Assembly and Taxonomic Profiling of Metagenomic Sequences using Deep Learning
CAREERS
13 June 2014.jpg
ai (271), bioinformatics (277), biology (515), deep-learning (303), gpu (80), hardware (74), machine-learning (272), neural-networks (435), python (69)
Complete

Project Leader

Ying Zhang
{Empty}
{Empty}

Project Personnel

Cecile Cres, Ying Zhang
Eric Rangel
{Empty}

Project Information

Microorganisms play important roles in nutrient cycling, energy production, and ecosystem health. The collection of microorganisms, also known as the “microbiomes”, are central participants of plant-soil interactions, bioproduct synthesis, environmental sustainability, and human health. The study of microbiomes has long been inhibited by lack of laboratory cultivated microbial isolates. This problem has been alleviated over the past decade through the rapid advancement and adaptation of metagenomics, a sequencing technology that determines the genomic composition of mixed microbial populations encompassing hundreds to thousands of genomes as a whole. The reconstruction of species diversity and function from metagenomic data is computationally highly expensive due to the challenges in assembling the short sequencing reads and the complexity in assigning sequencing data to specific taxonomic lineages, causing significant lags between data generation and data interpretation.

The goal of this project is to facilitate the development of deep learning models to enhance metagenomic data analysis. This effort will lead to an improvement in existing models developed in our prior studies by examining the influences of training data, model parameterization, and model architectures on the speed and accuracy of model development. We will also aim to improve the computational pipeline of model training and testing to improve speed and develop a better control of the data flow.

Project Information Subsection

The main deliverable of this project is a computational workflow on the appropriate computational resource that will allow for the development of a well-trained DL model that can be broadly applied for the analysis of metagenomic data from diverse environments.
{Empty}
Graduate or undergraduate
Experiences with machine learning
Python programming
Prior experiences with GPU computing
{Empty}
Some hands-on experience
{Empty}
University of Rhode Island
{Empty}
CR-University of Rhode Island
06/15/2021
No
Already behind5Start date is flexible
6
{Empty}
07/14/2021
01/12/2022
  • Milestone Title: Milestone #1
    Milestone Description: Launch presentation; Access to appropriate computational resources; set up the project with version control etc. via GitHub.
    Completion Date Goal: 2021-07-15
    Actual Completion Date: 2021-07-15
  • Milestone Title: Milestone #2
    Milestone Description: Compare a newly developed in-house reads simulator with existing simulators to identify differences in algorithm design and data simulation; Adapt the new simulator into simulating reads based on diverse sequencing platforms.
    Completion Date Goal: 2021-08-15
    Actual Completion Date: 2021-08-15
  • Milestone Title: Milestone #3
    Milestone Description: Train deep learning model with dataset generated with the new simulator and compare the training outcomes.
    Completion Date Goal: 2021-09-15
    Actual Completion Date: 2021-09-15
  • Milestone Title: Milestone #4
    Milestone Description: Incorporate and investigate how the oversampling approach affects the model’s performance.
    Completion Date Goal: 2021-10-15
    Actual Completion Date: 2021-10-15
  • Milestone Title: Milestone #5
    Milestone Description: Adapt approach to a dataset of reference genomes.
    Completion Date Goal: 2021-11-15
    Actual Completion Date: 2022-12-15
  • Milestone Title: Milestone #6
    Milestone Description: Evaluate model performance; update GitHub with code, documentation; wrap-up presentation.
    Completion Date Goal: 2021-12-15
    Actual Completion Date: 2022-01-12
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
GPU cluster; potentially AiMOS?
{Empty}

Final Report

Eric worked on a project related to taxonomic classification, a task frequently conducted by biologists to identify microbial organisms in environmental samples. His project consisted in designing a read simulator capable of simulating short DNA sequences using bacterial genomes as templates. By showing that a deep learning model trained with such data is able to efficiently classify regular short DNA sequences, Eric’s project provides promising results to improve the classification of short DNA sequences using deep learning algorithms and therefore contributes to the field of metagenomics and bioinformatics.
The approach taken towards parallelism and the lessons learned are applicable to a variety of other areas in science and engineering. Those outcomes may benefit other disciplines in a similar context -- increased throughput on relevant computations.

Eric was able to transfer the knowledge he acquired through this project into his academic training, especially his computer science and Python coding classes.
None.
The supported RCF: Eric Rangel is very interested now in a career in the area of research computing owing to his experience in CyberTeams. The supported student was effectively retained i.e. will continue on to a STEM career as a direct outcome of this grant. Funded research opportunities appear to have a strong impact on students in STEM disciplines.

In addition, these student was mentored and trained on how research in computational biology (and other sciences) is conducted, the mathematical and technological tools involved, and how to overcome roadblocks and challenges when working with unknowns.
None.
None.
None.
By improving the identification of microorganisms in samples collected from various environments such as oceans, biologists can provide more accurate results on the impact of climate change on microbial communities and the subsequent effects on oceans health and marine animals.

Advancement of scientific efforts and the development of a STEM trained workforce has many established positive impacts on society beyond those particular areas. While difficult to quantify, we envision that in the long run there will be a tangible positive impact of such projects well beyond their domains.
Beyond the scientific benefits that can be obtained from advanced instrumentation such as HPC systems, it became clear that a student with the right combination of interests and background can positively impact the throughput of a research lab over a short-term engagement as long as there is a supporting team available to the student as a resource.

Additionally, such positive short-term engagements seemed to be sufficient to get the student enthusiastic about a career in research computing support.
The project was successful in terms of training a student-facilitator to gain computer skills and knowledge on machine learning as well as making progress in developing new techniques to enhance the performance of taxonomic classification using deep learning algorithms;
Positive impact from short-term student engagement on research lab throughput and student retention in a STEM discipline;
Workforce development and training in a STEM area (research computing).