MATCHPlus Engagements
Investigation of robustness of state of the art methods for anxiety detection in real-world conditions
I am new to ACCESS. I have a little bit of past experience running code on NCSA's Blue Waters. As a self-taught programmer, it would be interesting to learn from an experienced mentor.
Here's an overview of my project:
Anxiety detection is topic that is actively studied but struggles to generalize and perform outside of controlled lab environments. I propose to critically analyze state of the art detection methods to quantitatively quantify failure modes of existing applied machine learning models and introduce methods to robustify real-world challenges. The aim is to start the study by performing sensitivity analysis of existing best-performing models, then testing existing hypothesis of real-world failure of these models. We predict that this will lead us to understand more deeply why models fail and use explainability to design better in-lab experimental protocols and machine learning models that can perform better in real-world scenarios. Findings will dictate future directions that may include improving personalized health detection, careful design of experimental protocols that empower transfer learning to expand on existing reach of anxiety detection models, use explainability techniques to inform better sensing methods and hardware, and other interesting future directions.

GPU-accelerated ice sheet flow modeling
Sea levels are rising (3.7 mm/yr and increasing!)! The primary contributor to rising sea levels is enhanced polar ice discharge due to climate change. However, their dynamic response to climate change remains a fundamental uncertainty in future projections. Computational cost limits the simulation time on which models can run to narrow the uncertainty in future sea level rise predictions. The project's overarching goal is to leverage GPU hardware capabilities to significantly alleviate the computational cost and narrow the uncertainty in future sea level rise predictions.
[Plan A] The PI is investigating numerical techniques to predict ice-sheet flow for large-to-continental glaciers on GPUs. Suppose a successful technique has been identified, implemented in MATLAB, and verified prior to the start of this project. In that case, the objective will be to port the code to CUDA C and investigate techniques to justify the GPU implementation in the price and power consumption to performance metric compared to a "standard" CPU implementation.
[Plan B] The PI developed a preliminary GPU implementation to predict ice-sheet flow for regional-scale glaciers on GPUs. The GPU ice velocity predictions agreed (~ 1% discrepancy) with a "standard" CPU implementation for the chosen glacier model configurations and input data. The GPU implementation was justified in the price and power consumption to performance metrics. However, the profiling results from the preliminary GPU implementation indicated non-optimal global memory access patterns reported in the LITEX and L2 cache. Furthermore, we observed significant drops in effective memory throughput with increased spatial resolution or degrees of freedom (DoFs). This project will aim to investigate techniques to reduce mesh non-localities and drop in effective memory throughput with an increase in spatial resolution.
Run Markov Chain Monte Carlo (MCMC) in Parallel for Evolutionary Study
My ongoing project is focused on using species trait value (as data matrices) and its corresponding phylogenetic relationship (as a distance matrix) to reconstruct the evolutionary history of the smoke-induced seed germination trait. The results of this project are expected to increase the predictability of which untested species could benefit from smoke treatment, which could promote germination success of native species in ecological restoration. This computational resources allocated for this project pull from the high-memory partition of our Ivy cluster of HPCC (Centos 8, Slurm 20.11, 1.5 TB memory/node, 20 core /node, 4 node). However, given that I have over 1300 species to analyze, using the maximum amount of resources to speed up the data analysis is a challenge for two reasons: (1) the ancestral state reconstruction (the evolutionary history of plant traits) needs to use the Markov Chain Monte Carlo (MCMC) in Bayesian statistics, which runs more than 10 million steps and, according to experienced evolutionary biologists, could take a traditional single core simulation up 6 months to run; and (2) my data contain over 1300 native species, with about 500 polymorphic points (phylogenetic uncertainty), which would need a large scale of random simulation to give statistical strength. For instance, if I use 100 simulations for each 500 uncertainty points, I would have 50,000 simulated trees. Based on my previous experience with simulations, I could design codes to parallel analyze 50,000 simulated trees but even with this parallelization the long run MCMC will still require 50000 cores to run for up to 6 months. Given this computational and evolutionary research challenge, my current work is focused on discovering a suitable parallelization methods for the MCMC steps. I hope to have some computational experts to discuss my project.
Stock Return Predictability with Machine Learning Methods
In this project, we plan to apply machine learning methods such as the Transformer to company fundamentals to predict their future stock returns or future earnings.
Our data is monthly from 1965 to 2022 and the size of the data is about 750M.
Our dependent variables are stock return or earnings in the next periods. Our independent variables are fundamental. We will implement rolling methods. For example, at the end of year t, we use the past five years of data, with 4 years for training and 1 year for validation. With the selected model, we test return or earnings predictions using data in year t+1. Then, we repeat this process for years t+1, t+2, and so on. We will need computing powers.
I am still in the process of figuring out how to apply transformer in such a setting.

Improving alphafold performance and applications of alphafold
The project involves the improvements of the alphafold program. The projects includes two parts, improvements of the program and possible applications of it in collaboration with different researchers. The effort will be in coordination of research groups interested in the application of alphafold.
Specifically the performance improvements involves a hierarchical/diverse aspects of the code, (1)I/O, (2) databases query, (3) multithread/multi process in python, as well as GPU acceleration, and more.
A High-Performance Computing Platform for the MS in Biomedical Image Computing Program
We have requested educational allocation (EVE220001) to support a new course developed by the Department of Bioengineering at UIUC, BIOE486 Applied Deep Learning for Biomedical Imaging. This course as part of our new Master of Science in Biomedical Image Computing (MS-BIC) program, covers basic concepts, methodology and algorithms in deep learning and their applications to solve modern biomedical imaging challenges. To achieve the educational goals, this course requires assess to high-performance computing for the students, especially GPUs, to study, train and improve various deep neural network based models. The purpose of this request is to seek opportunities to build a consistent and more accessible computing and data management platform for our course and even for the MS-BIC program in general.
While students who used or are using this resource have provided good feedback, they have found that dealing with the process of getting access to the GPU computing resource, setting up the necessary environment, and installing all the needed software can be taking a lot of efforts and distracting them from the core computational and deep learning problems. Moving forward, we hope to work with the MATCH program to create some support for our students.
Specifically, if possible, we would like to offer the amazing computing resources in ACCESS, e.g., the Expanse GPU cluster to all of our MS-BIC students. To that end, we would like to have a consistent way of setting up the environment(s) for them to develop deep learning models and algorithms in Tensorflow or Pytorch. A training module on how to efficiently work with interactive programming mode for writing codes to build and train deep neural networks, e.g., using Jupyter notebook or VScode to write TF or Pytorch codes on the cluster would be highly appreciated. Lastly, the availability o student support through the MATCH program to help students resolve issues when needed will be very helpful.
Ideally, we would like to have some technical support for the remaining of the semester (particularly on using Tensorflow/Pytorch on the Expanse cluster) as the students start to use the Expanse cluster more frequently with the HW and project getting more complicated. In terms of consistent environment set up and training materials, we do not have an urgent deadline.
Adapting a GEOspatial Agent-based model for Covid Transmission (GeoACT) for general use
GeoACT (GEOspatial Agent-based model for Covid Transmission) is a designed to simulate a range of intervention scenarios to help schools evaluate their COVID-19 plans to prevent super-spreader events and outbreaks. It consists of several modules, which compute infection risks in classrooms and on school buses, given specific classroom layouts, student population, and school activities. The first version of the model was deployed on the Expanse (and earlier, COMET) resource at SDSC and accessed via the Apache Airavata portal (geoact.org). The second version is a rewrite of the model which makes it easier to adjust to new strains, vaccines and boosters, and include detailed user-defined school schedules, school floor plans, and local community transmission rates. This version is nearing completion. We’ll use Expanse to run additional scenarios using the enhanced model and the newly added meta-analysis module. The current goal is to make the model more general so that it can be used for other health emergencies. GeoACT has been in the news, e.g. https://ucsdnews.ucsd.edu/feature/uc-san-diego-data-science-undergrads-help-keep-k-12-students-covid-safe, and https://www.hpcwire.com/2022/01/13/sdsc-supercomputers-helped-enable-safer-school-reopenings/ (HPCWire 2022 Editors' Choice Award)
MATCHPremier Engagements

High Performance Computing vs Quantum Computing for Neural Networks supporting Artificial Intelligence
A personalized learning system that adapts to learners' interests, needs, prior knowledge, and available resources is possible with artificial intelligence (AI) that utilizes natural language processing in neural networks. These deep learning neural networks can run on high performance computers (HPC) or on quantum computers (QC). Both HPC and QC are emergent technologies. Understanding both systems well enough to select which is more effective for a deep learning AI program, and show that understanding through example, is the ultimate goal of this project. The entry to learning technologies such as HPC and QC is narrow at present because it relies on classical education methods and mentoring. The gap between the knowledge workers needed, which is in high demand, and those with the expertise to teach, which is being achieved at a much slower rate, is widening. Here, an AI cognitive agent, trained via deep learning neural networks, can help in emergent technology subjects by assisting the instructor-learner pair with adaptive wisdom. We are building the foundations for this AI cognitive agent in this project.
The role of the student facilitator will involve optimizing a deep learning neural network, comparing and contrasting with the newest technologies, such as a quantum computer (and/or a quantum computer simulator) and a high performance computer and showing the efficiency of the different computing approaches. The student facilitator will perform these tasks at the rate described in the proposal. Milestone work will be displayed and shared publicly via posting to the Jupyter Notebooks on Google Colab and linked to regular Github uploads.

Developing Computational Labs for Upper Level Physical Chemistry II Course
Out of all the upper level chemistry courses, physical chemistry is the only course that provides an in-depth insight into the fundamental principles underpinning the concepts taught in various sub-disciplines of chemistry. Further, physical chemistry provides a connection between microscopic and macroscopic worlds of chemistry through mathematical models and experimental methods to test the validity of those models. Therefore, computational techniques are a perfect vehicle to teach content of physical chemistry course to undergraduate students. Additionally, American Chemical Society recommends computational chemistry to be incorporated into undergraduate chemistry curriculum. At Bridgewater State University (BSU) physical chemistry is a two-semester course referred to as 'physical chemistry I' and 'physical chemistry II'. While the overarching goal is to develop computational experiments (referred to as 'dry-labs'), project proposed here focuses on designing and developing dry labs for 'Physical Chemistry II' course at BSU. The inherently theoretical nature of this course along with its connection to wide range of spectroscopic techniques commonly used by chemists and physicists makes this course a perfect choice for assessing BSU students' reception to the idea of dry labs. It should be noted that there are no computational experiments in the current physical chemistry curriculum (both I and II) at BSU. The proposed project focuses on developing 4 - 6 computational experiments to be introduced (in spring 2018) as either stand-alone dry-lab experiments or accompany currently existing experiments. These dry labs will be developed on Gaussian 09 platform, which is currently installed on C3DDB server at MGHPCC. Finally, I also expect to make these experiments available to other New England instructors teaching physical chemistry II or equivalent course interested in incorporating computational chemistry into their curriculum.

UVM Art and AI Initiative
The UVM Art and AI Initiative is exploring approaches to artistic image production, comparing the results of StyleGAN and Genetic Algorithms*. More broadly, the project explores emerging artistic practices with Machine Learning and AI while referencing an artistic lineage to the artists Wassily Kandinsky, Jonn Cage and Yoko Ono; these artists employ(ed) instructions and systems in their non-digital artworks. Kandinsky distinguished systems and developed a science of aesthetics with the basic elements of point, line and plane; Cage used the oracle 'I Ching' like a computer to inform his compositional decisions; Ono writes poetic scores that turn her audience into active participants when they follow a series of imaginative instructions. Through this ongoing research and practice, we intend to join the larger conversation about art and A.I and design new curriculum for UVM undergraduate students.
This work began in February 2020 and is led by Jennifer Karson of UVM’s Department of Art and Art History and the CEMS UVM FabLab. The team has included three UVM students: two graduate students in data science and one undergraduate mechanical engineering student. The team currently uses RunwayML for the StyleGAN experiments and Processing, an open-source language and development environment built on top of the Java programming language, for Genetic Algorithms.
Additional summer funding ($2,000) is sought for one of the UVM Art and A.I. Initiative student coders. The funding will assist the team in reaching a short-term goal to present initial findings this July at Alife 2020 Montreal; a longer-term goal is to create an art installation for the UVM Fleming Museum of Art in the spring of 2021. This is a unique opportunity to exhibit as part of the statewide project 2020 Vision: Seeing the World through Technology and alongside the work of internationally renowned computer artist and co-founder of the Processing programming language Casey Reas.
Milestone 1:
Genetic Algorithms: Develop successful genetic algorithm code that meets compositional standard (color, architecture, appropriate datasets) while creating new compositions from the elements of existing hand-drawn compositions. The program should output image files that can be stored and printed at high resolutions on paper.
StyleGAN: Transition from RunwayML to coding in Python and employing VACC computer cluster. The process should output image files that can be stored and printed at high resolutions and on paper to be exhibited.
Milestone 2:
Genetic Algorithms: Create an interactive version of the program that allows for audience participation; can be exhibited in a museum gallery and online.
StyleGAN: Develop video that can be exhibited in museum gallery and online.
*Our Genetic Algorithm base code was developed by Daniel Shiffman