Submission Number: 98
Submission ID: 141
Submission UUID: f2357cc0-989a-468e-8cfc-be09d5a36a09
Submission URI: /form/project

Created: Tue, 04/27/2021 - 10:19
Completed: Tue, 04/27/2021 - 11:05
Changed: Wed, 07/06/2022 - 15:09

Remote IP address: 47.14.5.69
Submitted by: Chris Hill
Language: English

Is draft: No
Webform: Project
Project Title: A green, open-source greater than 10B parameter language model. 
Program:
CAREERS (323)

Project Image: {Empty}
Tags:
{Empty}

Status: Halted
Project Leader
--------------
Project Leader:
Chris Hill

Email: Cnh@mit.edu
Mobile Phone: {Empty}
Work Phone: {Empty}

Project Personnel
-----------------
Mentor(s):
Chris Hill (151)

Student-facilitator(s):
{Empty}

Mentee(s):
{Empty}


Project Information
-------------------
Project Description:
We are developing a new language model derived from the EleutherAI GPT-Neo initiative ( https://github.com/EleutherAI/gpt-neo ) for application to two projects. These problems need models the skill close to that of the state-of-the art GPT-3 proprietary model. One project is a demonstration of the model for state-of-the-art image captioning, the other project is the publication of the full model as an open community tools for the research community. 

For both projects we are interested in collaborating with Cyberteams students to work on model training optimization and testing. The project is looking to run model training and evaluate performance on multi-node configurations of the Aimos 6-GPU/node system. This will allow us to examine scaling and potentially prepare for large experiments with appropriate discussions with IBM teams. The model we will use is efficient and some preliminary work has been undertaken at MGHPCC. Both the RPI and MGHPCC systems have excellent carbon emissions footprints so we also anticipate being able to report energy and emissions statistics that are state-of-the-art for large scale language model training too. 

Project Information Subsection
------------------------------
Project Deliverables:
Start -  Core benchmarks on Aimos 4-nodes, 6 GPUs/node
Middle -  Extended model training with checkpointing and "Turing test" validations against standard benchmarks.
Conclusion - Initial paper submitted to NeurIPS and plans for extended experiments explored. Codes and models will be made fully available as open source in a professional form with documentation and available to any research group. 

Project Deliverables:
{Empty}

Student Research Computing Facilitator Profile:
A student with interest in ML and some programming experience would be best. 

Mentee Research Computing Profile:
{Empty}

Student Facilitator Programming Skill Level: Can work with any level
Mentee Programming Skill Level: {Empty}
Project Institution: MIT
Project Address:
Cambridge, Massachusetts. 02139

Anchor Institution: NE-MGHPCC
Preferred Start Date: 04/29/2021
Start as soon as possible.: Yes
Project Urgency: Already behind3Start date is flexible
Expected Project Duration (in months): One month
Launch Presentation: {Empty}
Launch Presentation Date: {Empty}
Wrap Presentation: {Empty}
Wrap Presentation Date: {Empty}
Project Milestones:
- Milestone Title: Start
  Milestone Description:  Core benchmarks on Aimos 4-nodes, 6 GPUs/node
  Completion Date Goal: 2021-05-06
- Milestone Title: Middle
  Milestone Description: Extended model training with checkpointing and "Turing test" validations against standard benchmarks.
  Actual Completion Date: 2021-05-18
- Milestone Title: Close
  Milestone Description: Initial paper submitted to NeurIPS and plans for extended experiments explored. Codes and models will be made fully available as open source in a professional form with documentation and available to any research group. 
  Completion Date Goal: 2021-06-15

Github Contributions: {Empty}
Planned Portal Contributions (if any):
Tools, will be openly published via GitHub. 

Planned Publications (if any):
Two at least
 1. NeurIPS
 2. TBD

What will the student learn?:
various ML tools 
 pytorch, NCCL, megatron, deeper speed

performance profiling
 profiling of GPU codes on Aimos and on MGHPCC Satori systems

What will the mentee learn?:
{Empty}

What will the Cyberteam program learn from this project?:
This will be a collaboration involving some of the most energy efficient systems with models that are traditionally seen as very resource hungry. 

HPC resources needed to complete this project?:
4 Aimos nodes, 6 GPUs per node. 

Notes:
There are two student facilitators (Alex Andonian and David Bau) and one mentor (John Cohn) not currently in the Cyberteams system.  



Final Report
------------
What is the impact on the development of the principal discipline(s) of the project?:
{Empty}

What is the impact on other disciplines?:
{Empty}

Is there an impact physical resources that form infrastructure?:
{Empty}

Is there an impact on the development of human resources for research computing?:
{Empty}

Is there an impact on institutional resources that form infrastructure?:
{Empty}

Is there an impact on information resources that form infrastructure?:
{Empty}

Is there an impact on technology transfer?:
{Empty}

Is there an impact on society beyond science and technology?:
{Empty}

Lessons Learned:
{Empty}

Overall results:
{Empty}