Knowledge Base Resources

Contributed by cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators), these resources are shared through the ConnectCI community platform. Add resources you find helpful!

Add a Resource

ACCESS Pegasus Documentation

ACCESS Pegasus Documentation

The documentation provides an overview of using Pegasus, a workflow management system, on ACCESS resources for high throughput computing (HTC) workloads, covering logging in, workflow creation, resource configuration, and monitoring options.

pegasus

1 Like

Type

documentation

Level

Managing Python Packages on an HPC Cluster

Python Packages on HPC

This workshop will go into the different ways python packages can be managed in a cluster environment using conda and python virtual environments both in batch mode from the command line and with Jupyter Notebooks and Jupyter Lab on the cluster. The examples will be run on the GMU HOPPER Cluster.

documentation pytorch data-science open-ondemand batch-jobs job-submission slurm environment-modules anaconda jupyterhub python library-paths dependencies pip version-control

1 Like

Type

documentation

Level

Introduction to Python for Digital Humanities and Computational Research

Introduction to Python book

This documentation contains introductory material on Python Programming for Digital Humanities and Computational Research. This can be a go-to material for a beginner trying to learn Python programming and for anyone wanting a Python refresher.

ai big-data data-analysis deep-learning data-science python

1 Like

Type

documentation

Level

Useful R Packages for Data Science and Statistics

https://www.udacity.com/blog/2021/01/best-r-packages-for-data-science.html

This Udacity article listed the most frequently used R packages for data science and statistics. For each package, the article provided the link to its official documentation. It will be a great start point if you want to start your data science journey in R.

plotting visualization data-analysis machine-learning data-science r

1 Like

Type

documentation

Level

Enhanced Sampling for MD simulations

Tools and plugins to enhance molecular dynamics sampling

data-analysis computational-chemistry c++conda cuda python

1 Like

Type

tool

Level

DARWIN Documentation Pages

DARWIN Documentation

DARWIN (Delaware Advanced Research Workforce and Innovation Network) is a big data and high performance computing system designed to catalyze Delaware research and education

darwin big-data

1 Like

Type

documentation

Level

PyTorch for Deep Learning and Natural Language Processing

Introduction to PyTorch for Deep Learning

PyTorch is a Python library that supports accelerated GPU processing for Machine Learning and Deep Learning. In this tutorial, I will teach the basics of PyTorch from scratch. I will then explore how to use it for some ML projects such as Neural Networks, Multi-layer perceptrons (MLPs), Sentiment analysis with RNN, and Image Classification with CNN.

ai big-data data-analysis deep-learning machine-learning neural-networks

1 Like

Type

documentation

Level

Data Visualization tools for Python

MatPlotLib Docs

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It makes analyzing and presenting your data extremely easy and works with Python which many people already know.

documentation python

1 Like

Type

documentation

Level

GIS: Geocoding Services

Geocoding is the process of taking a street address and converting it into coordinates that can be plotted on a map. This conversion typically requires an API call to a remote server hosted by an organization/institution. The remote server will take the address attributes provided by you and the remote server will compare it to the data it contains and return a best estimate on the coordinates for that location. There are many geocoding services available with different world coverages, quality of result, and set different rate limits for access. For R, a package called "tidygeocoder" provides an easy way to connect to these different services. As an additional benefit, their documentation provides a good summary of geocoding services available and links to their documentation. The link to the documentation for gecoding services accessible by "tidygeocoder" is provided below. For Python, geopy package is a library that provides connection to various geocoding services. The link to the documentation for this package is also included below.

gis

1 Like

Type

documentation

Level

Intro to Machine Learning on HPC

Intro to Machine Learning on HPC

This tutorial introduces machine learning on high performance computing (HPC) clusters. While it focuses on the HPC clusters at The University of Arizona, the content is generic enough that it can be used by students from other institutions.

ai supervised-learning unsupervised-learning deep-learning machine-learning neural-networks

0 Likes

Type

documentation

Level

CUDA Toolkit Documentation

CUDA Toolkit Documentation

NVIDIA CUDA Toolkit Documentation: If you are working with GPUs in HPC, the NVIDIA CUDA Toolkit is essential. You can access the CUDA Toolkit documentation, including programming guides and API references, at this provided website

documentation c c++fortran python

0 Likes

Type

documentation

Level

Gaussian 16

Gaussian 16 is a computational chemistry package that is used in predicting molecular properties and understanding molecular behavior at a quantum mechanical level.

gaussian computational-chemistry

0 Likes

Type

tool

Level

MDAnalysis - Python library for the analysis of molecular dynamics simulations

MDAnalysis

MDAnalysis is a python based library of tools for the analysis of molecular dynamics simulations. It is able to read and write many popular simulation formats including CHARMM, LAMMPS, GROMACS, and AMBER and more. This link contains the documentation pages of all MDAnalysis functions and has links to tutorials using Jupyter Notebooks.

computational-chemistry materials-science python

0 Likes

Type

tool

Level

Representation Learning in Deep Learning

Representation Learning in Deep Learning

Representation learning is a fundamental concept in machine learning and artificial intelligence, particularly in the field of deep learning. At its core, representation learning involves the process of transforming raw data into a form that is more suitable for a specific task or learning objective. This transformation aims to extract meaningful and informative features or representations from the data, which can then be used for various tasks like classification, clustering, regression, and more.

deep-learning image-processing machine-learning neural-networks

0 Likes

Type

documentation

Level

Official Documentation for PyTorch and NumPy

The official documentation for PyTorch, a machine learning tensor-based framework, and NumPy, which allows for support for ndarrays which is useful to make tensors when implementing NNs. Both libraries can be installed with pip.

deep-learning neural-networks pytorch python

0 Likes

Type

documentation

Level

MATLAB bioinformatics toolbox

https://www.mathworks.com/products/bioinfo.html

Bioinformatics Toolbox provides algorithms and apps for Next Generation Sequencing (NGS), microarray analysis, mass spectrometry, and gene ontology. Using toolbox functions, you can read genomic and proteomic data from standard file formats such as SAM, FASTA, CEL, and CDF, as well as from online databases such as the NCBI Gene Expression Omnibus and GenBank.

visualization data-analysis bioinformatics genomics matlab

0 Likes

Type

tool

Level

EasyBuild Documentation

EasyBuild is a software installation framework that allows administrators to easily build and install software on high-performance computing (HPC) systems. It supports a wide range of software packages, toolchains, and compilers. Supported software are found in the EasyConfigs repository, one of several resositories in EasyBuild project.

easybuild

0 Likes

Type

documentation

Level

Introductory Tutorial to Numpy and Pandas for Data Analysis

Numpy and Pandas for Data Analysis

In this tutorial, I present an overview with many examples of the use of Numpy and Pandas for data analysis. Beginners in the field of data analysis can find It incredibly helpful, and at the same time, anyone who already has experience in data analysis and needs a refresher can find value in it. I discuss the use of Numpy for analyzing 1D and 2D multidimensional data and an introduction on using Pandas to manipulate CSV files.

ai big-data data-analysis vectorization

0 Likes

Type

documentation

Level

Weka

Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.

big-data data-analysis machine-learning weka data-science java

0 Likes

Type

tool

Level

Neural Networks in Julia

Neural Networks in Julia using Flux.jl

Making a neural network has never been easier! The following link directs users to the Flux.jl package, the easiest way of programming a neural network using the Julia programming language. Julia is the fastest growing software language for AI/ML and this package provides a faster alternative to Python's TensorFlow and PyTorch with a 100% Julia native programming and GPU support.

ai deep-learning machine-learning neural-networks julia

0 Likes

Type

tool

Level

ACCESS KB Guide - Anvil

ACCESS KB Guide - Anvil

Purdue University is the home of Anvil, a powerful supercomputer that provides advanced computing capabilities to support a wide range of computational and data-intensive research spanning from traditional high-performance computing to modern artificial intelligence applications.

anvil

0 Likes

Type

documentation

Level

A survey on datasets for fairness-aware machine learning

A survey on datasets for fairness-aware machine learning

The research paper provides an overview of various datasets that have been used to study fairness in machine learning. It discusses the characteristics of these datasets, such as their size, diversity, and the fairness-related challenges they address. The paper also examines the different domains and applications covered by these datasets.

ai data-analysis deep-learning data-science

0 Likes

Type

documentation

Level

Samtools Documentation

https://www.htslib.org/doc/

Samtools is a suite of programs for interacting with high-throughput sequencing data, especially in the SAM/BAM format. It offers various utilities for processing, analyzing, and managing sequence data generated from next-generation sequencing (NGS) experiments. Samtools is widely used in bioinformatics and genomics research for tasks such as read alignment, variant calling, and data manipulation.

documentation data-analysis bioinformatics data-science genomics

0 Likes

Type

documentation

Level

DAGMan for orchestrating complex workflows on HTC resources (High Throughput Computing)

DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for HTCondor. It manages dependencies between jobs at a higher level than the HTCondor Scheduler. It is a workflow management system developed by the High-Throughput Computing (HTC) community, specifically for managing large-scale scientific computations and data analysis tasks. It enables users to define complex workflows as directed acyclic graphs (DAGs). In a DAG, nodes represent individual computational tasks, and the directed edges represent dependencies between the tasks. DAGMan manages the execution of these tasks and ensures that they are executed in the correct order based on their dependencies. The primary purpose of DAGMan is to simplify the management of large-scale computations that consist of numerous interdependent tasks. By defining the dependencies between tasks in a DAG, users can easily express the order of execution and allow DAGMan to handle the scheduling and coordination of the tasks. This simplifies the development and execution of complex scientific workflows, making it easier to manage and track the progress of computations.

open-science-grid

0 Likes

Type

tool

Level

Official Documentation of VisIt

VisIt is a prominent open-source, interactive parallel visualization and graphical analysis tool predominantly used for viewing scientific data. Its GitHub repository offers a detailed insight into the software's source code, documentation, and contribution guidelines. In particular, it offers useful examples on how it

visIt novel-accelerators particle-physics

0 Likes

Type

documentation

Level