Knowledge Base Resources

These resources have been contributed and “vetted” by the community of cyberinfrastructure professionals (researchers, research computing facilitators, research software engineers and HPC system administrators) that are participating in programs such as this one, that are supported by the ConnectCI community management platform. Additional Knowledge Base Resources are always welcome!

Add a Resource

ACCESS Pegasus Documentation

ACCESS Pegasus Documentation

The documentation provides an overview of using Pegasus, a workflow management system, on ACCESS resources for high throughput computing (HTC) workloads, covering logging in, workflow creation, resource configuration, and monitoring options.

pegasus

1 Like

Type

documentation

Level

Useful R Packages for Data Science and Statistics

https://www.udacity.com/blog/2021/01/best-r-packages-for-data-science.html

This Udacity article listed the most frequently used R packages for data science and statistics. For each package, the article provided the link to its official documentation. It will be a great start point if you want to start your data science journey in R.

plotting visualization data-analysis machine-learning data-science r

1 Like

Type

documentation

Level

DARWIN Documentation Pages

DARWIN Documentation

DARWIN (Delaware Advanced Research Workforce and Innovation Network) is a big data and high performance computing system designed to catalyze Delaware research and education

darwin big-data

1 Like

Type

documentation

Level

Neural Networks in Julia

Neural Networks in Julia using Flux.jl

Making a neural network has never been easier! The following link directs users to the Flux.jl package, the easiest way of programming a neural network using the Julia programming language. Julia is the fastest growing software language for AI/ML and this package provides a faster alternative to Python's TensorFlow and PyTorch with a 100% Julia native programming and GPU support.

ai deep-learning machine-learning neural-networks julia

0 Likes

Type

tool

Level

Fairness and Machine Learning

Fairness and Machine Learning

The "Fairness and Machine Learning" book offers a rigorous exploration of fairness in ML and is suitable for researchers, practitioners, and anyone interested in understanding the complexities and implications of fairness in machine learning.

ai data-analysis deep-learning machine-learning data-science

0 Likes

Type

documentation

Level

Samtools Documentation

https://www.htslib.org/doc/

Samtools is a suite of programs for interacting with high-throughput sequencing data, especially in the SAM/BAM format. It offers various utilities for processing, analyzing, and managing sequence data generated from next-generation sequencing (NGS) experiments. Samtools is widely used in bioinformatics and genomics research for tasks such as read alignment, variant calling, and data manipulation.

documentation data-analysis bioinformatics data-science genomics

0 Likes

Type

documentation

Level

MATLAB with other Programming Languages

Using MATLAB with Other Programming Languages

MATLAB is a really useful tool for data analysis among other computational work. This tutorial takes you through using MATLAB with other programming languages including C, C++, Fortran, Java, and Python.

c c++fortran java matlab python

0 Likes

Type

tool

Level

ACCESS KB Guide - Anvil

ACCESS KB Guide - Anvil

Purdue University is the home of Anvil, a powerful supercomputer that provides advanced computing capabilities to support a wide range of computational and data-intensive research spanning from traditional high-performance computing to modern artificial intelligence applications.

anvil

0 Likes

Type

documentation

Level

Probabilistic Semantic Data Association for Collaborative Human-Robot Sensing

Probabilistic Semantic Data Association for Collaborative Human-Robot Sensing

Humans cannot always be treated as oracles for collaborative sensing. Robots thus need to maintain beliefs over unknown world states when receiving semantic data from humans, as well as account for possible discrepancies between human-provided data and these beliefs. To this end, this paper introduces the problem of semantic data association (SDA) in relation to conventional data association problems for sensor fusion. It then, develops a novel probabilistic semantic data association (PSDA) algorithm to rigorously address SDA in general settings. Simulations of a multi-object search task show that PSDA enables robust collaborative state estimation under a wide range of conditions.

ai machine-learning

0 Likes

Type

documentation

Level

Chameleon

Chameleon User Guide

Chameleon is an NSF-funded testbed system for Computer Science experimentation. It is designed to be deeply reconfigurable, with a wide variety of capabilities for researching systems, networking, distributed and cluster computing and security.

data-sharing data-reproducibility

0 Likes

Type

documentation

Level

TensorFlow for Deep Neural Networks

TensorFlow Docs

TensorFlow is a powerful framework for Deep Learning, developed by google. This specifically is their python package, which is easy to use and can be used to train incredibly powerful models.

documentation faster tensorflow

0 Likes

Type

tool

Level

Implementing Markov Processes with Julia

Markov Decision Processes in Julia

The following link provides an easy method of implementing Markov Decision Processes (MDP) in the Julia computing language. MDPs are a class of algorithms designed to handle stochastic situations where the actor has some level of control. For example, used at a low level, MDPs can be used to control an inverted pendulum, but applied in higher level decision making the can also decide when to take evasive action in air traffic management. MDPs can also be extended to the partially observable domain to form the Partially Observable Markov Decision Process (POMDP). This link contains a wealth of information to show one can easily implement basic POMDP and MDP algorithms and apply well known online and offline solvers.

ai machine-learning julia

0 Likes

Type

tool

Level

CUDA Toolkit Documentation

CUDA Toolkit Documentation

NVIDIA CUDA Toolkit Documentation: If you are working with GPUs in HPC, the NVIDIA CUDA Toolkit is essential. You can access the CUDA Toolkit documentation, including programming guides and API references, at this provided website

documentation c c++fortran python

0 Likes

Type

documentation

Level

Optimizing Research Workflows - A Documentation of Snakemake

https://snakemake.readthedocs.io/en/stable/

Snakemake is a powerful and versatile workflow management system that simplifies the creation, execution, and management of data analysis pipelines. It uses a user-friendly, Python-based language to define workflows, making it particularly valuable for automating and reproducibly managing complex computational tasks in research and data analysis.

documentation data-analysis data-reproducibility workflow bioinformatics data-science python

0 Likes

Type

documentation

Level

DeepChem

DeepChem Tutorial

DeepChem is an open-source library built on TensorFlow and PyTorch. It is helpful in applying machine learning algorithms to molecular data.

pytorch tensorflow computational-chemistry

0 Likes

Type

tool

Level

ACCESS KB Guide - DELTA

ACCESS KB Guide - DELTA

NCSA is the home of Delta, a computing and data resource that balances cutting-edge graphics processor and CPU architectures with a non-POSIX file system with a POSIX-like interface. Delta allows applications to reap the benefits of modern file systems without rewriting code.

delta

0 Likes

Type

documentation

Level

Globus Documentation

Globus Documentation

Globus is a data transfer, sharing, automation, and discovery service used by hundreds of thousands of researchers to manage "big data" at universities, research labs, and national systems such as ACCESS. The Globus documentation website provides how-to guides, reference documentation, and examples for Globus's web application, command-line interface, Python software development kit (SDK), and APIs.

cloud-storage data-sharing data-management data-management-software data-transfer data-wrangling file-transfer globus dtn python data-security data-compliance federated-authentication secure-data-architecture

0 Likes

Type

documentation

Level

Official Documentation of VisIt

VisIt is a prominent open-source, interactive parallel visualization and graphical analysis tool predominantly used for viewing scientific data. Its GitHub repository offers a detailed insight into the software's source code, documentation, and contribution guidelines. In particular, it offers useful examples on how it

visIt novel-accelerators particle-physics

0 Likes

Type

documentation

Level

ACCESS Guide (originally given at Duke OIT)

Using Jetstream 2 for Duke members (written for Duke OIT)

A guide for Duke OIT on how to advise users on using ACCESS and allocation credits to jetstream 2 for Duke University members. This can be used for non Duke members. Assumes the reader has basic knowledge of ACCESS.

ACCESS-credits adding-users allocation-management jetstream cloud-computing login ACCESS-website project-management cilogon

0 Likes

Type

documentation

Level

Paraview UArizona HPC links (advanced)

These links take you to visualization resources supported by the University of Arizona's HPC visualization consultant ([rtdatavis.github.io](http://rtdatavis.github.io/)). The following links are specific to the Paraview program and the workflows that have been used my researchers at the U of Arizona. These links are distinct from the others posted in the beginner paraview access ci links from the University of Arizona in that they are for more complex workflows. The links included explain how to use the terminal with paraview (pvpython), and the steps to leverage HPC resources for headless batch rendering. The batch rendering tutorial is significantly more complex than the others so if you find yourself stuck please post on the https://ask.cyberinfrastructure.org/ and I will try to troubleshoot with you.

visualization

0 Likes

Type

documentation

Level

Raftlib: Open Source library for concurrent data processing pipelines

RaftLib

Raftlib is an open-source C++ Library that provides a framework for implementing parallel and concurrent data processing pipelines. It is designed to simplify the development of high-performance data processing applications by abstracting away the complexities of parallelism, concurrency, and data flow management. It enables stream/data-flow parallel computation by linking parallel compute kernels together using simple right shift operators, similar to C++ streams for string manipulation. RaftLib eliminates the need for explicit usage of traditional threading libraries such as pthreads, std::thread, or OpenMP, which can lead to non-deterministic behavior when misused.

parallelization pthreads openmp

0 Likes

Type

tool

Level

Weka

Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.

big-data data-analysis machine-learning weka data-science java

0 Likes

Type

tool

Level

ACCESS KB Guide - Expanse

ACCESS KB Guide

Expanse at SDSC is a cluster designed by Dell and SDSC delivering 5.16 peak petaflops, and offers Composable Systems and Cloud Bursting.

expanse composable-systems gpu

0 Likes

Type

documentation

Level

Jetstream2 Docs Site

Jetstream2 Docs Site

Jetstream2 makes cutting-edge high-performance computing and software easy to use for your research regardless of your project’s scale—even if you have limited experience with supercomputing systems.Cloud-based and on-demand, the 24/7 system includes discipline-specific apps. You can even create virtual machines that look and feel like your lab workstation or home machine, with thousands of times the computing power.

jetstream

0 Likes

Type

documentation

Level

DAGMan for orchestrating complex workflows on HTC resources (High Throughput Computing)

DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for HTCondor. It manages dependencies between jobs at a higher level than the HTCondor Scheduler. It is a workflow management system developed by the High-Throughput Computing (HTC) community, specifically for managing large-scale scientific computations and data analysis tasks. It enables users to define complex workflows as directed acyclic graphs (DAGs). In a DAG, nodes represent individual computational tasks, and the directed edges represent dependencies between the tasks. DAGMan manages the execution of these tasks and ensures that they are executed in the correct order based on their dependencies. The primary purpose of DAGMan is to simplify the management of large-scale computations that consist of numerous interdependent tasks. By defining the dependencies between tasks in a DAG, users can easily express the order of execution and allow DAGMan to handle the scheduling and coordination of the tasks. This simplifies the development and execution of complex scientific workflows, making it easier to manage and track the progress of computations.

open-science-grid

0 Likes

Type

tool

Level