We live in a big-data era and simple serial bioinformatic pipelines can’t efficiently handle huge input datasets. Hence, High Performance Computing (HPC) can represent a good solution for researchers who need to analyze and address new biological questions with their data.
This course is both theoretical and practical and is addressed to bioinformaticians who want to scale up their analysis on a cluster machine. It mainly focuses on the development and execution of automated data analysis pipelines.
On the first day, students will become confident with a cluster machine (e.g. hardware, software, module environment, data storage) and will learn how to submit a single batch script via the SLURM scheduler.
On the second day, partecipants will be introduced to the world of Next Generation Sequencing (NGS) and will learn how to build a fully automated RNA-seq pipeline able to handle large input datasets, focusing on job concatenations and HPC request resource optimization.
On the last day students will be introduced to cloud computing and to the world of snakemake, a python workflow management system tool able to create reproducible and scalable data analyses, with particular attention to scaling workflows on cluster and cloud without modifyng the workflow definition.
Ad-hoc hands-on sessions, aimed at applying the concepts explained during the course, will be held every afternoon.
Skills:
By the end of the course each student should be able to:
- Know all the conventions and opportunities offered by CINECA for accessing HPC resources;
- Download datasets from public repositories and/or transfer input files from the user’s local computer to the CINECA clusters;
- Navigate through the software environment set up by CINECA;
- Run single-step jobs on a supercomputer via SLURM scheduler;
- Combine several bioinformatics applications into a fully automated pipeline able to run on a supercomputer;
- Learn how to iterate through samples in order to manage huge input datasets;
- Have an overview of how to take advantage of snakemake to build a portable, scalable and fully automated pipeline.
Target audience:
Biologists, bioinformaticians and computer scientists interested in approaching large-scale NGS-data analysis.
Course prerequisites:
Good knowledge of python and shell command line.
A very basic knowledge of R and biology is recommended but not strictly required.