ncibtep@nih.gov

Bioinformatics Training and Education Program

Getting started with RNA-Seq analysis (bulk and single cell)

RNA-Seq technology provides scientists with a window into how cells and tissues function by measuring levels of gene expression. Since all normal cells within an organism possess the same genome, differences in cell identities and function are determined by gene expression. Bulk RNA-Seq experiments provide a view of gene expression of an entire sample. However they do not differentiate among cell types within the sample, rather they give a view of gene expression within a whole organ or tissue type.

Single cell RNA-Seq technology allows for the identification of new cell types based on gene expression profiles, and the quantification of transcripts for each cell type. This is done by dissociating the sample into individual single cells, identifying the cell types, and measuring the expression products of each cell. The resources listed below will guide you through the skills needed to learn to do RNA-Seq analysis “at the unix command line” within Biowulf, the NIH intramural supercomputer cluster. RNA-Seq is computationally intensive, and the unix environment provides the space and compute resources necessary to do the analysis.

In addition, there are several “point-and-click” options for working with RNA-Seq data, but many scientists find they need more flexibility in setting the parameters of their analysis, or would like to make changes to visualizations.

If you get stuck or have questions, please email ncibtep@nih.gov

Working at the unix command line

Unix is an operating system (OS), just like Windows or Mac. It is well-suited to working with large data files, like those being generated by Next Generation Sequencing (NGS) technologies such as RNA-Seq. Many programs are written for the Unix operating system, and are freely available within the community. By linking various programs together, analyses pipelines can be built. Linux is a variety of unix, and the terms are sometimes used interchangeably. To get started, I’ve listed several resources for you to learn about working with unix/Linux systems. Start with reading through either the Unix Tutorial for Beginners (1), or the Software Carpentry resource (2). Don’t worry if everything doesn’t make sense quite yet, this first part is just an introduction. After you’ve read through one of both of these resources, you’re ready to do some typing!

  1. Unix Tutorial for Beginners (mirror site of University of Surrey). This is a static site that you can read, no special access required. https://www.cs.sfu.ca/~ggbaker/reference/unix/index.html
  2. The Unix Shell. Find the “Episodes” tab, then go through each lesson, this is also a static site, no special access required. http://swcarpentry.github.io/shell-novice/
    • Introducing the Shell
    • Navigating Files and Directories
    • Working with Files and Directories
    • Pipes and Filters
    • Loops
    • Shell Scripts
    • Finding Things
  3. Work through the “Introduction to Shell for Data Science” on Datacamp. This interactive site allows you to type at a command line and see how the commands are executed by the unix system. DataCamp “Introduction to Shell for Data Science”, https://www.datacamp.com/courses/introduction-to-shell-for-data-science
  4. Print this out and hang it next to your computer. “Unix Cheat Sheet” Unix/Linux Command Reference Fosswire.com https://files.fosswire.com/2007/08/fwunixref.pdf

The NIH Biowulf supercomputer is a Linux cluster, where multiple, computationally intensive jobs can be run simultaneously. Over 600 tools for scientific analyses and databases are installed on Biowulf. In order to do work on the Biowulf cluster, you will need to be able to interact with it “at the command line”. Instead of a point-and-click interface like Windows or Mac, when working at the command line you will be typing in the commands you wish to execute. Your spelling and typing skills matter, as everything must be exactly as the unix system expects it.

  1. https://hpc.nih.gov/systems/
  2. How to get a Biowulf account https://hpc.nih.gov/docs/accounts.html
  3. See the “Training” section for an online Intro to Biowulf course https://hpc.nih.gov/training/intro_biowulf/
  4. See “Biowulf/Applications/Sequence Analysis”, “Biowulf/Applications/Scientific Databases” for lists of available tools to use on Biowulf.

Getting started with R Once you’ve gotten comfortable at the command line, you’re ready to dive into learning about the programming language “R”. There are lots of great scientific programs written in “R”, including RNA-seq analysis pipelines. The more you learn about “R”, the more competent you will be at using these tools for data analysis. Head back to datacamp.com and go through several of the “R” modules as listed below.

Get “R” and “R Studio” working on your computer

  1. Install “R” on your computer. Go to the Comprehensive R Archive Network (CRAN) at cran.r-project.org and download the most current version of “R” (3.5.3) for your operating system.
  2. While you’re at it, install “R Studio” (rstudio.com) on your computer as well, this will come in handy for later.

Learn “R” at Datacamp 

  1. Go to datacamp.com, and work through the following courses:
    • Introduction to R
    • Intermediate R and Intermediate R- Practice
    • Data visualization with ggplot2
  2. The NIH library also offers R courses see their schedule https://www.nihlibrary.nih.gov/training/calendar

Bulk RNA-Seq

  1. Galaxy Project, RNAseq – an introduction, https://galaxyproject.org/tutorials/rb_rnaseq/

Single Cell RNA-Seq

  1. satijalab.org, Seurat Guided Tutorials, https://satijalab.org/seurat/get_started.html
  2. Hemburg lab, https://hemberg-lab.github.io/scRNA.seq.course/introduction-to-single-cell-rna-seq.html

Other resources:

Commercially available point-and-click workflows

  • Partek Flow (bulk and single-cell)