Rna Seq Analysis
Creat necessary folders/directories
mkdir RNA_Seq_Analysis
cd RNA_Seq_Analysis
mkdir raw_data
mkdir fastqc_out
mkdir scripts
mkdir bam_files
mkdir count_file
mkdir meta_data
Download raw data from SRA
Go to following link and download accession list and metadata file. Accession list download link
Go to the link above and click on ‘Send to’(top right corner). Select “Run Selector” then click on ‘go’ Select above mentioned run then by clicking Metadata and Accession List we will get metadata and accession list file which we will use for our analysis. Accession list file name is SRR_Acc_List.txt and metadata file name is SraRunTable.txt. Move the acession list file to the raw data directory and SraRunTable.txt file to the metadata directory
cd raw_data
cat SRR_Acc_List.txt | xargs fastq-dump
Check quality of the raw data
We will use fastqc to check quality of raw data that we downloaded from sra database. The fastqc will generate html report of every file, I am not going to explain everything about those report but you can check this link to read about inpretation of fastqc report.
fastqc -o ~/RNA_Seq_Analysis/fastqc_out `ls *fastq`
Mapping raw data to the genome
Data we downloaded is RNA sequencing data from yeast strain between two different condition. In order to map this data to yeast genome I need to download yeast genome. I can download yeast genome using following command.
In order to consider sliced site I need to generate a file containing spliced site using extract_splice_sites.py script, which part of hisat2 software. Please download the hisat2
Generating index file from genome
cd scripts
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
unzip hisat2-2.1.0-Linux_x86_64.zip
Download yeast fasta file for yeast genome.
cd annotation
wget ftp://ftp.ensembl.org/pub/release-99/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz
gunzip Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz
In order to reads to the genome, we need to make index of the genome using following command.
hisat2-build Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel
Generate splice sites
Download GTF file for yeast genome. You will need this file for generating known splice site and counting number of read align to the gene.
wget ftp://ftp.ensembl.org/pub/release-99/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.99.gtf.gz
gunizp Saccharomyces_cerevisiae.R64-1-1.99.gtf.gz
If you would like to perform a spliced alignment, you can provide a known spliced sites which hisat2 will use for alignment. You can make file with known splice site from gtf file using hisat2_extract_splice_sites.py scripts provided in hisat2 directory. Here I am using yeast gtf file, if you can use any gtf file you want.
python ./scripts/hisat2-2.1.0/hisat2_extract_splice_sites.py ./annotation/Saccharomyces_cerevisiae.R64-1-1.99.gtf > ./annotation/splicesites.txt
Alignment to the genome
Below is the code for aligning read to the yeast genome. My reads are single end therefore I have one fastq file as input. If you have paired end data you need to modify this code. You can run the code below from rawdata directory. The output for this code will be bam file in bam_file directory.
for fqfile in `ls *.fastq | sed 's/.fastq//g' | sort -u`
do
hisat2 -x ~/Documents/Tutorial/RNA_Seq_Analysis/annotation/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel --known-splicesite-infile ~/Documents/Tutorial/RNA_Seq_Analysis/annotation/splicesites.txt $fqfile.fastq | samtools view -Sb > ~/Documents/Tutorial/RNA_Seq_Analysis/bam_files/$fqfile.bam
done
Sorting BAM files
Before counting the code you need to sort the bam file by using samtools sort. Run code below from the bam_file directory. This will generate sorted.bam files.
for bam_file in `ls *bam | sed 's/.bam//g' | sort -u`;
do
samtools sort $bam_file.bam > $bam_file.sorted.bam
done
Indexing BAM file
If you would like visualize RNA-Seq data in IGV, you need to make an index file, below is the code for generating index file. After generating index file you can load sorted.bam file in IGV for visualization of RNA-Seq track.
for bam_file in `ls *sorted.bam`;
do
samtools index $bam_file
done
Counting number of reads mapped to each feature
for bam_file in `ls *bam.sorted`;
do
featureCounts -s 2 -t exon -g gene_id -a ~/Documents/Tutorial/RNA_Seq_Analysis/annotation/Saccharomyces_cerevisiae.R64-1-1.99.gtf -o ~/Documents/Tutorial/RNA_Seq_Analysis/count_file/$bam_file ~/Documents/Tutorial/RNA_Seq_Analysis/bam_files/$bam_file
done
Differential gene expression analysis by R package DESeq2
Above code will generate file containing number of reads in each gene. Next of the RNA seq analysis is differential gene expression analysis by the R package DESeq2. Since next part we are going to use R package you need run code below in R studio.
Counts folder contain raw count data generated by featureCounts, there is one file for each sample. Sicne we have four sample we have four count files. In the first steps we are going merge all the file in one dataframe.
file_list <- list.files("~/Documents/Tutorial/RNA_Seq_Analysis/counts", pattern = 'bam.sorted$', full.name = TRUE)
counts_data <- read.table(file = file_list[1], header = TRUE)[,c(1,7)]
for (f in 2:length(file_list)){
temp <- read.table(file = file_list[f], header = TRUE)[,c(1,7)]
counts_file <- merge(counts_data, temp, by="Geneid")
}
We are changing column name of each sample, to make it look simpler and match with the metadata.
colnames(counts_file) <- c("Geneid", "SRR1761160", "SRR1761163", "SRR1761164", "SRR1761165")
Now we will create a dataframe to represent metadata, you can also input metadata from a file. Specifically the file we downloaded from runselector.
meta_data <- data.frame(sample=c("SRR1761160", "SRR1761163", "SRR1761164", "SRR1761165"),
condition=c("Nab2_depleted", "Nab2_depleted", "WT","WT"))
Next we will create an instance of deseqdatasetfrommatrix, which take following input: countData, which is the merged output from featureCounts, in our case is count_data object. colData, which is the information about the sample which is our case is meta_data object, design, which is how we would like to medel the sample that condition in our meta_data object.
dds <- DESeqDataSetFromMatrix(countData=counts_file,
colData=meta_data,
design=~condition, tidy = TRUE)
Before running the the deseq function if you would like to filter low counts genes you can run code below. This will filter out genes that have read counts less than 10.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
In order to determine differntially expressed gene between two condition describe in metadata, we can run deseq2 function.
dds <- DESeq(dds)
res <- results(dds)
Visualization of results
Visualization of log2 fold change data between wild type and Nab2 depleted strains by volcano plot.
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-3,3)))
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="blue"))
with(subset(res, padj<.05 & abs(log2FoldChange)>2), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
Here is the volcano plot:
If we have multiple samples we can visualize log2FoldChange by using clustering and heatmap. Letter tutorial I will write about making heatmap in R.