, , ,

As we learned before the analysis of count data from RNA-seq technology give us the number of sequence fragments that have been assigned to each feature. The feature could be a gene, isoform or exon.

From our sorghum data we quantified the gene expression in two different environmental conditions. Independent of the treatment, we know that there are genes that have different lengths (gene A has 600 bases and gene B has 1500 bases) with different abundance of reads. In order to compare those different genes A and B we should normalize the counts by the length of the gene.

There are two ways to normalize our count data using two different common units of expression: the Reads per kilobase of exon per million reads mapped (RPKM) or the more generic FPKM (Fragments  per kilobase of exon per million reads mapped). The RPKM has the following equation,

RPKM =  10^6 * C * 10^3 / L*M    where,

C = number of  mappable reads per each feature (in our case per gene)

L = number of the length of the feature (lenght of the gene) in kb (10^3)

M = Total number of mappable reads per sample in Millions Screenshot from 2015-03-08 22:45:51

Looking at the figure with normalized data we can conclude that from the sample 1 the gene A with a smaller length (600 B, RPKM = 3.33) has a higher expression compared to the gene C ( RPKM = 1.31). It is intuitive to think that the longer gene would probably have the more abundant expression space, but that is not a rule for expression data.  If we compare gene A and gene B from the first sample, even thought gene B is almost twice the size of gene A, both have a quite similar RPKM score(3.33, 3.64).

There are many biological process that were not taken into account when it is used the RPKM units of expression. This quantification was proposed for single-end sequencing method. However, some events of post-transcriptional modification (alternative splicing) could provide changes in relative abundance of transcripts that could not be detect when it’s used RPKM. It happens because in our count data we could have multi-reads, which are reads that may map to multiple transcripts (originated by multi-isoform genes or duplicated segments resulted by gene families). Therefore, in order to improve those weakness, the paired end sequencing technology was introduced, and the FPKM was created with a similar concept of RPKM.

Before introduce the other method FPKM, I think that we should review the difference between paired-end and single-end sequencing. The paired-end sequencing enables both ends of the DNA fragment to be sequenced (because the distance between each paired read is known). However, Single-read sequencing involves sequencing DNA from only one end. Let`s keep it in mind.

So the FPKM is calculated in the same way as RPKM, the unique difference is that instead of a read it will be counted a FRAGMENT. Normally the value of expression FPKM is used when the data was submitted to a paired-end sequencing method. But why? Because longer reads and pairs of reads from both ends of each RNA fragment can reduce uncertainty in assigning reads to alternative splice variants.

From the article that originated the method Cufflinks `Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms` we have the statement:

FPKM is conceptually analogous to the reads per kilobase per million reads sequenced (RPKM) measure, but it explicitly accommodates sequencing data with one, two, or – if needed for future sequencing platforms – higher numbers of reads from single source molecules. A fragment corresponds to a single cDNA molecule, which can be represented by a pair of reads from each end.

Therefore, FPKM is NOT 2 * RPKM if you have paired-end reads. FPKM == RPKM if you have single-end reads.

The most famous methods that were created to calculate RPKM and FPKM units of expression are rescue and Cufflinks method respectively.

It was defended in a few years by Marioni(2008) and Bullard (2010) the idea that the Poisson distribution could provide a good fit to the distribution of gene-level counts across replicate lanes, after normalization by total lane counts. DEgseq is using this concept. Maybe the explanation about why (or why not) and how could be (or could be not) a great topic for the next post!