[Research] Normalization of RNA-seq: Addressing TPM vs. GeTMM

2022-12-01 2 minute read

While revising a manuscript, I received a comment stating, “Authors used FPKM for RNA-seq analysis, but for inter-sample comparisons, you should use TPM. Please check if using TPM impacts your results.” This prompted me to dive deeper into a topic I hadn’t revisited in a while: RNA-seq normalization.

Considerations for RNA-seq normalization

When normalizing RNA-seq data, three key factors must be considered [7]:

Sequencing depth
Gene length
RNA composition

Current normalziation methods

The appropriate normalization method depends on the nature of your sample data and the assumptions you make. However, most current normalization methods are based on the assumption that most genes are not differentially expressed (DE).

Common methods under this assumption include RPKM/FPKM, TPM, Quantile normalization, RLE (DESeq2), and TMM (edgeR).

FPKM, which considers gene length, is not ideal for inter-sample comparisons because the total expression value of all genes varies significantly, leading to inconsistency. TPM, while ensuring that the sum of the normalized values is 1M across samples, is highly sensitive to a few highly expressed outlier genes, which can disproportionately affect the expression values of other genes. Therefore, neither FPKM nor TPM seems suitable. Quantile normalization, which adjusts expression values to match the distribution across samples, can increase variation between replicates, potentially leading to errors.

Focusing on RNA composition, methods like RLE (DESeq2) and TMM (edgeR) provide better results by excluding highly expressed genes and genes with high variability before calculating correction factors.

However, my research requires not only inter-sample comparison but also gene-level expression comparisons across species, necessitating the correction of biases introduced by gene length. Using TPM introduces bias when comparing species with significantly different numbers of known genes because it normalizes the total expression to 1M across all genes, which complicates the analysis of 1-to-1 orthologous genes.

Newly suggested medthod (GeTMM)

While researching this issue, I discovered efforts to incorporate gene length into RLE and TMM, leading me to GeTMM (Gene length corrected TMM). According to the literature (Ref. 1), GeTMM produces results comparable to TMM and RLE in DE analysis while also allowing intra-sample comparisons, much like TPM. I wondered why gene length correction was only applied to TMM and not RLE, and I found that DESeq2 requires integer inputs, which prevents the use of proportion values (read count/gene length) as inputs.

In conclusion, GeTMM seems to offer a balanced approach for normalization in my study, where both gene-level and species-level comparisons are crucial. It provides the necessary correction for gene length biases while maintaining robust results in DE analysis.

Reference

Share on

Twitter Facebook LinkedIn

[Research] Normalization of RNA-seq: Addressing TPM vs. GeTMM

Considerations for RNA-seq normalization

Current normalziation methods

Newly suggested medthod (GeTMM)

Reference

Share on

Leave a comment

You may also enjoy

[BI] Key Terms in T2T Genome Assembly and Pangenomics

[BI] Tools for chromatin-state discovery and annotation

[Lablife] My Lab Built from Scratch: The CDRC Story

[CS] Step-by-Step guide to registering a package on Bioconda