A short script to calculate RPKM and TPM from featureCounts output

Currently I prefer to use HISAT2, featureCounts and DESeq2 for my RNA-seq analyses. But DESeq and DESeq2 just adopted Variance Stabilizating Transformation (VST) in their normalization step, so one wired thing I have to do is to explain why no expressed genes were not zero in the final expression table.

Today one of my friends asked me about the easy way to calculate the RPKM and TPM except Cufflinks. Here is my script for the solution. It’s written in half hour, so please ignore the ugly styple…

Update 2018/06/19

Thanks to JianMing Zeng of Macau University, a bug was found as I forgot to group by the samples when I calculate TPM. It’s a tricky thing to convert the old style code to dplyr style…

#! /usr/bin/env Rscript

## functions for rpkm and tpm
## from https://gist.github.com/slowkow/c6ab0348747f86e2748b#file-counts_to_tpm-r-L44
## from https://www.biostars.org/p/171766/
rpkm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(counts) * 1e9
}

tpm <- function(counts, lengths) {
  rate <- counts / lengths
  rate / sum(rate) * 1e6
}

## read table from featureCounts output
args <- commandArgs(T)
tag <- tools::file_path_sans_ext(args[1])
ftr.cnt <- read.table(args[1], sep="\t", stringsAsFactors=FALSE,
  header=TRUE)
library(dplyr)
library(tidyr)

ftr.rpkm <- ftr.cnt %>%
  gather(sample, cnt, 7:ncol(ftr.cnt)) %>%
  group_by(sample) %>%
  mutate(rpkm=rpkm(cnt, Length)) %>%
  select(-cnt) %>%
  spread(sample, rpkm)
write.table(ftr.rpkm, file=paste0(tag, "_rpkm.txt"), sep="\t", row.names=FALSE, quote=FALSE)

ftr.tpm <- ftr.cnt %>%
  gather(sample, cnt, 7:ncol(ftr.cnt)) %>%
  group_by(sample) %>%
  mutate(tpm=tpm(cnt, Length)) %>%
  select(-cnt) %>%
  spread(sample, tpm)
write.table(ftr.tpm, file=paste0(tag, "_tpm.txt"), sep="\t", row.names=FALSE, quote=FALSE)