About sorting big BED files

It is a frequencing real world problem about sorting a big BED file.

There are many ways to do it. And the solutions I prefer are bedtools and GNU sort.

bedtools provides sortBed command for the sorting.

Things like this:

sortBed -i A.bed > A_sorted.bed

But sortBed of bedtools needs big memory of the server, so I also somtimes use GNU sort. sort utility in GNU coreutils now supports parallel computing and large cache.

Because generally our Linux servers are old RHEL or somethings old, so I generally use pkgsrc to install the relatively new version, in pkgsrc_source/sysutils/coreutils. And so on the new command is named as gsort to distiguish with sort in system $PATH.

So a typical command is:

gsort --parallel=16 -S 20G -k1,1 -k2,2n -k6,6 A.bed > A_sorted.bed

Here 16 are the cores used in sorting, and 20G is the cache size. gsort will sort the chromosome names firstly then the start position, then the strands.