About sorting big BED files
It is a frequencing real world problem about sorting a big BED file.
There are many ways to do it. And the solutions I prefer are bedtools
and GNU sort
.
bedtools provides sortBed command for the sorting.
Things like this:
sortBed -i A.bed > A_sorted.bed
But sortBed
of bedtools
needs big memory of the server, so I also somtimes use GNU sort
. sort
utility in GNU coreutils now supports parallel computing and large cache.
Because generally our Linux servers are old RHEL or somethings old, so I generally use pkgsrc to install the relatively new version, in pkgsrc_source/sysutils/coreutils
. And so on the new command is named as gsort
to distiguish with sort
in system $PATH.
So a typical command is:
gsort --parallel=16 -S 20G -k1,1 -k2,2n -k6,6 A.bed > A_sorted.bed
Here 16
are the cores used in sorting, and 20G is the cache size. gsort
will sort the chromosome names firstly then the start position, then the strands.