Dfam and RepeatMasker

Dfam (and nhmmer) has been incorporated into RepeatMasker, a widely used tool for annotating interspersed repeats and low complexity DNA sequences. RepeatMasker offers a significantly more sophisticated method for resolving and post-processing of repeat annotations than our own tool dfamscan. For details on installing and running, please visit the RepeatMasker site.

Running Dfam Searches Locally

Dfam can also be downloaded and searches performed locally. To do this you will need to download three items from our web site:

  1. The Dfam HMM library, Dfam.hmm
  2. HMMER source code, but please download the snapshot from Dfam as this is a pre-release version. Follow the installation directions in the INSTALL file, contained within this tar-ball.
  3. The Perl code for dfamscan.pl

If you want to include Tandem Repeat Finder annotation, please download the TRF binary (trf) from here. dfamscan.pl assumes that nhmmer and trf (if required) are in the executable PATH.

The script dfamscan.pl is designed to manage running nhmmer and the Dfam database of HMMs on a query sequence, including the task of resolving redundant profile hits (RPHs: cases in which multiple profile HMMs match the same region of input sequence). This script is intended to be a first-pass to annotation and not as a replacement for RepeatMasker, which is a more thorough expert system that incorporates Dfam and nhmmer (a version of RepeatMasker that incorporates nhmmer and Dfam will be available soon).

When searching with dfamscan.pl (or nhmmer), it is important to understand the two different score thresholds stored in the model. The gathering threshold (GA), used when running with the '--cut_ga' flag, is appropriate for masking the human genome, when a moderate false discovery rate (FDR) is acceptable. But when annotating another organism, the empirical FDR may not hold, so the more stringent trusted cutoff (TC) threshold, accessed using the '--cut_tc' flag, should be used.

To see the full set of command line options, execute the following command:

 dfamscan.pl --help 

This should print a help page like this:

Command line options for controlling dfamscan.pl
-------------------------------------------------------------------------------

   --help       : prints this help message
    
   Requires either
    --dfam_infile <s>
   or both of these
    -fastafile <s>
    -hmmfile <s>

   Requires
    -dfam_outfile <s>

   Optionally, one of these  (only -E and -T allowed with --dfam_infile)
    -E              (<0, <=10000)
    -T 
    --masking_thresh/--cut_ga
    --annotation_thresh/--cut_tc  (default)
    --species <i>  (not yet implemented)

   Optionally one of these
    --sortby_eval
    --sortby_model
    --sortby_seq     (default)

   All optional
    --trf_outfile <s>  (runs trf, put results in <s>; only with --fastafile)
    --cpu <i>    (default 4)
    --no_overlap
    --overlap_trim <i> (default 3, only if not --no_overlap)
    --log_file <s>

The recommended usage for masking is:

dfamscan.pl -fastafile myHumanSeq.fa -hmmfile Dfam.hmm -dfam_outfile myHumanSeqRegionsToMask.out --masking_thresh

And for accurate annotation:

dfamscan.pl -fastafile myChickenSeq.fa -hmmfile Dfam.hmm -dfam_outfile myChickenSeq.DfamHits.out 

If you have already performed a search using nhmmscan and want to adjust the post-processing (the 'sortby' flags or the 'overlap' flags, or set a more stringent threshold with -E or -T), the search need not be run again. You can used the previous outfile as the new input. For example:

dfamscan.pl --dfam_infile myChickenSeq.DfamHits.out --overlap_trim 5  -T 60.0 -dfam_outfile myChickenSeq.DfamHits.out2

Finally, two other parameters of note. The trf flag turns on the tandem repeat finder search, such that the results are displayed along side the Dfam hits. The other is flag --cpu controls how many CPUs nhmmer will try and use.


A note about RPH resolution

Redundant Profile Hit (RPH) resolution is tricky business. Clearly redundant hits should be removed, and slightly overlapping hits should be retained, but some such decisions are not obvious. The dfamscan script takes a conservative approach, removing obviously overlapping hits (e.g. in cases where one hit clearly outscores another, even if the first is slightly shorter than the second), but retaining possibly-redundant matches in which two model matches overlap substantially but without a clear "winner". This happens, for example, when when one hit is slightly outscored by another much shorter hit, or when two hits are offset such that the 5' half of one hit overlaps the 3' half of another hit. It is left to downstream tools to make final decisions in these (relatively rare) cases.