Sequence Search

Submission

The site allows searching of up to 50Kb DNA sequences against the Dfam. The source organism should be specified, so that the appropriate cut-offs should be used. At the moment, the search form is implemented such that sequences with from Human use the less stringent, gathering cut-offs, which are suited more for masking and allow some false positive detection of matches (thresholds are set to allow a false discovery rate of 0.1%). If the source organism is set to 'other', then the more stringent trusted cut-offs will be applied. These are set, such that there is an empirical false discovery rate of 0. This threshold provides the most accurate annotation. Alternative, a user defined E-value threshold can be defined, where 0 < E-value ≤ 0.1.

Running

The search consists of two parallel phases: tandem repeat identification with TRF and an nhmmer search with all Dfam models. TRF is run using the following parameters: '2 7 7 80 10 70 5'. By default, all nhmmer hits with score above model-specific gathering thresholds are shown, after removing redundant model hits (RPHs, cases in which multiple models match the same region of the submitted sequence; in such cases, shorter and lower scoring matches are discarded). Overall, this search can take up to approximately 1 minute depending on length and composition of the DNA sequence. Thus, after submission, a progress page is displayed. The results of the search will be loaded into this holding page, which is unique per job and can be bookmarked for future reference. Results will be held for as long as the release, or until disk usage requires that they are deleted. The search input parameters are available at the top of the page, by clicking on the search details link after the job identifier.

Results

The results are presented in two forms: a graphical representation and in a tabular form.

The hit graphic - The submitted sequence is represented by A grey bar, with dark grey boxes on the sequence bar representing TRF matches. Dfam hits to the plus strand are organized above the sequence bar, and hits to the minus strand are organized below the bar. The color of each Dfam hit bar depends on the entry type (DNA transposon, RNA retrotransposons, ncRNA, etc). When a Dfam hit bar is clicked, the page is scrolled to the row corresponding to that hit and the row highlighted.

Dfam Results table

Each row of the table represent as hit to a Dfam entry. The table contains the match score and E-value, positions in the respective model and sequence. This data can be downloaded via using the button found in the top right corner of the table header as a TSV file.

The alignment between the model and sequence can be obtained by expanding the row using the > symbol found at the beginning of each row.In the alignment, the model line presents the consensus sequence for aligned states in the model, colored according to the match line. The PP line represents the posterior probability, or degree of confidence in each aligned residue (for example, with '*' meaning highest confidence, and low numbers indicating low confidence), with corresponding grey scale coloration of the Query sequence.

TRF Results table

Each row in this table simple indicates the tandem repeat motif and the position of the repeat in the sequence. These results can be downloaded as a TSV file as indicated for the Dfam results.

Running searches locally

Due to hardware limitations, it is not possible to submit larger sequences to the web server. However, all of the software for running these searches is freely available for download, see the Tools for more details.


Retrieve Hits

It is possible to get a list of all Dfam hits within a 50kp range of dfamseq. To do so, enter the chromosome and the desired sequence range in base paris. It does not matter if the co-ordinates of the range are reversed, as both + and - strand matches will be returned for the range. Similar to the DNA sequence, it is possible to remove RPH. By default, this is always on. Searches can be further restricted, by entering a Dfam accession or identifier. This field is optional. If RPH removal is select and a Dfam accession is entered, only the hits belonging to that entry after all redundancy is removed will be displayed.

The results a presented in an identical manner to those in the DNA search results. Again, these results (Dfam hits and TRF) can be downloaded as a TSV file via the button in the top right to the table header.


Keyword Search

The keyword search box is found at the top of every page in Dfam, and is the same search as performed in the keyword tab, within the search section. The keyword search allows the user to quickly find entries via the Dfam identifier and accession. It also allows the searching of the textual data in the Dfam database.

The search currently covers the following sections of the database:

  • Dfam entry accession identifiers, description and comments
  • Synonyms and previous identifiers
  • Classification terms
  • Reference titles

The search is performed using the Apache Lucy search engine. Once a search has been performed, the results are displayed in a table below the keyword search interface, regardless of the entry point. The results are ordered according to the search engine scoring metric.

The total number of matches is reported in the top left corner of the results table.If the search term matches many entries, only the top 100 matches will be displayed.

Tips on keyword searches
  • Wild cards such as '*' are treated as characters, not wild cards.
  • Searches are case insensitive.
  • The union of hits are presented when multiple words enter. Boolean operators are not accepted.
  • Quoting phrases does not work.
  • Partial words or fragments will work, for example 'rotranspo' will match the word retrotransposon.