_/_/_/_/ _/_/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/_/ _/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/ _/ _/ This document currently includes a detailed description of the fields used in the Dfam flatfiles Dfam.hmm, Dfam.embl, and Dfam.hits. Dfam.embl ========= Dfam.embl file is composed of five sections shown in the figure below. __________________________________ | | | Header Section | | | |________________________________| | | | Classification Section | | | |________________________________| | | | Reference Section | | | |________________________________| | | | Comment Section | | | |________________________________| | | | Consensus Section | | | |________________________________| Header Section: --------------- The header section mainly contains compulsory fields. These include Dfam specific information such as accession numbers and identifiers, as well as a short description of the model. The only non-compulsory fields in the header section are the PI and SN field. All the fields in this section are described below. AC Accession number: DFxxxxxxx The Dfam accession numbers DFxxxxxxx are the stable identifier for each Dfam model. ID Identification: 15 characters or less This field is designed to be a meaningful identifier for the model. Capitalisation of the first letter will be preferred. Underscores are used in place of space, and hyphens are only used to mean hyphens. DE Definition: 75 characters or less This must be a one line description of the Dfam entry. AU Author: Author(s) of the entry. The format for this record is shown below, this is a comma separated list on a single line. AU Bloggs JJ, Bloggs JE BM HMM building command lines. See the HMMER3 user's manual for full instructions on building HMMs. Also see URL: http://hmmer.janelia.org An example of the BM lines from a single entry BM hmmbuild --hand --maxinsertlen 10 -o /dev/null HMM SEED SM HMM searching command line Search is performed using a beta version of nhmmer, based on the HMMER3 suite of homology search tools. Documentation for nhmmer is in development; for now the two primary sources of guidance should come from (1) running ‘nhmmer -h’ and (2) reading the HMMER3 user’s manual for instructions on using hmmsearch, which is most similar in functionality to nhmmer. An example of the SM line for a family: SM nhmmer -Z 3102 -E 1 HMM dfamseq.mask SE Source of seed: The source suggesting seed members belong to a family. GA Gathering threshold: Search threshold for generating the matches to the model when masking a genome for repetitive elements, when that genome agrees with the model specificity field. GA lines are the thresholds in bits used when running nhmmer on the command line. An example GA line is shown below: GA 25.00; The corresponding nhmmer command line for the HMM would be: nhmmer -T 25 HMM DB The -T option specifies the whole sequence score in bits. A warning: the GA value is typically the lowest bit score that retains a false discovery rate (FDR) <= 0.1% (using the larger of the values suggested by empirical and theoretical FDR estimates - empirical FDR considers false matches as those found in a search against an artificial sequence devoid of TEs ( generated using GARLIC ). As an example, if there are 1000 true instances of a family in a genome, then the GA will likely be set to the score required for an E-value of 1 - we expect one match with score above GA to be a false positive. Because of this, GA should only be used when masking sequence from an organism within the clade specified by the model specificity (MS) field. If GA is used to annotate an organism outside this clade, the risk of false positives is high. For example, odds are good that at least one above-GA match to Alu, a primate-specific repeat, will be found in any organism with gigabase sized genome. Finding such a match does *not* mean you’ve shown that Alus appear outside of primates. NC Noise cutoff: This field refers to the bit scores at which a match is deemed not significant. It will be less than GA. An example NC line is shown below NC 19.50; TC Trusted cutoff: This field refers to the bit score of the lowest scoring match at which no false positives are detected or corresponding to an E-value of 0.001, whichever is greater. This threshold should be used when trying to conservatively annotate repetitive elements in a genome. An example TC line is shown below TC 33.00; FR False discovery rate: This field refers to the false discovery rate (FR) that is used to set the GA threshold. The FR is the expected portion of significant matches that are expected to be false positives; for example, if 1000 hits were detected and the model was build with FR of 0.001, then 1 of the hits would be expected to be a false positives. Note that for a fixed score threshold, the number of false positives in two like-sized genomes is expected to be about the same, but the number of true matches might be very different, resulting in different false discovery rates. Thus, an FDR of 0.001 on the organism specified in the MS field does not ensure a similar FDR for all other organisms. MS Model specificity field: Indicates the clade for which the GA (gathering threshold) score may be used as a masking threshold, with reasonable expected false discovery rate. Clade Id and Name depend on NCBI taxonomy. An example line is shown below: TaxId:9606; TaxName:Homo sapiens; PI Previous IDs: Semi-colon list The most recent names are stored on the left. This field is non-compulsory. SN Synonym: Other names that have been assigned to this family, perhaps in other databases. This field is non-compulsory. Classification Section: ----------------------- CT Provides rudimentary placement of a family in the repeat hierarchy. Three levels of hierarchy are given: - Type (e.g. DNA Transposon, Retrotransposon) - Class (e.g. LINE, SINE, LTR, Rolling Circle) - Superfamily (e.g. L1, ERV1, Alu) An example collection of CT lines is shown below: CT Type; DNA Transposon; CT Class; Cut and Paste; CT Superfamily; hAT-Tip100; Reference Section: ------------------ The reference section mainly contains cross-links to other databases, and literature references. All the fields in this section are described below. WK Wikipedia Reference: A special database reference for wikipedia. This is the name of the wikipedia article. All of the articles we cite are from the English version of wikipedia. Therefore the name needs to be appended to http://en.wikipedia.org/wiki/. For example: KW Alu_sequence; Thus, the corresponding page in wikipedia would be http://en.wikipedia.org/wiki/Alu_sequence. DC Database Comment: Comment for database reference. DR Database Reference: Reference to external database. All DR lines end in a semicolon. Dfam carries links to a variety of databases, this information is found in DR lines. The format is DR Database; Primary-id; Examples: DR Repbase; CHARLIE1; DR Rfam; RF00017; DR EXPERT; name@domain.edu; DR URL; http://www.domain.edu/; RC Reference Comment: Comment for literature reference. RN Reference Number: Digit in square brackets Reference numbers are used to precede literature references, which have multiple line entries RN [1] RM Reference Medline: Eight digit number An example RM line is shown below RM 91006031 The number can be found as the UI number in pubmed http://www.ncbi.nlm.nih.gov/PubMed/ RT Reference Title: Title of paper. RA Reference Author: All RA lines use the following format RA Jurka J, Milosavljevic A; RL Reference Location: The reference line is in the format below. RL Journal abbreviation year;volume:page-page. RL J Mol Evol 1991;32:105-121. RL Nucleic Acids Res 1992;20:487-493. Journal abbreviations can be checked at http://expasy.hcuge.ch/cgi-bin/jourlist?jourlist.txt. Journal abbreviation have no full stops, and page numbers are not abbreviated. Comment Section: ---------------- The comment section contains information about the Dfam entry. The only field in the comment section is the CC field. CC Comment: Comment lines provide annotation and other information. Annotation in CC lines does not have a strict format. Links to Dfam families can be provided with the following syntax: Dfam:DFxxxxxxx. Consensus Section: ------------------ SQ Sequence: The length and composition of the sequence followed by the sequence formatted into 6, 10bp blocks per line. // End of sequence/EMBL record. Dfam.hmm ======== The concatenated set of Dfam profile HMMs. This should be used in concert with nhmmer to search query sequences. This file has a ‘header’ section, which contains the copyright and version information. Each HMM contains the standard text HMM fields, such NAME, ACC etc. See the hmmer documenation (http://hmmer.janelia.org) for more information. There are some additional field such as the classification tags (prefixed CT and described above), model specificity (MS) and CC lines that contain RepeatMasker specific information. If many searches are to be performed using the Dfam.hmm file, we suggest that you use hmmpress to convert it to a binary format. Dfam.hits ========= Dfam.hits file contains a tab separated list of the matches in dfamseq to each of the models contained in Dfam.hmm that score above the GA threshold for the model. This files contains all hits found by the Dfam HMMs and does not try and resolve overlaps between related models. The fields in that file are as follows: Dfamseq identifier - Constructed from organism name, chromosome (or NC for unplaced contigs), sequence number. Dfam accession - i.e. DFXXXXXXX, use this to link on. Dfam identifier - 16 character name for the entry. bit score - The score of the match E-value - Estimate of the statistical significance of this match in context of the the size of dfamseq. bias - Bias composition of the hit model start - The first position matched in the model by the hit. model end - The last position matched in the model by the hit. strand - Either '+' or '-'. alignment start - The position of the first base in the alignment between the hit and the model. alignment end - The start position of the first base in the alignment between the hit and the model. envelope start - The start position of the sequence that defines a subsequence for which there is substantial probability mass supporting a homologous hit. See the HMMER documentation for further information on HMM envelopes. envelope end - The end position of the sequence that defines a subsequence for which there is substantial probability mass supporting a homologous hit. sequence length - The length of the sequence containing the match. Kimura divergence - The Kimura divergence of the aligned bases