Family

A Dfam family represents a distinct tranpososable element from which many interspersed repeat copies trace their origin. The family is defined by a seed alignment which contains a representative set of the interspersed repeat copies. The seed alignment is the source of a consensus sequence and a profile hidden markov model for the family.

Each family in Dfam is represented on the website by a single page. The URL for the page ( e.g. http://dfam.org/family/DF######### ) is stable and may be saved or referenced on external websites or publications. Each entry page is divided into seven tabs detailed below.

Families with accession numbers beginning with DF have been curated. Families with accessions beginning with DR are output of de novo discovery methods. De novo repeat libraries are often fragmented, and may under- or overrepresent the diversity of repeat subfamilies within a genome, but they are still useful for the purpose of masking repetitive DNA in genomes.

Summary

Description

A detailed description of the family.

Classification

The classification of the family using the Dfam classification system. This system combines concepts from several well-known classification systems (PiƩgu et al., Jurka et al., Wicker et al., Curcio et al., Smit et al.) with phylogenies based on reverse transcriptase and transposases. Classification names have been chosen to be as descriptive as possible while honoring the most widely used acronyms whenever possible. In this new heirarchy class names do not have to be unique, rather the path through the heirarchy uniquely identifies a classification. Details and a visualization of the classification hierarchy can be viewed on the Classification page.

Taxa

The clades or species in which this family is known to be present. In most cases this will be the oldest clade in which copies of the family have been found at orthologous positions, meaning that the transposable element was active before the first speciation of extant species in this clade. In the case of horizontal transfer several clades/species may be listed. Clicking on the [..] icon will expand out the full taxonomic lineage for an entry. Clicking on the final clade/species links to the corresponding record in the NCBI taxonomy database.

Curation Details - Status

The family's curation status indicates how throughly it has been reviewed and worked on. Curation of a family might include filtering for redundancy with other families, extending a seed alignment for better coverage and representation of a family's true diversity in the genome(s) it appears in, ensuring gene families are not classified as transposable elements, and defining subfamilies.

Curation Details - Method and Source Assembly

For some families, the discovery method and source assembly are listed. The discovery method indicates what tool(s) produced the family and might include parameters or a command line for reproducibility. If known, a link is provided to the genome assembly that was processed to discover the family.

Length

The sequence length of the family. The process by which Dfam creates the consensus and profile HMM ensures that there is a one for one correspondence between the consensus and HMM match states.

Target Site Duplication

The preference for target site encoded as a consensus string. If only a length preference is known, it will be encode by a run of 'N's of the correct length.

Citations

A list of publications pertinent to this family with links to PubMed.

Aliases and External Links

Database cross references for this family.

Submitter

The Dfam account holder who transferred the family into the database.

Author

The list of author's who created and/or curated this family in Dfam. For Dfam 2.x this was used to designate the individuals who generated the profile HMM and curated the Dfam entry and does not include the original creator's of the family itself. For those entries details on the source of the family may be obtained from the citation list and/or through metadata stored at RepBase.

Created On

The date the Dfam family was created.

Modified On

The last date/time the family was modified.

Seed

The Seed Alignment Coverage and Whisker Plot is a combined plot showing both the coverage depth and quality of the aligned sequences. The coverage plot (top of the figure) provides a quick view of how well regions of the family are represented in the seed alignment. The whisker plot adds additional information about the fragmentation of seed alignment instances and alignment quality.

Example Seed Alignment and Whisker Plot.

The whisker plot is generated by coloring each aligned sequence based on it's sequence identity to the consensus. The score is calculated in non-overlaping windows of 10bp where cooler colors represent regions of higher conservation and warmer colors indicate regions of poor alignment. Drawing each aligned sequence as rows from top to bottom in the figure creates a heatmap of the multiple alignment highlighting regions of high variability or poor alignment. The length of the family as well as the number of sequences in the seed alignment are displayed above the figure, and the kimura divergence of the seed alignment is displayed below the figure.

Features

Some families in Dfam include a curated set of known coding sequences, high-scoring matches to a protein database, or specific known sequence features. For these families a feature map is generated depicting the location of each feature within the family. Target site duplications (TSD) will also be depicted on either side of the scale if applicable to the family. Hovering over the TSD will display the consensus sequence if a composition preference is known or a string of N's if only a length is known.

Example Feature Visualization.

Coding Sequences

For some families we provide details on a curated set of TE derived proteins. This is a non-redundant set, therefore many families may not have a coding sequence directly associated with them. For those that do, this section provides the following details:

Product: A unique (within Dfam) identifier for the amino acid sequence produced by this coding sequence.
Description: A short description of the coding sequence.
Protein Type: An short keyword for the protein type or "unk" for unknown. Here are the commonly used types:
  • gag - Group Antigens retroviral polyprotein.
  • pol - Reverse transcriptase, RNase-H and integrase functions.
  • env - Envelope retroviral protein.
  • yr - Tyrosine Recombinase.
  • tp - Transposase (integrase).
  • atp - ATP-dependent DNA binding protein.
  • int - Integrase
  • pro -
  • hel - Helicase
  • eg - Exapted gene
  • p# - Used when a family contains more than one protein of unknown function. This is used to make the product name unique and indicate the order in which the coding region is found within the family ( e.g. IS3EU-5_DR_p1 and IS3EU-5_DR_p2 ).
Tranlsation: A drop-down box containing the amino acid sequence translated from the family coding sequence.
Frameshifts: The number of frameshifts detected.
Stop Codons: The count of stop codons present in the final alignment.
Exons: The number of exons.
Strand: The strand of the coding sequence relative to the family ( '+' or '-' ).

Features

In this section generic features such as transcription factor binding sites and sequence-specific structures are shown.

Model

Models in Dfam are provided in two forms: a consensus sequence and a profile HMM (Hidden Markov Model). Both of these models are built from the same set of aligned representative sequences, termed the seed alignment.

Consensus

The consensus sequence is called from the seed alignment using a specialized caller. Rather than a simple majority rule, the caller attempts to detect and fix miscalled mutagenic CpG sites.

The sequence is displayed in a scrolling box with start/end base positions listed for each line. In addition, a search box is provided for highligting portions of the sequence. Single positions, position ranges as well as DNA sequence strings may be entered into the search box.

Example Consensus Text (Ricksha_c)

HMM

For each profile HMM, three thresholds are defined per associated species. The gathering threshold is appropriate for masking sequences that match the model specificity; it is is set using knowledge of the approximate copy number to yield high sensitivity with low false discovery rates. This threshold is used when defining the model's expected false discovery rate. When annotating sequences that fall outside of the model's specificity, the more stringent trusted cutoff threshold should be used. The noise cut-off indicates the score of the highest scoring match that falls below the gathering threshold.

Example HMM Logo (Ricksha_c)

The Logo of the HMM is depicted using an interactive visualization. Each position in the model is represented by a stack of letters, with stack height indicating the information content of the position. The rate and expected length of insertions after each position are shown in the fields below each stack. The logo can be zoomed using the '-' and '+' on-screen buttons to show more or less of the model as desired. It is also possible to center the logo by entering a column number in the field provided. A static image of the full logo can be downloaded using the green down arrow next to the word "Logo".

HMM Genome Specific Characteristics

Coverage, Conservation, and Inserts

The Coverage, Conservation, and Inserts plot shows, for hits above a variety of thresholds, (1) the distribution of hits along the model, (2) the position-specific levels of conservation of those hits, and (3) the position-specific rates of insertion among those hits. For a selected threshold, the purple line shows, for each model position, the fraction of all hits that have a match to that position, considering only RPH-filtered hits (hits for which this model is deemed to fit the sequence better than any other Dfam model). Among RPH-filtered hits, the green line shows, for each position, the average percent identity for the position. The grey line shows the number of insertions among those hits. In the 'Threshold' drop-down, each threshold is accompanied by the number of hits meeting it, in parentheses.

The figure below shows an example of the plot for the model Kanga1, representing the 214 RPH-filtered matches to the Human genome with E-value better than 1e-4. The purple line indicates that hits tend not to be full-length (the middle section is covered by only a few percent of hits), and that only about 30% of hits are aligned to the 3' end, where coverage is greatest. The green line shows that sequence conservation is on average around 70%, but with some variability. The grey lines show that inserts are generally rare, but at model position 1567, roughly 8 hits have insertions.

Example Non-Redundant Coverage, Conservation, and Inserts Plot (Kanga)

Non-Redundant Coverage

The Non-Redundant Coverage plot shows the distribution across the model for all above-threshold hits for which this model is deemed to fit the sequence better than any other Dfam model. The plot is generated using annotation model ranges and therefore does not account for insertions/deletions.

The figure below shows the non-redundant coverage for the model Kanga1. This example plot shows a common signal for DNA transposons, with the interior portion of the model covered by fewer instances than the termini, since non-autonomous TEs can suffer various degrees of internal deletion, yet must retain critical terminal features. Many of the 5' terminal hits fall between the gathering threshold and trusted cutoff threshold, leading to a terminal light green bulge on the sides of the plot.

Example Non-Redundant Coverage Plot (Kanga1a)

Redundant Coverage

This plot is much like the Non-Redundant Coverage plot, but instead of showing distribution across the model for only those matches that are best hit by this model, it shows the distribution for all hits to the model, even those better hit by some other model. The plot is generated using annotation model ranges and therefore does not account for insertions/deletions.

The figure below shows the redundant coverage for SVA_A. SVAs carry two reverse-complemented Alu fragments, one from positions 60 to 315 of the SVA model, and another shorter fragment from position 315 to 400. The first of these regions hits most of the Alu instances in the human genome, leading to the very large spike of over 1 million above-threshold hits covering that region, even though there are only a few thousand SVA copies. These Alu instances are better hit by one of the Alu models, and thus do not show up in the non-redundant coverage plot or related hit counts.

Example Redundant Coverage Plot (SVA_A)

False Coverage Plot

This plot shows how matches to an artificial benchmark sequence not containing any TE insertions are distributed across the model. This plot helps identify model regions that may be responsible for generating false positive hits, for example due to low-complexity or simple repeat characteristics.

The figure below shows an example False Coverage Plot, for Harlequin. It shows a spike of 36 false hits with E-value better than 1, covering a window around position 1000 of the model. Only the few hits signified by the purple part of the graph have E-value better than 1e-4.

Example False Coverage Plot (MER4-int)

Annotations

Overview

This table contains the count of matches, the Kimura divergence and average (fragment) hit length between the model and each associated genome. The table is first divided by Non-Redundant vs All hits. Non-Redundant (aka. NRPH hits) are the subset of all hits which are not covered by a higher scoring hit from another (probably related) Dfam family. This subset represents the most likely set of members of this family within a given genome. The All hits category is provided to give a broader view of how specific the model is within the Dfam library. The hit stats are further subdivided into those that score above the gathering and trusted cut-off thresholds.

The average hit length does not represent the average length of an insertion, rather it's the average length of an alignable portion of an insertion. In the near future Dfam's annotation adjudication pipeline will be refactored to use a RepeatMasker-like algorithm. This will enable joining of interrupted insertion fragments and generating a true average insertion length statistic.

Genome Annotation Distribution

The number of genomic matches that are found for many TE entries means that it is difficult to provide all matches via a web interface or as a multiple sequence alignment. The karyotype ideogram shows the distribution of hits for the entry against a particular genome as a heat map style representation. Each color represents a binned ranged of counts. Each color band is clickable, with the hits corresponding to that location loaded below the karyotype ideogram. The table contains the match score, E-value, and positions in the respective model and sequence. The alignment between the model and sequence can be obtained by expanding the row using the > symbol found at the beginning of each row. In the alignment, the model line represents the consensus sequence for aligned states in the model, colored according to the match line. The PP line represents the posterior probability, or degree of confidence in each aligned residue (for example, with '*' meaning highest confidence, and low numbers indicating low confidence), with corresponding grey scale coloration of the Query sequence.

By default, the hits counted here are only those in which this model was determined to better explain the sequence than any other model. To see distribution of all hits to the model (including those preferred by other models), select "All Hits".

The example below shows the non-uniform distribution of MIRs across the human genome. Large patches of white in the hit distribution ideogram indicate regions with no instances of the model; in this case these are particularly difficult to sequence heterochromatic regions (represented by Ns in the genome sequence).

Example Karyotype plot (MIR)

Below the karyotype ideogram are given the hits from a region on chromosome 21, with one hit expanded to show the alignment of that hit to the MIR model. In the alignment, the model line presents the consensus sequence for aligned states in the model, coloured according to the match line. The PP line represents the posterior probability, or degree of confidence in each aligned residue (for example, with '*' meaning highest confidence, and low numbers indicating low confidence), with corresponding grey scale colouration of the Query sequence.

Example Alignment (MIR)

Relationships

The Relationship tab provides a representation of the similarities between TE entries. Consensus sequences were produced for all models using the HMMER3 tool hmmemit. These sequences were then searched with all models using nhmmer ( and reciprocally with nhmmscan ), with a hit with E-value better than 1e-5 supporting a relationship.

If desired, results can be restricted to sequence similarities to families that belong to ancestor or descendant clades with the 'Related Clades Only' option. Additionally, the results can be expanded to include sequence similarity to uncurated or "raw" data (accessions starting with DR) which are hidden by default.

The image below shows part of the Relationship tab for the Ricksha_c (DF000001061) entry. In this case, Ricksha TE carries part of an ERVL and its LTR, MLT2. Simple glyphs are used to represent these relationships, as well as similarities to other Ricksha Dfam entries. The case of reverse-complement similarity is shown using a purple glyph with inverted orientation. Each glyph is shown with accompanying percent identity between the entry consensus sequences, match e-value, and percent shared coverage (length). The list of related entries can be sorted by any of these fields.

Example Relationships plot (Ricksha_c)

Download

From this tab, you can download the seed alignment, profile HMM, and EMBL- or FASTA-formatted consensus sequence for the family. You can also download the annotations for each genome: (1) matches to this family's profile HMM after removing redundant hits to other models ("Non-redundant"), and (2) matches above the gathering threshold, including those with better scoring matches to other models ("All hits").