DFAM : Multiple alignment and profile HMMs of repetitive DNA                                            
                        RELEASE 2.0
           --------------------------------------

1. INTRODUCTION

  Dfam is a collection of conserved DNA element sequence alignments,
  hidden Markov models (HMMs) and matches lists for complete genomes. This
  release focuses on models for Human, Mouse, Zebrafish, Worm and Fly.

2. LOCATIONS

  Dfam is available on the web at:

    http://dfam.org/

3. STATISTICS

  Dfam 2.0 consists of 4150 models. 

  Dfam families include retrotransposons, DNA transposons, interspersed 
  repeats of unknown origin, and a number of non-TE entries used to annotate 
  satellites or to avoid annotating noncoding RNA genes as TEs. The distribution
  of these constituent family types is given below:

                 retrotransposons   DNA transposons   unknown origin
                 ----------------   ---------------   --------------
     human only         428               46                  1
     mouse only         544                9                  8
    all mammals         388              277                 62
      zebrafish        1074              766                 13
            fly         165               27                 10
       nematode          57               98                  8
 
  In addition to the repeat families represented here, Dfam contains 76 noncoding
  RNA families and 92 satellite families.

4. CONSTRUCTION OF DFAM

  Dfam is based on a fixed sequence database called Dfamseq - Dfamseq 2.0 
  contains five genomes: 

       H. sapiens ( GRCh38 ),
       C. elegans ( ce10 ),
       D. melanogaster ( dm6 ),
       M. musculus ( mm10 ), and
       D. rerio ( danRer10 )

  and species-specific GARLIC [3] artificial benchmark sequences for each genome.  
       
  Sequence alignments for human interspersed repeat families were built using
  annotation on the UCSC genome browser (http://http://genome.ucsc.edu/,
  hg38, ce10, dm6, mm10, and danRer10), which itself depends on annotation 
  software RepeatMasker (http://http://www.repeatmasker.org/) and the database 
  of repeat consensus sequences, RepBase (http://www.girinst.org/repbase/). For 
  each family, annotated instances were transitively aligned based on mutual 
  alignment to the Repbase consensus sequence.

  Hidden Markov models (HMMs) were constructed from the sequence alignment
  using the HMMER3 tool hmmbuild, and each model was then searched against
  Dfamseq using a beta version of the HMMER3 tool nhmmer, with hit metadata
  (sequence location, score, etc) captured for distribution.

  Note: We are currently using a pre-release of HMMER 3.1. The source code 
  for this snapshot is available for the Dfam FTP site.

5. DESCRIPTION OF CHANGES FROM RELEASE 1.4 to 2.0

  1. Changes to the Dfam website largely revolve around support for the 
     presence of repeat families belonging to multiple species. The majority 
     of the changes are on the back end of the website, involving speed 
     and scalability.  

  2.  New tags have been added to the DESC file format to support model-use 
      differences between species.  Also note that several tags in the DESC 
      file format have slightly different uses than in previous releases.

      New TH Tags
      -----------
      A mobile element model found in Dfam may be found in one or more of Dfam's
      reference species ( currently: human, mouse, zebrafish, fly, and worm ). A
      set of score cutoff thresholds have been independently calculated for each
      of species in which it can be found and stored in the DESC file with the 
      new TH tag.  The tag includes the NCBI taxonomy database id as well 
      as the latin name of the species in which the thresholds apply. 

      i.e
      TH   TaxId:9606; TaxName:Homo sapiens; GA:5.60; TC:23.23; NC:5.55; fdr:0.002;

      GA/TC/NC/FR Tags
      ----------------
      With the addition of the TH tags the global score cutoff threshold tags 
      GA/TC/NC become application specific placeholders.  For example, they 
      could be populated with the values from a particular TH line immediately
      prior to a search using nhmmer.  As of this release the DESC file 
      GA/TC/NC tags are populated with the highest TC cutoff found in the 
      DESC TH lines — basically an extremely conservative value.
      We anticipate that RepeatMasker/dfamscan and others will apply the TH 
      data to these three fields as needed.

      i.e
      GA   23.23;
      TC   23.23;
      NC   23.23;
      TH   TaxId:9606; TaxName:Homo sapiens; GA:5.60; TC:23.23; NC:5.55; fdr:0.002;

      SM Tags
      -------
      The SM tag no longer contains the definitive parameters used to search dfamseq.  
      It now contains the parameters used for the *last* search ( one of many species 
      assemblies ).  The DESC file format does not allow for easy storage of every 
      parameter set.  Therefore this value should be used as an example search command
      line.

      i.e
      SM   nhmmer --cpu 8 --noali -E 100 --dfamtblout mm10-full_hits -Z 3102 HMM
      SM   dfamseq.mask

      MS Tags
      -------
      A model is applicable to one or more species or clades.  Horizontal transfer of 
      mobile elements between species requires that we support multiple disjoint taxa 
      in the DESC file.  Tigger1 is a classic example of wide horizontal transfer.  It’s 
      MS lines look like this:

      MS   TaxId:9263; TaxName:Metatheria;
      MS   TaxId:9348; TaxName:Xenarthra;
      MS   TaxId:9443; TaxName:Primates;
      MS   TaxId:9989; TaxName:Rodentia;
      MS   TaxId:33554; TaxName:Carnivora;
      MS   TaxId:91561; TaxName:Cetartiodactyla;
      MS   TaxId:311790; TaxName:Afrotheria;

  3. Dfamseq is Dfam's collection of reference genomes and benchmark
     sequences.  In the 1.x releases of dfamseq were maintained as
     a monolithic database referenced by seed alignments and search
     results alike.

     The new dfamseq is a collection of individual species-specific
     databases maintained in separate heirarchies/schemas. In this
     release we continue to maintain all provided seed alignment
     sequences but do not guarantee that the parent sequence for
     each seed is in dfamseq.
  
  4. Seed alignment sequence identifiers are now independent of dfamseq. 
     The identifier is a 80 character field and we validate entries
     conformating the following nomenclature:
           
     assembly:sequence:start_pos-end_pos

     Where coordinates are zero-based, half open.  For example seq1:0-1 would 
     specify the first base in the sequence "seq1". An example seed identifier 
     might look like this:

     mm10:chr7:46572136-46572252

     In the future we plan to flag seed identifiers that cannot be validated
     using public databases and using a standard nomenclature.  Currently all
     seed sequence can be validated in this fashion.


6. FUTURE FORMAT CHANGES

  No major changes for the format of the flatfile planned for next
  release.


7. DESCRIPTION OF RELEASE FILES

  relnotes.txt               - This file.
  userman.txt                - A fuller description of Dfam fields.
  Dfam.hmm                   - Dfam HMMs in an HMM library, searchable with the 
                               nhmmer program.
  Dfam.seed                  - Annotation and seed alignments of all Dfam entires 
                               in Stockholm format.
  <assembly>_dfam.hits       - TSV list of all matches found in the given assembly that
                               score above the GA threshold.
                               ie. hg38_dfam.hits.gz
  <assembly>_dfam.nrph.hits  - TSV list of all non-redundant matches found in the
                               given assembly and that score above the GA threshold.
                               ie. hg38_dfam.nrph.hits.gz
  hmmer.src                  - The source code of the current beta version of nhmmer 
                               used to make this release.

8. DESCRIPTION OF FIELDS
  
  See userman.txt for more detailed description of each field
 

  Compulsory fields:
  ------------------

  AC   Accession number:           Accession number in form DFxxxxxxx.
  ID   Identification:             One word name for entry.
  DE   Definition:                 Short description of entry.
  AU   Author:                     Authors of the entry.
  SE   Source of seed:             The source suggesting the seed members belong to one entry.
  GA   Gathering method:           Score used for sequences within the clade specified by MS.
  TC   Trusted Cutoff:             Score used for sequences outside the clade specified by MS.
  NC   Noise Cutoff:               Smaller cutoff than GA; not used in Dfam.
  FR   False Discovery Rate:       Target FDR used to set GA.
  BM   Build method
  SM   Internal search method
  MS   Model specificity:          TaxID and TaxName, based on NCBI taxonomy.
  CT   Classification tags:        Repeat Type, Class, and Superfamily.
  SQ   Sequence:                   Number of sequences in alignment.
  //                               End of alignment.

  Optional fields:
  ----------------

  DC   Database Comment:           Comment about database reference.
  DR   Database Reference:         Reference to external database.
  RC   Reference Comment:          Comment about literature reference.
  RN   Reference Number:           Reference Number.
  RM   Reference Medline:          Eight digit medline UI number.
  RT   Reference Title:            Reference Title.
  RA   Reference Author:           Reference Author
  RL   Reference Location:         Journal location.
  PI   Previous identifier:        Record of all previous ID lines.
  CC   Comment:                    Comments.
  WK   Wikipedia Reference:        Reference to wikipedia.
  SN   Synonym                     A widely accepted alternative name for the model.
  CN   Classification Note:        A free text comment about the model classification.


9. REFERENCES

  1. The Dfam Database of Repetitive DNA Families
     Robert Hubley, Robert D. Finn, Jody Clements, Sean R. Eddy, Thomas A. Jones, Weidong
     Bao, Arian F.A. Smit, Travis J. Wheeler
     Nucl. Acids Res. In Press.

  2. Dfam: a Database of Repetitive DNA Based on Profile Hidden Markov Models    
     Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD
     Nucl. Acids Res. (2013) Database Issue 41:D70-82. doi: 10.1093/nar/gks1265

  3. Realistic artificial DNA sequences as negative controls for computational
     genomics.  Caballero J, Smit AF, Hood L, Glusman G.  Nucl. Acids Res. 2014
     doi: 10.1093/nar/gku356

10. THE DFAM CONSORTIUM

  Dfam is maintained by a consortium of researchers. You can contact
  the Dfam consortium at:
      help@dfam.org

  The current members of the Dfam consortium are:
  Robert D. Finn, Jody Clements, Sean R. Eddy, Thomas A. Jones, Travis J.
  Wheeler: Janelia Farm Research Campus, USA
  Arian F. A. Smit, Robert Hubley: Institute for Systems Biology, USA
  Jerzy Jurka: Genetic Information Research Institute, USA
 

11. ACKNOWLEDGEMENTS
  
  R.D.F., J.C, S.R.E, T.A.J., and T.J.W received institutional support from
  HHMI Janelia Farm Research Campus. J.J. was supported by grants from the
  National Library of Medicine, National Institutes of Health
  (P41LM006252-12). A.F.A.S and R.H were supported by a
  grant from the National Institutes of Health (RO1 HG002939).

12. COPYRIGHT NOTICE

  Dfam - A database of conserved DNA element alignments and HMMs
  Copyright (C) 2015 The Dfam consortium.

  This database is free; you can redistribute it and/or modify it
  as you wish, under the terms of the CC0 1.0 license, a
  'no copyright' license:

  The Dfam consortium has dedicated the work to the public domain, waiving
  all rights to the work worldwide under copyright law, including all related
  and neighboring rights, to the extent allowed by law.

  You can copy, modify, distribute and perform the work, even for commercial
  purposes, all without asking permission. See Other Information below.
                               

  Other Information

  o In no way are the patent or trademark rights of any person affected by
    CC0, nor are the rights that other persons may have in the work or in how
    the work is used, such as publicity or privacy rights.
  o Unless expressly stated otherwise, the Dfam consortium makes no
    warranties about the work, and disclaims liability for all uses of the
    work, to the fullest extent permitted by applicable law.
  o When using or citing the work, you should not imply endorsement by the  
    Dfam consortium.

  You may also obtain a copy of the CC0 license here:
  http://creativecommons.org/publicdomain/zero/1.0/legalcode

___________________
The Dfam Consortium
2015