The yak genome database: an integrative database for studying yak biology and-altitude adaption

The YGD presently provides use of yak genome set up version 1.1, a de novo genome sequence set up prepared while using second generation sequencing technologies by Condition Key Laboratory of Grassland Agro-Ecosystem, College of Existence Science, Lanzhou College, Lanzhou, China and BGI-Shenzhen, China. The succession files are usually maintained through the former. The 65X genome set up includes a scaffold N50 of just one.4Mb having a total size 2,657 Mb. It’s thus similar in dimensions towards the 2,649 Mb cattle genome (UMD 3.1).

Search

Searching function was created to facilitate the identification of genes according to their predicted annotations. Gene datasets were acquired using a number of strategies, including RNA-seq, homology and ab initio gene conjecture. For homology conjecture, pseudo-genes were detected and dropped, gene models do not have synteny support but have greater quality score and fewer frame shift were retained. For ab initio conjecture, only gene mixers possess a minimum coverage of 30% in SwissProt/TrEMBL were retained. A consensus non-redundant dataset that contains 22,282 protein-coding genes was built by merging different gene datasets using GLEAN (http://sourceforge.internet/projects/glean-gene). Of those, 15716 (70.53%) have RNA-seq support, 21474(96.37%) from the predicted yak genes possess a homologue (TreeFam) in both human (n=19,894 89.28%), cattle (n=20,346 91.31%) and dog (n=19,455 87.31%) with 18,040 being homologous of all species examined, and 170 (.76%) being ‘unique’ to yak. And 8923(40.05%) genes have single orthologs in bovine and human. This combined gene dataset was utilized as reference and it has been built-into the YGD. Predicted noncoding genes, including 481 miRNAs, 562 rRNAs and 499 tRNAs were also built-into the database. The incorporated genes were annotated using a number of methods including Swiss-Prot annotation, TrEMBL annotation, KEGG annotation, InterPro domain annotation and Gene Ontology annotation. Swiss-Prot and TrEMBL annotations for that predicted yak proteins were generated by performing BLASTP (e-value ≤10-5) looks for each one of these from the Swiss-Prot and TrEMBL databases. The related yak genes were then mapped to KEGG path maps according to best BLASTP (e-value ≤10-5) hit. InterProScan was utilized to annotate motifs and domains in yak genes by evaluating these to hits in the public Pfam, PRINTS, PROSITE, ProDom and SMART while using HMMPfam, FPrintScan, ScanRegExp, ProfileScan, BlastProDom and HMMsmart using the following parameters -format raw -goterms -iprlookup. Gene Ontology information ended up being obtained from the InterProScan results within-house Perl scripts. These annotations is going to be refreshed when gene models are updated.

Tools

Generic genome browser

We used the Generic Genome Browser (GBrowse) developed included in the Generic Model Organism Database project (GMOD http://gmod.org/wiki/GMOD) to visualise the genome from the yak [7]. Additionally, predicted genes, single nucleotide variants (SNVs), multiple kinds of RNA sets and repeats contained inside the YGD could be visualized using Gbrowse. To recognize SNVs, top quality reads (i.e. reads by having an average quality score above 30) from short insert size libraries were re-aligned towards the set up using SOAP (http://soap.genomics.org.cn/). The odds of every possible genotype at each position around the reference genome were calculated, along with a record model according to Bayesian theory and also the Illumina quality system was utilized to SNVs. The allelic sequence using the greatest probability was utilized because the reference sequence and heterozygosity was calculated according to other high probability alleles. Repeat sequences were recognized by two different ways. First, we identified known TEs using RepeatMasker (http://www.repeatmasker.org) and also the Repbase TE library (http://www.girinst.org/repbase), after which used the RepeatProteinMask program to recognize TEs by aligning the genome sequence to some self-generated curated TE protein database. Second, we built a de novo yak repeat library using RepeatModeler, which produced consensus sequences and classification information for every repeat family. The RepeatMasker program ended up being put on these genome sequences, while using RepeatModeler consensus sequences because the library. GBrowse enables users to simply browse any region of great interest within the yak genome. A number of track features could be utilized with respect to the region’s position around the scaffold, including protein-coding genes, noncoding genes, GC content and repetitive sequences.

BLAST

BLAST is among the most helpful tools for searching the YGD, offering users the opportunity to search against scaffolds and genes in Yak 1.1. Around the search engines for any BLAST search of YGD, each hit is from the GBrowse look at the succession.

Generic synteny browser

A 3-way genome comparison between human, cattle and yak was generated using Mercator and MAVID [8], and could be viewed while using Generic Synteny Browser (Gbrowse-syn), which in line with the same software framework as GBrowse. The Gbrowse-syn may be used to compare co-straight line parts of multiple genomes while using familiar GBrowse-style web site and may also display gene symbols and functional information.

Downloads

Data in the YGD obtainable to do local searches from the whole genome, specific gene sequences and mitochondrial genomes.

Resourse: http://bmcgenomics.biomedcentral.com/articles/10.1186/