Consensus Annotations Track Settings

Home
Genomes
Genome Browser
Tools
Mirrors
- Third Party Mirrors
- Mirroring Instructions
Downloads
My Data
Projects
Help
About Us
- News
- Publications
- Blog
- Cite Us
- Credits
- Release Log
- Staff
- Contact Us
- Conditions of Use
- Jobs
- Licenses

Description

This track shows Open Reading Frame annotations. For consensuses from Dfam these come directly from the Dfam 3.1 annotations stored here.

You can also view them on Dfam.org under the "features" tab.

For consensuses built by the RepeatBrowser "buildSeqs" script we used tblastn to search the RepeatMasker peptide library and took the highest scoring hit (the highest scoring hit also had to pass a score threshold of 50).

Note that the Repeat Masker Peptide Library does not contain peptides for every element. That is, L1PA2 proteins are assigned to L1HS not because of a poor match, but because L1PA2 is the representative protein for all young L1PA (i.e. there is no L1HS peptide in the library).

Methods

#for exact matches

awk '($6=="exact")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > exact_annotations.tab

#for renamed matches

awk '($6=="int-strip")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > intstrip_annotations.tab

#for manual matches

awk '($6!="int-strip" && $6!="exact" && $6!="cons")' summary.tsv | cut -f 1 | while read i; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep -A1 "FT.*CDS "| tr -s "\n" "\t" | sed "s/--/\n/g" | tr -s "." " \t" | tr -s " " "\t" | tr -s "\"" "\t" | awk -v s=$i '{print s"\t"$3"\t"$4"\t"$7}'; done > manual_annotations.tab

cat intstrip_annotations.bed manual_annotations.bed exact_annotations.bed | awk '{print $0"\t0\t+"}'| sort | uniq > Dfam_annotations.bed

#built cons

awk '($6=="cons")' ../summary.tsv | cut -f 1 | while read i; do sed -n -e '/^>'${i}'$/,/^>/ p' ~/hive/jferna10/RepeatBrowserHub/hg38reps/hg38reps.fa | head -n-1 >> built_cons.fa; done

cd peps/

#RepeatMasker pep library from Hubley

wget https://github.com/rmhubley/RepeatMasker/raw/master/Libraries/RepeatPeps.lib

makeblastdb -in built_cons.fa -dbtype nucl -parse_seqids

#phylogenetic groupings for TEs found in hg38 (this list was arrived at by taking the Dfam name, reduce repeat peptide libs to peptides from those groups

cut -f 1 summary.tsv | while read i ; do sed -n -e '/^NM\s*'${i}'$/,/\/\// p' Dfam.embl | grep "^OS" | cut -f2; done > hg38_TE_phlyo.txt

#copy and paste output to grep,reduces peptide library significantly, prevents aligning proteins from random creatures to human

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -A1 "$i" HumanRepeatPeps.lib;done > Human_built_cons_peps.lib

sed -zi "s/--\n//g" Human_built_cons_peps.lib

#find peptides in consensuses

tblastn -db built_cons.fa -query Human_built_cons_peps.lib -outfmt 6 > built_consensus_annotations.tab

cat built_consensus_annotations.tab | sort | uniq | sort -k 2,12 -n -r > temp.tab

#take the highest scoring gag and pol, (since all built consesnsus are L1s); Also impose filter that peptide score must also be above 50

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "gag" | sort -k 12 -n -r | head -n+1 | awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_gag.bed

grep ">" built_cons.fa | cut -f 2 -d">" | while read i; do grep -P "\t$i\t" temp.tab | grep "pol" | sort -k 12 -n -r | head -n+1 | awk '($12> 50) {print $2"\t"$9-1"\t"$10"\t"$1"\t"$12"\t+"}';done | sort | uniq > cons_pol.bed

mv *.bed ../

cd ..

#combine and change score to 0 (avoid issues with decimals etc that make browser unhappy)

cat Dfam_annotations.bed cons_gag.bed cons_pol.bed HERV_full.bed| awk '{print $1"\t"$2"\t"$3"\t"$4"\t"0"\t"$6}' | sed "s/L1PBA1/L1PBa1/g" | bedtools sort > ORFs.bed

bedToBigBed ORFs.bed ../hg38reps/hg38reps.sizes ORFs.bb

mv ORFs.bb ../hg38reps/orfs.bb

References

The Dfam database of repetitive DNA families, Robert Hubley; Robert D. Finn; Jody Clements; Sean R. Eddy; Thomas A. Jones; Weidong Bao; Arian F.A. Smit; Travis J. Wheeler;Nucleic Acids Research (2016) Database Issue 44:D81-89. doi: 10.1093/nar/gkv1272

Email:jferna10@ucsc.edu or max@soe.ucsc.edu