Index of /EvidentialGene/arthropods/rnasets_srapublic
Name Last modified Size
Parent Directory 14-Mar-2021 16:09 -
sra_rnaseq2_201403.arpods.listan 30-Mar-2014 22:06 2k
sra_rnaseq2_201403.csv 30-Mar-2014 12:11 9.2M
sra_rnaseq2_201403.flies.listan 30-Mar-2014 22:11 12k
sra_rnaseq2_201403.insects.listan 30-Mar-2014 22:10 8k
sra_rnaseq2_201403.pe100m.listan 30-Mar-2014 16:49 32k
sra_rnaseq2_201403.readme.txt 05-Apr-2014 12:39 4k
sra_rnaseq2_201403.readme2.txt 06-Aug-2014 20:27 4k
sra_rnaseq2_201408.csv 06-Aug-2014 20:08 20.4M
sra_rnaseq2_201408.pe100m.listan 06-Aug-2014 20:22 41k
sra_rnaseq2_201411.csv 17-Nov-2014 15:38 19.7M
sra_rnaseq2_201411.pe100m.listan 17-Nov-2014 15:50 48k
sra_rnaseq2_201411.readme2.txt 18-Jan-2016 20:41 6k
sra_rnaseq2_201506.csv 09-Jun-2015 13:34 27.6M
sra_rnaseq2_201506.pe100m.listan 09-Jun-2015 14:17 64k
sra_rnaseq2_201506.readme2.txt 09-Jun-2015 14:27 6k
sra_rnaseq2_201509.csv 15-Sep-2015 15:24 24.9M
sra_rnaseq2_201509.pe100m.listan 15-Sep-2015 15:35 70k
sra_rnaseq2_201509.readme2.txt 17-Sep-2015 13:54 2k
sra_rnaseq2_201601.csv 18-Jan-2016 20:45 35.1M
sra_rnaseq2_201601.pe100m.listan 18-Jan-2016 20:55 85k
sra_rnaseq2_201601.readme2.txt 18-Jan-2016 21:24 3k
sra_rnaseq2_201609.readme.txt 03-Sep-2016 14:19 1k
sra_rnaseq2_201701.pe100m.list 20-Dec-2017 13:15 88k
sra_rnaseq2_201712.pe100m.list 20-Dec-2017 13:10 101k
sra_rnaseq2_201712.readme.txt 19-Dec-2017 15:04 2k
sra_rnaseq2_201906.pe100m.list 25-Jun-2019 15:55 481k
sra_rnaseq2_201906.readme.txt 25-Jun-2019 16:03 3k
sra_rnaseq2_201906pb.pe100m.list 25-Jun-2019 15:53 6k
Publicly available RNA-Seq data from NCBI SRA, 2014.03
collected by Don Gilbert, gilbertd At indiana.edu, EvidentialGene at euGenes.org
These are suitable for assembly to complete species gene sets. Some of these
arthropod species lack existing public gene sets, or have fragmented low-quality ones.
These will be interesting and valuable to assembly into good quality gene sets, which
the EvidentialGene pipeline is now ready to do.
Please see the subset lists, sra_rnaseq*.arpods, insects, flies.listan,
arpods = arthropods not insects, insects = not diptera, flies = dipterans (Drosophila, etc.)
Table columns are PairSpots=N read pairs, 100 M minimum in these tables, nSets=N experiment sets,
Mbases=Megabases of rnaseq, Species, SpeciesInfo=clade,taxid,common name,taxon lineage
Potential collaborators or biologists with interest in species in these RNA sets should
contact Don about this.
Gene set completeness for mRNA-genes and Genome-genes of Ticks
Human genes found (n=16631)
geneset hit% alnh alnt Gene set method, species
................................................................
ixodes.evg 95.7 434 415 mRNA-assembly, deer tick (2014.04 rough draft)
ztick.evg 91.4 416 380 mRNA-assembly, zebra tick
ixodes.gno 89.5 364 326 genome-predict, deer tick
tetur.gno 83.2 399 332 genome-predict, spider mite
................................................................
hit%= percent of ref genes found
alnh= alignment average, for hit genes
alnt= alignment average, for all ref genes
#...............................
Tables from http://www.ncbi.nlm.nih.gov/sra
query=
(("biomol transcriptomic"[Properties]) AND "platform illumina"[Properties]) AND "library layout paired"[Properties]
Taxonomic Groups n=25171 public set
eukaryotes (23288)
animals (17892)
chordates (14723)
arthropods (2237)
nematodes (541)
more... (391)
green plants (3899)
land plants (3705)
more... (194)
fungi (1012)
apicomplexans (208)
ciliates (46)
more... (231)
bacteria (1308)
unclassified (553)
viruses (17)
#...................
output file sra_rnaseq2_201403.csv has all above
cat sra_rnaseq2_201403.csv | perl -ne \
'chomp; s/^"//; s/"$//; @v=split"\",\""; if($v[0]=~/Experiment/) { @hd=@v; next; }
($sr,$sp,$mb,$nr,$ns,$libsel)=@v[0,2,9,10,11,17];
$sp=~s/ sp\..*$//; $mb{$sp}+=$mb; $nr{$sp}+=$nr; $ns{$sp}+=$ns;
END{ print join("\t",qw(PairSpots nSets Mbases Species))."\n";
for $s (sort{ $ns{$b}<=>$ns{$a} or $a cmp $b } keys %ns) {
($ns,$nr,$mb)=($ns{$s},$nr{$s},$mb{$s}); $mb=int($mb);
print join("\t",$ns,$nr,$mb,$s)."\n" if($ns>99999999); } }' \
> sra_rnaseq2_201403.pe100m.list
# n=580 at 100M+ spots, includes bacteria
# run thru NCBI taxonomy species into commontree > taxid, then taxid in entrez batch > taxres.xml
0. cut species names from sra.pe100m.list
1. spp.list > http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi > taxid.list, commontree.txt
2. taxid.list > http://www.ncbi.nlm.nih.gov/sites/batchentrez + taxonomy > taxres.xml
cat sra_rnaseq2_201403.taxres.xml | perl -ne\
'if(/^.Taxon\b/) { $nt++; $in=1; } elsif(/^\s+.Taxon\b/) { $in++; } elsif(m,^./Taxon\b,) { $in=0; }
elsif($in==1 and /<(TaxId|ScientificName|GenbankCommonName|CommonName|Division|Lineage)>([^\<]+)/) {
($t,$v)=($1,$2); $tid=$v if($t eq "TaxId"); $tv{$tid}{$t}.="$v,"; $tn{$t}++; }
END{ @tn= qw(TaxId Division ScientificName GenbankCommonName CommonName Lineage );
print join("\t",@tn)."\n"; $k="Division"; $j="Lineage";
foreach $t (sort{ $tv{$a}{$k} cmp $tv{$b}{$k} or $tv{$a}{$j} cmp $tv{$b}{$j} or $a <=> $b} keys %tv) {
@v=@{$tv{$t}}{@tn}; map{ s/,$// } @v; print join("\t",@v)."\n"; } }' \
> sra_rnaseq2_201403.taxres.tab
cat sra_rnaseq2_201403.taxres.tab sra_rnaseq2_201403.pe100m.list | perl -ne \
'if(/; Eukaryota/) { chomp; ($tx,$dv,$sp,$cg,$cn,$ln)=split"\t"; $spv{$sp}="$dv,tx$tx,$cg"; }
elsif(/^\d/) { ($np,$ns,$mb,$sp)=split"\t"; chomp($sp); if($sv=$spv{$sp}) { s/$/\t$sv/; print; } }
else { print; } '\
> sra_rnaseq2_201403.pe100m.listan
#....................................
|