Synthetic GFF Dataset for Genome Browser Benchmark

I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn’t it be nice to use SOLR here?

I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation set of the Hg18 build of the human genome.  You can grab the data set and script used to generate it here.  There are several files mRNA.EN.txt.gz that contain gzipped gene models, where N=3..7 indicates there are 10^N models in the file, uniformly distributed across a 500-megabase reference sequence.

I’m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:

  • Chado
  • The default Bio::DB::GFF schema (does it have a name?)
  • The SOLR backend I’m about to implement

I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.