<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Allen Day's Blog &#187; Genomics</title>
	<atom:link href="http://www.spicylogic.com/allenday/blog/category/science/genomics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spicylogic.com/allenday/blog</link>
	<description>♥data♥</description>
	<lastBuildDate>Mon, 21 Jun 2010 23:28:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Synthetic GFF Dataset for Genome Browser Benchmark</title>
		<link>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/#comments</comments>
		<pubDate>Tue, 07 Apr 2009 08:01:52 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Genomics]]></category>
		<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Science]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</guid>
		<description><![CDATA[I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use SOLR here?
I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation [...]]]></description>
			<content:encoded><![CDATA[<p>I deployed a Gbrowse/Chado installation last week at <a href="http://www.dowagro.com/">Dow Agrosciences</a>.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use <a href="http://lucene.apache.org/solr/">SOLR</a> here?</p>
<p>I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/">knownGene annotation set</a> of the Hg18 build of the human genome.  You can grab the data set and script used to generate it <a href="http://www.spicylogic.com/allenday/images/knownGene/">here</a>.  There are several files mRNA.E<strong>N</strong>.txt.gz that contain gzipped gene models, where <strong>N</strong>=3..7 indicates there are 10^<strong>N</strong> models in the file, uniformly distributed across a 500-megabase reference sequence.</p>
<p>I&#8217;m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:</p>
<ul>
<li>Chado</li>
<li>The default Bio::DB::GFF schema (does it have a name?)</li>
<li>The SOLR backend I&#8217;m about to implement</li>
</ul>
<p>I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
