<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Allen Day's Blog &#187; Informatics</title>
	<atom:link href="http://www.spicylogic.com/allenday/blog/category/science/informatics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spicylogic.com/allenday/blog</link>
	<description>♥data♥</description>
	<lastBuildDate>Mon, 21 Jun 2010 23:28:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Synthetic GFF Dataset for Genome Browser Benchmark</title>
		<link>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/#comments</comments>
		<pubDate>Tue, 07 Apr 2009 08:01:52 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Genomics]]></category>
		<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Science]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</guid>
		<description><![CDATA[I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use SOLR here?
I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation [...]]]></description>
			<content:encoded><![CDATA[<p>I deployed a Gbrowse/Chado installation last week at <a href="http://www.dowagro.com/">Dow Agrosciences</a>.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use <a href="http://lucene.apache.org/solr/">SOLR</a> here?</p>
<p>I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/">knownGene annotation set</a> of the Hg18 build of the human genome.  You can grab the data set and script used to generate it <a href="http://www.spicylogic.com/allenday/images/knownGene/">here</a>.  There are several files mRNA.E<strong>N</strong>.txt.gz that contain gzipped gene models, where <strong>N</strong>=3..7 indicates there are 10^<strong>N</strong> models in the file, uniformly distributed across a 500-megabase reference sequence.</p>
<p>I&#8217;m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:</p>
<ul>
<li>Chado</li>
<li>The default Bio::DB::GFF schema (does it have a name?)</li>
<li>The SOLR backend I&#8217;m about to implement</li>
</ul>
<p>I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Upcoming AI / Machine Learning Conferences</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/12/05/upcoming-ai-machine-learning-conferences/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/12/05/upcoming-ai-machine-learning-conferences/#comments</comments>
		<pubDate>Fri, 05 Dec 2008 19:49:13 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Networking]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/12/05/upcoming-ai-machine-learning-conferences/</guid>
		<description><![CDATA[A (partial) list I found today.  Doesn&#8217;t include NIPS, so I&#8217;m not sure how exhaustive it is, but it has a bunch I haven&#8217;t seen before.
http://www.kmining.com/info_conferences.html
]]></description>
			<content:encoded><![CDATA[<p>A (partial) list I found today.  Doesn&#8217;t include NIPS, so I&#8217;m not sure how exhaustive it is, but it has a bunch I haven&#8217;t seen before.</p>
<p><a href="http://www.kmining.com/info_conferences.html">http://www.kmining.com/info_conferences.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/12/05/upcoming-ai-machine-learning-conferences/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>iPhone 2.0 User-Agent string, other iPhone/iPod data</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/07/04/iphone-20-user-agent-string-other-iphoneipod-data/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/07/04/iphone-20-user-agent-string-other-iphoneipod-data/#comments</comments>
		<pubDate>Sat, 05 Jul 2008 01:53:10 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Mobile]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=39</guid>
		<description><![CDATA[I was preparing a report on iPhone locales from some web server logs, and noticed a few oddities.  Some of the hits appear to be coming from the new 3G iPhone 2.0, check out the User-Agent strings:

# observed from 1 metrocast.net (NY) IP
Mozilla/5.0 &#40;iPod; U; iPhone OS 2_0 like Mac OS X; en-us&#41; AppleWebKit/525.17 [...]]]></description>
			<content:encoded><![CDATA[<p>I was preparing a report on iPhone locales from some web server logs, and noticed a few oddities.  Some of the hits appear to be coming from the new 3G iPhone 2.0, check out the User-Agent strings:</p>

<div class="wp_syntax"><div class="code"><pre class="bash"><span style="color: #808080; font-style: italic;"># observed from 1 metrocast.net (NY) IP</span>
Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPod; U; iPhone OS 2_0 like Mac OS X; en-us<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.17</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.1</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>5A240d Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5525.7</span>
<span style="color: #808080; font-style: italic;"># observed from 1 optonline.net (NY) IP</span>
Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPhone Simulator; U; CPU iPhone OS 2_0 like Mac OS X; en-us<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.18</span><span style="color: #000000;">.1</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.1</span><span style="color: #000000;">.1</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>5A345 Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.20</span></pre></div></div>

<p>The former is confirmed to be an <a href="http://forums.macrumors.com/showthread.php?t=471274">iPhone 2.0 User-Agent string</a> on the MacRumors Forums.</p>
<p>Other unusual/rare iPhone/iPod User-Agent/UA strings:</p>

<div class="wp_syntax"><div class="code"><pre class="bash">Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPhone; U; CPU like Mac OS X; en<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">420.1</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.0</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>4A102 Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">419</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>United States<span style="color: #7a0874; font-weight: bold;">&#41;</span>
Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>Windows; U; Windows NT <span style="color: #000000;">5.1</span>; en-US; rv:<span style="color: #000000;">1.9</span><span style="color: #7a0874; font-weight: bold;">&#41;</span> Gecko<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">2008052906</span> Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPhone; U; CPU like Mac OS X; en<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">420</span>+ <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.0</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>1A543 Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">419.3</span>
Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPhone; U; CPU like Mac OS X; en<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">420.1</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Cydia<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">1.0</span><span style="color: #000000;">.2460</span><span style="color: #000000;">-59</span></pre></div></div>

<p><b>Update July 11</b>.  iPhone 2.0 is out, and the UA is (note the Safari revision increment from the earlier pre-launch UA):</p>

<div class="wp_syntax"><div class="code"><pre class="bash">Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPhone; U; CPU iPhone OS 2_0 like Mac OS X; en-us<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.18</span><span style="color: #000000;">.1</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.1</span><span style="color: #000000;">.1</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>5A345 Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.20</span></pre></div></div>

<p>While the iPod with iPhone 2.0 software update UA is:</p>

<div class="wp_syntax"><div class="code"><pre class="bash">Mozilla<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">5.0</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>iPod; U; CPU iPhone OS 2_0 like Mac OS X; en-us<span style="color: #7a0874; font-weight: bold;">&#41;</span> AppleWebKit<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.18</span><span style="color: #000000;">.1</span> <span style="color: #7a0874; font-weight: bold;">&#40;</span>KHTML, like Gecko<span style="color: #7a0874; font-weight: bold;">&#41;</span> Version<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">3.1</span><span style="color: #000000;">.1</span> Mobile<span style="color: #000000; font-weight: bold;">/</span>5A347 Safari<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">525.20</span></pre></div></div>

<p>Note that both the upgraded iPod and the iPhone UAs both contain the string &#8220;iPhone&#8221; in them, so you may need to update your device-detection logic if you care about discriminating between iPods and iPhones.  Not yet clear to me how to discriminate between an upgraded iPhone 1.0 w/ 2.0 software, and a bona fide 3G iPhone 2.0.  Will post more when I figure this out.</p>
<p>Know anything else about these?  Leave me a comment!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/07/04/iphone-20-user-agent-string-other-iphoneipod-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes on setting up Taste</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/06/30/notes-on-setting-up-taste/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/06/30/notes-on-setting-up-taste/#comments</comments>
		<pubDate>Mon, 30 Jun 2008 10:49:55 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/06/30/notes-on-setting-up-taste/</guid>
		<description><![CDATA[

Setting up Taste v1.7.2 on a CentOS 4 x86_64 box.
Taste has merged with Mahout now, but I still want to do this standalone b/c I&#8217;m having trouble getting the JUnit tests to pass for Mahout.  With that out of the way&#8230;
These are the shell commands I assembled after following the Taste Demo guide.

#make sure [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.spicylogic.com/allenday/blog/wp-content/uploads/2008/06/taste.png"><img class="alignright size-full wp-image-36" title="hadoop" src="http://www.spicylogic.com/allenday/blog/wp-content/uploads/2008/06/taste.png" alt="" /></a><br />
<a href='http://www.spicylogic.com/allenday/blog/wp-content/uploads/2008/06/mahout-logo-82x100.png'><img src="http://www.spicylogic.com/allenday/blog/wp-content/uploads/2008/06/mahout-logo-82x100.png" alt="" title="mahout-logo-82x100" width="82" height="100" class="alignright size-full wp-image-38" /></a></p>
<p>Setting up Taste v1.7.2 on a CentOS 4 x86_64 box.</p>
<p>Taste has merged with <a href="http://lucene.apache.org/mahout/">Mahout</a> now, but I still want to do this standalone b/c I&#8217;m having trouble getting the JUnit tests to pass for Mahout.  With that out of the way&#8230;</p>
<p>These are the shell commands I assembled after following the <a href="http://taste.sourceforge.net/#demo">Taste Demo guide</a>.</p>

<div class="wp_syntax"><div class="code"><pre class="bash"><span style="color: #808080; font-style: italic;">#make sure you have ant, and the JDK.  I don't recommend the CentOS stock, get them from Sun/Apache</span>
<span style="color: #808080; font-style: italic;">#download necessary .jar files, sources, data files.  unpack/move them to correct locations.</span>
<span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>internap.dl.sourceforge.net<span style="color: #000000; font-weight: bold;">/</span>sourceforge<span style="color: #000000; font-weight: bold;">/</span>taste<span style="color: #000000; font-weight: bold;">/</span>taste<span style="color: #000000;">-1.7</span><span style="color: #000000;">.2</span>.<span style="color: #c20cb9; font-weight: bold;">zip</span>
<span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>internap.dl.sourceforge.net<span style="color: #000000; font-weight: bold;">/</span>sourceforge<span style="color: #000000; font-weight: bold;">/</span>proguard<span style="color: #000000; font-weight: bold;">/</span>proguard4<span style="color: #000000;">.2</span>.<span style="color: #c20cb9; font-weight: bold;">zip</span>
<span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>www.grouplens.org<span style="color: #000000; font-weight: bold;">/</span>system<span style="color: #000000; font-weight: bold;">/</span>files<span style="color: #000000; font-weight: bold;">/</span>million-ml-data.tar__0.gz
<span style="color: #c20cb9; font-weight: bold;">wget</span> http:<span style="color: #000000; font-weight: bold;">//</span>www.hightechimpact.com<span style="color: #000000; font-weight: bold;">/</span>Apache<span style="color: #000000; font-weight: bold;">/</span>tomcat<span style="color: #000000; font-weight: bold;">/</span>tomcat<span style="color: #000000;">-5</span><span style="color: #000000; font-weight: bold;">/</span>v5<span style="color: #000000;">.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span>apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span>.<span style="color: #c20cb9; font-weight: bold;">tar</span>.gz
<span style="color: #c20cb9; font-weight: bold;">unzip</span> taste<span style="color: #000000;">-1.7</span><span style="color: #000000;">.2</span>.<span style="color: #c20cb9; font-weight: bold;">zip</span>
<span style="color: #c20cb9; font-weight: bold;">unzip</span> proguard4<span style="color: #000000;">.2</span>.<span style="color: #c20cb9; font-weight: bold;">zip</span>
<span style="color: #c20cb9; font-weight: bold;">tar</span> -xvzf million-ml-data.tar__0.gz
<span style="color: #c20cb9; font-weight: bold;">tar</span> -xvzf apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span>.<span style="color: #c20cb9; font-weight: bold;">tar</span>.gz
<span style="color: #c20cb9; font-weight: bold;">cp</span> proguard4<span style="color: #000000;">.2</span><span style="color: #000000; font-weight: bold;">/</span>lib<span style="color: #000000; font-weight: bold;">/</span>proguard.jar lib<span style="color: #000000; font-weight: bold;">/</span>
<span style="color: #c20cb9; font-weight: bold;">mv</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span>mr<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #000000; font-weight: bold;">*</span>.dat src<span style="color: #000000; font-weight: bold;">/</span>example<span style="color: #000000; font-weight: bold;">/</span>com<span style="color: #000000; font-weight: bold;">/</span>planetj<span style="color: #000000; font-weight: bold;">/</span>taste<span style="color: #000000; font-weight: bold;">/</span>example<span style="color: #000000; font-weight: bold;">/</span>grouplens<span style="color: #000000; font-weight: bold;">/</span>
<span style="color: #808080; font-style: italic;">#start up tomcat on port 8080 (default)</span>
<span style="color: #007800;">JAVA_OPTS=</span><span style="color: #ff0000;">&quot;-server -da -dsa -Xms1024m -Xmx1024m&quot;</span> <span style="color: #007800;">JAVA_HOME=</span><span style="color: #000000; font-weight: bold;">/</span>usr<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>jdk1<span style="color: #000000;">.6</span>.0_02 <span style="color: #c20cb9; font-weight: bold;">sh</span> apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span>startup.<span style="color: #c20cb9; font-weight: bold;">sh</span>
<span style="color: #808080; font-style: italic;">#build taste.war, and inject it into tomcat</span>
<span style="color: #007800;">JDK_HOME=</span><span style="color: #000000; font-weight: bold;">/</span>usr<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>jdk1<span style="color: #000000;">.6</span>.0_02 <span style="color: #007800;">JAVA_HOME=</span><span style="color: #000000; font-weight: bold;">/</span>usr<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>jdk1<span style="color: #000000;">.6</span>.0_02 ant build-grouplens-example
<span style="color: #c20cb9; font-weight: bold;">cp</span> taste.war apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>webapps<span style="color: #000000; font-weight: bold;">/</span>
<span style="color: #808080; font-style: italic;">#test the app.  may take a minute or two on the first query.</span>
<span style="color: #c20cb9; font-weight: bold;">wget</span> -O - -S <span style="color: #ff0000;">'http://localhost:8080/taste/RecommenderServlet?userID=1&amp;amp;debug=true'</span></pre></div></div>

<p>Once you get that working, you can tweak the demo slightly to work on another data set.  You just need to know the grouplens file format.  ratings.dat is of the format:</p>
<pre>UserID::MovieID::Rating::Timestamp</pre>
<p>e.g.</p>
<pre>1::1193::5::978300760</pre>
<p>and movies.dat is of the format:</p>
<pre>MovieID::Title::Genres</pre>
<p>e.g.</p>
<pre>1::Toy Story (1995)::Animation|Children's|Comedy</pre>
<p>I wrote a script, let&#8217;s call it load_taste.pl, that can generate new movies.dat and ratings.dat files from an alternate data source.  If I make these new files, I can drop them in place of the grouplens data, rebuild the .war files, and make recommendations on this other data set.  Here&#8217;s how to do it:</p>

<div class="wp_syntax"><div class="code"><pre class="bash"><span style="color: #808080; font-style: italic;">#generate ratings.dat and movies.dat.  move them to replace the grouplens data files.</span>
<span style="color: #c20cb9; font-weight: bold;">perl</span> .<span style="color: #000000; font-weight: bold;">/</span>load_taste.pl
<span style="color: #c20cb9; font-weight: bold;">mv</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span>mr<span style="color: #7a0874; font-weight: bold;">&#93;</span><span style="color: #000000; font-weight: bold;">*</span>.dat src<span style="color: #000000; font-weight: bold;">/</span>example<span style="color: #000000; font-weight: bold;">/</span>com<span style="color: #000000; font-weight: bold;">/</span>planetj<span style="color: #000000; font-weight: bold;">/</span>taste<span style="color: #000000; font-weight: bold;">/</span>example<span style="color: #000000; font-weight: bold;">/</span>grouplens<span style="color: #000000; font-weight: bold;">/</span>
<span style="color: #808080; font-style: italic;">#get rid of stale .war and .jar files</span>
<span style="color: #c20cb9; font-weight: bold;">rm</span> taste.war grouplens.jar
<span style="color: #808080; font-style: italic;">#build the &quot;quick&quot; version of the example.  see below for build.xml patch</span>
<span style="color: #007800;">JDK_HOME=</span><span style="color: #000000; font-weight: bold;">/</span>usr<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>jdk1<span style="color: #000000;">.6</span>.0_02 <span style="color: #007800;">JAVA_HOME=</span><span style="color: #000000; font-weight: bold;">/</span>usr<span style="color: #000000; font-weight: bold;">/</span>java<span style="color: #000000; font-weight: bold;">/</span>jdk1<span style="color: #000000;">.6</span>.0_02 ant build-grouplens-example-quick
<span style="color: #808080; font-style: italic;">#inject the re-built .war file into tomcat.</span>
<span style="color: #c20cb9; font-weight: bold;">cp</span> taste.war apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>webapps<span style="color: #000000; font-weight: bold;">/</span>
<span style="color: #808080; font-style: italic;">#get rid of stale tomcat caches</span>
<span style="color: #c20cb9; font-weight: bold;">rm</span> -rf apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>webapps<span style="color: #000000; font-weight: bold;">/</span>taste apache-tomcat<span style="color: #000000;">-5.5</span><span style="color: #000000;">.26</span><span style="color: #000000; font-weight: bold;">/</span>temp<span style="color: #000000; font-weight: bold;">/</span>taste.<span style="color: #000000; font-weight: bold;">*</span>.txt</pre></div></div>

<p>Note that I&#8217;ve defined a new ant build target called &#8220;build-grouplens-example-quick&#8221;.  The purpose of this is that we only want to rebuild grouplens.jar and taste.war, not reoptimize/reverify/rebuild taste.jar, etc.  The &#8220;build-grouplens-example&#8221; target takes ~55 seconds to complete on my machine, whereas the &#8220;build-grouplens-example-quick&#8221; target takes ~2 seconds.  Here&#8217;s a diff to the original build.xml file:</p>

<div class="wp_syntax"><div class="code"><pre class="diff"><span style="color: #888822;">--- /tmp/build.xml      <span style="">2008</span><span style="">-03</span><span style="">-21</span> <span style="">21</span>:<span style="">18</span>:<span style="">20.000000000</span> <span style="">-0700</span></span>
<span style="color: #888822;">+++ ./build.xml <span style="">2008</span><span style="">-06</span><span style="">-30</span> <span style="">11</span>:<span style="">46</span>:<span style="">18.000000000</span> <span style="">-0700</span></span>
<span style="color: #440088;">@@ <span style="">-161</span>,<span style="">6</span> <span style="">+161</span>,<span style="">58</span> @@</span>
      &lt;delete file=&quot;$<span style="">&#123;</span>my-web.xml<span style="">&#125;</span>&quot;/&gt;
   &lt;/target&gt;
&nbsp;
<span style="color: #00b000;">+  &lt;target depends=&quot;&quot; name=&quot;build-taste-server-quick&quot; description=&quot;Builds deployable web-based Taste server&quot;&gt;</span>
<span style="color: #00b000;">+     &lt;fail unless=&quot;my-recommender.jar&quot; message=&quot;Please set -Dmy-recommender.jar=XXX&quot;/&gt;</span>
<span style="color: #00b000;">+     &lt;fail unless=&quot;my-recommender-class&quot; message=&quot;Please set -Dmy-recommender-class=XXX&quot;/&gt;</span>
<span style="color: #00b000;">+     &lt;tempfile property=&quot;my-web.xml&quot;/&gt;</span>
<span style="color: #00b000;">+     &lt;copy file=&quot;src/main/com/planetj/taste/web/web.xml&quot; tofile=&quot;$<span style="">&#123;</span>my-web.xml<span style="">&#125;</span>&quot;&gt;</span>
<span style="color: #00b000;">+       &lt;filterset&gt;</span>
<span style="color: #00b000;">+               &lt;filter token=&quot;RECOMMENDER_CLASS&quot; value=&quot;$<span style="">&#123;</span>my-recommender-class<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/filterset&gt;</span>
<span style="color: #00b000;">+     &lt;/copy&gt;</span>
<span style="color: #00b000;">+     &lt;war destfile=&quot;$<span style="">&#123;</span>release-war<span style="">&#125;</span>&quot; webxml=&quot;$<span style="">&#123;</span>my-web.xml<span style="">&#125;</span>&quot;&gt;</span>
<span style="color: #00b000;">+       &lt;lib dir=&quot;.&quot;&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;$<span style="">&#123;</span>release-jar<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;$<span style="">&#123;</span>my-recommender.jar<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/lib&gt;</span>
<span style="color: #00b000;">+       &lt;lib dir=&quot;lib/axis&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;classes dir=&quot;build&quot;&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;com/planetj/taste/web/**&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/classes&gt;</span>
<span style="color: #00b000;">+       &lt;fileset dir=&quot;src/main/com/planetj/taste/web&quot;&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;RecommenderService.jws&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/fileset&gt;</span>
<span style="color: #00b000;">+     &lt;/war&gt;</span>
<span style="color: #00b000;">+     &lt;delete file=&quot;$<span style="">&#123;</span>my-web.xml<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+  &lt;/target&gt;</span>
<span style="color: #00b000;">+  &lt;target depends=&quot;&quot; name=&quot;build-grouplens-example-quick&quot; description=&quot;Builds deployable GroupLens example&quot;&gt;</span>
<span style="color: #00b000;">+     &lt;javac source=&quot;<span style="">1.5</span>&quot;</span>
<span style="color: #00b000;">+            target=&quot;<span style="">1.5</span>&quot;</span>
<span style="color: #00b000;">+            deprecation=&quot;true&quot;</span>
<span style="color: #00b000;">+          debug=&quot;true&quot;</span>
<span style="color: #00b000;">+          optimize=&quot;false&quot;</span>
<span style="color: #00b000;">+            destdir=&quot;build&quot;</span>
<span style="color: #00b000;">+            srcdir=&quot;src/example&quot;&gt;</span>
<span style="color: #00b000;">+       &lt;compilerarg value=&quot;-Xlint:all&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;classpath&gt;</span>
<span style="color: #00b000;">+               &lt;pathelement location=&quot;$<span style="">&#123;</span>release-jar<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+               &lt;pathelement location=&quot;$<span style="">&#123;</span>annotations.jar<span style="">&#125;</span>&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/classpath&gt;</span>
<span style="color: #00b000;">+     &lt;/javac&gt;</span>
<span style="color: #00b000;">+     &lt;jar jarfile=&quot;grouplens.jar&quot;&gt;</span>
<span style="color: #00b000;">+       &lt;fileset dir=&quot;src/example&quot;&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;com/planetj/taste/example/grouplens/ratings.dat&quot;/&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;com/planetj/taste/example/grouplens/movies.dat&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/fileset&gt;</span>
<span style="color: #00b000;">+       &lt;fileset dir=&quot;build&quot;&gt;</span>
<span style="color: #00b000;">+               &lt;include name=&quot;com/planetj/taste/example/grouplens/**&quot;/&gt;</span>
<span style="color: #00b000;">+       &lt;/fileset&gt;</span>
<span style="color: #00b000;">+     &lt;/jar&gt;</span>
<span style="color: #00b000;">+     &lt;property name=&quot;my-recommender.jar&quot; value=&quot;grouplens.jar&quot;/&gt;</span>
<span style="color: #00b000;">+     &lt;property name=&quot;my-recommender-class&quot; value=&quot;com.planetj.taste.example.grouplens.GroupLensRecommender&quot;/&gt;</span>
<span style="color: #00b000;">+     &lt;antcall target=&quot;build-taste-server-quick&quot;/&gt;</span>
<span style="color: #00b000;">+  &lt;/target&gt;</span>
<span style="color: #00b000;">+</span>
   &lt;target depends=&quot;build,optimize&quot; name=&quot;build-grouplens-example&quot; description=&quot;Builds deployable GroupLens example&quot;&gt;
      &lt;javac source=&quot;<span style="">1.5</span>&quot;
             target=&quot;<span style="">1.5</span>&quot;</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/06/30/notes-on-setting-up-taste/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Statistical HTML Content Extraction</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/#comments</comments>
		<pubDate>Tue, 27 May 2008 08:48:08 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=29</guid>
		<description><![CDATA[Introduction
I&#8217;ve been learning about some of the techniques used by the so-called &#8220;Black Hat SEO&#8221; community for boosting their rankings in search engine results.  Intriguing stuff.  I&#8217;m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>I&#8217;ve been learning about some of the techniques used by the so-called &#8220;Black Hat SEO&#8221; community for boosting their rankings in search engine results.  Intriguing stuff.  I&#8217;m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do with <a href="http://en.wikipedia.org/wiki/Network_analysis">my</a> <a href="http://en.wikipedia.org/wiki/Bioinformatics">primary</a> <a href="http://en.wikipedia.org/wiki/Genomics">areas</a> <a href="http://en.wikipedia.org/wiki/Informatics">of</a> <a href="http://en.wikipedia.org/wiki/Machine_learning">interest</a>.</p>
<h2>Generating Unique Content</h2>
<p>One &#8220;Black Hat SEO&#8221; application area is automatically generating HTML pages to improve search engine rankings.  This technique uses a <a href="http://en.wikipedia.org/wiki/Markov_process">Markov process</a> to generate text.  The idea is to build one or more web pages that contain the keywords the SEO is targeting.  The method basically works like this:</p>
<ol>
<li>Assemble a corpus of text to train the model.  For example, <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a></li>
<li>Build an order-N (typically N=2) <a href="http://en.wikipedia.org/wiki/Markov_model">Markov model</a> that captures the state changes in the corpus</li>
<li>Generate text from the model, periodically throwing in some keywords</li>
<li>Link the generated page to some other page to which you want to send traffic</li>
<li>Repeat again from Step 1</li>
</ol>
<p>One problem with this approach &#8212; aside from the fact that the keywords don&#8217;t really fitting in with the flow of the model &#8212; is that the model is trained on inappropriate text.  For instance, suppose you were trying to optimize for keywords:</p>
<ul>
<li>keywords</li>
<li>statistics</li>
<li>Search engine optimization</li>
<li>SEO</li>
<li>Automatic content generation</li>
<li>Automatic content extraction</li>
<li>HTML content extraction</li>
<li>Markov Model</li>
</ul>
<p>&#8230; then you probably wouldn&#8217;t want to train your model on, say, Jane Austen&#8217;s <a href="http://www.gutenberg.org/etext/1342">Pride and Prejudice</a>.</p>
<h2>Improve Generated Text: Use Niche Corpora</h2>
<p>A better thing to do would be to find some nice web pages containing <strong>keywords</strong>, <strong>statistics</strong>, <strong>seo</strong>, <strong>Markov model</strong>, and so on.  That way you&#8217;ll pick up related keywords that you didn&#8217;t initially think of (or weren&#8217;t suggested by your keyword expansion tool), too.</p>
<p>But let&#8217;s face it.  The corpora are going to be in HTML format.  So the question now becomes, <strong>How do I automate the transformation of HTML into plain text for input to the model?</strong>  A few strawman ideas, followed by my remarks:</p>
<ul>
<li>Get an HTML document, and remove all &lt;element/&gt;s. <i>Won&#8217;t work very well.  You end up training on page navigation, footers, headers, etc.</i></li>
<li>Build a site- or software-specific parser (e.g. for Wikipedia, or for Wordpress) to extract the main content.  <i>Scalability and maintenance nightmare.  This is not generalizable to general text extraction.  You&#8217;ll be constantly fixing broken parsers, too.</i></li>
<li>Devise a scoring system that can identify the main content of the page.  <i>Exactly!</i></li>
</ul>
<p>I did find some methods for scoring page fragments, such as the Perl modules <a href="http://search.cpan.org/~jtaverni/HTML-Content-Extractor-0.01/">HTML::Content::Extractor</a> and <a href="http://search.cpan.org/~cselt/HTML-Extract-0.15/">HTML::Extract</a>, and another method described by <a href="http://www.perlmonks.org/?node_id=57631">Nooks</a>.  There are also a few intersting ideas in <a href="http://www2003.org/cdrom/papers/refereed/p583/p583-gupta.html">Gupta&#8217;s WWW2003 paper</a>.</p>
<p>None of that Perl code linked above <em>actually works</em>, but Nooks and Jean Tavernier generally had the right idea.  Basically, they look &#8220;down&#8221; the DOM to find the sub-DOM with the highest text/tag ratio.</p>
<p>The main problem with this approach is that it biases for DOM leaves, or &#8220;twigs&#8221; that are very close to leaves.  You end up having to write special rules for accomodating the idiosyncrosies of each particular page dealt with, and it basically turns back into an HTML parsing exercise.</p>
<p>The other problem, and possibly more significant one from a statistician&#8217;s point of view, is that the ratio is not a well-understood metric for making decisions about what constitutes a &#8220;good&#8221; versus a &#8220;bad&#8221; sub-document.  It would be better to have a p-value&#8230;</p>
<h2>Balls and Urns</h2>
<p>Fortunately, <a href="http://en.wikipedia.org/wiki/Fisher's_exact_test">Fisher&#8217;s exact test</a> can be applied to this problem.  Here&#8217;s how you can apply it, explanation follows.  First, let&#8217;s define some variables:</p>
<ul>
<li><b>X</b>: the total number of words in the whole document.</li>
<li><b>x</b>: the number of words in a sub-document.</li>
<li><b>Y</b>: the total number of &lt;element/&gt;s in the whole document.</li>
<li><b>y</b>: the number of &lt;element/&gt;s in a sub-document.</li>
</ul>
<p>Then, we perform the following algorithm to identify the single best sub-document:</p>

<div class="wp_syntax"><div class="code"><pre class="c">tree; <span style="color: #808080; font-style: italic;">//the HTML tree's root node</span>
minP <span style="color: #66cc66;">=</span> <span style="color: #cc66cc;">1</span>; <span style="color: #808080; font-style: italic;">//minimum p-value observed in the document</span>
subD <span style="color: #66cc66;">=</span> <span style="color: #ff0000;">&quot;&quot;</span>; <span style="color: #808080; font-style: italic;">//sub-document corresponding to minimum p-value</span>
X <span style="color: #66cc66;">=</span> calculatex<span style="color: #66cc66;">&#40;</span>tree<span style="color: #66cc66;">&#41;</span>;
Y <span style="color: #66cc66;">=</span> calculatey<span style="color: #66cc66;">&#40;</span>tree<span style="color: #66cc66;">&#41;</span>;
look<span style="color: #66cc66;">&#40;</span>tree<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000000; font-weight: bold;">function</span> look <span style="color: #66cc66;">&#40;</span>node<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
  x <span style="color: #66cc66;">=</span> calculatex<span style="color: #66cc66;">&#40;</span>node<span style="color: #66cc66;">&#41;</span>;
  y <span style="color: #66cc66;">=</span> calculatey<span style="color: #66cc66;">&#40;</span>node<span style="color: #66cc66;">&#41;</span>;
  p <span style="color: #66cc66;">=</span> calculateHyperG<span style="color: #66cc66;">&#40;</span>x,y,X,Y<span style="color: #66cc66;">&#41;</span>;
  <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> p &lt; minP <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    minP <span style="color: #66cc66;">=</span> p;
    subD <span style="color: #66cc66;">=</span> node;
  <span style="color: #66cc66;">&#125;</span>
  C <span style="color: #66cc66;">=</span> children<span style="color: #66cc66;">&#40;</span>node<span style="color: #66cc66;">&#41;</span>;
  foreach <span style="color: #66cc66;">&#40;</span>c in C<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    look<span style="color: #66cc66;">&#40;</span>c<span style="color: #66cc66;">&#41;</span>;
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

<h2>Balls and Urns, Explained</h2>
<p>The pseudocode above is examining each sub-document of the HTML document in turn and identifying the one with the smallest p-value.  The p-value is calculated using the <a href="http://en.wikipedia.org/wiki/Hypergeometric_distribution">hypergeometirc distribution</a>, where we consider that a sub-document has <b>x</b> words and <b>y</b> HTML &lt;element/&gt;s.  This, in the context of the total document having <b>X</b> words and <b>Y</b> HTML &lt;element/&gt;s.  It&#8217;s better than a simple ratio calculation because it does not bias for the tree&#8217;s leaves.  That is, the p-value does not consider only the size of <b>x+y</b>.</p>
<h3>Caveats</h3>
<p>Bear in mind that testing so many sub-documents, especially for very large HTML documents, warrants so-called &#8220;<a href="http://en.wikipedia.org/wiki/Multiple_comparisons">multiple hypothesis testing correction</a>&#8220;, such as a <a href="http://en.wikipedia.org/wiki/Bonferroni_correction">Bonferroni correction</a>.  It&#8217;s outside the scope of this article.</p>
<p>Also, the tests performed are not entirely independent.  That is, if node B is a child of node A then B will have some effect on A when calculating A&#8217;s p-value and must be factored out.  This is also a well-defined problem but is, alas, also outside the scope of this article.  Do your homework! <b>Hint</b>: learn about the <a href="http://en.wikipedia.org/wiki/Gene_ontology">Gene Ontology</a>.</p>
<h2>Conclusion</h2>
<p>Fine and dandy, but does it work?  My conclusion: seems to work.  Here&#8217;s a CGI script demonstrating the <a href="http://www.spicylogic.com/allenday/cgi-bin/hyperG.cgi?u=http://www.cnn.com">hypergeometric content extraction</a> technique on CNN.com.  It reports a text snippet at the beginning and end of the single &#8220;best&#8221; sub-document and the corresponding (uncorrected) p-value.  Twiddle the <b>u</b> parameter to test on a page of your choice.  Some pages may block the user-agent I&#8217;m using&#8230;</p>
<p>There is also the issue of what to consider an element and what not to&#8230; or maybe even element weighting.  For instance, maybe &lt;p/&gt; and &lt;i/&gt; elements shouldn&#8217;t be penalized because they&#8217;re commonly associated with text, but &lt;script/&gt; elements are heavily penalized.</p>
<h1>Update 2009-02-06</h1>
<p>Someone asked for the source code used here.  I&#8217;m not actively pursuing any business with this, so here you go.  If you use it in something or make a derivative work I&#8217;d be pleased to know what you&#8217;ve done.</p>

<div class="wp_syntax"><div class="code"><pre class="perl"><span style="color: #808080; font-style: italic;">#!/usr/bin/perl</span>
<span style="color: #808080; font-style: italic;"># Copyright Allen Day &lt;allenday@gmail.com&gt;</span>
<span style="color: #808080; font-style: italic;"># License: Artistic 2.0</span>
<span style="color: #000000; font-weight: bold;">use</span> strict;
<span style="color: #000000; font-weight: bold;">use</span> lib <span style="color: #ff0000;">'/home/allenday/lib/perl/lib/perl/5.8.4/'</span>;
<span style="color: #000000; font-weight: bold;">use</span> CGI <span style="color: #000066;">qw</span><span style="color: #66cc66;">&#40;</span>:standard<span style="color: #66cc66;">&#41;</span>;
<span style="color: #000066;">print</span> header<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #000000; font-weight: bold;">use</span> HTML::<span style="color: #006600;">TreeBuilder</span>;
<span style="color: #000000; font-weight: bold;">use</span> GO::<span style="color: #006600;">TermFinder</span>::<span style="color: #006600;">Native</span>;
<span style="color: #000000; font-weight: bold;">use</span> LWP::<span style="color: #006600;">Simple</span> <span style="color: #000066;">qw</span><span style="color: #66cc66;">&#40;</span>get<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$d</span> = GO::<span style="color: #006600;">TermFinder</span>::<span style="color: #006600;">Native</span>::<span style="color: #006600;">Distributions</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">new</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">8192</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%ignore</span> = <span style="color: #000066;">map</span> <span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">$_</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#125;</span> <span style="color: #000066;">qw</span><span style="color: #66cc66;">&#40;</span>head iframe img script table td <span style="color: #000066;">tr</span> th tbody<span style="color: #66cc66;">&#41;</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">%ok</span>     = <span style="color: #000066;">map</span> <span style="color: #66cc66;">&#123;</span><span style="color: #0000ff;">$_</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#125;</span> <span style="color: #000066;">qw</span><span style="color: #66cc66;">&#40;</span>a b br div em h1 h2 h3 h4 i p span strong<span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$tree</span> = HTML::<span style="color: #006600;">TreeBuilder</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">new</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$all</span> = param<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'a'</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$doc</span> = get<span style="color: #66cc66;">&#40;</span>param<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'u'</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #0000ff;">$tree</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">parse</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$doc</span> <span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$wordtotal</span> = <span style="color: #cc66cc;">0</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$elemtotal</span> = <span style="color: #cc66cc;">0</span>;
tally<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$tree</span>,\<span style="color: #0000ff;">$elemtotal</span>,\<span style="color: #0000ff;">$wordtotal</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$minp</span> = <span style="color: #cc66cc;">1</span>;
<span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$result</span>;
examine<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$tree</span> <span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #000066;">print</span> <span style="color: #ff0000;">'&lt;html&gt;&lt;head&gt;&lt;title&gt;x&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;pre&gt;'</span>;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #66cc66;">!</span> <span style="color: #0000ff;">$all</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
  <span style="color: #0000ff;">$result</span> =~ <span style="color: #000066;">m</span><span style="color: #808080; font-style: italic;">#^(.{50}).*?(.{50})$#s;</span>
  <span style="color: #000066;">print</span> <span style="color: #0000ff;">$minp</span>,<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>$1 ... $2<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;
<span style="color: #66cc66;">&#125;</span>
<span style="color: #b1b100;">else</span> <span style="color: #66cc66;">&#123;</span>
  <span style="color: #000066;">print</span> <span style="color: #0000ff;">$minp</span>,<span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>$result<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;
<span style="color: #66cc66;">&#125;</span>
&nbsp;
<span style="color: #000066;">print</span> <span style="color: #ff0000;">'&amp;lt;/pre&gt;&lt;/body&gt;&lt;/html&gt;'</span>;
<span style="color: #000066;">print</span> <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\n</span>&quot;</span>;
&nbsp;
<span style="color: #000000; font-weight: bold;">sub</span> examine <span style="color: #66cc66;">&#123;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$node</span> <span style="color: #66cc66;">&#41;</span> = <span style="color: #0000ff;">@_</span>;
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$elemcount</span> = <span style="color: #cc66cc;">0</span>;
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$wordcount</span> = <span style="color: #cc66cc;">0</span>;
  tally<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$node</span>, \<span style="color: #0000ff;">$elemcount</span>, \<span style="color: #0000ff;">$wordcount</span> <span style="color: #66cc66;">&#41;</span>;
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$p</span> = <span style="color: #0000ff;">$d</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">hypergeometric</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$wordcount</span>, <span style="color: #0000ff;">$elemcount</span> + <span style="color: #0000ff;">$wordcount</span>, <span style="color: #0000ff;">$wordtotal</span>, <span style="color: #0000ff;">$elemtotal</span> + <span style="color: #0000ff;">$wordtotal</span> <span style="color: #66cc66;">&#41;</span>;
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$minp</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #0000ff;">$p</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span> = <span style="color: #ff0000;">''</span>;
    text<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$node</span>, \<span style="color: #0000ff;">$text</span> <span style="color: #66cc66;">&#41;</span>;
    <span style="color: #0000ff;">$minp</span> = <span style="color: #0000ff;">$p</span>;
    <span style="color: #0000ff;">$result</span> = <span style="color: #0000ff;">$text</span>;
  <span style="color: #66cc66;">&#125;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@content</span> = <span style="color: #0000ff;">$node</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">content_list</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #b1b100;">foreach</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">@content</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #000066;">ref</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
      examine<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$c</span>,<span style="color: #0000ff;">$elemcount</span>,<span style="color: #0000ff;">$wordcount</span><span style="color: #66cc66;">&#41;</span>;
    <span style="color: #66cc66;">&#125;</span>
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">sub</span> tally <span style="color: #66cc66;">&#123;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$node</span>, <span style="color: #0000ff;">$elemcount</span>, <span style="color: #0000ff;">$wordcount</span> <span style="color: #66cc66;">&#41;</span> = <span style="color: #0000ff;">@_</span>;
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@content</span> = <span style="color: #0000ff;">$node</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">content_list</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #b1b100;">foreach</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">@content</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #000066;">ref</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #66cc66;">!</span> <span style="color: #0000ff;">$ok</span><span style="color: #66cc66;">&#123;</span> <span style="color: #0000ff;">$c</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">tag</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        <span style="color: #0000ff;">$$elemcount</span>++;
      <span style="color: #66cc66;">&#125;</span>
      <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #66cc66;">!</span> <span style="color: #0000ff;">$ignore</span><span style="color: #66cc66;">&#123;</span> <span style="color: #0000ff;">$c</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">tag</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        tally<span style="color: #66cc66;">&#40;</span><span style="color: #0000ff;">$c</span>,<span style="color: #0000ff;">$elemcount</span>,<span style="color: #0000ff;">$wordcount</span><span style="color: #66cc66;">&#41;</span>;
      <span style="color: #66cc66;">&#125;</span>
    <span style="color: #66cc66;">&#125;</span>
    <span style="color: #b1b100;">else</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #66cc66;">!</span> <span style="color: #0000ff;">$ignore</span><span style="color: #66cc66;">&#123;</span> <span style="color: #0000ff;">$node</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">tag</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@words</span> = <span style="color: #000066;">grep</span> <span style="color: #66cc66;">/</span>\w<span style="color: #66cc66;">/</span>, <span style="color: #000066;">split</span> <span style="color: #66cc66;">/</span>\<span style="color: #000066;">s</span>+<span style="color: #66cc66;">/</span>, <span style="color: #0000ff;">$c</span>;
        <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$c</span> = <span style="color: #000066;">scalar</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">@words</span> <span style="color: #66cc66;">&#41;</span>;     
        <span style="color: #0000ff;">$$wordcount</span> += <span style="color: #0000ff;">$c</span> <span style="color: #b1b100;">if</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&gt;</span> <span style="color: #cc66cc;">2</span>;
      <span style="color: #66cc66;">&#125;</span>
    <span style="color: #66cc66;">&#125;</span>
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">sub</span> text <span style="color: #66cc66;">&#123;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$node</span>, <span style="color: #0000ff;">$text</span> <span style="color: #66cc66;">&#41;</span> = <span style="color: #0000ff;">@_</span>;
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">@content</span> = <span style="color: #0000ff;">$node</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">content_list</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>;
  <span style="color: #b1b100;">foreach</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">@content</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #000066;">ref</span><span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$c</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
      text<span style="color: #66cc66;">&#40;</span> <span style="color: #0000ff;">$c</span>, <span style="color: #0000ff;">$text</span> <span style="color: #66cc66;">&#41;</span>;
    <span style="color: #66cc66;">&#125;</span>
    <span style="color: #b1b100;">else</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> <span style="color: #66cc66;">!</span> <span style="color: #0000ff;">$ignore</span><span style="color: #66cc66;">&#123;</span> <span style="color: #0000ff;">$node</span>-<span style="color: #66cc66;">&gt;</span><span style="color: #006600;">tag</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#125;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        <span style="color: #0000ff;">$$text</span> .= <span style="color: #ff0000;">' '</span> . <span style="color: #0000ff;">$c</span>;
      <span style="color: #66cc66;">&#125;</span>
    <span style="color: #66cc66;">&#125;</span>
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sparse Matrices in R</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/05/05/sparse-matrices-in-r/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/05/05/sparse-matrices-in-r/#comments</comments>
		<pubDate>Mon, 05 May 2008 07:16:28 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=10</guid>
		<description><![CDATA[
I&#8217;ve had a need over the last week to work with some sparse matrix data in R.  This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.
It seems that the de-facto standard for moving sparse matrices is around is [...]]]></description>
			<content:encoded><![CDATA[<div style="float:right"><img style="width:300px" src="http://www.mathworks.com/matlabcentral/files/17385/dijkstra.jpg"/></div>
<p>I&#8217;ve had a need over the last week to work with some sparse matrix data in <a href="http://cran.r-project.org">R</a>.  This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.</p>
<p>It seems that the de-facto standard for moving sparse matrices is around is to use the <a href="http://math.nist.gov/MatrixMarket/collections/hb.html">Harwell-Boeing file format</a>, aka &#8220;harbo&#8221;.  It&#8217;s a horrible and largely undocumented fixed-width (think Fortran) file format.  The best documentation I could find was in source code <a href="http://acts.nersc.gov/tau/programs/pdgssvx/dreadhb.c">here</a>, although you may be able to piece more of it together with <a href="http://www.koders.com/default.aspx?s=harwell-boeing">Koders</a>.  R does include a harbo reader as part of the <a href="http://cran.r-project.org/web/packages/SparseM/">SparseM</a> package.</p>
<p>Given that I&#8217;m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available.  As it turns out, there is a package called <a href="http://search.cpan.org/~tpederse/Text-SenseClusters-1.01/">Text::SenseClusters</a> from <a href="http://www.d.umn.edu/~tpederse/senseclusters.html">Ted Pedersen</a> that ships with a nifty program, <a href="http://search.cpan.org/~tpederse/Text-SenseClusters-0.98/Toolkit/svd/mat2harbo.pl">mat2harbo.pl</a>.  I found the preferred sparse matrix &#8220;mat&#8221; format used by Text::SenseClusters to be more reasonable than harbo. Here&#8217;s an example.</p>
<pre>5 5 15
2 9 4 9
1 6 2 5 3 7 4 8 5 6
1 4 2 5
1 7 2 6 3 7
1 9 2 8 3 9</pre>
<p>.  There is a header line &#8220;
<pre>5 5 15</pre>
<p>&#8221; that defines the matrix rows, columns, and number of non null fields.  Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row.  Easy!</p>
<p>At this point I was formulating a plan to:</p>
<ol>
<li>use my matrix writer to write in &#8220;mat&#8221; format to <code>file1.mat</code>.</li>
<li>convert <code>file1.mat</code> to <code>file2.harbo</code> using <code>mat2harbo.pl</code> from Text::SenseClusters.</li>
<li>import file2.harbo into R using the <code>read.matrix.hb()</code> function in the <code>SparseM</code> package.</li>
<li>convert the SparseM matrix to an R graph (<code>graph</code> package).</li>
<li>get back to my original problem&#8230; analyzing the matrix in R with <a href="http://www.boost.org/">Boost</a> via the <code><a href="http://cran.r-project.org/web/packages/RBGL/">RBGL</a></code> package.</li>
</ol>
<p>Well, it wasn&#8217;t that easy.</p>
<p>Step 1 went okay.  Step 2 had problems with null columns, and had some glitches in the output format.  Some of these glitches were easy to fix (e.g. matrix definition of &#8220;rra&#8221; to &#8220;RRA&#8221;), but others were very difficult due to the fact that mat2harbo.pl didn&#8217;t provide &#8220;full&#8221; harbo support, and the SparseM reader needed some of the file format features that weren&#8217;t supported.</p>
<p>So I wrote my own &#8220;mat&#8221; file -&gt; R matrix.rsc object constructor myself.  Here it is:</p>

<div class="wp_syntax"><div class="code"><pre class="c">read.<span style="color: #202020;">matrix</span>.<span style="color: #202020;">pair</span> <span style="color: #66cc66;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #66cc66;">&#40;</span>file,debug<span style="color: #66cc66;">=</span><span style="color: #000000; font-weight: bold;">FALSE</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
mat.<span style="color: #202020;">lines</span> <span style="color: #66cc66;">=</span> readLines<span style="color: #66cc66;">&#40;</span>file<span style="color: #66cc66;">&#41;</span>;
header <span style="color: #66cc66;">=</span> mat.<span style="color: #202020;">lines</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span>;
F <span style="color: #66cc66;">=</span> strsplit<span style="color: #66cc66;">&#40;</span>header,<span style="color: #ff0000;">' '</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span>;
nrow <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>F<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
ncol <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>F<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
nelem <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>F<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">3</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
ja <span style="color: #66cc66;">=</span> vector<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;list&quot;</span>,nrow<span style="color: #66cc66;">&#41;</span>;
ra <span style="color: #66cc66;">=</span> vector<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;list&quot;</span>,nrow<span style="color: #66cc66;">&#41;</span>;
ia <span style="color: #66cc66;">=</span> vector<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;list&quot;</span>,nrow<span style="color: #66cc66;">&#41;</span>;
ia<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> c<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #66cc66;">&#40;</span> i in <span style="color: #cc66cc;">2</span><span style="color: #66cc66;">:</span><span style="color: #66cc66;">&#40;</span>nrow<span style="color: #cc66cc;">+1</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span> <span style="color: #339933;">#nrow</span>
mat.<span style="color: #202020;">line</span> <span style="color: #66cc66;">=</span> strsplit<span style="color: #66cc66;">&#40;</span>mat.<span style="color: #202020;">lines</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span>,<span style="color: #ff0000;">' '</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> length<span style="color: #66cc66;">&#40;</span>mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#41;</span> &gt; <span style="color: #cc66cc;">0</span> <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> debug <span style="color: #66cc66;">&#41;</span> print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'non-empty row'</span>,i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
ja<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#91;</span>  seq<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1</span>,length<span style="color: #66cc66;">&#40;</span>mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#41;</span>,by<span style="color: #66cc66;">=</span><span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;
ra<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">+</span>seq<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">1</span>,length<span style="color: #66cc66;">&#40;</span>mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#41;</span>,by<span style="color: #66cc66;">=</span><span style="color: #cc66cc;">2</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#93;</span>;
ia<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> ia<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">+</span> length<span style="color: #66cc66;">&#40;</span>mat.<span style="color: #202020;">line</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">/</span><span style="color: #cc66cc;">2</span>;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> debug <span style="color: #66cc66;">&#41;</span> print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'  pos:'</span>,ja<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> debug <span style="color: #66cc66;">&#41;</span> print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'  dat:'</span>,ra<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #66cc66;">&#125;</span> <span style="color: #b1b100;">else</span> <span style="color: #66cc66;">&#123;</span>
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> debug <span style="color: #66cc66;">&#41;</span> print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'    empty row'</span>,i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
ia<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span> <span style="color: #66cc66;">=</span> ia<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#91;</span>i<span style="color: #cc66cc;">-1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#93;</span>;
<span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span>
ans.<span style="color: #202020;">ja</span> <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>unlist<span style="color: #66cc66;">&#40;</span>ja<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
ans.<span style="color: #202020;">ra</span> <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>unlist<span style="color: #66cc66;">&#40;</span>ra<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
ans.<span style="color: #202020;">ia</span> <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>unlist<span style="color: #66cc66;">&#40;</span>ia<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
dimension <span style="color: #66cc66;">=</span> as.<span style="color: #202020;">integer</span><span style="color: #66cc66;">&#40;</span>c<span style="color: #66cc66;">&#40;</span>nrow,ncol<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
&nbsp;
<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> debug <span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'nrow'</span>,nrow<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
print<span style="color: #66cc66;">&#40;</span>paste<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'ncol'</span>,ncol<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span>;
print<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'ra'</span><span style="color: #66cc66;">&#41;</span>;print<span style="color: #66cc66;">&#40;</span>ans.<span style="color: #202020;">ra</span><span style="color: #66cc66;">&#41;</span>;
print<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'ja'</span><span style="color: #66cc66;">&#41;</span>;print<span style="color: #66cc66;">&#40;</span>ans.<span style="color: #202020;">ja</span><span style="color: #66cc66;">&#41;</span>;
print<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">'ia'</span><span style="color: #66cc66;">&#41;</span>;print<span style="color: #66cc66;">&#40;</span>ans.<span style="color: #202020;">ia</span><span style="color: #66cc66;">&#41;</span>;
<span style="color: #66cc66;">&#125;</span>
rd.<span style="color: #202020;">o</span> <span style="color: #66cc66;">=</span> new<span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;matrix.csr&quot;</span>, ra <span style="color: #66cc66;">=</span> ans.<span style="color: #202020;">ra</span>, ja <span style="color: #66cc66;">=</span> ans.<span style="color: #202020;">ja</span>, ia <span style="color: #66cc66;">=</span> ans.<span style="color: #202020;">ia</span>, dimension <span style="color: #66cc66;">=</span> dimension<span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>This let me just read the &#8220;mat&#8221; file directly into R.  After that, the conversion to a graph object seems to work okay.  I say seems to because <strike>I&#8217;m still waiting</strike> for the SparseM -&gt; graph conversion routine to finish.  It&#8217;s a 50K x 50K matrix with about 2 million edges, so it&#8217;s taking a little while&#8230;</p>
<p>Took about as long to convert as it took me to post this.  Everything is fine.  Now I get back to doing all-by-all <a href="http://en.wikipedia.org/wiki/Dijkstra's_algorithm">Dijkstra</a> on the graph, or at least find a reasonably fast way to allow for one-off queries.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/05/05/sparse-matrices-in-r/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
