<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Allen Day's Blog &#187; Java</title>
	<atom:link href="http://www.spicylogic.com/allenday/blog/category/computing/software/java/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spicylogic.com/allenday/blog</link>
	<description>♥data♥</description>
	<lastBuildDate>Mon, 21 Jun 2010 23:28:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Synthetic GFF Dataset for Genome Browser Benchmark</title>
		<link>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/#comments</comments>
		<pubDate>Tue, 07 Apr 2009 08:01:52 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Genomics]]></category>
		<category><![CDATA[Informatics]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Perl]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[Science]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/</guid>
		<description><![CDATA[I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use SOLR here?
I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation [...]]]></description>
			<content:encoded><![CDATA[<p>I deployed a Gbrowse/Chado installation last week at <a href="http://www.dowagro.com/">Dow Agrosciences</a>.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn&#8217;t it be nice to use <a href="http://lucene.apache.org/solr/">SOLR</a> here?</p>
<p>I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the <a href="http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/">knownGene annotation set</a> of the Hg18 build of the human genome.  You can grab the data set and script used to generate it <a href="http://www.spicylogic.com/allenday/images/knownGene/">here</a>.  There are several files mRNA.E<strong>N</strong>.txt.gz that contain gzipped gene models, where <strong>N</strong>=3..7 indicates there are 10^<strong>N</strong> models in the file, uniformly distributed across a 500-megabase reference sequence.</p>
<p>I&#8217;m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:</p>
<ul>
<li>Chado</li>
<li>The default Bio::DB::GFF schema (does it have a name?)</li>
<li>The SOLR backend I&#8217;m about to implement</li>
</ul>
<p>I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2009/04/07/synthetic-gff-dataset-for-genome-browser-benchmark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Taste item-item recommender example</title>
		<link>http://www.spicylogic.com/allenday/blog/2009/02/11/taste-item-item-recommender-example/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2009/02/11/taste-item-item-recommender-example/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 22:10:00 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Mahout]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2009/02/11/taste-item-item-recommender-example/</guid>
		<description><![CDATA[I threw together a Mahout/Taste based item-item based recommender last night.

	public static void itemItemRecommendations&#40;String path, String file&#41; &#123;
		File f = new File&#40;path, file&#41;;
	    try &#123;
			DataModel model = new FileDataModel&#40;f&#41;;
			model.refresh&#40;null&#41;;
		    ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity&#40;model&#41;;
		    ItemBasedRecommender itemRecommender = new GenericItemBasedRecommender&#40;model, itemSimilarity&#41;;
		    for &#40; Item [...]]]></description>
			<content:encoded><![CDATA[<p>I threw together a Mahout/Taste based item-item based recommender last night.</p>

<div class="wp_syntax"><div class="code"><pre class="java">	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #993333;">void</span> itemItemRecommendations<span style="color: #66cc66;">&#40;</span><span style="color: #aaaadd; font-weight: bold;">String</span> path, <span style="color: #aaaadd; font-weight: bold;">String</span> file<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
		<span style="color: #aaaadd; font-weight: bold;">File</span> f = <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #aaaadd; font-weight: bold;">File</span><span style="color: #66cc66;">&#40;</span>path, file<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
	    <span style="color: #000000; font-weight: bold;">try</span> <span style="color: #66cc66;">&#123;</span>
			DataModel model = <span style="color: #000000; font-weight: bold;">new</span> FileDataModel<span style="color: #66cc66;">&#40;</span>f<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
			model.<span style="color: #006600;">refresh</span><span style="color: #66cc66;">&#40;</span><span style="color: #000000; font-weight: bold;">null</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		    ItemSimilarity itemSimilarity = <span style="color: #000000; font-weight: bold;">new</span> LogLikelihoodSimilarity<span style="color: #66cc66;">&#40;</span>model<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		    ItemBasedRecommender itemRecommender = <span style="color: #000000; font-weight: bold;">new</span> GenericItemBasedRecommender<span style="color: #66cc66;">&#40;</span>model, itemSimilarity<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		    <span style="color: #b1b100;">for</span> <span style="color: #66cc66;">&#40;</span> Item i : model.<span style="color: #006600;">getItems</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>
			    <span style="color: #b1b100;">for</span> <span style="color: #66cc66;">&#40;</span> RecommendedItem j : itemRecommender.<span style="color: #006600;">mostSimilarItems</span><span style="color: #66cc66;">&#40;</span>i.<span style="color: #006600;">getID</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>, <span style="color: #cc66cc;">50</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#41;</span>
			    	<span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span> j.<span style="color: #006600;">getValue</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&gt;</span>= <span style="color: #cc66cc;">0.7</span> <span style="color: #66cc66;">&#41;</span>
			    		<span style="color: #aaaadd; font-weight: bold;">System</span>.<span style="color: #006600;">out</span>.<span style="color: #006600;">println</span><span style="color: #66cc66;">&#40;</span>i.<span style="color: #006600;">getID</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> + <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> + j.<span style="color: #006600;">getItem</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>.<span style="color: #006600;">getID</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span> + <span style="color: #ff0000;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span> + <span style="color: #aaaadd; font-weight: bold;">String</span>.<span style="color: #006600;">format</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;%.3f&quot;</span>, j.<span style="color: #006600;">getValue</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		<span style="color: #66cc66;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #66cc66;">&#40;</span><span style="color: #aaaadd; font-weight: bold;">FileNotFoundException</span> e<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
			<span style="color: #808080; font-style: italic;">// TODO Auto-generated catch block</span>
			e.<span style="color: #006600;">printStackTrace</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		<span style="color: #66cc66;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #66cc66;">&#40;</span>TasteException e<span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
			<span style="color: #808080; font-style: italic;">// TODO Auto-generated catch block</span>
			e.<span style="color: #006600;">printStackTrace</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
		<span style="color: #66cc66;">&#125;</span>
	<span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>This outputs item1 &#8211;recommends&#8211;>item2 pairs with a weight.  I&#8217;m taking this and putting it into a solr document so I can display related item2s alongside item1 when it&#8217;s viewed.</p>
<p>Input data are comma-delimited <userID,itemID,score> tuples like so:</p>
<pre>
1fe7401b81eed49353d0cbeba5383848,5212,0.6
3c1832954a6e8781836fed670bb37b24,5212,1
70273e4c7c77700ee97acb8d0306c405,5213,0.8
1f057ccde135acbc881008bbf466e7e1,5213,1
51d44c7baca65ad39d11ba87bf2d438b,5213,1
adc924559b37114cd97d1f5cf7c71419,5213,1
78e254b4a11e61d76ff63cea02de4de8,5213,1
5c373ec7d9ad4a6f392c291d8ccba5ce,5213,0.2
fab8537564094fa8885f6214e6b682e1,5213,1
127f46aabcdbc2d2d04da8398a996c75,5213,1
</pre>
<p>Works great.  Thanks Sean.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2009/02/11/taste-item-item-recommender-example/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Parallel DNS reverse lookups</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/11/10/parallel-dns-reverse-lookups/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/11/10/parallel-dns-reverse-lookups/#comments</comments>
		<pubDate>Mon, 10 Nov 2008 20:43:53 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/11/10/parallel-dns-reverse-lookups/</guid>
		<description><![CDATA[Need to do lots of reverse DNS lookups for some reason?  Maybe b/c you&#8217;re trying to get a seed list for a web crawl or hack attempt on a bunch of ISPs.  Who cares.  Here&#8217;s a quick way to generate names from a big list of IPs like:

1.1.1.1
1.1.1.2
[...]
254.254.254.253
254.254.254.254

We can use hadoop streaming [...]]]></description>
			<content:encoded><![CDATA[<p>Need to do lots of reverse DNS lookups for some reason?  Maybe b/c you&#8217;re trying to get a seed list for a web crawl or hack attempt on a bunch of ISPs.  Who cares.  Here&#8217;s a quick way to generate names from a big list of IPs like:</p>
<pre>
1.1.1.1
1.1.1.2
[...]
254.254.254.253
254.254.254.254
</pre>
<p>We can use hadoop streaming to chunk the list so we can do the DNS lookups in parallel.  Easy and requires little to know thought:</p>

<div class="wp_syntax"><div class="code"><pre class="bash">.<span style="color: #000000; font-weight: bold;">/</span>bin<span style="color: #000000; font-weight: bold;">/</span>hadoop jar contrib<span style="color: #000000; font-weight: bold;">/</span>streaming<span style="color: #000000; font-weight: bold;">/*</span>-streaming.jar -input <span style="color: #000000; font-weight: bold;">/</span>home<span style="color: #000000; font-weight: bold;">/</span>aday<span style="color: #000000; font-weight: bold;">/</span>classC.dat -output <span style="color: #000000; font-weight: bold;">/</span>home<span style="color: #000000; font-weight: bold;">/</span>aday<span style="color: #000000; font-weight: bold;">/</span>classC_dns.dat -mapper <span style="color: #ff0000;">'perl -ne '</span>\<span style="color: #ff0000;">''</span>print `host <span style="color: #007800;">$_</span>`<span style="color: #ff0000;">'<span style="color: #000099; font-weight: bold;">\'</span>'</span><span style="color: #ff0000;">' -numReduceTasks 0</span></pre></div></div>

<p>We wrap the <code>host</code> call in backticks so we can trap non-zero exit codes and get an error message on stdout courtesy of perl.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/11/10/parallel-dns-reverse-lookups/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Java port of GNU getopt</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/10/22/java-port-of-gnu-getopt/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/10/22/java-port-of-gnu-getopt/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 02:12:10 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/10/22/java-port-of-gnu-getopt/</guid>
		<description><![CDATA[This looks useful
http://www.urbanophile.com/arenn/hacking/getopt/gnu.getopt.Getopt.html
]]></description>
			<content:encoded><![CDATA[<p>This looks useful<br />
<a href="http://www.urbanophile.com/arenn/hacking/getopt/gnu.getopt.Getopt.html">http://www.urbanophile.com/arenn/hacking/getopt/gnu.getopt.Getopt.html</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/10/22/java-port-of-gnu-getopt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Thoughts on Hadoop JobTracker/TaskTracker Scheduling</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/09/11/thoughts-on-hadoop-tasktracker-jobtrackerscheduling/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/09/11/thoughts-on-hadoop-tasktracker-jobtrackerscheduling/#comments</comments>
		<pubDate>Fri, 12 Sep 2008 01:07:59 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Random musings]]></category>
		<category><![CDATA[SGE]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=67</guid>
		<description><![CDATA[Had a brief, interesting conversation on freenode #hadoop today with Rapleaf Engineer Nathan Marz today about scheduling in Hadoop.
Pretty much supports my sense that scheduling is not Hadoop&#8217;s strong suit.  It&#8217;s really pretty shitty.  Would be great to see some more cross-pollination between the Beowulf (SGE, PBS, Globus) and MapReduce (Hadoop, HBase) communities. [...]]]></description>
			<content:encoded><![CDATA[<p>Had a brief, interesting conversation on freenode #hadoop today with <a href="http://blog.rapleaf.com/2008/06/11/rapleafs-newest-engineer-nathan-marz/">Rapleaf Engineer Nathan Marz</a> today about scheduling in Hadoop.</p>
<p>Pretty much supports my sense that scheduling is not Hadoop&#8217;s strong suit.  It&#8217;s really pretty shitty.  Would be great to see some more cross-pollination between the Beowulf (SGE, PBS, Globus) and MapReduce (Hadoop, HBase) communities.  The former have more mature scheduling, resource management and permissions models.  They don&#8217;t really do a good job thought with providing a framework for distributed, parallel computing at the application level though &#8212; everything is roll-your-own.  Perhaps Hadoop could be integrated as a parallel environment to consume resources from a SGE master [<a href="http://www.spicylogic.com/allenday/blog/2008/09/03/sge-hadoop-integration/">1</a>, <a href="http://www.spicylogic.com/allenday/blog/2008/08/08/hadoop-sge-grid-engine-convergence/">2</a>] rather than managing its own mapper/reducer pools.</p>
<p>A less ambitious scheduler improvement is to modify the way the Hadoop scheduler allocates map/reduce resources.  The main itch I&#8217;m trying to scratch right now has to do with the coupling of map/reduce allocation.  There are some cases where it seems this shouldn&#8217;t be done.  Read the dialog with Nathan below if you care to know more.</p>
<table class="msg-table" border="0">
<tbody>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>is it possible to decouple mapper and reducer slot allocation for jobs?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i mean, if a job is #1 in the MR queue, but it is not yet ready to reduce, can it be prevented from consuming reducer slots?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>|&lt;&#8211;</span></td>
<td class="msg-data" colspan="5"><span>Smokinn has left irc.freenode.net ()</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>|&lt;&#8211;</span></td>
<td class="msg-data" colspan="5"><span>savage- has left irc.freenode.net (Read error: 110 (Connection timed out))</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>&#8211;&gt;|</span></td>
<td class="msg-data" colspan="5"><span>overlast (<a class="chatzilla-link" href="mailto:n=overlast@19.181.210.220.dy.bbexcite.jp">n=overlast@19.181.210.220.dy.bbexcite.jp</a>) has joined <a class="chatzilla-link" href="irc://irc.freenode.net/%23hadoop">#hadoop</a></span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>|&lt;&#8211;</span></td>
<td class="msg-data" colspan="5"><span>overlast has left irc.freenode.net (&#8220;Leaving&#8230;&#8221;)</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>allenday: i think that would be hard&#8230; reducing starts while the mapping is happening (copy stage)</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>nathanmarz, i frequently find that while the reduce has &#8220;started&#8221;, it can just sit there for a long time doing nothing</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>this is most common with nutch</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>so there could be a bunch of other jobs further back in the queue that get starved for reduces b/c the head of the queue is squatting on the slots</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>it just sits there in the reduce phase?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>for sure nutch does, yeah</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>during fetch, when it crawling</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>|&lt;&#8211;</span></td>
<td class="msg-data" colspan="5"><span>cutting has left irc.freenode.net (&#8220;Leaving.&#8221;)</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>i see</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>i don&#8217;t have that much familiarity with nutch</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>is it possible to increase the number of reducers?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>yep, but then you can get into i/o trouble later</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>for the job i mean, not the cluster</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>oh</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>it sounds like you propose having these squatters consume minimal # of reducers (e.g. only 1)</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>actually, the opposite</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>let&#8217;s say you have 16 reduce slots</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>and the job i set to use 16 reducers</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>each one of those reducers potentially has to go over a lot of data</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>if the job is instead set to use a lot more reducers, like 100 or something</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>than an individual reducer will go a lot faster</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>and potentially, those freed reduce slots will go to jobs with higher priority</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>ok, so you introduce priority to bump the further back ahead in the queue</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>yea</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>is that settable in jobconf?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>you can set num reducers</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-type"><span>&#8211;&gt;|</span></td>
<td class="msg-data" colspan="5"><span>tobias_au (<a class="chatzilla-link" href="mailto:n=opera@CPE-121-50-201-65.dsl.OntheNet.net">n=opera@CPE-121-50-201-65.dsl.OntheNet.net</a>) has joined <a class="chatzilla-link" href="irc://irc.freenode.net/%23hadoop">#hadoop</a></span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>so let&#8217;s suppose the job that squats on reduce slots gets to the head of the queue. regardless of if it has 16 or 100 reducers configured</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>JobConf#setNumReduceTasks</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>and that it it still in map phase only.  has not begun reducing yet</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>until one of those reduces finishes (i.e. the map has finished) all slots are still filled</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>it&#8217;s only when the first reduce finishes that the job at #2 can take over a reduce slot</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>right</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>yea that&#8217;s true</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>that&#8217;s bad</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>this scheme doesn&#8217;t help until mappers finished</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>you really want this #1 job</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>when it is allocating reducers</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>to have low priority in acquiring the slots</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>right</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>well you don&#8217;t want it to acquire any slots until mappers finish</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>so you give reduce slots to #2, #3, #4, etc.  until everyone who wants slots has them.  then you assign to #1</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>or until #1 is ready&#8230;</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>is it just me or does the queueing system in hadoop kind of suck?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i am coming here from sun grid which puts a lot of emphasis on this aspect</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>well, the priority system will work if you start job #1 after the other jobs</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>if you start the other jobs after #1 then they will get starved of reducers</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>heh, but the whole reason it is in #1 is because it was submitted first, right?  isn&#8217;t hadoop FIFO wrt jobs?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>if they&#8217;re the same priority</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>so maybe decreasing the reducers job #1 uses is the way to go</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>set it so it doesn&#8217;t use all the reduce slots on the cluster</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i need to do some research to see if there are jira open for improving the scheduler. or if there are some commercial plugins to improve the scheduling</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>definitely room for improvement, agreed</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>yeah, that was what i thought you meant initially.  it&#8217;s a hack too though, and breaks down when the number of jobs gets large</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i&#8217;m surprised they are coupled.  do you understand how it works when the mapper hands off to the reducer?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>b/c i don&#8217;t and i need to</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>yes</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>can i get the 2min version?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>the reason the reducers start while the mappers are running is because there&#8217;s some work they can do without all the map data</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>each reducer needs to copy the relevant outputs from all the mappers to its machine</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>this is called the &#8220;copy&#8221; phase and can occur in parallel with mapping</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>ok, i&#8217;ve seen that</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>so what we need is a flag taht indicates there will be no data to copy until maps all finish</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>yea</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>a flag that says not to pipeline the process</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>default behavior is to have the flag off and copy greedily</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>which is like it does now</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>turn the flag on says to wait until upstream map finishes before grabbing a reduce slot and kicking off the copy</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>**all upstream maps</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span><a class="chatzilla-link" href="http://hadoop.apache.org/core/docs/current/hadoop-default.html" target="_content">http://hadoop.apache.org/core/docs/current/hadoop-default.html</a></span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>those are all the hadoop config parameters</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>you might be able to find something in there</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>yeah, i fiind goodies in there every time i read that page<span class="chatzilla-emote-txt"> <img src='http://www.spicylogic.com/allenday/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i am only ~1mo into hadoop</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>here&#8217;s another scheduling related question/issue i&#8217;m having</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>i find that job i/o and cpu usage tend to synchronize after a while</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>b/c if there is a slow moving job in the queue, all the others tend to get jammed behind it</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>have you seen this?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>no, i haven&#8217;t</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>but that&#8217;s interesting</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>it comes back to resource (mis)allocation by the scheduler</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><a class="chatzilla-link" href="irc://irc.freenode.net/nathanmarz,isnick"><span>nathanmarz</span></a></td>
<td class="msg-data" colspan="5"><span>how are you measuring that?</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>it&#8217;s this same issue where jobs will consume all the slots</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>so if you have a slow moving thing blocking all the resources</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>no one else can get past</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>then when the slow moving job finishes, the others all start getting processed very quickly (high cpu load during map), then as they begin to finish there is a flurry of i/o</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>it&#8217;s like congestion on the freeway where one car slams on the breaks it sends this wave of traffic jam behind it</span></td>
</tr>
<tr class="msg">
<td class="msg-timestamp"></td>
<td class="msg-user"><span>allenday</span></td>
<td class="msg-data" colspan="5"><span>assuming the freeway is already close to capacity (not sparse)</span></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/09/11/thoughts-on-hadoop-tasktracker-jobtrackerscheduling/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cascading for Hadoop</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/09/05/cascading-for-hadoop/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/09/05/cascading-for-hadoop/#comments</comments>
		<pubDate>Fri, 05 Sep 2008 23:40:20 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/09/05/cascading-for-hadoop/</guid>
		<description><![CDATA[Need to check this out as an alternative to hadoop streaming
]]></description>
			<content:encoded><![CDATA[<p>Need to check this out as an alternative to hadoop <a href="http://blog.rapleaf.com/dev/?p=33">streaming</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/09/05/cascading-for-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SGE / Hadoop integration</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/09/03/sge-hadoop-integration/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/09/03/sge-hadoop-integration/#comments</comments>
		<pubDate>Thu, 04 Sep 2008 00:10:17 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[SGE]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/09/03/sge-hadoop-integration/</guid>
		<description><![CDATA[Yet another interesting blog post I&#8217;ve found today on integrating Hadoop and Sun Grid Engine.
]]></description>
			<content:encoded><![CDATA[<p>Yet another interesting blog post I&#8217;ve found today on <a href="http://blogs.sun.com/ravee/entry/creating_hadoop_pe_under_sge">integrating Hadoop and Sun Grid Engine</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/09/03/sge-hadoop-integration/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Quality Control and Monitoring at Last.FM</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/09/03/quality-control-and-monitoring-at-lastfm/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/09/03/quality-control-and-monitoring-at-lastfm/#comments</comments>
		<pubDate>Wed, 03 Sep 2008 21:39:18 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Scalability]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/2008/09/03/quality-control-and-monitoring-at-lastfm/</guid>
		<description><![CDATA[I found the Last.fm blog today.  They&#8217;re having a lot of fun with QC tools.  Worth a read!
]]></description>
			<content:encoded><![CDATA[<p>I found the Last.fm blog today.  They&#8217;re having a lot of fun with <a href="http://blog.last.fm/2008/08/01/quality-control">QC tools</a>.  Worth a read!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/09/03/quality-control-and-monitoring-at-lastfm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop streaming recipes</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/09/02/hadoop-streaming-recipes/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/09/02/hadoop-streaming-recipes/#comments</comments>
		<pubDate>Tue, 02 Sep 2008 19:18:31 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=56</guid>
		<description><![CDATA[I started playing with Hadoop Streaming today because I needed to do the equivalent of the shell script

cat /some/input &#124; cut -f 1 &#124; sort &#124; uniq &#38;gt; /some/output

on an HDFS file.
The basic thing you want to do to get a map working follows.  The general rule of thumb is that if there is [...]]]></description>
			<content:encoded><![CDATA[<p>I started playing with <a href="http://wiki.apache.org/hadoop/HadoopStreaming">Hadoop Streaming</a> today because I needed to do the equivalent of the shell script</p>

<div class="wp_syntax"><div class="code"><pre>cat /some/input | cut -f 1 | sort | uniq &amp;gt; /some/output</pre></div></div>

<p>on an HDFS file.</p>
<p>The basic thing you want to do to get a map working follows.  The general rule of thumb is that if there is one <i>or more</i> lines of output for each line of input, then you don&#8217;t need to use any reducers, hence the <code>-numReduceTasks 0</code> option.</p>

<div class="wp_syntax"><div class="code"><pre>$HADOOP_HOME/bin/hadoop jar contrib/streaming/*-streaming.jar -input /some/input -output /some/output -mapper 'cut -f 1' -numReduceTasks 0</pre></div></div>

<p>In my case though, I wanted to <code>uniq</code>ify my list.  Putting <code>uniq</code> into the mapper chain would cause the job to fail.  Instead I had to drop the <code>-numReduceTasks 0</code> and do like so:</p>

<div class="wp_syntax"><div class="code"><pre>$HADOOP_HOME/bin/hadoop jar contrib/streaming/*-streaming.jar -input /some/input -output /some/output -mapper 'cut -f 1' -reducer 'uniq'</pre></div></div>

<p>Note also that I didn&#8217;t need to include the <code>sort</code> from my original shell command.  That&#8217;s because sorting is implicit in the MapReduce process.</p>
<p>As usual, I&#8217;m new to all of this, so if you have any insights leave a comment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/09/02/hadoop-streaming-recipes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Nutch to download large binary media and image files</title>
		<link>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/</link>
		<comments>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/#comments</comments>
		<pubDate>Fri, 29 Aug 2008 07:29:34 +0000</pubDate>
		<dc:creator>allenday</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Nutch]]></category>

		<guid isPermaLink="false">http://www.spicylogic.com/allenday/blog/?p=55</guid>
		<description><![CDATA[Here&#8217;s a recipe for using Nutch to crawl some site(s) and extract out the images.  I&#8217;m blogging this because I couldn&#8217;t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.
Step 1: modify Nutch URL filters
Okay, so first thing, modify $NUTCH_HOME/conf/crawl-urlfilter.txt .  Let&#8217;s assume you only [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a recipe for using Nutch to crawl some site(s) and extract out the images.  I&#8217;m blogging this because I couldn&#8217;t find (no surprise here, sigh) any documentation or complete examples via mailing list archives for how to do this.</p>
<h2>Step 1: modify Nutch URL filters</h2>
<p>Okay, so first thing, modify <code>$NUTCH_HOME/conf/crawl-urlfilter.txt</code> .  Let&#8217;s assume you only care about JPEG images, change this line:</p>

<div class="wp_syntax"><div class="code"><pre>-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$</pre></div></div>

<p>to this:</p>

<div class="wp_syntax"><div class="code"><pre>-\.(gif|GIF|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$</pre></div></div>

<p>also update the &#8220;MY.DOMAIN.NAME&#8221; section appropriately.</p>
<h2>Step 2: set up crawl configuration</h2>
<p>Edit <code>$NUTCH_HOME/conf/nutch-site.xml</code>.  You want to update/add properties for <code>http.content.limit</code> and <code>file.content.limit</code> so that your big files don&#8217;t get truncated.  Look at <code>$NUTCH_HOME/conf/nutch-default.xml</code> for examples of how to do this.  You might also want to adjust <code>Protocol.CHECK_ROBOTS</code> <img src='http://www.spicylogic.com/allenday/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<h2>Step 3: crawl</h2>
<p>I&#8217;m not going to go into this here as it is well-covered elsewhere.  Basically you just want to make a list of seed URLs, then let nutch do its thing, e.g. like:</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch crawl /home/allenday/urls -dir /home/allenday/crawled -depth 5</pre></div></div>

<p>This is going to generate some directories under <code>/home/allenday/crawled/segments</code>.  </p>
<h2>Step 4: massage crawl outputs and extract images</h2>
<p>Merge the crawl segments into one big segment.  This makes the following steps easier.</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch mergesegs /tmp/merged /home/allenday/crawled/segments</pre></div></div>

<p>Now dump the segment and show the image URLs fro the crawl.</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/nutch readseg -dump /tmp/merged/* /tmp/dump
$NUTCH_HOME/bin/hadoop dfs -cat /tmp/dump/dump | grep -aE 'URL'</pre></div></div>

<p>The grep should show something like this:</p>
<pre>
URL:: http://spicylogic.com/some-url.html
URL:: http://spicylogic.com/some-url.jpg
</pre>
<p>Obviously you&#8217;re interested in grepping for jpg, jpeg, etc.  Do it.</p>
<p>Once you have the image list, you can use this little Java program to pull the images out of the segment one by one.</p>

<div class="wp_syntax"><div class="code"><pre class="java"><span style="color: #000000; font-weight: bold;">package</span> com.<span style="color: #006600;">spicylogic</span>.<span style="color: #006600;">allenday</span><span style="color: #66cc66;">;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//JDK imports</span>
<span style="color: #a1a100;">import java.io.ByteArrayInputStream;</span>
<span style="color: #a1a100;">import java.io.DataInput;</span>
<span style="color: #a1a100;">import java.io.DataInputStream;</span>
<span style="color: #a1a100;">import java.io.DataOutput;</span>
<span style="color: #a1a100;">import java.io.DataOutputStream;</span>
<span style="color: #a1a100;">import java.io.IOException;</span>
<span style="color: #a1a100;">import java.util.Arrays;</span>
<span style="color: #a1a100;">import java.util.zip.InflaterInputStream;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//Hadoop imports</span>
<span style="color: #a1a100;">import org.apache.hadoop.conf.Configuration;</span>
<span style="color: #a1a100;">import org.apache.hadoop.fs.FileSystem;</span>
<span style="color: #a1a100;">import org.apache.hadoop.fs.Path;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.ArrayFile;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.DataOutputBuffer;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.IntWritable;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.SequenceFile;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.SequenceFile.ValueBytes;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.Text;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.UTF8;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.VersionMismatchException;</span>
<span style="color: #a1a100;">import org.apache.hadoop.io.Writable;</span>
&nbsp;
<span style="color: #808080; font-style: italic;">//Nutch imports</span>
<span style="color: #a1a100;">import org.apache.nutch.metadata.Metadata;</span>
<span style="color: #a1a100;">import org.apache.nutch.protocol.Content;</span>
<span style="color: #a1a100;">import org.apache.nutch.util.MimeUtil;</span>
<span style="color: #a1a100;">import org.apache.nutch.util.NutchConfiguration;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000000; font-weight: bold;">class</span> ExtractFile <span style="color: #66cc66;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #993333;">void</span> main<span style="color: #66cc66;">&#40;</span><span style="color: #aaaadd; font-weight: bold;">String</span> argv<span style="color: #66cc66;">&#91;</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #aaaadd; font-weight: bold;">Exception</span> <span style="color: #66cc66;">&#123;</span>
&nbsp;
    <span style="color: #aaaadd; font-weight: bold;">String</span> usage = <span style="color: #ff0000;">&quot;Content (-local | -dfs &amp;lt;namenode:port&amp;gt;) url segment&quot;</span><span style="color: #66cc66;">;</span>
&nbsp;
    <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span>argv.<span style="color: #006600;">length</span> <span style="color: #66cc66;">&amp;</span>lt<span style="color: #66cc66;">;</span> <span style="color: #cc66cc;">3</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #aaaadd; font-weight: bold;">System</span>.<span style="color: #006600;">out</span>.<span style="color: #006600;">println</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;usage:&quot;</span> + usage<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      <span style="color: #000000; font-weight: bold;">return</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span>
    Configuration conf = NutchConfiguration.<span style="color: #006600;">create</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    FileSystem fs = FileSystem.<span style="color: #006600;">parseArgs</span><span style="color: #66cc66;">&#40;</span>argv, <span style="color: #cc66cc;">0</span>, conf<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #000000; font-weight: bold;">try</span> <span style="color: #66cc66;">&#123;</span>
      <span style="color: #aaaadd; font-weight: bold;">String</span> segment = argv<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">1</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">;</span>
&nbsp;
      Path file = <span style="color: #000000; font-weight: bold;">new</span> Path<span style="color: #66cc66;">&#40;</span>segment, Content.<span style="color: #006600;">DIR_NAME</span> + <span style="color: #ff0000;">&quot;/part-00000/data&quot;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      SequenceFile.<span style="color: #aaaadd; font-weight: bold;">Reader</span> reader = <span style="color: #000000; font-weight: bold;">new</span> SequenceFile.<span style="color: #aaaadd; font-weight: bold;">Reader</span><span style="color: #66cc66;">&#40;</span>fs, file, conf<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
&nbsp;
      Text key = <span style="color: #000000; font-weight: bold;">new</span> Text<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
      Content content = <span style="color: #000000; font-weight: bold;">new</span> Content<span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
&nbsp;
      <span style="color: #b1b100;">while</span> <span style="color: #66cc66;">&#40;</span>reader.<span style="color: #006600;">next</span><span style="color: #66cc66;">&#40;</span>key, content<span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
        <span style="color: #808080; font-style: italic;">//System.err.println( key + &quot;\t=\t&quot; + argv[0] );</span>
        <span style="color: #b1b100;">if</span> <span style="color: #66cc66;">&#40;</span>key.<span style="color: #006600;">equals</span><span style="color: #66cc66;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> Text<span style="color: #66cc66;">&#40;</span>argv<span style="color: #66cc66;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #66cc66;">&#93;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#123;</span>
          <span style="color: #aaaadd; font-weight: bold;">System</span>.<span style="color: #006600;">out</span>.<span style="color: #006600;">write</span><span style="color: #66cc66;">&#40;</span> content.<span style="color: #006600;">getContent</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>, <span style="color: #cc66cc;">0</span>, content.<span style="color: #006600;">getContent</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span>.<span style="color: #006600;">length</span> <span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
          <span style="color: #000000; font-weight: bold;">break</span><span style="color: #66cc66;">;</span>
        <span style="color: #66cc66;">&#125;</span>
      <span style="color: #66cc66;">&#125;</span>
      reader.<span style="color: #006600;">close</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span> <span style="color: #000000; font-weight: bold;">finally</span> <span style="color: #66cc66;">&#123;</span>
      fs.<span style="color: #006600;">close</span><span style="color: #66cc66;">&#40;</span><span style="color: #66cc66;">&#41;</span><span style="color: #66cc66;">;</span>
    <span style="color: #66cc66;">&#125;</span>
  <span style="color: #66cc66;">&#125;</span>
<span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>Compile it, then you can do like so:</p>

<div class="wp_syntax"><div class="code"><pre>$NUTCH_HOME/bin/hadoop --config YOUR:CLASS:PATH com.spicylogic.allenday.ExtractFile http://spicylogic.com/some-url.jpg /tmp/merged/* &amp;gt; out.jpg</pre></div></div>

<p>Hope that helps.  Let me know if you have corrections/clarifications (or a complete script!) for this post and I&#8217;ll be happy to merge them in with attribution.</p>
<p>Thanks for this post go to <a href="http://kazmuzik.net/lj/77261.html">Kaz Muzik</a>, who was doing something similar to back up his blog.  Also, the <code>com.spicylogic.allenday.ExtractFile</code> class is based on <code>org.apache.nutch.protocol.Content</code> class.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
