Computing

8 Keys to Effective Crowdsourcing

The key to effective crowdsourcing is effective communication.  You communicate with your crowdsourced workers so that you can train them.  Training has a measurable cost, and you want to minimize this cost to make most effective use of your time and your budget.

Consider the situation when you’re in a professional position, or the flipside and you’re training someone to take on a new role.  Assuming you are/have the “right” person with regard to relevant skills to perform the requisite tasks, why is training required?  Knowledge transfer needs to occur.  The same is also true for crowdsourced workers.  So how can we effectively transfer knowledge to workers who may only be spending a few seconds on your task?

Key 1: Be consistent.

Use similar phrasings and images for all of your task descriptions.  This allows workers to come up to speed in a minimum amount of time.  Imagine how hard it would be to read your email if each message opened in a differently styled window.  Similar phrasings/images are just one example of how to employ…

Key 2: Use variables.

Smartsheet.com got this right.  Have a look at these 2 tasks submitted from Smartsheet to Amazon’s Mechanical Turk:

Look closely at what’s going on here.  The two tasks’ input variables (Blog Name and Blog URL) are identical, only their values change.  Note also that there are 2114 tasks just like this available.  Workers like to have lots of very similar tasks because…

Key 3: Batch tasks.

Crowdsourced workers like batches of similar tasks because it presents an opportunity for them to set up a workflow, or even write a small computer program to do the tasks for them, for you.  The cost of learning how to do a task is amortized over the entire batch, letting them make more efficient use of time (and letting you make more efficient use of your budget).

Key 4: Be visual.

The adage “a picture is worth one thousand words” couldn’t be more fitting to communicating with crowdsourced workers.  Images are very information dense, are more friendly to scanning, and are able to more quickly communicate non-linear process structure when compared to text.  The most effective visual tool I have found thus far is to…

Key 5: Use flow charts.

Consider learning to use flow charts, and also to extend your visual vocabulary.  I’m an avid user of OmniGraffle for creating diagrams for crowdsourcing (as well as for myself).  I’ll be presenting some flow charts in the future.  You will find that by presenting your task graphically and in a formal way as a flow chart (as opposed to simply giving graphical examples), users will do more work for the same price because you’ve made it easier for them.  The flow chart also forces you be clear about what you want, which brings us to…

Key 6: Know what you want.  Be unambiguous.

Know what you expect the worker to do for you.  Make each task so simple that it’s virtually impossible for a worker to do it incorrectly.  Break up complex tasks into their most elementary pieces.  Ideally one task = one decision.  Make each task closed-ended.  Do not leave any room for ambiguity.

Designing tasks in this way requires more effort on your part, but will result in less money spent and higher-quality results.

Key 7: Improve through iteration.

Being unambiguous on the first try is nigh on impossible.  It’s for the same reason that you “bounce” ideas off of your peers/friends — to see how your approach to an idea or task might be sub-optimal or misunderstood.

Iteratively remove ambiguity.  Submit a sampling tasks out of a larger batch with a test task description.  See where the crowdsourced workers make mistakes.  Re-examine your task description to a) find the misunderstanding, and b) disambiguate it.

Key 8. Build validators into your tasks.

Make sure the worker’s work is validated before it gets to you.  This could mean having workers check each others’ work, and can even involve some fancy statistics.  It could also mean writing a bit of javascript or some other backend systems to validate worker inputs (e.g. you ask for a minimum 300-word document.  count the words with javascript before they submit).  This is getting a bit more advanced, but opens more opportunity for more complex tasks by delegating part of the work to the computer.

Computing
Crowdsourcing
Random musings
Scalability

Comments (0)

Permalink

Google/HTC Nexus One Unboxing

P1000457

P1000441P1000442P1000443P1000444P1000445P1000446P1000447P1000448P1000449P1000450P1000451P1000452P1000453P1000454P1000455P1000456P1000457P1000458P1000459P1000460P1000461P1000464

Computing
Fun
Life
Mobile

Comments (0)

Permalink

How to fix the meetup.com broken exported calendars.

I’m a big fan of meetup.com, but they’re so tragically unhip when it comes to mashups/integration/web 2.0.  One of my biggest gripes until about 6 months ago was that they had no facility (besides API) for exporting a calendar of meetups to my calendar app (I use Google Calendar), or any other calendar app for that matter.

They introduced an export feature recently, but it’s pretty useless.  Here’s why: they offer two calendars

  • [Calendar A] contains all upcoming items in all your meetup groups
  • [Calendar B] contains upcoming items which you have RSVP’d with “yes” or “maybe”.

That’s it.  The calendars exported don’t even contain links that allow you to RSVP from directly inside your calendar — you have click through to the meetup.com site, log in, then RSVP.  Ugh.

 

Come on, product guys.  What’s really called for is 4 separate calendars.

  • [Calendar "yes"] All groups, “yes” events
  • [Calendar "maybe"] All groups, “maybe” events
  • [Calendar "no"] All groups, “no” events
  • [Calendar "none"] All groups, events to which I have not yet submitted an RSVP.

I was finally just pissed off enough about the status quo that I fixed it for myself, and below I share the code.  You can try it out here: http://spicylogic.com/allenday/cgi-bin/mu.cgi?key=<your_api_key>&cal=<calendar> 

where <your_api_key> can be found here and <calendar> is one of “yes”, “no”, “none”, “maybe”.

Okay, here’s the code.  Install it on your own machine if possible, my ISP will appreciate it.  If you find fuckups, let me know and I’ll update the post.

#!/usr/bin/perl
use strict;
use CGI qw(:standard);
use Date::Manip qw(ParseDate ParseDateString ParseDateDelta DateCalc UnixDate);
use Date::Parse;
use HTML::Entities;
use LWP::Simple qw(get);
use XML::DOM;
 
use constant URL_EVENTS =&gt; 'http://api.meetup.com/events?key=%s&amp;member_id=%d&amp;format=xml';
 
print header(q(text/calendar));
 
my $parser = new XML::DOM::Parser ();
 
my $mode = param( 'cal' );
my $key  = param( 'key' );
my $user = param( 'user' );
 
if ( ! $mode || ! $key || ! $user ) {
  die
}
 
my $events_url = sprintf( URL_EVENTS, $key, $user );
#warn $events_url;
my $events_txt = get( $events_url );
#warn $events_txt;
my $events_dom = $parser-&gt;parse( $events_txt );
#warn $events_dom;
 
print qq(BEGIN:VCALENDAR\nPRODID:-//Meetup Inc//RemoteApi//EN\nVERSION:2.0\nMETHOD:PUBLISH\nCALSCALE:GREGORIAN\nX-ORIGINAL-URL:http://www.meetup.com/\nX-WR-CALNAME:mu $mode\n);
 
my $events = $events_dom-&gt;getElementsByTagName( 'item' );
for ( my $i = 0 ; $i &lt; $events-&gt;getLength() ; $i++ ) {
  my $event = $events-&gt;item( $i );
  my $n_id    = $event-&gt;getElementsByTagName( 'id'             )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_rsvp  = $event-&gt;getElementsByTagName( 'myrsvp'         )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr0 = $event-&gt;getElementsByTagName( 'venue_name'     )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr1 = $event-&gt;getElementsByTagName( 'venue_address1' )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr2 = $event-&gt;getElementsByTagName( 'venue_address2' )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr3 = $event-&gt;getElementsByTagName( 'venue_address3' )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr4 = $event-&gt;getElementsByTagName( 'venue_city'     )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr5 = $event-&gt;getElementsByTagName( 'venue_state'    )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_addr6 = $event-&gt;getElementsByTagName( 'venue_zip'      )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_desc  = $event-&gt;getElementsByTagName( 'description'    )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_link  = $event-&gt;getElementsByTagName( 'event_url'      )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_name  = $event-&gt;getElementsByTagName( 'name'           )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_lat   = $event-&gt;getElementsByTagName( 'venue_lat'      )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_lon   = $event-&gt;getElementsByTagName( 'venue_lon'      )-&gt;item( 0 )-&gt;getFirstChild();
  my $n_start_time  = $event-&gt;getElementsByTagName( 'time'           )-&gt;item( 0 )-&gt;getFirstChild();
 
  my $start_time;
  my $end_time;
 
  #my $dummy_time = "20000101T000000Z";
  my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time());
  my $dummy_time = sprintf( q(%04d%02d%02dT%02d%02d%02dZ), $year + 1900, $mon + 1, $mday, $hour, $min, $sec );
 
  if ( $n_start_time ) {
    my ($ss,$mm,$hh,$day,$month,$year,$zone);
 
    ($ss,$mm,$hh,$day,$month,$year,$zone) = strptime( $n_start_time-&gt;toString() );
    $start_time = sprintf( q(%04d%02d%02dT%02d%02d%02dZ), $year + 1900, $month + 1, $day, $hh, $mm, $ss );
 
    my $eday = $day;
    if ( $hh == 23 ) {
      $eday = $day + 1;
    }
    my $ehh = ($hh + 1) % 24;
    $end_time   = sprintf( q(%04d%02d%02dT%02d%02d%02dZ), $year + 1900, $month + 1, $eday, $ehh, $mm, $ss );
  }
  else {
    $start_time = '';
    $end_time = '';
  }
 
  if ( $mode eq $n_rsvp-&gt;toString() ) {
    my $id   = $n_id-&gt;toString();
    my $name = $n_name ? $n_name-&gt;toString() : "";
    my $desc = $n_desc ? $n_desc-&gt;toString() : "";
    my $addr = ( $n_addr0 ? $n_addr0-&gt;toString().', ' : "" )
             . ( $n_addr1 ? $n_addr1-&gt;toString().', ' : "" )
             . ( $n_addr2 ? $n_addr2-&gt;toString().', ' : "" )
             . ( $n_addr3 ? $n_addr3-&gt;toString().', ' : "" )
             . ( $n_addr4 ? $n_addr4-&gt;toString().', ' : "" )
             . ( $n_addr5 ? $n_addr5-&gt;toString().', ' : "" )
             . ( $n_addr6 ? $n_addr6-&gt;toString() : "" );
    #$desc =~ s/(.)/(ord($1) &gt; 127) ? "" : $1/egs;
 
    $name = HTML::Entities::decode_entities( $name );
    $desc = HTML::Entities::decode_entities( $desc );
    $addr = HTML::Entities::decode_entities( $addr );
    $name =~ s/,/\\,/g;
    $desc =~ s/,/\\,/g;
    $addr =~ s/,/\\,/g;
 
    $desc =~ s#
#\\n#gs;
    $desc .= "\\n\\n\\nGoing?\\n\\n";
    foreach my $response ( qw( yes no maybe ) ) {
      $desc .= uc($response).qq(: http://api.meetup.com/rsvp?event_id=$id&amp;key=$key&amp;rsvp=$response\\n);
    }
 
    my $geo = $n_lat &amp;&amp; $n_lon ? "GEO:" . $n_lat-&gt;toString() . ";" . $n_lon-&gt;toString() . "\n" : undef;
 
    #print sprintf( qq(BEGIN:VEVENT\nSUMMARY:%s\nDESCRIPTION:%s\nLAST-MODIFIED:%s\nUID:%s\nCLASS:%s\nCREATED:%s\nDTSTAMP:%s\nDTSTART:%s\nDTEND:%s\nLOCATION:%s\n\nURL:%s\nEND:VEVENT\n),
    print sprintf( qq(BEGIN:VEVENT\nSUMMARY:%s\nDESCRIPTION:%s\nLAST-MODIFIED:%s\nUID:%s\nCLASS:%s\nCREATED:%s\nDTSTAMP:%s\nDTSTART:%s\nDTEND:%s\n%sLOCATION:%s\nURL:%s\nEND:VEVENT\n),
      $name,
      $desc,
      $start_time,
      "event_$id\@meetup.com",
      "PUBLIC",
      $dummy_time,
      $dummy_time,
      $start_time,
      $end_time,
      $geo,
      $addr,
      $n_link ? $n_link-&gt;toString() : "",
    );
  }
}
 
print qq(END:VCALENDAR\n);

Administration
Life
Networking
Perl
Software

Comments (1)

Permalink

Synthetic GFF Dataset for Genome Browser Benchmark

I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn’t it be nice to use SOLR here?

I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation set of the Hg18 build of the human genome.  You can grab the data set and script used to generate it here.  There are several files mRNA.EN.txt.gz that contain gzipped gene models, where N=3..7 indicates there are 10^N models in the file, uniformly distributed across a 500-megabase reference sequence.

I’m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:

  • Chado
  • The default Bio::DB::GFF schema (does it have a name?)
  • The SOLR backend I’m about to implement

I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.

Genomics
Informatics
Java
Perl
Scalability
Science

Comments (0)

Permalink

Taste item-item recommender example

I threw together a Mahout/Taste based item-item based recommender last night.

	public static void itemItemRecommendations(String path, String file) {
		File f = new File(path, file);
	    try {
			DataModel model = new FileDataModel(f);
			model.refresh(null);
		    ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(model);
		    ItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(model, itemSimilarity);
		    for ( Item i : model.getItems() )
			    for ( RecommendedItem j : itemRecommender.mostSimilarItems(i.getID(), 50) )
			    	if ( j.getValue() >= 0.7 )
			    		System.out.println(i.getID() + "\t" + j.getItem().getID() + "\t" + String.format("%.3f", j.getValue()));
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (TasteException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

This outputs item1 –recommends–>item2 pairs with a weight. I’m taking this and putting it into a solr document so I can display related item2s alongside item1 when it’s viewed.

Input data are comma-delimited tuples like so:

1fe7401b81eed49353d0cbeba5383848,5212,0.6
3c1832954a6e8781836fed670bb37b24,5212,1
70273e4c7c77700ee97acb8d0306c405,5213,0.8
1f057ccde135acbc881008bbf466e7e1,5213,1
51d44c7baca65ad39d11ba87bf2d438b,5213,1
adc924559b37114cd97d1f5cf7c71419,5213,1
78e254b4a11e61d76ff63cea02de4de8,5213,1
5c373ec7d9ad4a6f392c291d8ccba5ce,5213,0.2
fab8537564094fa8885f6214e6b682e1,5213,1
127f46aabcdbc2d2d04da8398a996c75,5213,1

Works great. Thanks Sean.

Analytics
Java
Mahout

Comments (1)

Permalink

Google Android G1 APN Settings for AT&T / Cingular, First Impressions

I got an unlocked T-Mobile G1 today. Woo. There is a bunch of mis-information out there on blogs and forum about how to get the phone set up. Here’s the real deal, I found these settings on Piaw’s Blog.

Name: whatever_you_want_the_name_to_be
APN: wap.cingular
Username: wap@cingulargprs.com
Password: cingular1
MMSC: http://mmsc.cingular.com
MMS Proxy: wireless.cingular.com
MMS port: 80
MCC: 310
MNC: 410

Now, on to my first impressions of the phone.

Works:

  • Calling works.
  • Google contact import works.
  • Google Chat works.
  • EDGE data works.
  • WiFi data works.
  • Keyboard works. It rocks

Doesn’t work:

  • AOL Chat does not work. Complains it can’t read my mobile number from my SIM card.
  • 3G data does not work. I read that the phone doesn’t support the 3G band used by AT&T.

Works, but not well:

  • The browser works, but it sucks compared to the iPhone. Feels very slow. I was expecting a lot more given that it’s using MobileSafari/WebKit.
  • Video download/playback works. The player was branded with the YouTube logo, so I’m guessing it only supports YouTube out of the box.
  • The UI feels clunky at first. Sort of feels like Nokia S60, too many menus and inconsistency in how different tasks are done

Computing
Mobile

Comments (9)

Permalink

EveryDNS – free DNS service

http://www.everydns.com

Found this today, and it works as advertised. Need to look more closely, but with this I think I can stop paying dyndns.com $30/year/domain for custom DNS.

Administration
Business

Comments (0)

Permalink

Upcoming AI / Machine Learning Conferences

A (partial) list I found today. Doesn’t include NIPS, so I’m not sure how exhaustive it is, but it has a bunch I haven’t seen before.

http://www.kmining.com/info_conferences.html

Analytics
Informatics
Mathematics
Networking
Science
Software
Statistics

Comments (0)

Permalink

Is Amazon CloudFront right for me?

Here’s what the pricing looks like. Learn more about Amazon CloudFront at http://aws.amazon.com/cloudfront.

United States Edge Locations

Data Transfer

$0.170 per GB – first 10 TB / month data transfer out
$0.120 per GB – next 40 TB / month data transfer out
$0.100 per GB – next 100 TB / month data transfer out
$0.090 per GB – data transfer out / month over 150 TB

Requests
$0.010 per 10,000 GET requests

I evaluated AWS for hosting a while back and concluded that the bandwidth and storage costs were just too expensive if you have even a modest amount of storage traffic needs. Here’s the breakdown:

A dedicated 100Mbit line can xfer 30TB/month. Costs $1000/mo, or $10/Mbit/mo. Source: CalPOP. (I host here).

From AWS @ $0.120/GB that’s $3600/mo. If you’re pushing sizable volumes of bits, it seems like it will only make sense to do this under 2 scenarios:

  1. you can benefit from having a >100Mbit/s cap b/c you have *very* spiky traffic. you xfer well over 400Mbit/s for a few hours/day (and 0Mbit/s the rest), and
  2. you need lower latency than a 1-2 datacenter network can give you

I suspect most for most of their target clients it’s [2], or clients that are really in it for the whole S3/EC2/SQS/EBS bundle. Being able to rent cores at $0.10/hour can be really attractive for some types of services.

So no, it’s not right for me. YMMV.

Business
Computing
Scalability

Comments (0)

Permalink

Parallel DNS reverse lookups

Need to do lots of reverse DNS lookups for some reason? Maybe b/c you’re trying to get a seed list for a web crawl or hack attempt on a bunch of ISPs. Who cares. Here’s a quick way to generate names from a big list of IPs like:

1.1.1.1
1.1.1.2
[...]
254.254.254.253
254.254.254.254

We can use hadoop streaming to chunk the list so we can do the DNS lookups in parallel. Easy and requires little to know thought:

./bin/hadoop jar contrib/streaming/*-streaming.jar -input /home/aday/classC.dat -output /home/aday/classC_dns.dat -mapper 'perl -ne '\''print `host $_`'\''' -numReduceTasks 0

We wrap the host call in backticks so we can trap non-zero exit codes and get an error message on stdout courtesy of perl.

Distributed Systems
Hadoop
Java

Comments (0)

Permalink