Science

Synthetic GFF Dataset for Genome Browser Benchmark

I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn’t it be nice to use SOLR here?

I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation set of the Hg18 build of the human genome.  You can grab the data set and script used to generate it here.  There are several files mRNA.EN.txt.gz that contain gzipped gene models, where N=3..7 indicates there are 10^N models in the file, uniformly distributed across a 500-megabase reference sequence.

I’m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:

  • Chado
  • The default Bio::DB::GFF schema (does it have a name?)
  • The SOLR backend I’m about to implement

I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.

Genomics
Informatics
Java
Perl
Scalability
Science

Comments (0)

Permalink

Taste item-item recommender example

I threw together a Mahout/Taste based item-item based recommender last night.

	public static void itemItemRecommendations(String path, String file) {
		File f = new File(path, file);
	    try {
			DataModel model = new FileDataModel(f);
			model.refresh(null);
		    ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(model);
		    ItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(model, itemSimilarity);
		    for ( Item i : model.getItems() )
			    for ( RecommendedItem j : itemRecommender.mostSimilarItems(i.getID(), 50) )
			    	if ( j.getValue() >= 0.7 )
			    		System.out.println(i.getID() + "\t" + j.getItem().getID() + "\t" + String.format("%.3f", j.getValue()));
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (TasteException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

This outputs item1 –recommends–>item2 pairs with a weight. I’m taking this and putting it into a solr document so I can display related item2s alongside item1 when it’s viewed.

Input data are comma-delimited tuples like so:

1fe7401b81eed49353d0cbeba5383848,5212,0.6
3c1832954a6e8781836fed670bb37b24,5212,1
70273e4c7c77700ee97acb8d0306c405,5213,0.8
1f057ccde135acbc881008bbf466e7e1,5213,1
51d44c7baca65ad39d11ba87bf2d438b,5213,1
adc924559b37114cd97d1f5cf7c71419,5213,1
78e254b4a11e61d76ff63cea02de4de8,5213,1
5c373ec7d9ad4a6f392c291d8ccba5ce,5213,0.2
fab8537564094fa8885f6214e6b682e1,5213,1
127f46aabcdbc2d2d04da8398a996c75,5213,1

Works great. Thanks Sean.

Analytics
Java
Mahout

Comments (1)

Permalink

Upcoming AI / Machine Learning Conferences

A (partial) list I found today. Doesn’t include NIPS, so I’m not sure how exhaustive it is, but it has a bunch I haven’t seen before.

http://www.kmining.com/info_conferences.html

Analytics
Informatics
Mathematics
Networking
Science
Software
Statistics

Comments (0)

Permalink

ZIP code demographic data with Perl

I needed some demographics data earlier this week and tried using the SF3 files from census.gov’s “Census 2000″ data set.

What a time sink. Ugh.

The methods used are very well documented, and I learned a lot about the census. What I was not able to learn, however, was how to actually extract the data from the flat files. Look at what Joshua Tauberer went through to get some idea of the pain level.

Finally I got fed up and wrote a screen scraper for ZIPskinny.com in Perl. It’s one-off crappy code. You can get it from CPAN under namespace Geo::Demo::Zipskinny.

Hope it saves you some time. Leave me a comment if you have working code that can deal with SF3 files.

Here’s a little ZIP code to rich-vs-poor plot I made earlier.

Analytics
Perl
Science
Statistics

Comments (0)

Permalink

Webserver logs access time by region/language

As anyone with a popular website knows, there’s a big difference in the resources required for peak vs. off-peak hours and you typically have to pay for peak usage even if you don’t always use it (e.g. 95th percentile bandwidth billing)

Frugal as I am, I was curious to see if I could increase traffic during what are off-peak hours. Seemed sensible that people in different regions of the world might be accessing during off-hours.

So I aggregated data by country code/language and 10-minute time segment. Applied a Daniell smoothing kernel (a sliding window) of 6 segments (1 hour) and plotted a a row-scaled heatmap in R. Rows are clustered so similar access patterns are next to one another, with the left-hand-side dendrogram indicating dissimilarity between rows. Yellow-white is a traffic burst. I’ll post the code and data later for how I made this.

access times by country/language

As it turns out, the main off-peak trough corresponds to the middle of the Pacific ocean. Kinda watery for people to live there. Oh well, I tried.

Analytics
R
Science
Statistics

Comments (0)

Permalink

R Matrix sparseVector operations

I’ve only done minus(vec1,vec2) so far. More to come.

library(Matrix);
 
#F = strsplit(as.character(mm[1,2]),', ')[[1]]
#G = matrix( as.numeric(unlist(strsplit(F[c(-1,-length(F))],':'))), nrow=2 )
#tt = new('dsparseVector', x=G[2,], i=as.integer(G[1,]), length=max(as.integer(G[1,])))
 
minus = function(v1,v2) {
  i = sort(union(v1@i,v2@i));
  s = length(i);
 
  x = vector(mode='numeric',length=s);
  for ( k in 1:s ) {
    z = i[k];
    if ( z < length(v1) ) {
      x[k] = as.numeric(v1[z]);
    }
    if ( z < length(v2) ) {
      x[k] = x[k] - as.numeric(v2[z]);
    }
  }
  new("dsparseVector", x=x, i=i, length=max(v1@i,v2@i))
}

R
Statistics

Comments (0)

Permalink

aggregate – report event counts from a stream

Another shell utility. This one is useful for, e.g. counting 404, 500, 200, 302 HTTP codes from a log file.

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $mode = 'line';
my $tick = 100;
my $help = undef;
my $keysfile = undef;
my %keys = ();
 
GetOptions(
  'mode|m=s' => \$mode,
  'tick|t=i' => \$tick,
  'help|h'   => \$help,
  'keys|k=f' => \$keysfile,
);
 
if ( $help || ( $mode ne 'line' && $mode ne 'time' ) || $tick <= 0 || ( defined($keysfile) && !-f $keysfile ) ) {
  my $USAGE = join '', <DATA>;
  print STDERR $USAGE and exit(1);
}
 
if ( $keysfile ) {
  open(K, $keysfile) or die "Couldn't open keys file '$keysfile': $!";
  while ( my $line = <K> ) {
    chomp $line;
    $keys{ $line }++;
  }
  close(K);
}
 
my %count = %keys;
my $offset = 0;
my $mark = 0;
my $offset = 0;
 
if ( $mode eq 'time' ) {
  $mark = time();
}
 
while ( my $element = <> ) {
  chomp $element;
  if ( scalar( %keys ) ) {
    $count{ $element }++ if $keys{ $element };
  }
  else {
    $count{ $element }++;
  }
 
  if ( $mode eq 'line' ) {
    $offset++;
    $mark++;
    if ( $mark >= $tick ) {
      $mark = 0;
      flush();
    }
  }
  elsif ( $mode eq 'time' ) {
    if ( $mark + $tick < time() ) {
      $offset = time();
      $mark = time();
      flush();
    }
  }
}
flush();
 
sub flush {
  print "summary/$tick @ $offset\n";
  foreach my $k ( sort keys %count ) {
    print "\t", $count{ $k }, "\t", $k, "\n";
  }
  %count = %keys;
}
 
__DATA__
Usage: aggregate [-h] [-m <time|line>] [-t <# of seconds or lines>] [-k <keys file>]
 
Read lines from STDIN.  Print lines by frequency per input lines or time.
 
  -h    show help (this message)
  -m    mode.  one of 'time' or 'line'.  defaults to 'line'.
  -t    aggregation size.  an integer.  value is # of lines ('line' mode) or # of
        seconds ('time' mode) after which an aggregation is triggered.  defaults to 100.
  -k    keys file.  a text file of strings to *exactly* match in the input, one per line.
        if a keys file is provided, lines not present in the keys file will be silently
        ignored.

Administration
Analytics
Perl

Comments (0)

Permalink

shuffle – randomize a stream of data

Here’s another little shell utility I’ve been sitting on for a while. This one shuffles the line-oriented data read from a pipe. It has the notion of buffering and partial flushing so we can handle streams / very large data sets.

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $USAGE = join '', <DATA>;
 
my $B = 0;
my $D = 1;
my $H = 0;
 
GetOptions ("buffer|b=i"   => \$B,
            "draw|d=i"     => \$D,
            "help|h"       => \$H,
           ); 
 
if ( $D == 1 && $B > 0 ) {
  $D = $B;
}
 
if (
  ($B < 0) ||
  ($D < 1) ||
  ($B > $D) ||
  ($H)
) {
  print $USAGE and exit(1);
}
 
 
my @buf = ();
 
while ( my $element = <> ) {
  #buffer whole stream
  if ( $B == 0 ) {
    push @buf, $element;
  }
  #no-op
  elsif ( $B == 1 ) {
    print $element;
  }
  #buffer window
  else {
    push @buf, $element;
    if ( scalar( @buf ) >= $D && scalar( @buf ) > $B ) {
      flush();
    }
  }
}
flush();
 
sub flush {
  for ( my $j = scalar( @buf ) - 1 ; $j >= 0 ; $j-- ) {
    my $swap = int(rand($j));
    if ( $swap != $j ) {
      ($buf[ $j ], $buf[ $swap ]) = ($buf[ $swap ], $buf[ $j ]);
    }
  }
  while ( scalar( @buf ) - 1 > $B - $D ) {
    print shift @buf;
  }
}
 
 
__DATA__
Usage: shuffle [-h] [-b <buffer size>] [-d <draw size>]
 
Shuffle lines from a stream on STDIN.  Write lines to STDOUT.
 
  -h    show help (this message)
  -b    buffer size
        (default 0.  indicates shuffle whole stream, then write)
        range: 1..
  -d    draw size
        (defaults to value of -b.  number of items to remove from the
        buffer when it fills)
        range: 1..buffer size
 
You have to parameters available (besides -h for help).
 
* buffer size (-b).  Determines how many elements to temporarily hold
before shuffling.  The advantage of this buffer is to allow shuffling on
very long streams that would not fit into system memory.  The
disadavantage is that it is not a truly random shuffle, as each input
element can appear at most buffer-size positions away from the original
position.  Buffer size defaults to zero, so make sure to set it if your
data set size is large.
 
* draw size (-d).  Determines how frequently the buffer is shuffled and
flushed.  Rather than shuffling/flushing all elements in the buffer, only
do D elements.  The advantage here is elements can appear more than
buffer-size positions away from the original position.  The disadvantage
is that shuffling is done B/D times more frequently.  Draw size defaults
to buffer size, and has no effect.  Set it to 1 to maximize randomness.
 
Copyright/License:
 
  Allen Day <allenday@ucla.edu>, licensed under GPL 2006-2008

Administration
Analytics
Perl

Comments (0)

Permalink

sample – probabilistic sampling from a stream of lines

I’m frequently monitoring webservers, cache servers, database servers, etc by tailing their log files. See my previous post on making logs easier to monitor by color.

Sometimes you also have too much data, and you don’t want to look at all of it. Use this to sample.

sample source:

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $USAGE = join '', <DATA>;
 
my $T = 0;
my $K = 0;
my $P = 1;
my $H = 0;
my $N = 0;
my $S = 0;
 
GetOptions ("time|t=i"     => \$T,
            "number|n=i"   => \$N,
            "count|k=i"    => \$K,
            "prob|p=f"     => \$P,
            "shuffle|s"    => \$S,
            "help|h"       => \$H,
           ); 
 
if (
  ($T > 0 && $P != 1) ||
  ($K > 0 && $P != 1) ||
  ($K < 0 || $P < 0 || $T < 0 || $N < 0 || $P > 1 ) ||
  ($T > 0 && $N > 0) ||
  ($H)
) {
  print $USAGE and exit(1);
}
 
my $position = 0;
my @buf = ();
my $before = time();
 
while ( my $element = <> ) {
  # sample full stream, report at the end
  # sample K elements every T seconds
  if ( $K > 0 ) {
    if ( scalar( @buf ) < $K ) {
      push @buf, [$position, $element];
    }
    elsif ( $K/$position < rand() ) {
      my $index = int(rand($K));
      $buf[ $index ] = [$position, $element]; #save position for sort
    }
    #time-based K-sampling
    if ( $T > 0 && time() > $before + $T ) {
      flush();
    }
    #event-based K-sampling
    elsif ( $N > 0 && $position > $N ) {
      flush();
    }
  }
  # sample with probability
  elsif ( $P < 1 && rand() < $P ) {
    print $element;
  }
  $position++;
}
flush();
 
sub flush {
  $before = time();
  #Knuth shuffle
  if ( $S ) {
    for ( my $j = scalar( @buf ) - 1 ; $j >= 0 ; $j-- ) {
      my $swap = int(rand($j));
      if ( $swap != $j ) {
        ($buf[ $j ], $buf[ $swap ]) = ($buf[ $swap ], $buf[ $j ]);
      }
      print $buf[ $j ]->[ 1 ];
    }
  }
  else {
    foreach my $b ( sort {$a->[0] <=> $b->[0]} @buf ) {
      print $b->[1];
    }
  }
  @buf = ();
  $position = 0;
}
 
 
__DATA__
Usage: sample -[[h][p][t[k[n]]]]
 
Sample lines from a stream on STDIN.  Write lines to STDOUT.
 
  -h    show help (this message)
  -k    sample K elements from stream
        (default 0)
        range: 0..
  -p    sample elements from stream with probability
        (default 1)
        range: 0 <= p <= 1
  -n    sample over windows of N elements
        (default 0)
        range: 0..
  -t    sample over windows of T seconds
        (default 0, instantaneous with -p, infinity with -k)
        range: 0..
  -s    shuffle outputs
        (default false)
 
There are two modes of sampling:
 
  * sample with probability (-p)
  * sample a fixed number of elements (-k)
 
Both modes sample over a given time interval in seconds (-t).
-t defaults to zero (process full stream).  -p can only be
used alone.  -n can only be used with -k
 
Examples:
 
  * sample K elements from a stream:
    cat /etc/passwd | sample -k 5
 
  * sample 1% of elements from a stream:
    tail -f /var/logs/httpd/access_log | sample -p 0.01
 
  * sample K elements from a stream every 30 seconds:
    tail -f /var/logs/httpd/access_log | sample -k 5 -t 30
 
  * sample K elements from a stream every 30 seconds, shuffled:
    tail -f /var/logs/httpd/access_log | sample -k 5 -t 30 -s
 
  * sample K elements from a stream every 100 elements:
    tail -f /var/logs/httpd/access_log | sample -k 5 -n 100
 
Copyright/License:
 
  Allen Day <allenday@ucla.edu>, licensed under GPL 2006-2008

Administration
Analytics
Perl

Comments (0)

Permalink

Examples for data import, export, and transport with HBase

I’m in the process of setting up an analytic workflow at BiggerBoat. It’s looking like the main theme in data structures around here will be the sparse matrix. So I’ve been playing with opensource technologies for sparse matrices. Apache Hadoop’s HBase is looking like a good choice for now, maybe Hive later.

Right now I’m getting familiar with the former. As part of this, I’m improving the docs on the wiki to make them more user- (as opposed to core developer-) friendly. My documentation goal right now is to add some data transformation example code. There are already lots of hadoop examples for doing text -&gt text mapping, e.g. grep, cat, etc. For HBase not so much. I.e.

  • text to text (done, many examples
  • flatfile to HBase table (Bulk loader in the HBase wiki, I haven’t tried it yet)
  • HBase table to flatfile
  • HBase table to HBase table

I’ll be adding updated, complete, and simple code for the latter two (three?) in the next few days to the HBase/MapReduce page.

Analytics
Distributed Systems
Java

Comments (1)

Permalink