October 2008

ZIP code demographic data with Perl

I needed some demographics data earlier this week and tried using the SF3 files from census.gov’s “Census 2000″ data set.

What a time sink. Ugh.

The methods used are very well documented, and I learned a lot about the census. What I was not able to learn, however, was how to actually extract the data from the flat files. Look at what Joshua Tauberer went through to get some idea of the pain level.

Finally I got fed up and wrote a screen scraper for ZIPskinny.com in Perl. It’s one-off crappy code. You can get it from CPAN under namespace Geo::Demo::Zipskinny.

Hope it saves you some time. Leave me a comment if you have working code that can deal with SF3 files.

Here’s a little ZIP code to rich-vs-poor plot I made earlier.

Analytics
Perl
Science
Statistics

Comments (0)

Permalink

Java port of GNU getopt

This looks useful
http://www.urbanophile.com/arenn/hacking/getopt/gnu.getopt.Getopt.html

Java

Comments (0)

Permalink

Webserver logs access time by region/language

As anyone with a popular website knows, there’s a big difference in the resources required for peak vs. off-peak hours and you typically have to pay for peak usage even if you don’t always use it (e.g. 95th percentile bandwidth billing)

Frugal as I am, I was curious to see if I could increase traffic during what are off-peak hours. Seemed sensible that people in different regions of the world might be accessing during off-hours.

So I aggregated data by country code/language and 10-minute time segment. Applied a Daniell smoothing kernel (a sliding window) of 6 segments (1 hour) and plotted a a row-scaled heatmap in R. Rows are clustered so similar access patterns are next to one another, with the left-hand-side dendrogram indicating dissimilarity between rows. Yellow-white is a traffic burst. I’ll post the code and data later for how I made this.

access times by country/language

As it turns out, the main off-peak trough corresponds to the middle of the Pacific ocean. Kinda watery for people to live there. Oh well, I tried.

Analytics
R
Science
Statistics

Comments (0)

Permalink

R Matrix sparseVector operations

I’ve only done minus(vec1,vec2) so far. More to come.

library(Matrix);
 
#F = strsplit(as.character(mm[1,2]),', ')[[1]]
#G = matrix( as.numeric(unlist(strsplit(F[c(-1,-length(F))],':'))), nrow=2 )
#tt = new('dsparseVector', x=G[2,], i=as.integer(G[1,]), length=max(as.integer(G[1,])))
 
minus = function(v1,v2) {
  i = sort(union(v1@i,v2@i));
  s = length(i);
 
  x = vector(mode='numeric',length=s);
  for ( k in 1:s ) {
    z = i[k];
    if ( z < length(v1) ) {
      x[k] = as.numeric(v1[z]);
    }
    if ( z < length(v2) ) {
      x[k] = x[k] - as.numeric(v2[z]);
    }
  }
  new("dsparseVector", x=x, i=i, length=max(v1@i,v2@i))
}

R
Statistics

Comments (0)

Permalink

WordPress – collapse redundant tags

I’ve been experimenting with automation of WordPress posts. Probably I’m doing something wrong with the way I make the XML RPC calls, but I find that I end up with redundant tags in my database. For instance, if I tagged two separate, RPC-posted posts with “orange”, I get two different tags both called “orange”. Until I figure out how to fix this properly, here’s a little script that will clean up the database by consolidating all redundantly named tags to one tag. You might want to back up your database before using this…

#!/usr/bin/perl
use strict;
use DBI;
 
######configuration
my $PREFIX = 'wp_h5otpn_';
my $DB = '';
my $HOST = '';
my $USER = '';
my $PASS = '';
######
my $dbh = DBI->connect(qq(dbi:mysql:database=$DB;host=$HOST), $USER, $PASS) or die $!;
 
my $term_sth   = $dbh->prepare(qq(SELECT * FROM (SELECT name, count(name) AS c FROM ${PREFIX}terms GROUP BY name) AS d WHERE d.c > 1));
my $name_sth   = $dbh->prepare(qq(SELECT term_id FROM ${PREFIX}terms WHERE name = ?));
my $update_sth = $dbh->prepare(qq(UPDATE ${PREFIX}term_relationships SET term_taxonomy_id = (SELECT term_taxonomy_id FROM ${PREFIX}term_taxonomy WHERE term_id = ?) WHERE term_taxonomy_id = (SELECT term_taxonomy_id FROM ${PREFIX}term_taxonomy WHERE term_id = ?)));
my $delete1_sth = $dbh->prepare(qq(DELETE FROM ${PREFIX}term_taxonomy WHERE term_id = ?));
my $delete2_sth = $dbh->prepare(qq(DELETE FROM ${PREFIX}terms WHERE term_id = ?));
$term_sth->execute();
 
while ( my ( $name, $count ) = $term_sth->fetchrow_array() ) {
  $name_sth->execute( $name );
  my $new = undef;
  while ( my ( $term_id ) = $name_sth->fetchrow_array() ) {
    if ( ! $new ) {
      $new = $term_id;
      next;
    }
    warn "$name\t$term_id\t->\t$new";
    $update_sth->execute( $new, $term_id );
    $delete1_sth->execute( $term_id );
    $delete2_sth->execute( $term_id );
  }
}
 
__DATA__
SELECT t.term_id, t.name, r.*, s.* FROM wp_h5otpn_terms AS t, wp_h5otpn_term_taxonomy AS r, wp_h5otpn_term_relationships AS s WHERE s.term_taxonomy_id = r.term_taxonomy_id AND r.term_id = t.term_id AND r.taxonomy = 'post_tag' AND t.name = 'whatever';

Administration
WordPress

Comments (0)

Permalink

aggregate – report event counts from a stream

Another shell utility. This one is useful for, e.g. counting 404, 500, 200, 302 HTTP codes from a log file.

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $mode = 'line';
my $tick = 100;
my $help = undef;
my $keysfile = undef;
my %keys = ();
 
GetOptions(
  'mode|m=s' => \$mode,
  'tick|t=i' => \$tick,
  'help|h'   => \$help,
  'keys|k=f' => \$keysfile,
);
 
if ( $help || ( $mode ne 'line' && $mode ne 'time' ) || $tick <= 0 || ( defined($keysfile) && !-f $keysfile ) ) {
  my $USAGE = join '', <DATA>;
  print STDERR $USAGE and exit(1);
}
 
if ( $keysfile ) {
  open(K, $keysfile) or die "Couldn't open keys file '$keysfile': $!";
  while ( my $line = <K> ) {
    chomp $line;
    $keys{ $line }++;
  }
  close(K);
}
 
my %count = %keys;
my $offset = 0;
my $mark = 0;
my $offset = 0;
 
if ( $mode eq 'time' ) {
  $mark = time();
}
 
while ( my $element = <> ) {
  chomp $element;
  if ( scalar( %keys ) ) {
    $count{ $element }++ if $keys{ $element };
  }
  else {
    $count{ $element }++;
  }
 
  if ( $mode eq 'line' ) {
    $offset++;
    $mark++;
    if ( $mark >= $tick ) {
      $mark = 0;
      flush();
    }
  }
  elsif ( $mode eq 'time' ) {
    if ( $mark + $tick < time() ) {
      $offset = time();
      $mark = time();
      flush();
    }
  }
}
flush();
 
sub flush {
  print "summary/$tick @ $offset\n";
  foreach my $k ( sort keys %count ) {
    print "\t", $count{ $k }, "\t", $k, "\n";
  }
  %count = %keys;
}
 
__DATA__
Usage: aggregate [-h] [-m <time|line>] [-t <# of seconds or lines>] [-k <keys file>]
 
Read lines from STDIN.  Print lines by frequency per input lines or time.
 
  -h    show help (this message)
  -m    mode.  one of 'time' or 'line'.  defaults to 'line'.
  -t    aggregation size.  an integer.  value is # of lines ('line' mode) or # of
        seconds ('time' mode) after which an aggregation is triggered.  defaults to 100.
  -k    keys file.  a text file of strings to *exactly* match in the input, one per line.
        if a keys file is provided, lines not present in the keys file will be silently
        ignored.

Administration
Analytics
Perl

Comments (0)

Permalink

shuffle – randomize a stream of data

Here’s another little shell utility I’ve been sitting on for a while. This one shuffles the line-oriented data read from a pipe. It has the notion of buffering and partial flushing so we can handle streams / very large data sets.

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $USAGE = join '', <DATA>;
 
my $B = 0;
my $D = 1;
my $H = 0;
 
GetOptions ("buffer|b=i"   => \$B,
            "draw|d=i"     => \$D,
            "help|h"       => \$H,
           ); 
 
if ( $D == 1 && $B > 0 ) {
  $D = $B;
}
 
if (
  ($B < 0) ||
  ($D < 1) ||
  ($B > $D) ||
  ($H)
) {
  print $USAGE and exit(1);
}
 
 
my @buf = ();
 
while ( my $element = <> ) {
  #buffer whole stream
  if ( $B == 0 ) {
    push @buf, $element;
  }
  #no-op
  elsif ( $B == 1 ) {
    print $element;
  }
  #buffer window
  else {
    push @buf, $element;
    if ( scalar( @buf ) >= $D && scalar( @buf ) > $B ) {
      flush();
    }
  }
}
flush();
 
sub flush {
  for ( my $j = scalar( @buf ) - 1 ; $j >= 0 ; $j-- ) {
    my $swap = int(rand($j));
    if ( $swap != $j ) {
      ($buf[ $j ], $buf[ $swap ]) = ($buf[ $swap ], $buf[ $j ]);
    }
  }
  while ( scalar( @buf ) - 1 > $B - $D ) {
    print shift @buf;
  }
}
 
 
__DATA__
Usage: shuffle [-h] [-b <buffer size>] [-d <draw size>]
 
Shuffle lines from a stream on STDIN.  Write lines to STDOUT.
 
  -h    show help (this message)
  -b    buffer size
        (default 0.  indicates shuffle whole stream, then write)
        range: 1..
  -d    draw size
        (defaults to value of -b.  number of items to remove from the
        buffer when it fills)
        range: 1..buffer size
 
You have to parameters available (besides -h for help).
 
* buffer size (-b).  Determines how many elements to temporarily hold
before shuffling.  The advantage of this buffer is to allow shuffling on
very long streams that would not fit into system memory.  The
disadavantage is that it is not a truly random shuffle, as each input
element can appear at most buffer-size positions away from the original
position.  Buffer size defaults to zero, so make sure to set it if your
data set size is large.
 
* draw size (-d).  Determines how frequently the buffer is shuffled and
flushed.  Rather than shuffling/flushing all elements in the buffer, only
do D elements.  The advantage here is elements can appear more than
buffer-size positions away from the original position.  The disadvantage
is that shuffling is done B/D times more frequently.  Draw size defaults
to buffer size, and has no effect.  Set it to 1 to maximize randomness.
 
Copyright/License:
 
  Allen Day <allenday@ucla.edu>, licensed under GPL 2006-2008

Administration
Analytics
Perl

Comments (0)

Permalink

sample – probabilistic sampling from a stream of lines

I’m frequently monitoring webservers, cache servers, database servers, etc by tailing their log files. See my previous post on making logs easier to monitor by color.

Sometimes you also have too much data, and you don’t want to look at all of it. Use this to sample.

sample source:

#!/usr/bin/perl
$|++;
use strict;
use Getopt::Long;
 
my $USAGE = join '', <DATA>;
 
my $T = 0;
my $K = 0;
my $P = 1;
my $H = 0;
my $N = 0;
my $S = 0;
 
GetOptions ("time|t=i"     => \$T,
            "number|n=i"   => \$N,
            "count|k=i"    => \$K,
            "prob|p=f"     => \$P,
            "shuffle|s"    => \$S,
            "help|h"       => \$H,
           ); 
 
if (
  ($T > 0 && $P != 1) ||
  ($K > 0 && $P != 1) ||
  ($K < 0 || $P < 0 || $T < 0 || $N < 0 || $P > 1 ) ||
  ($T > 0 && $N > 0) ||
  ($H)
) {
  print $USAGE and exit(1);
}
 
my $position = 0;
my @buf = ();
my $before = time();
 
while ( my $element = <> ) {
  # sample full stream, report at the end
  # sample K elements every T seconds
  if ( $K > 0 ) {
    if ( scalar( @buf ) < $K ) {
      push @buf, [$position, $element];
    }
    elsif ( $K/$position < rand() ) {
      my $index = int(rand($K));
      $buf[ $index ] = [$position, $element]; #save position for sort
    }
    #time-based K-sampling
    if ( $T > 0 && time() > $before + $T ) {
      flush();
    }
    #event-based K-sampling
    elsif ( $N > 0 && $position > $N ) {
      flush();
    }
  }
  # sample with probability
  elsif ( $P < 1 && rand() < $P ) {
    print $element;
  }
  $position++;
}
flush();
 
sub flush {
  $before = time();
  #Knuth shuffle
  if ( $S ) {
    for ( my $j = scalar( @buf ) - 1 ; $j >= 0 ; $j-- ) {
      my $swap = int(rand($j));
      if ( $swap != $j ) {
        ($buf[ $j ], $buf[ $swap ]) = ($buf[ $swap ], $buf[ $j ]);
      }
      print $buf[ $j ]->[ 1 ];
    }
  }
  else {
    foreach my $b ( sort {$a->[0] <=> $b->[0]} @buf ) {
      print $b->[1];
    }
  }
  @buf = ();
  $position = 0;
}
 
 
__DATA__
Usage: sample -[[h][p][t[k[n]]]]
 
Sample lines from a stream on STDIN.  Write lines to STDOUT.
 
  -h    show help (this message)
  -k    sample K elements from stream
        (default 0)
        range: 0..
  -p    sample elements from stream with probability
        (default 1)
        range: 0 <= p <= 1
  -n    sample over windows of N elements
        (default 0)
        range: 0..
  -t    sample over windows of T seconds
        (default 0, instantaneous with -p, infinity with -k)
        range: 0..
  -s    shuffle outputs
        (default false)
 
There are two modes of sampling:
 
  * sample with probability (-p)
  * sample a fixed number of elements (-k)
 
Both modes sample over a given time interval in seconds (-t).
-t defaults to zero (process full stream).  -p can only be
used alone.  -n can only be used with -k
 
Examples:
 
  * sample K elements from a stream:
    cat /etc/passwd | sample -k 5
 
  * sample 1% of elements from a stream:
    tail -f /var/logs/httpd/access_log | sample -p 0.01
 
  * sample K elements from a stream every 30 seconds:
    tail -f /var/logs/httpd/access_log | sample -k 5 -t 30
 
  * sample K elements from a stream every 30 seconds, shuffled:
    tail -f /var/logs/httpd/access_log | sample -k 5 -t 30 -s
 
  * sample K elements from a stream every 100 elements:
    tail -f /var/logs/httpd/access_log | sample -k 5 -n 100
 
Copyright/License:
 
  Allen Day <allenday@ucla.edu>, licensed under GPL 2006-2008

Administration
Analytics
Perl

Comments (0)

Permalink

Hiromi’s Sonic Bloom @ the Jazz Bakery

I saw Hiromi Uehara’s Sonicbloom at the Jazz Bakery Saturday. Wow! I’m now a fan.

Check out this bootleg of Hiromi’s covering Gershwin on YT. She played this one last night.

Fun
Life

Comments (0)

Permalink