Analytics

Examples for data import, export, and transport with HBase

I’m in the process of setting up an analytic workflow at BiggerBoat. It’s looking like the main theme in data structures around here will be the sparse matrix. So I’ve been playing with opensource technologies for sparse matrices. Apache Hadoop’s HBase is looking like a good choice for now, maybe Hive later.

Right now I’m getting familiar with the former. As part of this, I’m improving the docs on the wiki to make them more user- (as opposed to core developer-) friendly. My documentation goal right now is to add some data transformation example code. There are already lots of hadoop examples for doing text -&gt text mapping, e.g. grep, cat, etc. For HBase not so much. I.e.

  • text to text (done, many examples
  • flatfile to HBase table (Bulk loader in the HBase wiki, I haven’t tried it yet)
  • HBase table to flatfile
  • HBase table to HBase table

I’ll be adding updated, complete, and simple code for the latter two (three?) in the next few days to the HBase/MapReduce page.

Analytics
Distributed Systems
Java

Comments (0)

Permalink

Cookies, IP Addresses and Unique Users

I’ve been thinking about how to track unique users today. These are my so-far-unorganized thoughts. Please comment!

You can’t track users by cookie alone for a couple of reasons: 1) they might use multiple computers, 2) they might delete cookies, 3) multiple users might share the same computer (same account=same cookie)

You also can’t track users by IP address alone for some more reasons: 1) they might be using a mobile device or portable computer that moves from IP to IP, 2) there could be multiple machines passing through a single gateway IP (i.e. LAN NAT).

However, if you combine the cookie/IP information together, you can start to address some of these issues. Let’s assume you have some webserver logs that minimally contain <IP address "A">, <cookie ID "C">, <timestamp T> triples.

            === time ===>
PATTERN 1:
C1-A1 ---      ---
C1-A2    ---         ---
C1-A3       ---   ---

PATTERN 2:
C1-A1 ------
C2-A1       ------
C1-A2             ------

PATTERN 3:
C1-A1 -----!
C2-A1       -----!
C3-A1             ------

PATTERN 4:
C1-A1 ------      ------
C2-A1     ----------

PATTERN 5:
C1-A1 -----     -----
C2-A1      -----     ---

This matrix indicates the compatibility of each of the patterns (P1-P5)
with several different classes of cookie/IP address combination that we
might want to detect.

                                     patterns
                               P1  P2  P3  P4  P5
profiles                      --------------------
multiple users per IP        | -   +   -   +   +
multiple users per cookie    | -   -   -   -   -
multiple IPs per user        | +   +   -   -   -
multiple cookies per user    | -   -   +   -   +
cookie deletion              | -   -   +   -   -
"permanent" IP change        | -   +   -   -   -

Note that none of these patterns gives any indication for the “multiple
users per cookie” profile. To assess if there is more than one
user/cookie, you might want to look at the context in which you’re
observing the cookie. Consider attributes like (timezone corrected)
time-of-day, day-of-week, type of content being viewed.

Analytics
Random musings

Comments (0)

Permalink

pcoc - Piped Command Output Colorizer

I’m frequently monitoring webservers, cache servers, database servers, etc by tailing their log files, e.g.

tail -f /etc/httpd/logs/access_log

I like the –color option provided by grep, but found it to be too limited (only one allowed, no wildcard support). After a bit of searching to see if a tool existed for doing arbitrary colorizing, I found
acoc, the Arbitrary Command Output Colourer.

…which almost did what I needed, but couldn’t read from a pipe. So I wrote pcoc, the Piped Command Output Colorizer. I’m only publishing this because I’ve been using it for about 1 1/2 years, and still find it useful.

Source code at the end of this post. Here’s an example that highlights iPhone/iPod user agents and requests with a 500/400/404 HTTP response:

tail -f ./logs/access_log | pcoc -f '(iPod)=bold cyan' -f '(iPhone)=bold magenta' -f '\b(500|404|400)\b=red on_black'

Sorry, no screenshots :(.

pcoc source:

#!/usr/bin/perl
use strict;
use Getopt::Long;
use Term::ANSIColor qw(colored);
$|++;
 
my %format = ();
GetOptions( "format|f=s" => \%format);
 
if ( ! keys %format ) {
  print <<"EOF";
Synopsis:
        pcoc - Piped Command Output Colorizer.  Inspired by acoc.
 
Usage:
 
        $0 -f '<regex1>=<color1>' -f '<regex2>=<color2>'
 
$0 reads from a pipe and colorizes each line based on format (-f) parameters.
 
Arguments:
 
-f '<regex>=<color>'  Required, multiple values okay. 
 
        <regex>: A regular expression from which \$1 will be colorized
 
        <color>: One or more colorization keywords, see perldoc
        Term::ANSIColor, but briefly they are:
 
        boldness:
                bold
        foreground:
                red yellow green blue magenta cyan black white
        background:
                on_red on_yellow on_green on_blue on_magenta on_cyan
                on_black on_white
 
Examples:
 
        #highlight the account's shell in bold green
        cat /etc/passwd | $0 -f '.+:([^:]+)\$=bold green'
 
        #... and the username in red with black background
        cat /etc/passwd | $0 -f '([^:]+)=red on_black' -f '.+:([^:]+)\$=bold green'
 
Copyright/License:
 
        Allen Day <allenday\@ucla.edu>, licensed under GPL 2006-2008
 
EOF
  exit(1);
}
 
while ( my $line = <> ) {
  chomp( $line );
  foreach my $f ( keys %format ) {
    my @c = split ',', $format{ $f };
 
    if ( $line =~ qr/$f/ ) {
      while ( my ( $s, $t ) = $f =~ m/^(.*?)\(+(.+?)\)+/ ) {
        my $c = pop @c || last;
        $line =~ s/($s)($t)/$1.colored($2,$c)/e;
        $f =~ s/^(.*?)\((.+?)\)/$1$2/;
      }
    }
  }
  print "$line\n";
}

Administration
Analytics
Perl

Comments (0)

Permalink