Upcoming AI / Machine Learning Conferences
A (partial) list I found today. Doesn’t include NIPS, so I’m not sure how exhaustive it is, but it has a bunch I haven’t seen before.
♥data♥
{ Category Archives }
A (partial) list I found today. Doesn’t include NIPS, so I’m not sure how exhaustive it is, but it has a bunch I haven’t seen before.
I needed some demographics data earlier this week and tried using the SF3 files from census.gov’s “Census 2000″ data set.
What a time sink. Ugh.
The methods used are very well documented, and I learned a lot about the census. What I was not able to learn, however, was how to actually extract the data from the flat files. Look at what Joshua Tauberer went through to get some idea of the pain level.
Finally I got fed up and wrote a screen scraper for ZIPskinny.com in Perl. It’s one-off crappy code. You can get it from CPAN under namespace Geo::Demo::Zipskinny.
Hope it saves you some time. Leave me a comment if you have working code that can deal with SF3 files.
Here’s a little ZIP code to rich-vs-poor plot I made earlier.

As anyone with a popular website knows, there’s a big difference in the resources required for peak vs. off-peak hours and you typically have to pay for peak usage even if you don’t always use it (e.g. 95th percentile bandwidth billing)
Frugal as I am, I was curious to see if I could increase traffic during what are off-peak hours. Seemed sensible that people in different regions of the world might be accessing during off-hours.
So I aggregated data by country code/language and 10-minute time segment. Applied a Daniell smoothing kernel (a sliding window) of 6 segments (1 hour) and plotted a a row-scaled heatmap in R. Rows are clustered so similar access patterns are next to one another, with the left-hand-side dendrogram indicating dissimilarity between rows. Yellow-white is a traffic burst. I’ll post the code and data later for how I made this.
As it turns out, the main off-peak trough corresponds to the middle of the Pacific ocean. Kinda watery for people to live there. Oh well, I tried.
I’ve only done minus(vec1,vec2) so far. More to come.
library(Matrix);
#F = strsplit(as.character(mm[1,2]),', ')[[1]]
#G = matrix( as.numeric(unlist(strsplit(F[c(-1,-length(F))],':'))), nrow=2 )
#tt = new('dsparseVector', x=G[2,], i=as.integer(G[1,]), length=max(as.integer(G[1,])))
minus = function(v1,v2) {
i = sort(union(v1@i,v2@i));
s = length(i);
x = vector(mode='numeric',length=s);
for ( k in 1:s ) {
z = i[k];
if ( z < length(v1) ) {
x[k] = as.numeric(v1[z]);
}
if ( z < length(v2) ) {
x[k] = x[k] - as.numeric(v2[z]);
}
}
new("dsparseVector", x=x, i=i, length=max(v1@i,v2@i))
}Setting up Taste v1.7.2 on a CentOS 4 x86_64 box.
Taste has merged with Mahout now, but I still want to do this standalone b/c I’m having trouble getting the JUnit tests to pass for Mahout. With that out of the way…
These are the shell commands I assembled after following the Taste Demo guide.
#make sure you have ant, and the JDK. I don't recommend the CentOS stock, get them from Sun/Apache #download necessary .jar files, sources, data files. unpack/move them to correct locations. wget http://internap.dl.sourceforge.net/sourceforge/taste/taste-1.7.2.zip wget http://internap.dl.sourceforge.net/sourceforge/proguard/proguard4.2.zip wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz wget http://www.hightechimpact.com/Apache/tomcat/tomcat-5/v5.5.26/bin/apache-tomcat-5.5.26.tar.gz unzip taste-1.7.2.zip unzip proguard4.2.zip tar -xvzf million-ml-data.tar__0.gz tar -xvzf apache-tomcat-5.5.26.tar.gz cp proguard4.2/lib/proguard.jar lib/ mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/ #start up tomcat on port 8080 (default) JAVA_OPTS="-server -da -dsa -Xms1024m -Xmx1024m" JAVA_HOME=/usr/java/jdk1.6.0_02 sh apache-tomcat-5.5.26/bin/startup.sh #build taste.war, and inject it into tomcat JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example cp taste.war apache-tomcat-5.5.26/webapps/ #test the app. may take a minute or two on the first query. wget -O - -S 'http://localhost:8080/taste/RecommenderServlet?userID=1&debug=true'
Once you get that working, you can tweak the demo slightly to work on another data set. You just need to know the grouplens file format. ratings.dat is of the format:
UserID::MovieID::Rating::Timestamp
e.g.
1::1193::5::978300760
and movies.dat is of the format:
MovieID::Title::Genres
e.g.
1::Toy Story (1995)::Animation|Children's|Comedy
I wrote a script, let’s call it load_taste.pl, that can generate new movies.dat and ratings.dat files from an alternate data source. If I make these new files, I can drop them in place of the grouplens data, rebuild the .war files, and make recommendations on this other data set. Here’s how to do it:
#generate ratings.dat and movies.dat. move them to replace the grouplens data files. perl ./load_taste.pl mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/ #get rid of stale .war and .jar files rm taste.war grouplens.jar #build the "quick" version of the example. see below for build.xml patch JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example-quick #inject the re-built .war file into tomcat. cp taste.war apache-tomcat-5.5.26/webapps/ #get rid of stale tomcat caches rm -rf apache-tomcat-5.5.26/webapps/taste apache-tomcat-5.5.26/temp/taste.*.txt
Note that I’ve defined a new ant build target called “build-grouplens-example-quick”. The purpose of this is that we only want to rebuild grouplens.jar and taste.war, not reoptimize/reverify/rebuild taste.jar, etc. The “build-grouplens-example” target takes ~55 seconds to complete on my machine, whereas the “build-grouplens-example-quick” target takes ~2 seconds. Here’s a diff to the original build.xml file:
--- /tmp/build.xml 2008-03-21 21:18:20.000000000 -0700 +++ ./build.xml 2008-06-30 11:46:18.000000000 -0700 @@ -161,6 +161,58 @@ <delete file="${my-web.xml}"/> </target> + <target depends="" name="build-taste-server-quick" description="Builds deployable web-based Taste server"> + <fail unless="my-recommender.jar" message="Please set -Dmy-recommender.jar=XXX"/> + <fail unless="my-recommender-class" message="Please set -Dmy-recommender-class=XXX"/> + <tempfile property="my-web.xml"/> + <copy file="src/main/com/planetj/taste/web/web.xml" tofile="${my-web.xml}"> + <filterset> + <filter token="RECOMMENDER_CLASS" value="${my-recommender-class}"/> + </filterset> + </copy> + <war destfile="${release-war}" webxml="${my-web.xml}"> + <lib dir="."> + <include name="${release-jar}"/> + <include name="${my-recommender.jar}"/> + </lib> + <lib dir="lib/axis"/> + <classes dir="build"> + <include name="com/planetj/taste/web/**"/> + </classes> + <fileset dir="src/main/com/planetj/taste/web"> + <include name="RecommenderService.jws"/> + </fileset> + </war> + <delete file="${my-web.xml}"/> + </target> + <target depends="" name="build-grouplens-example-quick" description="Builds deployable GroupLens example"> + <javac source="1.5" + target="1.5" + deprecation="true" + debug="true" + optimize="false" + destdir="build" + srcdir="src/example"> + <compilerarg value="-Xlint:all"/> + <classpath> + <pathelement location="${release-jar}"/> + <pathelement location="${annotations.jar}"/> + </classpath> + </javac> + <jar jarfile="grouplens.jar"> + <fileset dir="src/example"> + <include name="com/planetj/taste/example/grouplens/ratings.dat"/> + <include name="com/planetj/taste/example/grouplens/movies.dat"/> + </fileset> + <fileset dir="build"> + <include name="com/planetj/taste/example/grouplens/**"/> + </fileset> + </jar> + <property name="my-recommender.jar" value="grouplens.jar"/> + <property name="my-recommender-class" value="com.planetj.taste.example.grouplens.GroupLensRecommender"/> + <antcall target="build-taste-server-quick"/> + </target> + <target depends="build,optimize" name="build-grouplens-example" description="Builds deployable GroupLens example"> <javac source="1.5" target="1.5"
I’ve been learning about some of the techniques used by the so-called “Black Hat SEO” community for boosting their rankings in search engine results. Intriguing stuff. I’m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do with my primary areas of interest.
One “Black Hat SEO” application area is automatically generating HTML pages to improve search engine rankings. This technique uses a Markov process to generate text. The idea is to build one or more web pages that contain the keywords the SEO is targeting. The method basically works like this:
One problem with this approach — aside from the fact that the keywords don’t really fitting in with the flow of the model — is that the model is trained on inappropriate text. For instance, suppose you were trying to optimize for keywords:
… then you probably wouldn’t want to train your model on, say, Jane Austen’s Pride and Prejudice.
A better thing to do would be to find some nice web pages containing keywords, statistics, seo, Markov model, and so on. That way you’ll pick up related keywords that you didn’t initially think of (or weren’t suggested by your keyword expansion tool), too.
But let’s face it. The corpora are going to be in HTML format. So the question now becomes, How do I automate the transformation of HTML into plain text for input to the model? A few strawman ideas, followed by my remarks:
I did find some methods for scoring page fragments, such as the Perl modules HTML::Content::Extractor and HTML::Extract, and another method described by Nooks. There are also a few intersting ideas in Gupta’s WWW2003 paper.
None of that Perl code linked above actually works, but Nooks and Jean Tavernier generally had the right idea. Basically, they look “down” the DOM to find the sub-DOM with the highest text/tag ratio.
The main problem with this approach is that it biases for DOM leaves, or “twigs” that are very close to leaves. You end up having to write special rules for accomodating the idiosyncrosies of each particular page dealt with, and it basically turns back into an HTML parsing exercise.
The other problem, and possibly more significant one from a statistician’s point of view, is that the ratio is not a well-understood metric for making decisions about what constitutes a “good” versus a “bad” sub-document. It would be better to have a p-value…
Fortunately, Fisher’s exact test can be applied to this problem. Here’s how you can apply it, explanation follows. First, let’s define some variables:
Then, we perform the following algorithm to identify the single best sub-document:
tree; //the HTML tree's root node minP = 1; //minimum p-value observed in the document subD = ""; //sub-document corresponding to minimum p-value X = calculatex(tree); Y = calculatey(tree); look(tree); function look (node) { x = calculatex(node); y = calculatey(node); p = calculateHyperG(x,y,X,Y); if ( p < minP ) { minP = p; subD = node; } C = children(node); foreach (c in C) { look(c); } }
The pseudocode above is examining each sub-document of the HTML document in turn and identifying the one with the smallest p-value. The p-value is calculated using the hypergeometirc distribution, where we consider that a sub-document has x words and y HTML <element/>s. This, in the context of the total document having X words and Y HTML <element/>s. It’s better than a simple ratio calculation because it does not bias for the tree’s leaves. That is, the p-value does not consider only the size of x+y.
Bear in mind that testing so many sub-documents, especially for very large HTML documents, warrants so-called “multiple hypothesis testing correction“, such as a Bonferroni correction. It’s outside the scope of this article.
Also, the tests performed are not entirely independent. That is, if node B is a child of node A then B will have some effect on A when calculating A’s p-value and must be factored out. This is also a well-defined problem but is, alas, also outside the scope of this article. Do your homework! Hint: learn about the Gene Ontology.
Fine and dandy, but does it work? My conclusion: seems to work. Here’s a CGI script demonstrating the hypergeometric content extraction technique on CNN.com. It reports a text snippet at the beginning and end of the single “best” sub-document and the corresponding (uncorrected) p-value. Twiddle the u parameter to test on a page of your choice. Some pages may block the user-agent I’m using…
There is also the issue of what to consider an element and what not to… or maybe even element weighting. For instance, maybe <p/> and <i/> elements shouldn’t be penalized because they’re commonly associated with text, but <script/> elements are heavily penalized.
Someone asked for the source code used here. I’m not actively pursuing any business with this, so here you go. If you use it in something or make a derivative work I’d be pleased to know what you’ve done.
#!/usr/bin/perl # Copyright Allen Day <allenday@gmail.com> # License: Artistic 2.0 use strict; use lib '/home/allenday/lib/perl/lib/perl/5.8.4/'; use CGI qw(:standard); print header(); use HTML::TreeBuilder; use GO::TermFinder::Native; use LWP::Simple qw(get); my $d = GO::TermFinder::Native::Distributions->new(8192); my %ignore = map {$_=>1} qw(head iframe img script table td tr th tbody); my %ok = map {$_=>1} qw(a b br div em h1 h2 h3 h4 i p span strong); my $tree = HTML::TreeBuilder->new(); my $all = param('a'); my $doc = get(param('u')); $tree->parse( $doc ); my $wordtotal = 0; my $elemtotal = 0; tally($tree,\$elemtotal,\$wordtotal); my $minp = 1; my $result; examine( $tree ); print '<html><head><title>x</title></head><body><pre>'; if ( ! $all ) { $result =~ m#^(.{50}).*?(.{50})$#s; print $minp,"\t$1 ... $2\n"; } else { print $minp,"\t$result\n"; } print '</pre></body></html>'; print "\n"; sub examine { my ( $node ) = @_; my $elemcount = 0; my $wordcount = 0; tally( $node, \$elemcount, \$wordcount ); my $p = $d->hypergeometric( $wordcount, $elemcount + $wordcount, $wordtotal, $elemtotal + $wordtotal ); if ( $minp > $p ) { my $text = ''; text( $node, \$text ); $minp = $p; $result = $text; } my @content = $node->content_list(); foreach my $c ( @content ) { if ( ref( $c ) ) { examine($c,$elemcount,$wordcount); } } } sub tally { my ( $node, $elemcount, $wordcount ) = @_; my @content = $node->content_list(); foreach my $c ( @content ) { if ( ref( $c ) ) { if ( ! $ok{ $c->tag() } ) { $$elemcount++; } if ( ! $ignore{ $c->tag() } ) { tally($c,$elemcount,$wordcount); } } else { if ( ! $ignore{ $node->tag() } ) { my @words = grep /\w/, split /\s+/, $c; my $c = scalar( @words ); $$wordcount += $c if $c > 2; } } } } sub text { my ( $node, $text ) = @_; my @content = $node->content_list(); foreach my $c ( @content ) { if ( ref( $c ) ) { text( $c, $text ); } else { if ( ! $ignore{ $node->tag() } ) { $$text .= ' ' . $c; } } } }

I’ve had a need over the last week to work with some sparse matrix data in R. This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.
It seems that the de-facto standard for moving sparse matrices is around is to use the Harwell-Boeing file format, aka “harbo”. It’s a horrible and largely undocumented fixed-width (think Fortran) file format. The best documentation I could find was in source code here, although you may be able to piece more of it together with Koders. R does include a harbo reader as part of the SparseM package.
Given that I’m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available. As it turns out, there is a package called Text::SenseClusters from Ted Pedersen that ships with a nifty program, mat2harbo.pl. I found the preferred sparse matrix “mat” format used by Text::SenseClusters to be more reasonable than harbo. Here’s an example.
5 5 15 2 9 4 9 1 6 2 5 3 7 4 8 5 6 1 4 2 5 1 7 2 6 3 7 1 9 2 8 3 9
. There is a header line “
5 5 15
” that defines the matrix rows, columns, and number of non null fields. Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row. Easy!
At this point I was formulating a plan to:
file1.mat.file1.mat to file2.harbo using mat2harbo.pl from Text::SenseClusters.read.matrix.hb() function in the SparseM package.graph package).RBGL package.Well, it wasn’t that easy.
Step 1 went okay. Step 2 had problems with null columns, and had some glitches in the output format. Some of these glitches were easy to fix (e.g. matrix definition of “rra” to “RRA”), but others were very difficult due to the fact that mat2harbo.pl didn’t provide “full” harbo support, and the SparseM reader needed some of the file format features that weren’t supported.
So I wrote my own “mat” file -> R matrix.rsc object constructor myself. Here it is:
read.matrix.pair = function (file,debug=FALSE) { mat.lines = readLines(file); header = mat.lines[1]; F = strsplit(header,' ')[[1]]; nrow = as.integer(F[1]); ncol = as.integer(F[2]); nelem = as.integer(F[3]); ja = vector("list",nrow); ra = vector("list",nrow); ia = vector("list",nrow); ia[[1]] = c(1); for ( i in 2:(nrow+1) ) { #nrow mat.line = strsplit(mat.lines[i],' ')[[1]]; if ( length(mat.line) > 0 ) { if ( debug ) print(paste('non-empty row',i-1)); ja[[i-1]] = mat.line[ seq(1,length(mat.line),by=2)]; ra[[i-1]] = mat.line[1+seq(1,length(mat.line),by=2)]; ia[[i]] = ia[[i-1]] + length(mat.line)/2; if ( debug ) print(paste(' pos:',ja[[i-1]])); if ( debug ) print(paste(' dat:',ra[[i-1]])); } else { if ( debug ) print(paste(' empty row',i-1)); ia[[i]] = ia[[i-1]]; } } ans.ja = as.integer(unlist(ja)); ans.ra = as.integer(unlist(ra)); ans.ia = as.integer(unlist(ia)); dimension = as.integer(c(nrow,ncol)); if ( debug ) { print(paste('nrow',nrow)); print(paste('ncol',ncol)); print('ra');print(ans.ra); print('ja');print(ans.ja); print('ia');print(ans.ia); } rd.o = new("matrix.csr", ra = ans.ra, ja = ans.ja, ia = ans.ia, dimension = dimension) }
This let me just read the “mat” file directly into R. After that, the conversion to a graph object seems to work okay. I say seems to because I’m still waiting for the SparseM -> graph conversion routine to finish. It’s a 50K x 50K matrix with about 2 million edges, so it’s taking a little while…
Took about as long to convert as it took me to post this. Everything is fine. Now I get back to doing all-by-all Dijkstra on the graph, or at least find a reasonably fast way to allow for one-off queries.