Computing

pcoc - Piped Command Output Colorizer

I’m frequently monitoring webservers, cache servers, database servers, etc by tailing their log files, e.g.

tail -f /etc/httpd/logs/access_log

I like the –color option provided by grep, but found it to be too limited (only one allowed, no wildcard support). After a bit of searching to see if a tool existed for doing arbitrary colorizing, I found
acoc, the Arbitrary Command Output Colourer.

…which almost did what I needed, but couldn’t read from a pipe. So I wrote pcoc, the Piped Command Output Colorizer. I’m only publishing this because I’ve been using it for about 1 1/2 years, and still find it useful.

Source code at the end of this post. Here’s an example that highlights iPhone/iPod user agents and requests with a 500/400/404 HTTP response:

tail -f ./logs/access_log | pcoc -f '(iPod)=bold cyan' -f '(iPhone)=bold magenta' -f '\b(500|404|400)\b=red on_black'

Sorry, no screenshots :(.

pcoc source:

#!/usr/bin/perl
use strict;
use Getopt::Long;
use Term::ANSIColor qw(colored);
$|++;
 
my %format = ();
GetOptions( "format|f=s" => \%format);
 
if ( ! keys %format ) {
  print <<"EOF";
Synopsis:
        pcoc - Piped Command Output Colorizer.  Inspired by acoc.
 
Usage:
 
        $0 -f '<regex1>=<color1>' -f '<regex2>=<color2>'
 
$0 reads from a pipe and colorizes each line based on format (-f) parameters.
 
Arguments:
 
-f '<regex>=<color>'  Required, multiple values okay. 
 
        <regex>: A regular expression from which \$1 will be colorized
 
        <color>: One or more colorization keywords, see perldoc
        Term::ANSIColor, but briefly they are:
 
        boldness:
                bold
        foreground:
                red yellow green blue magenta cyan black white
        background:
                on_red on_yellow on_green on_blue on_magenta on_cyan
                on_black on_white
 
Examples:
 
        #highlight the account's shell in bold green
        cat /etc/passwd | $0 -f '.+:([^:]+)\$=bold green'
 
        #... and the username in red with black background
        cat /etc/passwd | $0 -f '([^:]+)=red on_black' -f '.+:([^:]+)\$=bold green'
 
Copyright/License:
 
        Allen Day <allenday\@ucla.edu>, licensed under GPL 2006-2008
 
EOF
  exit(1);
}
 
while ( my $line = <> ) {
  chomp( $line );
  foreach my $f ( keys %format ) {
    my @c = split ',', $format{ $f };
 
    if ( $line =~ qr/$f/ ) {
      while ( my ( $s, $t ) = $f =~ m/^(.*?)\(+(.+?)\)+/ ) {
        my $c = pop @c || last;
        $line =~ s/($s)($t)/$1.colored($2,$c)/e;
        $f =~ s/^(.*?)\((.+?)\)/$1$2/;
      }
    }
  }
  print "$line\n";
}

Administration
Analytics
Perl

Comments (0)

Permalink

Notes on setting up Taste


Setting up Taste v1.7.2 on a CentOS 4 x86_64 box.

Taste has merged with Mahout now, but I still want to do this standalone b/c I’m having trouble getting the JUnit tests to pass for Mahout. With that out of the way…

These are the shell commands I assembled after following the Taste Demo guide.

#make sure you have ant, and the JDK.  I don't recommend the CentOS stock, get them from Sun/Apache
#download necessary .jar files, sources, data files.  unpack/move them to correct locations.
wget http://internap.dl.sourceforge.net/sourceforge/taste/taste-1.7.2.zip
wget http://internap.dl.sourceforge.net/sourceforge/proguard/proguard4.2.zip
wget http://www.grouplens.org/system/files/million-ml-data.tar__0.gz
wget http://www.hightechimpact.com/Apache/tomcat/tomcat-5/v5.5.26/bin/apache-tomcat-5.5.26.tar.gz
unzip taste-1.7.2.zip
unzip proguard4.2.zip
tar -xvzf million-ml-data.tar__0.gz
tar -xvzf apache-tomcat-5.5.26.tar.gz
cp proguard4.2/lib/proguard.jar lib/
mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/
#start up tomcat on port 8080 (default)
JAVA_OPTS="-server -da -dsa -Xms1024m -Xmx1024m" JAVA_HOME=/usr/java/jdk1.6.0_02 sh apache-tomcat-5.5.26/bin/startup.sh
#build taste.war, and inject it into tomcat
JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example
cp taste.war apache-tomcat-5.5.26/webapps/
#test the app.  may take a minute or two on the first query.
wget -O - -S 'http://localhost:8080/taste/RecommenderServlet?userID=1&amp;debug=true'

Once you get that working, you can tweak the demo slightly to work on another data set. You just need to know the grouplens file format. ratings.dat is of the format:

UserID::MovieID::Rating::Timestamp

e.g.

1::1193::5::978300760

and movies.dat is of the format:

MovieID::Title::Genres

e.g.

1::Toy Story (1995)::Animation|Children's|Comedy

I wrote a script, let’s call it load_taste.pl, that can generate new movies.dat and ratings.dat files from an alternate data source. If I make these new files, I can drop them in place of the grouplens data, rebuild the .war files, and make recommendations on this other data set. Here’s how to do it:

#generate ratings.dat and movies.dat.  move them to replace the grouplens data files.
perl ./load_taste.pl
mv [mr]*.dat src/example/com/planetj/taste/example/grouplens/
#get rid of stale .war and .jar files
rm taste.war grouplens.jar
#build the "quick" version of the example.  see below for build.xml patch
JDK_HOME=/usr/java/jdk1.6.0_02 JAVA_HOME=/usr/java/jdk1.6.0_02 ant build-grouplens-example-quick
#inject the re-built .war file into tomcat.
cp taste.war apache-tomcat-5.5.26/webapps/
#get rid of stale tomcat caches
rm -rf apache-tomcat-5.5.26/webapps/taste apache-tomcat-5.5.26/temp/taste.*.txt

Note that I’ve defined a new ant build target called “build-grouplens-example-quick”. The purpose of this is that we only want to rebuild grouplens.jar and taste.war, not reoptimize/reverify/rebuild taste.jar, etc. The “build-grouplens-example” target takes ~55 seconds to complete on my machine, whereas the “build-grouplens-example-quick” target takes ~2 seconds. Here’s a diff to the original build.xml file:

--- /tmp/build.xml      2008-03-21 21:18:20.000000000 -0700
+++ ./build.xml 2008-06-30 11:46:18.000000000 -0700
@@ -161,6 +161,58 @@
      <delete file="${my-web.xml}"/>
   </target>
 
+  <target depends="" name="build-taste-server-quick" description="Builds deployable web-based Taste server">
+     <fail unless="my-recommender.jar" message="Please set -Dmy-recommender.jar=XXX"/>
+     <fail unless="my-recommender-class" message="Please set -Dmy-recommender-class=XXX"/>
+     <tempfile property="my-web.xml"/>
+     <copy file="src/main/com/planetj/taste/web/web.xml" tofile="${my-web.xml}">
+       <filterset>
+               <filter token="RECOMMENDER_CLASS" value="${my-recommender-class}"/>
+       </filterset>
+     </copy>
+     <war destfile="${release-war}" webxml="${my-web.xml}">
+       <lib dir=".">
+               <include name="${release-jar}"/>
+               <include name="${my-recommender.jar}"/>
+       </lib>
+       <lib dir="lib/axis"/>
+       <classes dir="build">
+               <include name="com/planetj/taste/web/**"/>
+       </classes>
+       <fileset dir="src/main/com/planetj/taste/web">
+               <include name="RecommenderService.jws"/>
+       </fileset>
+     </war>
+     <delete file="${my-web.xml}"/>
+  </target>
+  <target depends="" name="build-grouplens-example-quick" description="Builds deployable GroupLens example">
+     <javac source="1.5"
+            target="1.5"
+            deprecation="true"
+          debug="true"
+          optimize="false"
+            destdir="build"
+            srcdir="src/example">
+       <compilerarg value="-Xlint:all"/>
+       <classpath>
+               <pathelement location="${release-jar}"/>
+               <pathelement location="${annotations.jar}"/>
+       </classpath>
+     </javac>
+     <jar jarfile="grouplens.jar">
+       <fileset dir="src/example">
+               <include name="com/planetj/taste/example/grouplens/ratings.dat"/>
+               <include name="com/planetj/taste/example/grouplens/movies.dat"/>
+       </fileset>
+       <fileset dir="build">
+               <include name="com/planetj/taste/example/grouplens/**"/>
+       </fileset>
+     </jar>
+     <property name="my-recommender.jar" value="grouplens.jar"/>
+     <property name="my-recommender-class" value="com.planetj.taste.example.grouplens.GroupLensRecommender"/>
+     <antcall target="build-taste-server-quick"/>
+  </target>
+
   <target depends="build,optimize" name="build-grouplens-example" description="Builds deployable GroupLens example">
      <javac source="1.5"
             target="1.5"

Informatics
Java
Statistics

Comments (2)

Permalink

Notes on setting up hadoop on CentOS 5 x86

Overview

This documents my experience setting up hadoop on three 2-cpu hyperthreaded Centos 5 x86 boxes.

The machines being used are called:

  • hadoop-0-0
  • hadoop-0-1
  • hadoop-0-2

System Setup

Unless otherwise stated, each task was executed on all three machines.

Set up hadoop user

I added a “hadoop” user up on each node, and untarred the hadoop source directly in to the /home/hadoop directory.  So there was a /home/hadoop/conf, /home/hadoop/bin, etc.

useradd hadoop
passwd hadoop
cd /tmp
wget http://apache.siamwebhosting.com/hadoop/core/hadoop-0.17.0/hadoop-0.17.0.tar.gz
su hadoop
cd ~
tar -xvzf /tmp/hadoop*gz
mv hadoop*/* ./
rmdir hadoop*
exit

Next, I set up passphrase-less ssh. On hadoop-0-0, I set up a DSA key:

su hadoop
ssh-keygen -t dsa
#[...use blank passphrase...]
scp ~/.ssh/id_dsa.pub hadoop-0-1:/tmp/
scp ~/.ssh/id_dsa.pub hadoop-0-2:/tmp/

Then, on hadoop-0-1 and hadoop-0-2:

su hadoop
mkdir ~/.ssh
chmod 700 ~/.ssh
cp /tmp/id_dsa.pub ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys

This allows login from hadoop-0-0 to both hadoop-0-1 and hadoop-0-2 without typing a password. It’s necessary.

Install prerequisite software

ssh and rsync are required. You can install them like so:

yum -y install ssh rsync

I installed Java 6 from Sun (it’s important to mention here that the CentOS yum/JPackage RPMs for gcc-java, etc did not work for Mahout, so it had to be the Sun Java).

rpm -Uvh j*rpm

Setup system services

I turned off iptables and ip6tables, and disabled them on startup. You could configure them, but I just turned them off. They get in the way of the nodes communicating.

/etc/init.d/iptables stop
/etc/init.d/ip6tables stop
/usr/sbin/ntsysv
#[...]

Then, I edited the hosts file. I actually did this later in the process after a bunch of debugging, but it makes sense to do it here. You need to make sure that the name by which you refer to a host is not associated with the loopback IP address (127.0.0.1). So your /etc/hosts file should look something like this on hadoop-0-0:

127.0.0.1    localhost localhost.localdomain
::1    localhost6.localdomain6 localhost6
10.0.0.1    hadoop-0-0

where the 10.0.0.1 entry is optional if you have some other way to resolve it, e.g. DNS. The key point is that the hadoop-0-0 name not be associated with 127.0.0.1.

Configuring hadoop in standalone mode

I tried using the Hadoop Quick Start guide first. It was trivial to get working in standalone mode. The documentation is sufficient, I won’t discuss it further.

Configuring hadoop in cluster mode

I tried using the Hadoop Cluster Setup guide next. It describes there are four types of hadoop services, split across three types of machines:

  • NameNode machine, runs NameNode service
  • JobTracker machine, runs JobTracker service
  • Slave machine, runs TaskTracker and DataNode services

I wanted to have 2 slave nodes.  So I decided to split it up like:

  • hadoop-0-0 - NameNode and JobTracker
  • hadoop-0-1 - DataNode and TaskTracker
  • hadoop-0-2 - DataNode and TaskTracker

Here’s what my config files look like on all three machines:

/home/hadoop/conf/hadoop-env.sh: added one line.

export JAVA_HOME=/usr/java/jdk1.6.0_06

/home/hadoop/conf/masters:

hadoop-0-0

/home/hadoop/conf/slaves:

hadoop-0-1
hadoop-0-2

/home/hadoop/conf/hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-0-0:9000/</value>
<final>true</final>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-0-0:9001/</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data</value>
<final>true</final>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/hadoop/mapred/system</value>
<final>true</final>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/hadoop/mapred/local</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<final>true</final>
</property>
</configuration>

This defines some services on hadoop-0-0 ports 9000 and 9001, and says that the slaves should get 2 map and 2 reduce tasks each (one for each of my 4 CPUs [remember, I'm hyperthreading]).

Make sure the configuration files are the same on all machines.

Starting hadoop services

From here I pretty much followed the tutorial.

First, set up the filesystem for the NameNode service on the NameNode (hadoop-0-0 for me):

/home/hadoop/bin/hadoop namenode -format

Next start up HDFS, the distributed filesystem:

/home/hadoop/bin/start-dfs.sh

Then start up the JobTracker service on the JobTracker node (hadoop-0-0 for me):

/home/hadoop/bin/start-mapred.sh

This also starts up the TaskTracker and DataNode services on all the nodes specified in the /home/hadoop/conf/slaves file.

At this point, doing a:

ps x

as the hadoop user on the hadoop-0-{0,1,2} machines will show some java daemons running. If you run into trouble, there is diagnostic information in the /home/hadoop/logs/*.log files. These were indispensible in debugging my setup.

Testing hadoop setup

Testing HDFS

To test HDFS, the distributed file system. On hadoop-0-0:

[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 1 items
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -touchz /foo
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -ls /
Found 2 items
/foo    <r 3>   0       2008-06-20 23:49        rw-r–r–       hadoop  supergroup
/home   <dir>           2008-06-20 18:58        rwxr-xr-x       hadoop  supergroup
[hadoop@hadoop-0-0 ~]$ ./bin/hadoop dfs -rm /foo
Deleted /foo
[hadoop@hadoop-0-0 ~]$

Testing MapReduce

Haven’t done this yet, except in standalone mode.

Computing
Java
Science

Comments (0)

Permalink

Los Angeles SoC(i)al Tech Scene

I’m collating here all the tech events/sites I hear about that are specific to Los Angeles / Southern California. This post is the result of several conversations I’ve had, verbally and via email, about the scattered tech scene in Los Angeles. Disclaimer: I’m not the organizer of any of these events, and most of them I’ve never even attended so I can’t vouch for the quality. For now, the events are in the order in which I hear about them.

If you know of a resource I missed leave a comment and I’ll add it in.

Business
Computing
Networking

Comments (0)

Permalink

Statistical HTML Content Extraction

Introduction

I’ve been learning about some of the techniques used by the so-called “Black Hat SEO” community for boosting their rankings in search engine results. Intriguing stuff. I’m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do with my primary areas of interest.

Generating Unique Content

One “Black Hat SEO” application area is automatically generating HTML pages to improve search engine rankings. This technique uses a Markov process to generate text. The idea is to build one or more web pages that contain the keywords the SEO is targeting. The method basically works like this:

  1. Assemble a corpus of text to train the model. For example, Project Gutenberg
  2. Build an order-N (typically N=2) Markov model that captures the state changes in the corpus
  3. Generate text from the model, periodically throwing in some keywords
  4. Link the generated page to some other page to which you want to send traffic
  5. Repeat again from Step 1

One problem with this approach — aside from the fact that the keywords don’t really fitting in with the flow of the model — is that the model is trained on inappropriate text. For instance, suppose you were trying to optimize for keywords:

  • keywords
  • statistics
  • Search engine optimization
  • SEO
  • Automatic content generation
  • Automatic content extraction
  • HTML content extraction
  • Markov Model

… then you probably wouldn’t want to train your model on, say, Jane Austen’s Pride and Prejudice.

Improve Generated Text: Use Niche Corpora

A better thing to do would be to find some nice web pages containing keywords, statistics, seo, Markov model, and so on. That way you’ll pick up related keywords that you didn’t initially think of (or weren’t suggested by your keyword expansion tool), too.

But let’s face it. The corpora are going to be in HTML format. So the question now becomes, How do I automate the transformation of HTML into plain text for input to the model? A few strawman ideas, followed by my remarks:

  • Get an HTML document, and remove all <element/>s. Won’t work very well. You end up training on page navigation, footers, headers, etc.
  • Build a site- or software-specific parser (e.g. for Wikipedia, or for Wordpress) to extract the main content. Scalability and maintenance nightmare. This is not generalizable to general text extraction. You’ll be constantly fixing broken parsers, too.
  • Devise a scoring system that can identify the main content of the page. Exactly!

I did find some methods for scoring page fragments, such as the Perl modules HTML::Content::Extractor and HTML::Extract, and another method described by Nooks. There are also a few intersting ideas in Gupta’s WWW2003 paper.

None of that Perl code linked above actually works, but Nooks and Jean Tavernier generally had the right idea. Basically, they look “down” the DOM to find the sub-DOM with the highest text/tag ratio.

The main problem with this approach is that it biases for DOM leaves, or “twigs” that are very close to leaves. You end up having to write special rules for accomodating the idiosyncrosies of each particular page dealt with, and it basically turns back into an HTML parsing exercise.

The other problem, and possibly more significant one from a statistician’s point of view, is that the ratio is not a well-understood metric for making decisions about what constitutes a “good” versus a “bad” sub-document. It would be better to have a p-value…

Balls and Urns

Fortunately, Fisher’s exact test can be applied to this problem. Here’s how you can apply it, explanation follows. First, let’s define some variables:

  • X: the total number of words in the whole document.
  • x: the number of words in a sub-document.
  • Y: the total number of <element/>s in the whole document.
  • y: the number of <element/>s in a sub-document.

Then, we perform the following algorithm to identify the single best sub-document:

tree; //the HTML tree's root node
minP = 1; //minimum p-value observed in the document
subD = ""; //sub-document corresponding to minimum p-value
X = calculatex(tree);
Y = calculatey(tree);
look(tree);
function look (node) {
  x = calculatex(node);
  y = calculatey(node);
  p = calculateHyperG(x,y,X,Y);
  if ( p < minP ) {
    minP = p;
    subD = node;
  }
  C = children(node);
  foreach (c in C) {
    look(c);
  }
}

Balls and Urns, Explained

The pseudocode above is examining each sub-document of the HTML document in turn and identifying the one with the smallest p-value. The p-value is calculated using the hypergeometirc distribution, where we consider that a sub-document has x words and y HTML <element/>s. This, in the context of the total document having X words and Y HTML <element/>s. It’s better than a simple ratio calculation because it does not bias for the tree’s leaves. That is, the p-value does not consider only the size of x+y.

Caveats

Bear in mind that testing so many sub-documents, especially for very large HTML documents, warrants so-called “multiple hypothesis testing correction“, such as a Bonferroni correction. It’s outside the scope of this article.

Also, the tests performed are not entirely independent. That is, if node B is a child of node A then B will have some effect on A when calculating A’s p-value and must be factored out. This is also a well-defined problem but is, alas, also outside the scope of this article. Do your homework! Hint: learn about the Gene Ontology.

Conclusion

Fine and dandy, but does it work? My conclusion: seems to work. Here’s a CGI script demonstrating the hypergeometric content extraction technique on CNN.com. It reports a text snippet at the beginning and end of the single “best” sub-document and the corresponding (uncorrected) p-value. Twiddle the u parameter to test on a page of your choice. Some pages may block the user-agent I’m using…

There is also the issue of what to consider an element and what not to… or maybe even element weighting. For instance, maybe <p/> and <i/> elements shouldn’t be penalized because they’re commonly associated with text, but <script/> elements are heavily penalized.

Informatics
Software
Statistics

Comments (0)

Permalink

Configure Wordpress Ping

I wanted to configure Wordpress pinging for the Facebook Flog Blog application. For some reason the feed on my profile page isn’t updating, and I thought maybe this would do the trick.

Took a bit of digging, but I found a guide at Technorati. Hint: “options” has been (moved and) renamed as “settings” as late as Wordpress 5.2.1.

Let’s see if it works!

Update: just by visiting the Flog Blog settings page, I have somehow managed to get Flog Blog to update. Hmm…

PHP
Random musings

Comments (0)

Permalink

Wordpress Disable Autosave

The autosave feature has been giving me a lot of trouble since upgrading to Wordpress 2.5.1. I found this post on the topic of disabling autosave. Moonlight gets it mostly right, having identified the causal line

wp_enqueue_script('autosave')

. I just had to comment that out in all the files under wp-admin and everything works again.

PHP

Comments (0)

Permalink

X on OS X

I’ve been using Fink for my X11 needs for the last ~5 years… the whole time I’ve had a Mac. I was never happy with the shenanigans required to get X11 working with Fink. The procedure is to erase the system-installed X11, replace with Fink X11, twiddle XDarwin (which comes from… where, exactly in the Fink build process?), twiddle some .info files to get fluxbox working, etc, etc.

Today I upgraded my 10.4 box to 10.5 and… yep. My X11 setup broke. So I got to spend today fighting with Fink again. In the end, I discovered that this is now (as opposed to 5 years ago) a dream to do with DarwinPorts. It’s pretty much

  1. sudo port install fluxbox
  2. configure .xinitrc, and
  3. startx

I’m still using fink for some other development related stuff, but I’m planning to look into using more DarwinPorts. Hadn’t realized it had come so far along.

Update: The X11.app / DarwinPorts technique suffers from clipboard problems and is actually more broken than XDarwin.app / Fink in that it will not copy/paste between X11/Aqua in either direction. I managed to get this working though by:

  1. In X11.app preferences, check “Enable key equivalents under X11″. All else unchecked.
  2. Get the quartz-wm from a Tiger X11User.pkg file. I got mine after cobbling together some information and put it in ~/Documents. Start it up in .xinitrc as “quartz-wm –proxy-only &”. This allows Aqua copy -> X11 paste.

What this achieves:

  • Copy/paste in Aqua: As usual with Command-{C,V}.
  • Copy/paste in X11: As usual with highlight/middle-click.
  • Copy in X11, paste in Aqua: highlight in X11. Command-C in X11 to copy. Command-V in Aqua to paste.
  • Copy in Aqua, paste in X11: Command-C in Aqua to copy. Middle-click in X11 to paste.

Aside from going X11 -> Aqua this feels pretty natural. Whew!

Computing

Comments (0)

Permalink

Sparse Matrices in R

I’ve had a need over the last week to work with some sparse matrix data in R. This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.

It seems that the de-facto standard for moving sparse matrices is around is to use the Harwell-Boeing file format, aka “harbo”. It’s a horrible and largely undocumented fixed-width (think Fortran) file format. The best documentation I could find was in source code here, although you may be able to piece more of it together with Koders. R does include a harbo reader as part of the SparseM package.

Given that I’m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available. As it turns out, there is a package called Text::SenseClusters from Ted Pedersen that ships with a nifty program, mat2harbo.pl. I found the preferred sparse matrix “mat” format used by Text::SenseClusters to be more reasonable than harbo. Here’s an example.

5 5 15
2 9 4 9
1 6 2 5 3 7 4 8 5 6
1 4 2 5
1 7 2 6 3 7
1 9 2 8 3 9

. There is a header line “

5 5 15

” that defines the matrix rows, columns, and number of non null fields. Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row. Easy!

At this point I was formulating a plan to:

  1. use my matrix writer to write in “mat” format to file1.mat.
  2. convert file1.mat to file2.harbo using mat2harbo.pl from Text::SenseClusters.
  3. import file2.harbo into R using the read.matrix.hb() function in the SparseM package.
  4. convert the SparseM matrix to an R graph (graph package).
  5. get back to my original problem… analyzing the matrix in R with Boost via the RBGL package.

Well, it wasn’t that easy.

Step 1 went okay. Step 2 had problems with null columns, and had some glitches in the output format. Some of these glitches were easy to fix (e.g. matrix definition of “rra” to “RRA”), but others were very difficult due to the fact that mat2harbo.pl didn’t provide “full” harbo support, and the SparseM reader needed some of the file format features that weren’t supported.

So I wrote my own “mat” file -> R matrix.rsc object constructor myself. Here it is:

read.matrix.pair = function (file,debug=FALSE) {
mat.lines = readLines(file);
header = mat.lines[1];
F = strsplit(header,' ')[[1]];
nrow = as.integer(F[1]);
ncol = as.integer(F[2]);
nelem = as.integer(F[3]);
 
ja = vector("list",nrow);
ra = vector("list",nrow);
ia = vector("list",nrow);
ia[[1]] = c(1);
 
for ( i in 2:(nrow+1) ) { #nrow
mat.line = strsplit(mat.lines[i],' ')[[1]];
if ( length(mat.line) > 0 ) {
if ( debug ) print(paste('non-empty row',i-1));
ja[[i-1]] = mat.line[  seq(1,length(mat.line),by=2)];
ra[[i-1]] = mat.line[1+seq(1,length(mat.line),by=2)];
ia[[i]] = ia[[i-1]] + length(mat.line)/2;
if ( debug ) print(paste('  pos:',ja[[i-1]]));
if ( debug ) print(paste('  dat:',ra[[i-1]]));
} else {
if ( debug ) print(paste('    empty row',i-1));
ia[[i]] = ia[[i-1]];
}
}
ans.ja = as.integer(unlist(ja));
ans.ra = as.integer(unlist(ra));
ans.ia = as.integer(unlist(ia));
dimension = as.integer(c(nrow,ncol));
 
if ( debug ) {
print(paste('nrow',nrow));
print(paste('ncol',ncol));
print('ra');print(ans.ra);
print('ja');print(ans.ja);
print('ia');print(ans.ia);
}
rd.o = new("matrix.csr", ra = ans.ra, ja = ans.ja, ia = ans.ia, dimension = dimension)
}

This let me just read the “mat” file directly into R. After that, the conversion to a graph object seems to work okay. I say seems to because I’m still waiting for the SparseM -> graph conversion routine to finish. It’s a 50K x 50K matrix with about 2 million edges, so it’s taking a little while…

Took about as long to convert as it took me to post this. Everything is fine. Now I get back to doing all-by-all Dijkstra on the graph, or at least find a reasonably fast way to allow for one-off queries.

Informatics
Mathematics
R
Statistics

Comments (1)

Permalink

new R package: getopt

I just uploaded the getopt package to CRAN. This will make it easy to use command line options with Rscript #! “shebang” scripts. It’s pretty much like what is available in every other programming language (getopt.h in C, Getopt::Long in Perl), but oddly was not yet available for R. So I wrote it! :-)

Example usage, to print a sampling of a random normal variable, you might make a script named ./rnorm that contains:

#!/usr/bin/Rscript
library('getopt');
opt = getopt(c(
'verbose', 'v', 2, "integer",
'help'   , 'h', 0, "logical",
'count'  , 'c', 1, "integer",
'mean'   , 'm', 1, "double",
'sd'     , 's', 1, "double"
));
if ( !is.null(opt$help) ) {
self = commandArgs()[1];
cat(paste("Usage: ",self," [-[gh]] [-[-mean|m] <mean>] [-[-sd|s] <sd>] [-[-count|c] <count>]\n",sep=""));
q(status=1);
}
if ( is.null(opt$mean    ) ) { opt$mean    = 0     }
if ( is.null(opt$sd      ) ) { opt$sd      = 1     }
if ( is.null(opt$count   ) ) { opt$count   = 10    }
if ( is.null(opt$verbose ) ) { opt$verbose = FALSE }
if ( opt$verbose ) { write("writing...",stderr()); }
cat(paste(rnorm(opt$count,mean=opt$mean,sd=opt$sd),collapse="\n"));
cat("\n");
q(status=0);

and can be called, e.g., like:

blink:/tmp allenday$ ./rnorm  -s 10 -c 10 --mean=100 --verbose=2
writing...
80.2953014070924
109.36715703278
104.856588724070
97.7983406914681
102.163515767212
90.7613417541473
97.8344921793064
108.918662445162
100.725143995218
105.285435884127
blink:/tmp allenday$

Yay! It’s a trivial example, but this can get pretty powerful once you can start passing in data files, reading from pipes, etc. I have some more example code for doing that, but it’s not getopt related so I need to dig it up.

Ok, I seriously need to get a syntax highlighter installed on this blog. Anyone have a recommendation? Isn’t there an enscript plugin for wordpress?

Update: I installed wp-syntax. There is no R support, but the C highlighter seems to work okay… wonder what it will do if I start doing the funky <- left arrow assign syntax…

Mobile
R
Software

Comments (0)

Permalink