R

Sparse Matrices in R

I’ve had a need over the last week to work with some sparse matrix data in R. This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.

It seems that the de-facto standard for moving sparse matrices is around is to use the Harwell-Boeing file format, aka “harbo”. It’s a horrible and largely undocumented fixed-width (think Fortran) file format. The best documentation I could find was in source code here, although you may be able to piece more of it together with Koders. R does include a harbo reader as part of the SparseM package.

Given that I’m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available. As it turns out, there is a package called Text::SenseClusters from Ted Pedersen that ships with a nifty program, mat2harbo.pl. I found the preferred sparse matrix “mat” format used by Text::SenseClusters to be more reasonable than harbo. Here’s an example.

5 5 15
2 9 4 9
1 6 2 5 3 7 4 8 5 6
1 4 2 5
1 7 2 6 3 7
1 9 2 8 3 9

. There is a header line “

5 5 15

” that defines the matrix rows, columns, and number of non null fields. Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row. Easy!

At this point I was formulating a plan to:

  1. use my matrix writer to write in “mat” format to file1.mat.
  2. convert file1.mat to file2.harbo using mat2harbo.pl from Text::SenseClusters.
  3. import file2.harbo into R using the read.matrix.hb() function in the SparseM package.
  4. convert the SparseM matrix to an R graph (graph package).
  5. get back to my original problem… analyzing the matrix in R with Boost via the RBGL package.

Well, it wasn’t that easy.

Step 1 went okay. Step 2 had problems with null columns, and had some glitches in the output format. Some of these glitches were easy to fix (e.g. matrix definition of “rra” to “RRA”), but others were very difficult due to the fact that mat2harbo.pl didn’t provide “full” harbo support, and the SparseM reader needed some of the file format features that weren’t supported.

So I wrote my own “mat” file -> R matrix.rsc object constructor myself. Here it is:

read.matrix.pair = function (file,debug=FALSE) {
mat.lines = readLines(file);
header = mat.lines[1];
F = strsplit(header,' ')[[1]];
nrow = as.integer(F[1]);
ncol = as.integer(F[2]);
nelem = as.integer(F[3]);
 
ja = vector("list",nrow);
ra = vector("list",nrow);
ia = vector("list",nrow);
ia[[1]] = c(1);
 
for ( i in 2:(nrow+1) ) { #nrow
mat.line = strsplit(mat.lines[i],' ')[[1]];
if ( length(mat.line) > 0 ) {
if ( debug ) print(paste('non-empty row',i-1));
ja[[i-1]] = mat.line[  seq(1,length(mat.line),by=2)];
ra[[i-1]] = mat.line[1+seq(1,length(mat.line),by=2)];
ia[[i]] = ia[[i-1]] + length(mat.line)/2;
if ( debug ) print(paste('  pos:',ja[[i-1]]));
if ( debug ) print(paste('  dat:',ra[[i-1]]));
} else {
if ( debug ) print(paste('    empty row',i-1));
ia[[i]] = ia[[i-1]];
}
}
ans.ja = as.integer(unlist(ja));
ans.ra = as.integer(unlist(ra));
ans.ia = as.integer(unlist(ia));
dimension = as.integer(c(nrow,ncol));
 
if ( debug ) {
print(paste('nrow',nrow));
print(paste('ncol',ncol));
print('ra');print(ans.ra);
print('ja');print(ans.ja);
print('ia');print(ans.ia);
}
rd.o = new("matrix.csr", ra = ans.ra, ja = ans.ja, ia = ans.ia, dimension = dimension)
}

This let me just read the “mat” file directly into R. After that, the conversion to a graph object seems to work okay. I say seems to because I’m still waiting for the SparseM -> graph conversion routine to finish. It’s a 50K x 50K matrix with about 2 million edges, so it’s taking a little while…

Took about as long to convert as it took me to post this. Everything is fine. Now I get back to doing all-by-all Dijkstra on the graph, or at least find a reasonably fast way to allow for one-off queries.

Informatics
Mathematics
R
Statistics

Comments (1)

Permalink

new R package: getopt

I just uploaded the getopt package to CRAN. This will make it easy to use command line options with Rscript #! “shebang” scripts. It’s pretty much like what is available in every other programming language (getopt.h in C, Getopt::Long in Perl), but oddly was not yet available for R. So I wrote it! :-)

Example usage, to print a sampling of a random normal variable, you might make a script named ./rnorm that contains:

#!/usr/bin/Rscript
library('getopt');
opt = getopt(c(
'verbose', 'v', 2, "integer",
'help'   , 'h', 0, "logical",
'count'  , 'c', 1, "integer",
'mean'   , 'm', 1, "double",
'sd'     , 's', 1, "double"
));
if ( !is.null(opt$help) ) {
self = commandArgs()[1];
cat(paste("Usage: ",self," [-[gh]] [-[-mean|m] <mean>] [-[-sd|s] <sd>] [-[-count|c] <count>]\n",sep=""));
q(status=1);
}
if ( is.null(opt$mean    ) ) { opt$mean    = 0     }
if ( is.null(opt$sd      ) ) { opt$sd      = 1     }
if ( is.null(opt$count   ) ) { opt$count   = 10    }
if ( is.null(opt$verbose ) ) { opt$verbose = FALSE }
if ( opt$verbose ) { write("writing...",stderr()); }
cat(paste(rnorm(opt$count,mean=opt$mean,sd=opt$sd),collapse="\n"));
cat("\n");
q(status=0);

and can be called, e.g., like:

blink:/tmp allenday$ ./rnorm  -s 10 -c 10 --mean=100 --verbose=2
writing...
80.2953014070924
109.36715703278
104.856588724070
97.7983406914681
102.163515767212
90.7613417541473
97.8344921793064
108.918662445162
100.725143995218
105.285435884127
blink:/tmp allenday$

Yay! It’s a trivial example, but this can get pretty powerful once you can start passing in data files, reading from pipes, etc. I have some more example code for doing that, but it’s not getopt related so I need to dig it up.

Ok, I seriously need to get a syntax highlighter installed on this blog. Anyone have a recommendation? Isn’t there an enscript plugin for wordpress?

Update: I installed wp-syntax. There is no R support, but the C highlighter seems to work okay… wonder what it will do if I start doing the funky <- left arrow assign syntax…

Mobile
R
Software

Comments (0)

Permalink