Sparse Matrices in R

I’ve had a need over the last week to work with some sparse matrix data in R. This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.

It seems that the de-facto standard for moving sparse matrices is around is to use the Harwell-Boeing file format, aka “harbo”. It’s a horrible and largely undocumented fixed-width (think Fortran) file format. The best documentation I could find was in source code here, although you may be able to piece more of it together with Koders. R does include a harbo reader as part of the SparseM package.

Given that I’m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available. As it turns out, there is a package called Text::SenseClusters from Ted Pedersen that ships with a nifty program, mat2harbo.pl. I found the preferred sparse matrix “mat” format used by Text::SenseClusters to be more reasonable than harbo. Here’s an example.

5 5 15
2 9 4 9
1 6 2 5 3 7 4 8 5 6
1 4 2 5
1 7 2 6 3 7
1 9 2 8 3 9

. There is a header line “

5 5 15

” that defines the matrix rows, columns, and number of non null fields. Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row. Easy!

At this point I was formulating a plan to:

  1. use my matrix writer to write in “mat” format to file1.mat.
  2. convert file1.mat to file2.harbo using mat2harbo.pl from Text::SenseClusters.
  3. import file2.harbo into R using the read.matrix.hb() function in the SparseM package.
  4. convert the SparseM matrix to an R graph (graph package).
  5. get back to my original problem… analyzing the matrix in R with Boost via the RBGL package.

Well, it wasn’t that easy.

Step 1 went okay. Step 2 had problems with null columns, and had some glitches in the output format. Some of these glitches were easy to fix (e.g. matrix definition of “rra” to “RRA”), but others were very difficult due to the fact that mat2harbo.pl didn’t provide “full” harbo support, and the SparseM reader needed some of the file format features that weren’t supported.

So I wrote my own “mat” file -> R matrix.rsc object constructor myself. Here it is:

read.matrix.pair = function (file,debug=FALSE) {
mat.lines = readLines(file);
header = mat.lines[1];
F = strsplit(header,' ')[[1]];
nrow = as.integer(F[1]);
ncol = as.integer(F[2]);
nelem = as.integer(F[3]);
 
ja = vector("list",nrow);
ra = vector("list",nrow);
ia = vector("list",nrow);
ia[[1]] = c(1);
 
for ( i in 2:(nrow+1) ) { #nrow
mat.line = strsplit(mat.lines[i],' ')[[1]];
if ( length(mat.line) > 0 ) {
if ( debug ) print(paste('non-empty row',i-1));
ja[[i-1]] = mat.line[  seq(1,length(mat.line),by=2)];
ra[[i-1]] = mat.line[1+seq(1,length(mat.line),by=2)];
ia[[i]] = ia[[i-1]] + length(mat.line)/2;
if ( debug ) print(paste('  pos:',ja[[i-1]]));
if ( debug ) print(paste('  dat:',ra[[i-1]]));
} else {
if ( debug ) print(paste('    empty row',i-1));
ia[[i]] = ia[[i-1]];
}
}
ans.ja = as.integer(unlist(ja));
ans.ra = as.integer(unlist(ra));
ans.ia = as.integer(unlist(ia));
dimension = as.integer(c(nrow,ncol));
 
if ( debug ) {
print(paste('nrow',nrow));
print(paste('ncol',ncol));
print('ra');print(ans.ra);
print('ja');print(ans.ja);
print('ia');print(ans.ia);
}
rd.o = new("matrix.csr", ra = ans.ra, ja = ans.ja, ia = ans.ia, dimension = dimension)
}

This let me just read the “mat” file directly into R. After that, the conversion to a graph object seems to work okay. I say seems to because I’m still waiting for the SparseM -> graph conversion routine to finish. It’s a 50K x 50K matrix with about 2 million edges, so it’s taking a little while…

Took about as long to convert as it took me to post this. Everything is fine. Now I get back to doing all-by-all Dijkstra on the graph, or at least find a reasonably fast way to allow for one-off queries.