R

Webserver logs access time by region/language

As anyone with a popular website knows, there’s a big difference in the resources required for peak vs. off-peak hours and you typically have to pay for peak usage even if you don’t always use it (e.g. 95th percentile bandwidth billing)

Frugal as I am, I was curious to see if I could increase traffic during what are off-peak hours. Seemed sensible that people in different regions of the world might be accessing during off-hours.

So I aggregated data by country code/language and 10-minute time segment. Applied a Daniell smoothing kernel (a sliding window) of 6 segments (1 hour) and plotted a a row-scaled heatmap in R. Rows are clustered so similar access patterns are next to one another, with the left-hand-side dendrogram indicating dissimilarity between rows. Yellow-white is a traffic burst. I’ll post the code and data later for how I made this.

access times by country/language

As it turns out, the main off-peak trough corresponds to the middle of the Pacific ocean. Kinda watery for people to live there. Oh well, I tried.

Analytics
R
Science
Statistics

Comments (0)

Permalink

R Matrix sparseVector operations

I’ve only done minus(vec1,vec2) so far. More to come.

library(Matrix);
 
#F = strsplit(as.character(mm[1,2]),', ')[[1]]
#G = matrix( as.numeric(unlist(strsplit(F[c(-1,-length(F))],':'))), nrow=2 )
#tt = new('dsparseVector', x=G[2,], i=as.integer(G[1,]), length=max(as.integer(G[1,])))
 
minus = function(v1,v2) {
  i = sort(union(v1@i,v2@i));
  s = length(i);
 
  x = vector(mode='numeric',length=s);
  for ( k in 1:s ) {
    z = i[k];
    if ( z < length(v1) ) {
      x[k] = as.numeric(v1[z]);
    }
    if ( z < length(v2) ) {
      x[k] = x[k] - as.numeric(v2[z]);
    }
  }
  new("dsparseVector", x=x, i=i, length=max(v1@i,v2@i))
}

R
Statistics

Comments (0)

Permalink

Sparse Matrices in R

I’ve had a need over the last week to work with some sparse matrix data in R. This was a totally new problem for me, and I can now sympathize with anyone else having to do this and will document the experience.

It seems that the de-facto standard for moving sparse matrices is around is to use the Harwell-Boeing file format, aka “harbo”. It’s a horrible and largely undocumented fixed-width (think Fortran) file format. The best documentation I could find was in source code here, although you may be able to piece more of it together with Koders. R does include a harbo reader as part of the SparseM package.

Given that I’m more comfortable working in Perl than in R or Fortran, I decided to have a look on CPAN to see what was available. As it turns out, there is a package called Text::SenseClusters from Ted Pedersen that ships with a nifty program, mat2harbo.pl. I found the preferred sparse matrix “mat” format used by Text::SenseClusters to be more reasonable than harbo. Here’s an example.

5 5 15
2 9 4 9
1 6 2 5 3 7 4 8 5 6
1 4 2 5
1 7 2 6 3 7
1 9 2 8 3 9

. There is a header line “

5 5 15

” that defines the matrix rows, columns, and number of non null fields. Each subsequent (possibly blank) line gives index/value pairs for the non-null positions in that row. Easy!

At this point I was formulating a plan to:

  1. use my matrix writer to write in “mat” format to file1.mat.
  2. convert file1.mat to file2.harbo using mat2harbo.pl from Text::SenseClusters.
  3. import file2.harbo into R using the read.matrix.hb() function in the SparseM package.
  4. convert the SparseM matrix to an R graph (graph package).
  5. get back to my original problem… analyzing the matrix in R with Boost via the RBGL package.

Well, it wasn’t that easy.

Step 1 went okay. Step 2 had problems with null columns, and had some glitches in the output format. Some of these glitches were easy to fix (e.g. matrix definition of “rra” to “RRA”), but others were very difficult due to the fact that mat2harbo.pl didn’t provide “full” harbo support, and the SparseM reader needed some of the file format features that weren’t supported.

So I wrote my own “mat” file -> R matrix.rsc object constructor myself. Here it is:

read.matrix.pair = function (file,debug=FALSE) {
mat.lines = readLines(file);
header = mat.lines[1];
F = strsplit(header,' ')[[1]];
nrow = as.integer(F[1]);
ncol = as.integer(F[2]);
nelem = as.integer(F[3]);
 
ja = vector("list",nrow);
ra = vector("list",nrow);
ia = vector("list",nrow);
ia[[1]] = c(1);
 
for ( i in 2:(nrow+1) ) { #nrow
mat.line = strsplit(mat.lines[i],' ')[[1]];
if ( length(mat.line) > 0 ) {
if ( debug ) print(paste('non-empty row',i-1));
ja[[i-1]] = mat.line[  seq(1,length(mat.line),by=2)];
ra[[i-1]] = mat.line[1+seq(1,length(mat.line),by=2)];
ia[[i]] = ia[[i-1]] + length(mat.line)/2;
if ( debug ) print(paste('  pos:',ja[[i-1]]));
if ( debug ) print(paste('  dat:',ra[[i-1]]));
} else {
if ( debug ) print(paste('    empty row',i-1));
ia[[i]] = ia[[i-1]];
}
}
ans.ja = as.integer(unlist(ja));
ans.ra = as.integer(unlist(ra));
ans.ia = as.integer(unlist(ia));
dimension = as.integer(c(nrow,ncol));
 
if ( debug ) {
print(paste('nrow',nrow));
print(paste('ncol',ncol));
print('ra');print(ans.ra);
print('ja');print(ans.ja);
print('ia');print(ans.ia);
}
rd.o = new("matrix.csr", ra = ans.ra, ja = ans.ja, ia = ans.ia, dimension = dimension)
}

This let me just read the “mat” file directly into R. After that, the conversion to a graph object seems to work okay. I say seems to because I’m still waiting for the SparseM -> graph conversion routine to finish. It’s a 50K x 50K matrix with about 2 million edges, so it’s taking a little while…

Took about as long to convert as it took me to post this. Everything is fine. Now I get back to doing all-by-all Dijkstra on the graph, or at least find a reasonably fast way to allow for one-off queries.

Informatics
Mathematics
R
Statistics

Comments (1)

Permalink

new R package: getopt

I just uploaded the getopt package to CRAN. This will make it easy to use command line options with Rscript #! “shebang” scripts. It’s pretty much like what is available in every other programming language (getopt.h in C, Getopt::Long in Perl), but oddly was not yet available for R. So I wrote it! :-)

Example usage, to print a sampling of a random normal variable, you might make a script named ./rnorm that contains:

#!/usr/bin/Rscript
library('getopt');
opt = getopt(c(
'verbose', 'v', 2, "integer",
'help'   , 'h', 0, "logical",
'count'  , 'c', 1, "integer",
'mean'   , 'm', 1, "double",
'sd'     , 's', 1, "double"
));
if ( !is.null(opt$help) ) {
self = commandArgs()[1];
cat(paste("Usage: ",self," [-[gh]] [-[-mean|m] <mean>] [-[-sd|s] <sd>] [-[-count|c] <count>]\n",sep=""));
q(status=1);
}
if ( is.null(opt$mean    ) ) { opt$mean    = 0     }
if ( is.null(opt$sd      ) ) { opt$sd      = 1     }
if ( is.null(opt$count   ) ) { opt$count   = 10    }
if ( is.null(opt$verbose ) ) { opt$verbose = FALSE }
if ( opt$verbose ) { write("writing...",stderr()); }
cat(paste(rnorm(opt$count,mean=opt$mean,sd=opt$sd),collapse="\n"));
cat("\n");
q(status=0);

and can be called, e.g., like:

blink:/tmp allenday$ ./rnorm  -s 10 -c 10 --mean=100 --verbose=2
writing...
80.2953014070924
109.36715703278
104.856588724070
97.7983406914681
102.163515767212
90.7613417541473
97.8344921793064
108.918662445162
100.725143995218
105.285435884127
blink:/tmp allenday$

Yay! It’s a trivial example, but this can get pretty powerful once you can start passing in data files, reading from pipes, etc. I have some more example code for doing that, but it’s not getopt related so I need to dig it up.

Ok, I seriously need to get a syntax highlighter installed on this blog. Anyone have a recommendation? Isn’t there an enscript plugin for wordpress?

Update: I installed wp-syntax. There is no R support, but the C highlighter seems to work okay… wonder what it will do if I start doing the funky <- left arrow assign syntax…

Mobile
R
Software

Comments (0)

Permalink