April 2009

Standalone BitTorrent Checksum Verification Tool

I’m writing some scripts to let me automate the downloading and seeding of torrents. The idea is to have torrents pulled in from RSS or a screenscrape (much as Azureus does this, but I want to script everything with Python/Mainline BitTorrent, bash, and Perl), then to sit on the torrents for a day or so after their last mtime, then checksum them and if they’re good move them elsewhere for watching, etc.

Part of this requires checksumming the files, and Mainline doesn’t ship with a standalone utility to do this. So I wrote one in Perl, see below. This only handles single-file torrents for now (i.e. no directories of files).

#!/usr/bin/perl
$|++;
use strict;
use Convert::Bencode_XS qw(bdecode);
use Data::Dumper;
use Digest::SHA1 qw(sha1);
use URI::Escape qw(uri_escape);
 
my $base = shift @ARGV;
my $torrent = "$base.torrent";
 
open( T, $torrent) or die $!;
my $torrent_data = join '', <T>;
close( T );
 
my $metainfo = bdecode( $torrent_data );
 
my $file_name = "$base/" . $metainfo->{'info'}->{'name'};
my $file_length = $metainfo->{'info'}->{'length'};
my $piece_length = $metainfo->{'info'}->{'piece length'};
 
my $pieces = $metainfo->{'info'}->{'pieces'};
my @pieces = ();
my $offset = 0;
while ( $offset < length( $pieces ) ) {
  my $p = substr( $pieces, $offset, 20 );
  $offset += 20;
  push @pieces, $p;
}
 
open( F, $file_name ) or die;
my $seek = 0;
foreach my $p ( @pieces ) {
  my $buf = '';
  seek( F, $seek, 0 );
  read( F, $buf, $piece_length );
  if ( $p eq sha1( $buf ) ) {
    print '.';
  }
  else {
    print 'x';
  }
  $seek += $piece_length;
}
close( F );

Random musings

Comments (0)

Permalink

Synthetic GFF Dataset for Genome Browser Benchmark

I deployed a Gbrowse/Chado installation last week at Dow Agrosciences.  It got me thinking about how slow and basic the searches are with the Bio::DB::Das::Chado* adaptor, and wouldn’t it be nice to use SOLR here?

I made up a test dataset of gene/mRNA/exon 3-tiered feature groups by permuting some gene model data from the knownGene annotation set of the Hg18 build of the human genome.  You can grab the data set and script used to generate it here.  There are several files mRNA.EN.txt.gz that contain gzipped gene models, where N=3..7 indicates there are 10^N models in the file, uniformly distributed across a 500-megabase reference sequence.

I’m planning to load these data into a couple of different systems and then compare performance on some of the typical Bio::DB::GFF API calls.  I can personally test on:

  • Chado
  • The default Bio::DB::GFF schema (does it have a name?)
  • The SOLR backend I’m about to implement

I know there are other feature DBs out there.  It would be good to include them as well in a later pass or to have someone else contribute the data once I get the benchmarking script written.

Genomics
Informatics
Java
Perl
Scalability
Science

Comments (0)

Permalink