Here’s another little shell utility I’ve been sitting on for a while. This one shuffles the line-oriented data read from a pipe. It has the notion of buffering and partial flushing so we can handle streams / very large data sets.
#!/usr/bin/perl $|++; use strict; use Getopt::Long; my $USAGE = join '', <DATA>; my $B = 0; my $D = 1; my $H = 0; GetOptions ("buffer|b=i" => \$B, "draw|d=i" => \$D, "help|h" => \$H, ); if ( $D == 1 && $B > 0 ) { $D = $B; } if ( ($B < 0) || ($D < 1) || ($B > $D) || ($H) ) { print $USAGE and exit(1); } my @buf = (); while ( my $element = <> ) { #buffer whole stream if ( $B == 0 ) { push @buf, $element; } #no-op elsif ( $B == 1 ) { print $element; } #buffer window else { push @buf, $element; if ( scalar( @buf ) >= $D && scalar( @buf ) > $B ) { flush(); } } } flush(); sub flush { for ( my $j = scalar( @buf ) - 1 ; $j >= 0 ; $j-- ) { my $swap = int(rand($j)); if ( $swap != $j ) { ($buf[ $j ], $buf[ $swap ]) = ($buf[ $swap ], $buf[ $j ]); } } while ( scalar( @buf ) - 1 > $B - $D ) { print shift @buf; } } __DATA__ Usage: shuffle [-h] [-b <buffer size>] [-d <draw size>] Shuffle lines from a stream on STDIN. Write lines to STDOUT. -h show help (this message) -b buffer size (default 0. indicates shuffle whole stream, then write) range: 1.. -d draw size (defaults to value of -b. number of items to remove from the buffer when it fills) range: 1..buffer size You have to parameters available (besides -h for help). * buffer size (-b). Determines how many elements to temporarily hold before shuffling. The advantage of this buffer is to allow shuffling on very long streams that would not fit into system memory. The disadavantage is that it is not a truly random shuffle, as each input element can appear at most buffer-size positions away from the original position. Buffer size defaults to zero, so make sure to set it if your data set size is large. * draw size (-d). Determines how frequently the buffer is shuffled and flushed. Rather than shuffling/flushing all elements in the buffer, only do D elements. The advantage here is elements can appear more than buffer-size positions away from the original position. The disadvantage is that shuffling is done B/D times more frequently. Draw size defaults to buffer size, and has no effect. Set it to 1 to maximize randomness. Copyright/License: Allen Day <allenday@ucla.edu>, licensed under GPL 2006-2008
Post a Comment