May 2008

Desktop Tower Defense

7872. It’s either my most recent score, or a lower bound on the number of minutes I’ve spent on Desktop Tower Defense. Take your pick, just don’t tell my dissertation committee ;-)

Play here, or check my mazes. Warning, it’s addictive if you like puzzle/realtime strategy games.

Fun

Comments (0)

Permalink

Los Angeles SoC(i)al Tech Scene

I’m collating here all the tech events/sites I hear about that are specific to Los Angeles / Southern California. This post is the result of several conversations I’ve had, verbally and via email, about the scattered tech scene in Los Angeles. Disclaimer: I’m not the organizer of any of these events, and most of them I’ve never even attended so I can’t vouch for the quality. For now, the events are in the order in which I hear about them.

If you know of a resource I missed leave a comment and I’ll add it in.

Business
Computing
Networking

Comments (0)

Permalink

The Construction and Usage of a Microarray Data Warehouse

That’s my dissertation topic, and I’m defending it Thursday morning. You can grab the latest PDF here. I’ll update this post with my PowerPoint slides when I finish them. The crux of the work was published last year as Celsius: A community resource for affymetrix microarray data. Genome Biology, 6(8), 2007. [pdf]

Update: Powerpoint Slides are now online. Also, my oral defense is complete. Other than filing my defense with the UCLA research library, I am now a Doctor of Philosophy, Human Genetics.

Random musings

Comments (0)

Permalink

Statistical HTML Content Extraction

Introduction

I’ve been learning about some of the techniques used by the so-called “Black Hat SEO” community for boosting their rankings in search engine results. Intriguing stuff. I’m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do with my primary areas of interest.

Generating Unique Content

One “Black Hat SEO” application area is automatically generating HTML pages to improve search engine rankings. This technique uses a Markov process to generate text. The idea is to build one or more web pages that contain the keywords the SEO is targeting. The method basically works like this:

  1. Assemble a corpus of text to train the model. For example, Project Gutenberg
  2. Build an order-N (typically N=2) Markov model that captures the state changes in the corpus
  3. Generate text from the model, periodically throwing in some keywords
  4. Link the generated page to some other page to which you want to send traffic
  5. Repeat again from Step 1

One problem with this approach — aside from the fact that the keywords don’t really fitting in with the flow of the model — is that the model is trained on inappropriate text. For instance, suppose you were trying to optimize for keywords:

  • keywords
  • statistics
  • Search engine optimization
  • SEO
  • Automatic content generation
  • Automatic content extraction
  • HTML content extraction
  • Markov Model

… then you probably wouldn’t want to train your model on, say, Jane Austen’s Pride and Prejudice.

Improve Generated Text: Use Niche Corpora

A better thing to do would be to find some nice web pages containing keywords, statistics, seo, Markov model, and so on. That way you’ll pick up related keywords that you didn’t initially think of (or weren’t suggested by your keyword expansion tool), too.

But let’s face it. The corpora are going to be in HTML format. So the question now becomes, How do I automate the transformation of HTML into plain text for input to the model? A few strawman ideas, followed by my remarks:

  • Get an HTML document, and remove all <element/>s. Won’t work very well. You end up training on page navigation, footers, headers, etc.
  • Build a site- or software-specific parser (e.g. for Wikipedia, or for Wordpress) to extract the main content. Scalability and maintenance nightmare. This is not generalizable to general text extraction. You’ll be constantly fixing broken parsers, too.
  • Devise a scoring system that can identify the main content of the page. Exactly!

I did find some methods for scoring page fragments, such as the Perl modules HTML::Content::Extractor and HTML::Extract, and another method described by Nooks. There are also a few intersting ideas in Gupta’s WWW2003 paper.

None of that Perl code linked above actually works, but Nooks and Jean Tavernier generally had the right idea. Basically, they look “down” the DOM to find the sub-DOM with the highest text/tag ratio.

The main problem with this approach is that it biases for DOM leaves, or “twigs” that are very close to leaves. You end up having to write special rules for accomodating the idiosyncrosies of each particular page dealt with, and it basically turns back into an HTML parsing exercise.

The other problem, and possibly more significant one from a statistician’s point of view, is that the ratio is not a well-understood metric for making decisions about what constitutes a “good” versus a “bad” sub-document. It would be better to have a p-value…

Balls and Urns

Fortunately, Fisher’s exact test can be applied to this problem. Here’s how you can apply it, explanation follows. First, let’s define some variables:

  • X: the total number of words in the whole document.
  • x: the number of words in a sub-document.
  • Y: the total number of <element/>s in the whole document.
  • y: the number of <element/>s in a sub-document.

Then, we perform the following algorithm to identify the single best sub-document:

tree; //the HTML tree's root node
minP = 1; //minimum p-value observed in the document
subD = ""; //sub-document corresponding to minimum p-value
X = calculatex(tree);
Y = calculatey(tree);
look(tree);
function look (node) {
  x = calculatex(node);
  y = calculatey(node);
  p = calculateHyperG(x,y,X,Y);
  if ( p < minP ) {
    minP = p;
    subD = node;
  }
  C = children(node);
  foreach (c in C) {
    look(c);
  }
}

Balls and Urns, Explained

The pseudocode above is examining each sub-document of the HTML document in turn and identifying the one with the smallest p-value. The p-value is calculated using the hypergeometirc distribution, where we consider that a sub-document has x words and y HTML <element/>s. This, in the context of the total document having X words and Y HTML <element/>s. It’s better than a simple ratio calculation because it does not bias for the tree’s leaves. That is, the p-value does not consider only the size of x+y.

Caveats

Bear in mind that testing so many sub-documents, especially for very large HTML documents, warrants so-called “multiple hypothesis testing correction“, such as a Bonferroni correction. It’s outside the scope of this article.

Also, the tests performed are not entirely independent. That is, if node B is a child of node A then B will have some effect on A when calculating A’s p-value and must be factored out. This is also a well-defined problem but is, alas, also outside the scope of this article. Do your homework! Hint: learn about the Gene Ontology.

Conclusion

Fine and dandy, but does it work? My conclusion: seems to work. Here’s a CGI script demonstrating the hypergeometric content extraction technique on CNN.com. It reports a text snippet at the beginning and end of the single “best” sub-document and the corresponding (uncorrected) p-value. Twiddle the u parameter to test on a page of your choice. Some pages may block the user-agent I’m using…

There is also the issue of what to consider an element and what not to… or maybe even element weighting. For instance, maybe <p/> and <i/> elements shouldn’t be penalized because they’re commonly associated with text, but <script/> elements are heavily penalized.

Informatics
Software
Statistics

Comments (0)

Permalink

Culver City Barking Dogs, Los Angeles Barking Dogs, California Barking Dogs

I’ve been collating information on laws for the area where I live that relate to barking dogs. If you’re in Culver City, Los Angeles County, or the state of California, some or all of this should be useful to you. If you also have a Home Owners’ Association there is probably also wording in your CC&Rs or Bylaws that prohibit annoying noises and describe the powers of enforcement given to the HOA.

These links are also more generally useful on local laws:

Culver City Code

Los Angeles County Code

California Code

I also found this page and this page to be useful.

Below are the specific sections of the relevant codes that apply to barking dogs:

Culver City § 9.07.030 ANIMALS AND FOWL
Any animal or fowl which emanates sound or outcry in an excessive, continuous, or untimely fashion, shall be considered a public nuisance and is subject to abatement pursuant to Chapter 9.04 of the Culver City Municipal Code.

(’65 Code, § 23-44.6) (Ord. No. 95-004 § 2 (part))

Culver City § 9.01.035 ANIMAL ANNOYANCE PROHIBITED
It shall be unlawful for any person to harbor or keep any animal, bird or fowl which disturbs the peace or causes annoyance or disturbance to the neighborhood or reasonably interferes with the peace, comfort or repose of any person or persons in the quiet enjoyment of his or their property, by repeated or continuous barking, howling, whining, or making other sounds common to their species, between the hours of 10:00 p.m. and 8:00 a.m. and such disturbance shall be deemed to constitute the maintenance of a nuisance. Provided, however, that the prohibitions contained in this Section shall not apply to a licensed kennel owner or hospital or other place in which animals, birds or fowl are kept pursuant to a license or permit issued by governmental agencies.

(’65 Code, § 5-7) (Ord. No. CS-415 § 5-17; Ord. No. CS-24 § 2(d))

Los Angeles County Code § 13.45.010 Loud, unnecessary and unusual noise
Notwithstanding any other provisions of this chapter and in addition thereto, it shall be unlawful for any person to wilfully make or continue, or cause to be made or continued, any loud, unnecessary, and unusual noise which disturbs the peace or quiet of any neighborhood or which causes discomfort or annoyance to any reasonable person of normal sensitiveness residing in the area. The standard which may be considered in determining whether a violation of the provisions of this section exists may include, but not be limited to, the following:
A. The level of noise;
B. Whether the nature of the noise is usual or unusual;
C. Whether the origin of the noise is natural or unnatural;
D. The level and intensity of any background noise;
E. The proximity of the noise to residential sleeping facilities;
F. The nature and zoning of the area within which the noise emanates;
G. The density of the inhabitation of the area within which the noise emanates;
H. The time of the day or night the noise occurs;
I. The duration of the noise;
J. Whether the noise is recurrent, intermittent, or constant; and
K. Whether the noise is produced by a commercial or noncommercial activity. (Ord. 2001-0075 § 1 (part), 2001.)

Los Angeles County Code § 13.45.020 Penalty
Any person violating this chapter is guilty of a misdemeanor punishable by a fine or by imprisonment no more than six months, or both. The fines imposed under this chapter are as follows:
A. A fine of not more than $100.00 for a first violation;
B. A fine of not more than $200.00 for a second violation of the same provision of this ordinance within one year;
C. A fine of not more than $500.00 for each additional violation of the same provision of this ordinance within one year. (Ord. 2001-0075 § 1 (part), 2001.)

Los Angeles County Code § 10.40.065 Public nuisance
A. Any animal (or animals) which molests passersby or passing vehicles, attacks other animals, trespasses on school grounds, is repeatedly at large, damages and or trespasses on private or public property, barks, whines or howls in a continuous or untimely fashion, shall be considered a public nuisance.

B. Every person who maintains, permits or allows a public nuisance to exist upon his or her property or premises, and every person occupying or leasing the property or premises of another and who maintains, permits or allows a public nuisance as described above to exist thereon, after reasonable notice in writing from the department of animal care and control has been served upon such person to cease such nuisance, is guilty of a misdemeanor. The existence of such nuisance for each and every day after the service of such notice shall be deemed a separate and distinct offense. (Ord. 2000-0075 § 54, 2000: Ord. 85-0204 § 24, 1985.)

California Penal Code § 373A:
Every person who maintains, permits, or allows a public nuisance to exist upon his or her property or premises, and every person occupying or leasing the property or premises of another who maintains, permits or allows a public nuisance to exist thereon, after reasonable notice in writing from a health officer or district attorney or city attorney or prosecuting attorney to remove, discontinue or abate the same has been served upon such person, is guilty of a misdemeanor, and shall be punished accordingly; and the existence of such nuisance for each and every day after the service of such notice shall be deemed a separate and distinct offense, and it is hereby made the duty of the district attorney, or the city attorney of any city the charter of which imposes the duty upon the city attorney to prosecute state misdemeanors, to prosecute all persons guilty of violating this section by continuous prosecutions until the nuisance is abated and removed.

Random musings

Comments (0)

Permalink

Configure Wordpress Ping

I wanted to configure Wordpress pinging for the Facebook Flog Blog application. For some reason the feed on my profile page isn’t updating, and I thought maybe this would do the trick.

Took a bit of digging, but I found a guide at Technorati. Hint: “options” has been (moved and) renamed as “settings” as late as Wordpress 5.2.1.

Let’s see if it works!

Update: just by visiting the Flog Blog settings page, I have somehow managed to get Flog Blog to update. Hmm…

PHP
Random musings

Comments (0)

Permalink

Kinesis Advantage Pro USB Review

I bought a couple of Kinesis Advantage™ Pro USB for PC & Mac ergonomic keyboards about 3 years ago. A friend recommended it for reducing wrist and forearm pain from spending many hours typing.

Overall it’s a good product, and it is definitely the best keyboard I’ve used. The main advantage is that it reduces fatigue. Some of the fatigue reduction is due to the QWERTY key layout — the keys are vertically aligned instead of staggered, and the concave contour lets your hands sit in a more relaxed position.

However, the major advantage of this keyboard over others is the thumb keys. Think about it — your thumbs are very strong digits, certainly much stronger than your pinkies. Why should they only be used to hit the space bar, or occasionally curled under to hit CTRL and ALT?

I set up my keyboard to leverage the thumb keys as much as possible using the built-in remapping feature of the Kinesis Advantage. Side by side below are the default layout and my modified layout.

The default layout seems to be well-suited for transcription. For programming… not so good. Especially if you use Emacs as your editor (as I used to) or Fluxbox as your window manager (as I do), because these two rely heavily on the ALT key for keychain commands.

I remapped the up/down arrows to be more like vim. I put CAPS LOCK as far away as possible to prevent hitting it, and put CTRL there instead — like the Sun layout. I’m not using a Windows machine at all, so I opted to not map the Windows key.

This is a great layout.

Downsides of the keyboard:

Crappy, crappy firmware. The keyboard frequently does not detect key-up events, so I frequently find the keys get stuck and a character will get inserted a bunch of times. The shift keys seem to be most prone to getting stuck, especially if you type A BUNCH OF CAPS IN A ROW WITHOUT LETTING OFF THE SHIFT KEY IN BETWEEN THEM. The fix for this is to hit both shift keys a bunch of times to get them to register the key-up event. Very annoying. There have been times when keys getting stuck have actually caused me to lose work in my editor (think about having the shift key stuck in vim command-mode). Ugh. Granted, this does seem to happen more on the keyboard in my home office than in my office at school. Maybe it’s the cat hair?

Function keys suck too. Frequently a single press to a function key registers as two key presses.

Random musings

Comments (0)

Permalink

Best time for that blog post… or not

Saw an article that claims Thursday after lunch is the best time to post your blog entry. But the study is flawed, not because it is necessarily incorrect, but because it does not document the methodology used.

For instance, did you normalize your data by day and by hour? How do I know most hot posts don’t occur at Thursday noon just because that’s when the most posts are submitted, so the probability of getting a high rank is just proportional to that volumne of submission.

Pretty awesome that the analysis was done in R, though. Give us some methods (preferrably with code), please.

Via ReadWriteWeb

Random musings

Comments (0)

Permalink

Costco Stainless Steel Cookware Review

I bought the Kirkland Signature™ 16-pc Stainless Steel Cookware Set Copper Bonded 5-ply Base last week. Just as functional and beautiful as the All-Clads, and at a fraction of the price (presumably because they’re manufactured in Thailand and don’t have a built-up brand name). Highly recommended.

There is a great thread over at Chowhound about these pans. If you’re thinking about getting some steel pans, it’s particularly useful to know that you it’s super-helpful to use an oxalic acid cleanser to keep the pans from staining. Also, make sure you heat them up before adding any cooking ingredients or things will stick. I’ve only regularly used non-stick pans before having these, and it’s quite different because the thick copper/aluminum base really holds the heat once it gets hot. For instance, I can turn the burner to low and maintain a medium temp. no problem.

Random musings

Comments (0)

Permalink

Wordpress Disable Autosave

The autosave feature has been giving me a lot of trouble since upgrading to Wordpress 2.5.1. I found this post on the topic of disabling autosave. Moonlight gets it mostly right, having identified the causal line

wp_enqueue_script('autosave')

. I just had to comment that out in all the files under wp-admin and everything works again.

PHP

Comments (0)

Permalink