Introduction
I’ve been learning about some of the techniques used by the so-called “Black Hat SEO” community for boosting their rankings in search engine results. Intriguing stuff. I’m by no means an expert in this area, but the theory underlying building black-hat pages and networks sure looks like it has a lot to do with my primary areas of interest.
Generating Unique Content
One “Black Hat SEO” application area is automatically generating HTML pages to improve search engine rankings. This technique uses a Markov process to generate text. The idea is to build one or more web pages that contain the keywords the SEO is targeting. The method basically works like this:
- Assemble a corpus of text to train the model. For example, Project Gutenberg
- Build an order-N (typically N=2) Markov model that captures the state changes in the corpus
- Generate text from the model, periodically throwing in some keywords
- Link the generated page to some other page to which you want to send traffic
- Repeat again from Step 1
One problem with this approach — aside from the fact that the keywords don’t really fitting in with the flow of the model — is that the model is trained on inappropriate text. For instance, suppose you were trying to optimize for keywords:
- keywords
- statistics
- Search engine optimization
- SEO
- Automatic content generation
- Automatic content extraction
- HTML content extraction
- Markov Model
… then you probably wouldn’t want to train your model on, say, Jane Austen’s Pride and Prejudice.
Improve Generated Text: Use Niche Corpora
A better thing to do would be to find some nice web pages containing keywords, statistics, seo, Markov model, and so on. That way you’ll pick up related keywords that you didn’t initially think of (or weren’t suggested by your keyword expansion tool), too.
But let’s face it. The corpora are going to be in HTML format. So the question now becomes, How do I automate the transformation of HTML into plain text for input to the model? A few strawman ideas, followed by my remarks:
- Get an HTML document, and remove all <element/>s. Won’t work very well. You end up training on page navigation, footers, headers, etc.
- Build a site- or software-specific parser (e.g. for Wikipedia, or for Wordpress) to extract the main content. Scalability and maintenance nightmare. This is not generalizable to general text extraction. You’ll be constantly fixing broken parsers, too.
- Devise a scoring system that can identify the main content of the page. Exactly!
I did find some methods for scoring page fragments, such as the Perl modules HTML::Content::Extractor and HTML::Extract, and another method described by Nooks. There are also a few intersting ideas in Gupta’s WWW2003 paper.
None of that Perl code linked above actually works, but Nooks and Jean Tavernier generally had the right idea. Basically, they look “down” the DOM to find the sub-DOM with the highest text/tag ratio.
The main problem with this approach is that it biases for DOM leaves, or “twigs” that are very close to leaves. You end up having to write special rules for accomodating the idiosyncrosies of each particular page dealt with, and it basically turns back into an HTML parsing exercise.
The other problem, and possibly more significant one from a statistician’s point of view, is that the ratio is not a well-understood metric for making decisions about what constitutes a “good” versus a “bad” sub-document. It would be better to have a p-value…
Balls and Urns
Fortunately, Fisher’s exact test can be applied to this problem. Here’s how you can apply it, explanation follows. First, let’s define some variables:
- X: the total number of words in the whole document.
- x: the number of words in a sub-document.
- Y: the total number of <element/>s in the whole document.
- y: the number of <element/>s in a sub-document.
Then, we perform the following algorithm to identify the single best sub-document:
tree; //the HTML tree's root node minP = 1; //minimum p-value observed in the document subD = ""; //sub-document corresponding to minimum p-value X = calculatex(tree); Y = calculatey(tree); look(tree); function look (node) { x = calculatex(node); y = calculatey(node); p = calculateHyperG(x,y,X,Y); if ( p < minP ) { minP = p; subD = node; } C = children(node); foreach (c in C) { look(c); } }
Balls and Urns, Explained
The pseudocode above is examining each sub-document of the HTML document in turn and identifying the one with the smallest p-value. The p-value is calculated using the hypergeometirc distribution, where we consider that a sub-document has x words and y HTML <element/>s. This, in the context of the total document having X words and Y HTML <element/>s. It’s better than a simple ratio calculation because it does not bias for the tree’s leaves. That is, the p-value does not consider only the size of x+y.
Caveats
Bear in mind that testing so many sub-documents, especially for very large HTML documents, warrants so-called “multiple hypothesis testing correction“, such as a Bonferroni correction. It’s outside the scope of this article.
Also, the tests performed are not entirely independent. That is, if node B is a child of node A then B will have some effect on A when calculating A’s p-value and must be factored out. This is also a well-defined problem but is, alas, also outside the scope of this article. Do your homework! Hint: learn about the Gene Ontology.
Conclusion
Fine and dandy, but does it work? My conclusion: seems to work. Here’s a CGI script demonstrating the hypergeometric content extraction technique on CNN.com. It reports a text snippet at the beginning and end of the single “best” sub-document and the corresponding (uncorrected) p-value. Twiddle the u parameter to test on a page of your choice. Some pages may block the user-agent I’m using…
There is also the issue of what to consider an element and what not to… or maybe even element weighting. For instance, maybe <p/> and <i/> elements shouldn’t be penalized because they’re commonly associated with text, but <script/> elements are heavily penalized.
Post a Comment