More thoughts on EC2 / EBS / Hadoop

I’m still getting up to speed on running Hadoop on EC2. Found this AWS post today describing how to easily port data into a hadoop cluster from S3, as well as easily create new Hadoop slaves using the AMI system images, start up clusters, and tear down clusters.

I made some comments yesterday about wanting to be able to scale the Hadoop cluster down as well as up, in particular being able to disable cores, which are the really expensive part of running a cluster on AWS.

Now we need to look into the AWS scripts and AMI images that are available to see how feasible it is to just maintain more data volumes than images. What I’m (roughly) thinking is that we might set up M data volumes for the DFS, but might want to run 1 <= N <= data/task nodes. In the case that N < M, you just load some of the N nodes with more than 1 EBS volume.

Also need to look into how HDFS deals with adding new volumes, i.e. will it just start replicating data onto nodes as they are added into the system? Is there a way to hot-add rather than restarting the data master? Hot-adding EBS volumes onto existing data nodes?

Distributed Systems
Java
Scalability

Comments (0)

Permalink

Darren Rush blogging

Looks like Darren has starting a blog. Awesome. Adding it to my link roll as soon as I figure out how.

I’m looking forward to that post comparing monitoring systems!

Random musings

Comments (0)

Permalink

EC2 + EBS + Hadoop at BiggerBoat

Rodger and I at BiggerBoat got Hadoop and HBase up and running on Amazon EC2 today. We initially set up a cluster of 1 master and 10 slaves. After a quick calculation of how much this costs to keep running 24/7, we started trying to figure out how to scale the thing DOWN as well as UP, and to be able to do so dynamically. Seems like the tricky piece is the Hadoop storage, not so much the compute power available. Amazon just launched their Elastic Block Store a few days ago, so we’re seeing how that fits in. Seems like the EBS I/O is pretty good given our Bonnie++ tests.

Tom White has some architecture scenarios for building this kind of stuff.

Distributed Systems
Java
Scalability

Comments (1)

Permalink

Examples for data import, export, and transport with HBase

I’m in the process of setting up an analytic workflow at BiggerBoat. It’s looking like the main theme in data structures around here will be the sparse matrix. So I’ve been playing with opensource technologies for sparse matrices. Apache Hadoop’s HBase is looking like a good choice for now, maybe Hive later.

Right now I’m getting familiar with the former. As part of this, I’m improving the docs on the wiki to make them more user- (as opposed to core developer-) friendly. My documentation goal right now is to add some data transformation example code. There are already lots of hadoop examples for doing text -&gt text mapping, e.g. grep, cat, etc. For HBase not so much. I.e.

  • text to text (done, many examples
  • flatfile to HBase table (Bulk loader in the HBase wiki, I haven’t tried it yet)
  • HBase table to flatfile
  • HBase table to HBase table

I’ll be adding updated, complete, and simple code for the latter two (three?) in the next few days to the HBase/MapReduce page.

Analytics
Distributed Systems
Java

Comments (0)

Permalink

Cookies, IP Addresses and Unique Users

I’ve been thinking about how to track unique users today. These are my so-far-unorganized thoughts. Please comment!

You can’t track users by cookie alone for a couple of reasons: 1) they might use multiple computers, 2) they might delete cookies, 3) multiple users might share the same computer (same account=same cookie)

You also can’t track users by IP address alone for some more reasons: 1) they might be using a mobile device or portable computer that moves from IP to IP, 2) there could be multiple machines passing through a single gateway IP (i.e. LAN NAT).

However, if you combine the cookie/IP information together, you can start to address some of these issues. Let’s assume you have some webserver logs that minimally contain <IP address "A">, <cookie ID "C">, <timestamp T> triples.

            === time ===>
PATTERN 1:
C1-A1 ---      ---
C1-A2    ---         ---
C1-A3       ---   ---

PATTERN 2:
C1-A1 ------
C2-A1       ------
C1-A2             ------

PATTERN 3:
C1-A1 -----!
C2-A1       -----!
C3-A1             ------

PATTERN 4:
C1-A1 ------      ------
C2-A1     ----------

PATTERN 5:
C1-A1 -----     -----
C2-A1      -----     ---

This matrix indicates the compatibility of each of the patterns (P1-P5)
with several different classes of cookie/IP address combination that we
might want to detect.

                                     patterns
                               P1  P2  P3  P4  P5
profiles                      --------------------
multiple users per IP        | -   +   -   +   +
multiple users per cookie    | -   -   -   -   -
multiple IPs per user        | +   +   -   -   -
multiple cookies per user    | -   -   +   -   +
cookie deletion              | -   -   +   -   -
"permanent" IP change        | -   +   -   -   -

Note that none of these patterns gives any indication for the “multiple
users per cookie” profile. To assess if there is more than one
user/cookie, you might want to look at the context in which you’re
observing the cookie. Consider attributes like (timezone corrected)
time-of-day, day-of-week, type of content being viewed.

Analytics
Random musings

Comments (0)

Permalink

Hadoop / SGE Grid Engine Convergence

I’m an old hand with SGE and a more user of Hadoop / Pig.  Good to see that there is interest in making these technologies interoperate.

Distributed Systems
Java
Scalability
Science

Comments (0)

Permalink

Laser Magic

LATIMER the world champion of magic www.latimeronline.com

Man, I really want a Magic Castle invite…

Random musings

Comments (0)

Permalink

The Most Awesome Nigerian Scam/Spam Email Yet

from: IKEMBA OKOYE <ikembaokoye2003@yahoo.com>
reply-to: ikembaokoye2007@yahoo.fr
to: XXX@XXX.XXX
date: Mon, Jul 28, 2008 at 9:26 AM
subject: SOMEONE YOU CALL YOUR FRIEND, WANTS YOU DEAD.
SOMEONE YOU CALL YOUR FRIEND, WANTS YOU DEAD.

I felt very sorry and bad for you, that your life is going to end like this, I was paid to eliminate you and I have to do it within 10 days.

Someone you call your friend wants you dead by all means, and the person have spent a lot of money on this, the person came to us and told us that he wants you dead and he provided us your names, photograph and other necessary information we needed about you.

Meanwhile, I have sent my boys to track you down and they have carried out the necessary investigation needed for the operation, but I ordered them to stop for a while and not to strike immediately because I just felt something good and sympathetic about you. I decided to contact you first and know why somebody will want you dead. Right now my men are monitoring you, their eyes are on you, and even the place you think is safer for you to hide might not be.

Now do you want to LIVE OR DIE? It is up to you. Get back to me now if you are ready to enter deal with me, I mean life trade, who knows, and I might just spear your life, $8,000 is all you need to spend. You will first of all pay $900 then I will send the tape of the person that want you dead to you and when the tape gets to you, you will pay the remaining $7100. If you are not ready for my help, then I will have no choice but to carry on the assignment after all I have already being paid.

Warning: do not think of contacting the police or even tell anyone because I will extend it to any member of your family since you are aware that somebody want you dead, and the person knows some members of your family as well.

For your own good I will advise you not to go out once it is 8pm until I make out time to see you and give you the tape of my discussion with the person who want you dead then you can use it to take any legal action. You can send the $900 to one of my local boy in Benin with this below information via western union or money Gram.

Receivers name. Christian Oforka.
Country. Benin.
City. Cotonou.
Question. Who made
Answer God.
Amount to be sent first $900

Good luck as I await your reply to this e-mail contact: ikembaokoye2007@yahoo.fr
Bye.

Ikemba Okoye.

Random musings

Comments (0)

Permalink

Mozilla Firefox Layout DOM and element positioning ; BoxObject Box Object getBoundingClientRect getBoxObjectFor

Hopefully the keywords I’m using here will help the next poor soul who has to learn this part of the Mozilla API. I needed to find where in the window some elements were. This information is not contained in the usual DOM referred to via the document variable. There are actually two parallel DOMs: the Content DOM (the usual one) and the Layout DOM. The Layout DOM’s structure contains all the elements in the Content DOM, but has positional information available to you as BoxObject (Firefox 2.*) or BoundingClientRect (Firefox 3.*) objects. Read the XUL Box Object Tutorial

Here’s how you get the boxes. For the sake of example, we’ll refer to the first

element in the page.

Firefox 2.*:

tab = document.getElementsByTagName("table").item(0);
tabBox = document.getBoxObjectFor(tab);
//tabBox.x
//tabBox.y
//tabBox.width
//tabBox.height

the x, y attributes give the location of the upper-left corner of the element, relative to the browser window. width, height tell you the size of the element.

Firefox 3.*:

tab = document.getElementsByTagName("table").item(0);
tabBox = document.getBoundingClientRect(tab);
//tabBox.left
//tabBox.top
//tabBox.width
//tabBox.height

The left, top attributes are equivalent to the Firefox 2.* x, y attributes, and width, height have identical meaning.

These docs helped me piece this together:
http://developer.mozilla.org/en/docs/DOM:element.getBoundingClientRect
http://developer.mozilla.org/en/docs/XUL_Tutorial:Box_Objects

Some good stuff on MouseEvents I’m stashing here for my own reference:
http://developer.mozilla.org/samples/domref/dispatchEvent.html
http://blog.stchur.com/blogcode/event-rerouting/

Image borrowed from Mozilla’s Layout Engine by L. David Baron

Javascript

Comments (0)

Permalink

Sun Grid Engine SGE state letter symbol codes meanings

Adapted from here.

Category State SGE Letter Code
Pending pending qw
pending, user hold qw
pending, system hold hqw
pending, user and system hold hqw
pending, user hold, re-queue hRwq
pending, system hold, re-queue hRwq
pending, user and system hold, re-queue hRwq
Running running r
transferring t
running, re-submit Rr
transferring, re-submit Rt
Suspended job suspended s, ts
queue suspended S, tS
queue suspended by alarm T, tT
all suspended with re-submit Rs, Rts, RS, RtS, RT, RtT
Error all pending states with error Eqw, Ehqw, EhRqw
Deleted all running and suspended states with deletion dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT

Administration
Distributed Systems

Comments (0)

Permalink