More thoughts on EC2 / EBS / Hadoop
I’m still getting up to speed on running Hadoop on EC2. Found this AWS post today describing how to easily port data into a hadoop cluster from S3, as well as easily create new Hadoop slaves using the AMI system images, start up clusters, and tear down clusters.
I made some comments yesterday about wanting to be able to scale the Hadoop cluster down as well as up, in particular being able to disable cores, which are the really expensive part of running a cluster on AWS.
Now we need to look into the AWS scripts and AMI images that are available to see how feasible it is to just maintain more data volumes than images. What I’m (roughly) thinking is that we might set up M data volumes for the DFS, but might want to run 1 <= N <= data/task nodes. In the case that N < M, you just load some of the N nodes with more than 1 EBS volume.
Also need to look into how HDFS deals with adding new volumes, i.e. will it just start replicating data onto nodes as they are added into the system? Is there a way to hot-add rather than restarting the data master? Hot-adding EBS volumes onto existing data nodes?
