Thoughts on Hadoop JobTracker/TaskTracker Scheduling

Had a brief, interesting conversation on freenode #hadoop today with Rapleaf Engineer Nathan Marz today about scheduling in Hadoop.

Pretty much supports my sense that scheduling is not Hadoop’s strong suit. It’s really pretty shitty. Would be great to see some more cross-pollination between the Beowulf (SGE, PBS, Globus) and MapReduce (Hadoop, HBase) communities. The former have more mature scheduling, resource management and permissions models. They don’t really do a good job thought with providing a framework for distributed, parallel computing at the application level though — everything is roll-your-own. Perhaps Hadoop could be integrated as a parallel environment to consume resources from a SGE master [1, 2] rather than managing its own mapper/reducer pools.

A less ambitious scheduler improvement is to modify the way the Hadoop scheduler allocates map/reduce resources. The main itch I’m trying to scratch right now has to do with the coupling of map/reduce allocation. There are some cases where it seems this shouldn’t be done. Read the dialog with Nathan below if you care to know more.

allenday is it possible to decouple mapper and reducer slot allocation for jobs?
allenday i mean, if a job is #1 in the MR queue, but it is not yet ready to reduce, can it be prevented from consuming reducer slots?
|<– Smokinn has left irc.freenode.net ()
|<– savage- has left irc.freenode.net (Read error: 110 (Connection timed out))
–>| overlast (n=overlast@19.181.210.220.dy.bbexcite.jp) has joined #hadoop
|<– overlast has left irc.freenode.net (”Leaving…”)
nathanmarz allenday: i think that would be hard… reducing starts while the mapping is happening (copy stage)
allenday nathanmarz, i frequently find that while the reduce has “started”, it can just sit there for a long time doing nothing
allenday this is most common with nutch
allenday so there could be a bunch of other jobs further back in the queue that get starved for reduces b/c the head of the queue is squatting on the slots
nathanmarz it just sits there in the reduce phase?
allenday for sure nutch does, yeah
allenday during fetch, when it crawling
|<– cutting has left irc.freenode.net (”Leaving.”)
nathanmarz i see
nathanmarz i don’t have that much familiarity with nutch
nathanmarz is it possible to increase the number of reducers?
allenday yep, but then you can get into i/o trouble later
nathanmarz for the job i mean, not the cluster
allenday oh
allenday it sounds like you propose having these squatters consume minimal # of reducers (e.g. only 1)
nathanmarz actually, the opposite
nathanmarz let’s say you have 16 reduce slots
nathanmarz and the job i set to use 16 reducers
nathanmarz each one of those reducers potentially has to go over a lot of data
nathanmarz if the job is instead set to use a lot more reducers, like 100 or something
nathanmarz than an individual reducer will go a lot faster
nathanmarz and potentially, those freed reduce slots will go to jobs with higher priority
allenday ok, so you introduce priority to bump the further back ahead in the queue
nathanmarz yea
allenday is that settable in jobconf?
nathanmarz you can set num reducers
–>| tobias_au (n=opera@CPE-121-50-201-65.dsl.OntheNet.net) has joined #hadoop
allenday so let’s suppose the job that squats on reduce slots gets to the head of the queue. regardless of if it has 16 or 100 reducers configured
nathanmarz JobConf#setNumReduceTasks
allenday and that it it still in map phase only. has not begun reducing yet
allenday until one of those reduces finishes (i.e. the map has finished) all slots are still filled
allenday it’s only when the first reduce finishes that the job at #2 can take over a reduce slot
nathanmarz right
nathanmarz yea that’s true
allenday that’s bad
nathanmarz this scheme doesn’t help until mappers finished
allenday you really want this #1 job
allenday when it is allocating reducers
allenday to have low priority in acquiring the slots
nathanmarz right
nathanmarz well you don’t want it to acquire any slots until mappers finish
allenday so you give reduce slots to #2, #3, #4, etc. until everyone who wants slots has them. then you assign to #1
allenday or until #1 is ready…
allenday is it just me or does the queueing system in hadoop kind of suck?
allenday i am coming here from sun grid which puts a lot of emphasis on this aspect
nathanmarz well, the priority system will work if you start job #1 after the other jobs
nathanmarz if you start the other jobs after #1 then they will get starved of reducers
allenday heh, but the whole reason it is in #1 is because it was submitted first, right? isn’t hadoop FIFO wrt jobs?
nathanmarz if they’re the same priority
nathanmarz so maybe decreasing the reducers job #1 uses is the way to go
nathanmarz set it so it doesn’t use all the reduce slots on the cluster
allenday i need to do some research to see if there are jira open for improving the scheduler. or if there are some commercial plugins to improve the scheduling
nathanmarz definitely room for improvement, agreed
allenday yeah, that was what i thought you meant initially. it’s a hack too though, and breaks down when the number of jobs gets large
allenday i’m surprised they are coupled. do you understand how it works when the mapper hands off to the reducer?
allenday b/c i don’t and i need to
nathanmarz yes
allenday can i get the 2min version?
nathanmarz the reason the reducers start while the mappers are running is because there’s some work they can do without all the map data
nathanmarz each reducer needs to copy the relevant outputs from all the mappers to its machine
nathanmarz this is called the “copy” phase and can occur in parallel with mapping
allenday ok, i’ve seen that
allenday so what we need is a flag taht indicates there will be no data to copy until maps all finish
nathanmarz yea
nathanmarz a flag that says not to pipeline the process
allenday default behavior is to have the flag off and copy greedily
allenday which is like it does now
allenday turn the flag on says to wait until upstream map finishes before grabbing a reduce slot and kicking off the copy
allenday **all upstream maps
nathanmarz http://hadoop.apache.org/core/docs/current/hadoop-default.html
nathanmarz those are all the hadoop config parameters
nathanmarz you might be able to find something in there
allenday yeah, i fiind goodies in there every time i read that page :)
allenday i am only ~1mo into hadoop
allenday here’s another scheduling related question/issue i’m having
allenday i find that job i/o and cpu usage tend to synchronize after a while
allenday b/c if there is a slow moving job in the queue, all the others tend to get jammed behind it
allenday have you seen this?
nathanmarz no, i haven’t
nathanmarz but that’s interesting
allenday it comes back to resource (mis)allocation by the scheduler
nathanmarz how are you measuring that?
allenday it’s this same issue where jobs will consume all the slots
allenday so if you have a slow moving thing blocking all the resources
allenday no one else can get past
allenday then when the slow moving job finishes, the others all start getting processed very quickly (high cpu load during map), then as they begin to finish there is a flurry of i/o
allenday it’s like congestion on the freeway where one car slams on the breaks it sends this wave of traffic jam behind it
allenday assuming the freeway is already close to capacity (not sparse)