Had a brief, interesting conversation on freenode #hadoop today with Rapleaf Engineer Nathan Marz today about scheduling in Hadoop.
Pretty much supports my sense that scheduling is not Hadoop’s strong suit. It’s really pretty shitty. Would be great to see some more cross-pollination between the Beowulf (SGE, PBS, Globus) and MapReduce (Hadoop, HBase) communities. The former have more mature scheduling, resource management and permissions models. They don’t really do a good job thought with providing a framework for distributed, parallel computing at the application level though — everything is roll-your-own. Perhaps Hadoop could be integrated as a parallel environment to consume resources from a SGE master [1, 2] rather than managing its own mapper/reducer pools.
A less ambitious scheduler improvement is to modify the way the Hadoop scheduler allocates map/reduce resources. The main itch I’m trying to scratch right now has to do with the coupling of map/reduce allocation. There are some cases where it seems this shouldn’t be done. Read the dialog with Nathan below if you care to know more.
| allenday | is it possible to decouple mapper and reducer slot allocation for jobs? | |||||
| allenday | i mean, if a job is #1 in the MR queue, but it is not yet ready to reduce, can it be prevented from consuming reducer slots? | |||||
| |<– | Smokinn has left irc.freenode.net () | |||||
| |<– | savage- has left irc.freenode.net (Read error: 110 (Connection timed out)) | |||||
| –>| | overlast (n=overlast@19.181.210.220.dy.bbexcite.jp) has joined #hadoop | |||||
| |<– | overlast has left irc.freenode.net (”Leaving…”) | |||||
| nathanmarz | allenday: i think that would be hard… reducing starts while the mapping is happening (copy stage) | |||||
| allenday | nathanmarz, i frequently find that while the reduce has “started”, it can just sit there for a long time doing nothing | |||||
| allenday | this is most common with nutch | |||||
| allenday | so there could be a bunch of other jobs further back in the queue that get starved for reduces b/c the head of the queue is squatting on the slots | |||||
| nathanmarz | it just sits there in the reduce phase? | |||||
| allenday | for sure nutch does, yeah | |||||
| allenday | during fetch, when it crawling | |||||
| |<– | cutting has left irc.freenode.net (”Leaving.”) | |||||
| nathanmarz | i see | |||||
| nathanmarz | i don’t have that much familiarity with nutch | |||||
| nathanmarz | is it possible to increase the number of reducers? | |||||
| allenday | yep, but then you can get into i/o trouble later | |||||
| nathanmarz | for the job i mean, not the cluster | |||||
| allenday | oh | |||||
| allenday | it sounds like you propose having these squatters consume minimal # of reducers (e.g. only 1) | |||||
| nathanmarz | actually, the opposite | |||||
| nathanmarz | let’s say you have 16 reduce slots | |||||
| nathanmarz | and the job i set to use 16 reducers | |||||
| nathanmarz | each one of those reducers potentially has to go over a lot of data | |||||
| nathanmarz | if the job is instead set to use a lot more reducers, like 100 or something | |||||
| nathanmarz | than an individual reducer will go a lot faster | |||||
| nathanmarz | and potentially, those freed reduce slots will go to jobs with higher priority | |||||
| allenday | ok, so you introduce priority to bump the further back ahead in the queue | |||||
| nathanmarz | yea | |||||
| allenday | is that settable in jobconf? | |||||
| nathanmarz | you can set num reducers | |||||
| –>| | tobias_au (n=opera@CPE-121-50-201-65.dsl.OntheNet.net) has joined #hadoop | |||||
| allenday | so let’s suppose the job that squats on reduce slots gets to the head of the queue. regardless of if it has 16 or 100 reducers configured | |||||
| nathanmarz | JobConf#setNumReduceTasks | |||||
| allenday | and that it it still in map phase only. has not begun reducing yet | |||||
| allenday | until one of those reduces finishes (i.e. the map has finished) all slots are still filled | |||||
| allenday | it’s only when the first reduce finishes that the job at #2 can take over a reduce slot | |||||
| nathanmarz | right | |||||
| nathanmarz | yea that’s true | |||||
| allenday | that’s bad | |||||
| nathanmarz | this scheme doesn’t help until mappers finished | |||||
| allenday | you really want this #1 job | |||||
| allenday | when it is allocating reducers | |||||
| allenday | to have low priority in acquiring the slots | |||||
| nathanmarz | right | |||||
| nathanmarz | well you don’t want it to acquire any slots until mappers finish | |||||
| allenday | so you give reduce slots to #2, #3, #4, etc. until everyone who wants slots has them. then you assign to #1 | |||||
| allenday | or until #1 is ready… | |||||
| allenday | is it just me or does the queueing system in hadoop kind of suck? | |||||
| allenday | i am coming here from sun grid which puts a lot of emphasis on this aspect | |||||
| nathanmarz | well, the priority system will work if you start job #1 after the other jobs | |||||
| nathanmarz | if you start the other jobs after #1 then they will get starved of reducers | |||||
| allenday | heh, but the whole reason it is in #1 is because it was submitted first, right? isn’t hadoop FIFO wrt jobs? | |||||
| nathanmarz | if they’re the same priority | |||||
| nathanmarz | so maybe decreasing the reducers job #1 uses is the way to go | |||||
| nathanmarz | set it so it doesn’t use all the reduce slots on the cluster | |||||
| allenday | i need to do some research to see if there are jira open for improving the scheduler. or if there are some commercial plugins to improve the scheduling | |||||
| nathanmarz | definitely room for improvement, agreed | |||||
| allenday | yeah, that was what i thought you meant initially. it’s a hack too though, and breaks down when the number of jobs gets large | |||||
| allenday | i’m surprised they are coupled. do you understand how it works when the mapper hands off to the reducer? | |||||
| allenday | b/c i don’t and i need to | |||||
| nathanmarz | yes | |||||
| allenday | can i get the 2min version? | |||||
| nathanmarz | the reason the reducers start while the mappers are running is because there’s some work they can do without all the map data | |||||
| nathanmarz | each reducer needs to copy the relevant outputs from all the mappers to its machine | |||||
| nathanmarz | this is called the “copy” phase and can occur in parallel with mapping | |||||
| allenday | ok, i’ve seen that | |||||
| allenday | so what we need is a flag taht indicates there will be no data to copy until maps all finish | |||||
| nathanmarz | yea | |||||
| nathanmarz | a flag that says not to pipeline the process | |||||
| allenday | default behavior is to have the flag off and copy greedily | |||||
| allenday | which is like it does now | |||||
| allenday | turn the flag on says to wait until upstream map finishes before grabbing a reduce slot and kicking off the copy | |||||
| allenday | **all upstream maps | |||||
| nathanmarz | http://hadoop.apache.org/core/docs/current/hadoop-default.html | |||||
| nathanmarz | those are all the hadoop config parameters | |||||
| nathanmarz | you might be able to find something in there | |||||
| allenday | yeah, i fiind goodies in there every time i read that page |
|||||
| allenday | i am only ~1mo into hadoop | |||||
| allenday | here’s another scheduling related question/issue i’m having | |||||
| allenday | i find that job i/o and cpu usage tend to synchronize after a while | |||||
| allenday | b/c if there is a slow moving job in the queue, all the others tend to get jammed behind it | |||||
| allenday | have you seen this? | |||||
| nathanmarz | no, i haven’t | |||||
| nathanmarz | but that’s interesting | |||||
| allenday | it comes back to resource (mis)allocation by the scheduler | |||||
| nathanmarz | how are you measuring that? | |||||
| allenday | it’s this same issue where jobs will consume all the slots | |||||
| allenday | so if you have a slow moving thing blocking all the resources | |||||
| allenday | no one else can get past | |||||
| allenday | then when the slow moving job finishes, the others all start getting processed very quickly (high cpu load during map), then as they begin to finish there is a flurry of i/o | |||||
| allenday | it’s like congestion on the freeway where one car slams on the breaks it sends this wave of traffic jam behind it | |||||
| allenday | assuming the freeway is already close to capacity (not sparse) | |||||
Post a Comment