Uncategorized

Data-Intensive Text Processing with MapReduce

Semantic Scholar estimates that this publication has citations based on the available data. See our FAQ for additional information.

Working Through Data-Intensive Text Processing With MapReduce

References Publications referenced by this paper. Showing of references. Brown , Vincent J. The Page- Rank citation ranking: Bringing order to the Web.

Data-Intensive Text Processing with MapReduce

An exploration of the principles underlying redundancy-based factoid question answering. Witten , Alistair Moffat , Timothy C. Information platforms and the rise of the data scientist. In Toby Segaran and Jeff Hammerbacher, editors…. In the above example, a map is created and has its contents dumped over the wire for each invocation of the map method.

In this example we are going make the map an instance variable and shift the instantiation of the map to the setUp method in our mapper.

Formats and Editions of Data-intensive text processing with MapReduce [www.newyorkethnicfood.com]

Likewise the contents of the map will not be sent out to the reducers until all of the calls to mapper have completed and the cleanUp method is called. As we can see from the above code example, the mapper is keeping track of unique word counts, across all calls to the map method. By keeping track of unique tokens and their counts, there should be a substantial reduction in the number of records sent to the reducers, which in turn should improve the running time of the MapReduce job.

This accomplishes the same effect as using the combiner function option provided by the MapReduce framework, but in this case you are guaranteed that the combining code will be called. But there are some caveats with this approach also. Also, by keeping state across all mappers, depending on the data used in the job, memory could be another issue to contend with.

www.newyorkethnicfood.com

Ultimately, one would have to weigh all of the trade offs to determine the best approach. Now lets take a look at the some results of the different mappers. Since the job was run in pseudo-distributed mode, actual running times are irrelevant, but we can still infer how using local aggregation could impact the efficiency of MapReduce job running on a real cluster.

Local Aggregation

As expected the Mapper that did no combining had the worst results, followed closely by the first in-mapper combining option although these results could have been made better had the data been cleaned up before running the word count. The second in-mapper combining option and the combiner function had virtually identical results. Reducing the amount of bytes sent over the network to the reducers by that amount would surely would have a positive impact on the efficiency of a MapReduce job. As you can see the benefits of using either in-mapper combining or the Hadoop combiner function require serious consideration when looking to improve the performance of your MapReduce jobs.

As for which approach, it is up to you the weigh the trade offs for each approach. The Definitive Guide by Tom White. Project Gutenberg a great source of books in plain text format, great for testing Hadoop jobs locally. Local Aggregation At a very high level, when Mappers emit data, the intermediate results are written to disk then sent across the network to Reducers for final processing.

We are going to consider 3 ways of achieving local aggregation: Using Hadoop Combiner functions. Combiners A combiner function is an object that extends the Reducer class. A combiner function is specified when setting up the MapReduce job like so: In Mapper Combining Option 2 The second option of in mapper combining figure 3.