• Home
  • RSS Feed
  • Log in

Thinking MapReduce with Hadoop
Posted by Maarten Winkels in the early morning: July 2nd, 2009

Apache Hadoop promises "a software platform that lets one easily write and run applications that process vast amounts of data". Sure enough, when reading the documentation, descriptions like:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Are simple enough to read and understand, but how do you apply MapReduce to a problem you face in a real-life project?

This blog tries to give some insight into how to apply MapReduce with Hadoop.

What is Hadoop?

Hadoop is basically two things:

  1. A distributed file system -- HDFS
  2. A MapReduce framework that allows algorithms to work on data in the distributed file system in parallel

The Hadoop Distributed File System (HDFS) is really the heart of Hadoop. It provides scalability, reliability and performance at a low cost. The system is designed to run on commodity hardware. Although the system is written in Java, there are other ways to access and use it.

MapReduce is a software framework that allows computation to run on a cluster. It uses HDFS for data-proximity: The computation will be distributed and run in parallel on the cluster and each process will access and process data that is available locally on its node. This gives a major performance boost. Furthermore the framework provides reliability, because processes that fail will automatically restarted at other nodes.

When to apply MapReduce?

MapReduce is only useful for systems that process large amounts of data. There is a overhead for starting tasks and there is always the network overhead. When talking about large amounts of data we mean GBs and above rather then MBs.

Another important aspect is the data usage pattern in your application. If you need to 'randomly' read and write data, for example based on a certain request coming in, MapReduce cannot help you. MapReduce really shines when data is read in a batch-like streaming manner.

The final requirement for using MapReduce is that the algorithm can be described as a map-and-reduce process. In this blog I want to focus on this last aspect.

How to apply MapReduce? - Another example

So how do you describe an algorithm in a MapReduce manner? To illustrate, nothing works better than an example. As with every example that I have seen for Hadoop, it is a bit academic. What I'm trying to explain is how a MapReduce algorithm is different from a normal approach and how to go about designing that algorithm.

The main thing with a MapReduce algorithm is that it reasons about < key , value > pairs all along, from the input format to the output format, if necessary using synthetic keys. If your input is a simple flat file, it will by default break it up on line ends and provide the offset into the file as key and the line as value. The main strength of the algorithm lies with the fact that between the map and the reduce phase, it will sort the data by key. The framework will then provide all data with the same key to the same reducer instance. Any successful MapReduce algorithm should leverage this mechanism.

The problem - Finding Anagrams
Say you have to find Anagrams in a very large input file. How would implement this?

I think a first attempt would have some sort of function like this:

 
  public static boolean isAnagram(String first, String second) {
    // Checks that the two inputs are anagrams, by checking they have all the same characters.
    // Left as exercise for the user...
  }
 

The application would have to somehow execute this function on all pairs of words in the input. However fast this method would be, the overall execution would still take quite some time.

Hadoopifying...
How do you now design a MapReduce algorithm that will give the desired answer? The key lies in finding a function that will produce the same key for all words that are anagrams. Applying this in the map phase will use the power of the MapReduce framework to deliver all words that are anagrams to the same reducer. The solution, when found, is deceivingly simple as usual:

 
  public static String sortCharacters(String input) {
    char[] cs = input.toCharArray();
    Arrays.sort(cs);
    return new String(cs);
  }
 

By sorting all the characters in all the words in the input, all anagrams will have the same key:

  aspired -> adeiprs
  despair -> adeiprs

Now the list of characters to the right has no meaning, but all anagrams will have exactly the same result for this function.

Implementation
Once the algorithm is found the implementation using Hadoop is quite straightforward and simple (though pretty long...).

public class AnagramFinder extends Configured implements Tool {

  public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> {

    private Text sortedText = new Text();
    private Text outputValue = new Text();

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer tokenizer = new StringTokenizer(value.toString(),
          " \t\n\r\f,.:()!?", false);
      while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken().trim().toLowerCase();
        sortedText.set(sort(token));
        outputValue.set(token);
        context.write(sortedText, outputValue);
      }
    }

    protected String sort(String input) {
      char[] cs = input.toCharArray();
      Arrays.sort(cs);
      return new String(cs);
    }

  }

  public static class Combiner extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {

    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      Set<Text> uniques = new HashSet<Text>();
      for (Text value : values) {
        if (uniques.add(value)) {
          context.write(key, value);
        }
      }
    }
  }

  public static class Reducer extends org.apache.hadoop.mapreduce.Reducer<Text, Text, IntWritable, Text> {

    private IntWritable count = new IntWritable();
    private Text outputValue = new Text();

    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      Set<Text> uniques = new HashSet<Text>();
      int size = 0;
      StringBuilder builder = new StringBuilder();
      for (Text value : values) {
        if (uniques.add(value)) {
          size++;
          builder.append(value.toString());
          builder.append(',');
        }
      }
      builder.setLength(builder.length() - 1);

      if (size > 1) {
        count.set(size);
        outputValue.set(builder.toString());
        context.write(count, outputValue);
      }
    }

  }

  public int run(String[] args) throws Exception {
    Path inputPath = new Path(args[0]);
    Path outputPath = new Path(args[1]);

    Job job = new Job(getConf(), "Anagram Finder");

    job.setJarByClass(AnagramFinder.class);

    FileInputFormat.setInputPaths(job, inputPath);
    FileOutputFormat.setOutputPath(job, outputPath);

    job.setMapperClass(Mapper.class);
    job.setCombinerClass(Combiner.class);
    job.setReducerClass(Reducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    return job.waitForCompletion(false) ? 0 : -1;
  }

  public static void main(String[] args) throws Exception {
    System.exit(ToolRunner.run(new Configuration(), new AnagramFinder(), args));
  }
}

The main parts of the implementation are the following:

Mapper - Breaks up the input text in tokens (filtering some common punctuation marks) and applies the character sorting to arrive at the required key.
Combiner (optional) - Removes duplicate values from the input.
Reducer - Collects anagrams and outputs the number of anagrams (key) and all the words concatenated (value).
Main and Run - This code configures the job to run on the MapReduce framework.

The Combiner is used to do some preprocessing for the reducer. The main reason for this is that results from Mappers that are run on different nodes will be processed on the same node and will thus have to travel the network. To minimize network load, the Combiner might reduce the number of < key , value > pairs that will be processed, as shown here by filtering out duplicates. The process can however not rely on all < key , value > pairs to be processed by a single Combiner, so the Reducer will also have to remove duplicates.

It is interesting to see that the concept Anagram doesn't materialize anywhere in this code. The fact that the code finds anagrams follows from the fact that all anagrams will have the same value from the sort function and that is used as the map output key. This might be quite confusing for readers.

Conclusion

The main challenge posed by Hadoop is coming up with a good algorithm for MapReduce applications. The algorithm will mostly be the result of the whole MapReduce process and might not be easy to understand from the code. This is because some of the functionality that the framework provides might be key to the algorithm. Good documentation that describes the whole process is vital to overcome this problem. Once an algorithm is designed, impelementing it in Hadoop is quite straightforward.

  • Share/Bookmark

Tags: hadoop, mapreduce
Filed under hadoop | 2 Comments »



2 Responses to “Thinking MapReduce with Hadoop”



    Jonas Bandi Says:
    Posted at: July 2, 2009 at 11:27 am

    Thanks a lot for this nice introduction.
    I never had time or need to find my way into this topic. This was a nice kick-start.



    Lars Vonk Says:
    Posted at: July 3, 2009 at 9:07 am

    Good read, thanks Maarten.



Leave a Reply

Click here to cancel reply.

Deployment automation for Java application running on Websphere, WebLogic and JBoss

Archives

  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009

Xebia Sites

  • Xebia Corporate
  • Xebia France
  • Xebia India

Categories

  • Java (279)
  • Agile (109)
  • General (50)
  • Testing (42)
  • Performance (42)
  • Hibernate (36)
  • Scrum (33)
  • Podcast (31)
  • Architecture (31)
  • Spring (28)
  • SOA (24)
  • Maven (22)
  • Project Management (22)
  • Flex (17)
  • JPA (17)
    • JPA implementation patterns (13)
  • Eclipse (15)
  • Quality Assurance (14)
  • Middleware (19)
  • Frameworks (13)

Tag Cloud

    Agile XML product owner esb Seam SOA IntelliJ Groovy Performance Poppendieck qcon Closures Hibernate Scala JavaOne Spring Scrum Java Grails Testing Ajax fitnesse Introduction to Agile Maven Architecture Lean Xebia Semantic Web Agile Awareness Workshop Functional Programming
medicin depression buy phentermine without a perscription aricept generic hair loss help how do you prevent bone loss urinary tract infection symptoms viagra sex domination cialis viagra cures for throat infection buy sumycin acne care new medication for cancer treatment help for sleeping problems on-line pharmacies cure snoring medications to help clot blood what is aspirin buy zestoretic bronchitis vs pneumonia back pain muscle acne face medication muscle women pain behind knee fat blocker man health arthritis natural cure woman health women insomnia cheap phentermine online cats and irritable bowel syndrome buy cialis generic online nutritional diet for osteoporosis abnormal blood clots treatments for hair loss what is zyprexa dental whitening products impotence herbs drugs for diabetes allergy prevention buy canada levitra Mentax adhd in children hair loss in woman medicines for blood clot online imitrex viagra buy free dog products clindamycin drug how to stop hair loss chloramphenicol discount drug viagra what valium does permanent hair loss heart failure medicine avapro 150mg ordering viagra online food allergies order viagra online online viagra prescription carisoprodol mg improve your skin discount erectile dysfunction medication buy xanax online buy order viagra scabies teatments information allegra vitamine b1 diazepam breast cancer support free stop smoking cipro side effects ultram cheapest treatment attention deficit disorder discount vitamins supplements how to get viagra online synthroid buy cheapest cialis zyrtec online how to clear acne preventive osteoporosis immune stimulants what is hoodia On Line Viagra getting over the pain diflucan dosage health asthma online stores hair loss products blood clot drugs colon parasites hair loss products discount medicine pravastatin buy griseofulvin tablets order indomethacin dog health products how to take a beta-blocker diazapan is valium treating cold sores chronic pain drug what is osteoporosis stress drug tooth whitening lowering cholesterol naturally legality of buying cialis online order levitra treatment for insomnia cheapest cialis index depakote overdose alprazolam condom sales treatment of yeast infection xanax sales taking viagra after cialis how to control pain new birth control chest pain health prozac prescription blood clots viagra in mexico chlamydia pill cancer drugs cold flu drugs how do i order viagra online super viagra acyclovir medicine benadryl dosage erythromycin pregnancy buy contoured condom chronic muscle pain pet health dogs treatment attention deficit disorder dental teeth whitening asthma medicine free prescription drugs herpes drug diabetes treatment buy tooth whitening gel cheap fast valium generic levitra buy cheapest viagra online lopressor drug pharmacy drug prices ultram dosing treatments for bipolar disorder neurontin withdrawal parasite medication chlamydia tips for increasing breast size ways to enhance breast what is valium used for metformin tablet order birth control hair loss for men how does xanax work treatment hepatitis c rythmol cheap acai antioxidants nexium generic blood pressure pills levitra online no prescription Levitra Online medications on line motion sickness drugs bactrim online order roxithromycin nicotine where can i order viagra immune supplements buy erexin v bph prostate allopurinol xanax for depression drug new smoking stop cheap impotence drug generic cialis delivery new treatment for depression antibiotics for cat viagra china alternative medicine cholesterol viagra dose anxiety disorder treatment severe muscle pain treatment of cancer calcium carbonate penis enlargement without pill valium maximum dosage reasons for high blood pressure energy product breast enlargement info cheap effexor building your body wrinkle cream aricept dosage alpha blocker increasing female sex drive valium depression new pain meds no rx xanax drug trileptal mg imitrex avapro 150mg medicine drugs contraception female claritin pill medication for acne med orders buy viagra internet levitra effect treatment for blood clots order sominex buy creatine buy precose cheap viagra overnight lopressor drug body building info health drugs general health and medical what is diazepam eye infections in dogs online prescription pills diclofenac tablet new medication anxiety buy citalopram medication male enhancement enhancement fat blocker medicine for throat infection order cardizem about soma health remedies for dogs generic xanax cheap zyrtec for depression medicine viagra sex domination buy acne skin care product hypnosis help study cure vaginal yeast infection weight loss supplement program muscle pain in leg how to increase erection buy viagra what is cla augmentin doses gaining muscle mass health med online heart rate treatments lopressor drug dog ear canal phentermine without prescription viagra order online weight loss glipizide diabetes astelin generic fat blocker buy gel tooth whitening cheap wellbutrin online weight loss program buy antiox anti-biotics acne skin treatmen tramadole vpxl pill drugs affecting levitra immune system support augmentin hypothyroidism medication buy erexin v uy prescription medication without a prescription buy discount order osteo arthritis online buy pilocarpine cheapest place to buy phentermine parasite treatment impotence help body fat loss viagra herb alternative constipation supplements treatment dementia adhd and medications muscle spasm relief viagra online cheap relieve upper back pain stop hair loss discount viagra online menstrual cycle problems antifungal shampoo side effects ativan gabapentin medication where can i buy viagra diazepam buy soma online clonidine dosage viagra gel top hair loss fast antibiotics cure chlamydia skin fungal infections drug zofran give up smoking alternative medicine cholesterol sleeping help best online viagra scams prednisone 10mg viagra sex domination lotensin easy weight loss pain meds without prescription over the counter drugs new high blood pressure medic generic compazine cetirizine drug order phentermine best fat blockers woman enhancement supplement drug zofran buy precose new drug treatment for cancer how to increase fertility viagra in australia benadryl dosing buy alcoholism medications order l arginine buy diazepam generic for ativan ativan prescription drugs weight loss treatment for chest pain woman health where can i buy phentermine online skin fungal infection give up smoking viagra on line hoodia information how does osteoporosis occur buy viagra online buy alcoholism medications depakote overdose klonopin pill tetracycline capsules what is high blood pressure bladder control for dogs generic for lipitor glucophage online pharmacy gabapentin dosage treating yeast infections dog health info cymbalta anxiety cheap tramadol without prescription hydrea drugs used for cancer cure for high blood pressure alcohol and valium relief from constipation liver infection treatment cialis soft zantac medication help sleep problems all natural antibiotics order medication without prescription sleep problems free hypnotherapy gaining muscle mass cheap viagra order online natural help for pain how to buy viagra drug price celebrex information otc diuretic levitra 10 mg buy medicine online pets products relief foot pain cialis without prescription med care cheapest generic cialis rapid hair loss pain medications generic side effects meds without prescriptions cat anxiety buy simplicef natural cure arthritis effects of high blood pressure lowest price generic viagra how to get birth control new breast cancer drug buy topamax blood pressure meds when are beta blockers prescribed how to get pain meds order fosamax online viagra name order viagra viagra cialis cat's eye health how to relieve lower back pain treating ear infections diazapan is valium online pain doctors high blood pressure in elderly medication to stop smoking wellbutrin dosages diabetes blood sugar levels weight loss diet pill side effects of prescribed pain pills drug list high blood pressure buy cialis online in usa ultram cost how to help osteoporosis how to use clomid discount brand viagra wellbutrin cymbalta buy pills without a prescription buy pain medicine online tab tramadol depression symptoms treatment how levitra work hypertension medications beta blockers prevent premature ejaculation xanax interactions with other medicines purchase medicine on line does alli work xenical mexico prescriptions buy sumycin uy prescription medication without a prescription ambien cost methocarbamol effects cheap beta blockers cats bladder reduce cholesterol naturally metformin tablet scabies medicine breast enhancer pills body building over 50 order viagra cheap zestril medication how to buy prescription medications online pharma kamagra drugs depression ear infection symptoms big muscle controlling blood pressure pain meds and pregnancy buy diazepam without prescription skin allergies antibiotic zoloft buy weight loss nutrition program Buy Cialis breast increase meds without prescriptions blood clots medical edema treatment for flu best hangover remedy diabetes drugs