hadoop

Automated Export of Cloudera Manager Configuration for Hadoop

Joris Bontje

Cloudera Manager is a web based management application for your Apache Hadoop cluster. It makes the installation and configuration for your Hadoop cluster a whole lot easier and is free for a cluster up to 50 nodes. In particular I like the suggested configuration settings based on your cluster hardware.

All the configuration settings of Cloudera Manager are persisted in the configuration database, which can be manually exported through the admin interface. One of our clients wanted to export these settings programmatically for auditing and backup purposes.

Currently there isn’t an automated way to do that, besides backing up the entire database. Here is a little shell script that allows you to download the configuration automatically in text format.

 Read more

Sentiment Analysis using Apache Hive

Joris Bontje

Apache Hive is a data warehouse system built on top of Hadoop. Using SQL-like language you can query data stored in the Hadoop filesystem (HDFS). Those queries are then translated into Map Reduce jobs and executed on your cluster.

As an example we’ll analyze tweets from the Twitter Streaming logs and calculate the top 5 hashtags per day which are associated with positive sentiment signals (smileys).

You can imagine how this can be expand this to simple sentiment analysis on your (potential) customer feedback.

 Read more

Going NoSQL at RIPE NCC

Friso van Vollenhoven

At Xebia, we show a lot of interest in the developing NoSQL community and all the great software and solutions that result from it. Big data analysis and heavy traffic web sites and applications are here to stay and we need solutions capable of dealing with those. The commodity stack of some flavor of relational database with a Java app server on top and the stateful model of server side sessions just doesn’t cut it in some cases. As data volume and traffic grow, these cases will present themselves increasingly often. In our App Incubator program we see a lot of interest in non relational databases and stateless server side setups with more logic on the client side (cleverly coined: NoJSP). Also, at clients the problem of ever growing data sets and the lack of options to do proper analysis with existing tools and databases starts to arise. One of these clients is the RIPE NCC. The story is roughly this: about 80GB of data comes in per day and there is ten years of historical data of the same kind and volume; we need to do queries against this and get sub-second answers. We solve this with the use of Hadoop en HBase.
 Read more

Thinking MapReduce with Hadoop

Maarten Winkels

Apache Hadoop promises “a software platform that lets one easily write and run applications that process vast amounts of data”. Sure enough, when reading the documentation, descriptions like:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Are simple enough to read and understand, but how do you apply MapReduce to a problem you face in a real-life project?

This blog tries to give some insight into how to apply MapReduce with Hadoop.

 Read more