PhantomJS data mining & BASH data analysis

Joshua Appelman

CasperJS

As a moderately large company we rent mail boxes for our employees at a hosting provider; a lot of mailboxes. These come in varying sizes, and naturally the larger you go the more expensive they become.

The other day I received an email requesting several new accounts and set upon creating these when I came across what seemed to be a rather inefficient allocation. The user had a mid-size tier, costing about €150 per year, while he could seemingly make do with the very smallest tier of about €50 annually.

This, of course, made me curious about our other allocations and I went looking for an overview of all our mail accounts' usage. No such luck. The only way to see how much of the rented space was actually being used was by navigating the - non-rest and stateful - web interface of our hosting provider and looking up the statistics for each user individually.

Challenge accepted!

I've gotten tired of using selenium lately, and been meaning to look into some of PhantomJS based alternatives. My eye had fallen on CasperJS and I decided to give it a spin.

Using brew I downloaded the latest development release:

$ brew update
$ brew install casperjs --devel

The script in it's entirety can be found in this gist, but walking per section:

var casper = require('casper').create({ verbose: true, logLevel: 'info' });
var credentials = JSON.parse(require('fs').read('./credentials.json'));
var url = 'private';

casper.start(url + '/user/login', function() {
  this.fill('form#login_form', credentials, true);
});

The first line initialises CasperJS, with some logging enabled. The second line reads in a simple json file containing the form fields and values of the login page, while the third line contains the base url of our hosting provider.

In the next section casper is told to navigate to said url's login page, fill in the specified form with the credentials and submit. It's that easy!

casper.thenOpen(url + '/subscription/config/xebia.com/exchange_mailbox', function() {
  this.getElementsInfo('tr td  a').forEach(function (node) {
    if (node.attributes.nicetitle === "View") {

All logged in, it's time to navigate to the exchange's overview page. Here every user's account details are linked to, in a node with the attribute nicetitle="view". Naturally, we want to iterate over these. This is where a small hitch in the plan was encountered.. the html is completely unstructured. Simply a table of varying dimensions with label, value pairs. I decide to postpone the problem, and for now simply fetch the entire element:

      casper.thenOpen(url + node.attributes.href, function() {
        require('fs').write('output', JSON.stringify(this.getElementInfo('div.contentleft').html, 'a'));
      });
    }
  });
});

Ending it all with a:

casper.run();

It's time to dive in to the console and give it a spin:

$ casperjs fetch.js

Excellent! Casper is spinning along, discovering and fetching the data, and I can see a tail of the generated output file streaming in. Unreadable, but the data is all there, bringing use nicely to the second topic of this post.

BASH data analysis:

To begin with, let's put this malformed html through tidy. Since we're not interested in the many warnings tidy will give us, we'll redirect stderr to /dev/null, yielding us:

tidy <fetched 2>/dev/null
 \n\t\t\t\t</pre>
<div class="\&quot;clear\&quot;"></div>
<pre>\n\t\t</pre>
<div>\n\t\t\t<label>Current mailbox size</label>\n\t\t\t1921
MB\n\t\t</div>
<pre>\n\t\t</pre>
<div>\n\t\t\t<label>Warning quota</label>\n\t\t\t2250
MB\n\t\t</div>
<pre>\n\t\t</pre>
<div>\n\t\t\t<label>Block send quota</label>\n\t\t\t2375

Since, as mentioned before, all interesting fields contain a 'label' tag, let's grep for those:

tidy <fetched 2>/dev/null | grep label</pre>
<div>\n\t\t\t<label><label>Email
</label></label>
<div>\n\t\t\t<label>Email aliases</label>\n\t\t\t
<div>\n\t\t\t<label>Current mailbox size</label>\n\t\t\t1921
<div>\n\t\t\t<label>Warning quota</label>\n\t\t\t2250
<div>\n\t\t\t<label>Block send quota</label>\n\t\t\t2375
<div>\n\t\t\t<label>Block send and receive
quota</label>\n\t\t\t2500 MB\n\t\t</div>
<div>\n\t\t\t<label>Pop enabled</label>\n\t\t\t<img alt="" src="<br" />

Okay, this is starting to look like something. Let's trim away everything before the closing label tags, and remove the \n and \t characters.

tidy <first-fetch 2>/dev/null | grep label | sed 's/^.*\/label>//' | 
sed 's/\\nt//g'
<div><label><label>Exchange
</label></label>
<div><label>SMTP
1921
2250
2375
2500 MB</label></div>
<label><label>
<img alt="" src="<br" /><img alt="" src="<br" />...

Let's filter out some of the uninteresting lines to get:

tidy <first-fetch 2>/dev/null | grep label | sed 's/^.*\/label>//' | 
sed 's/\\nt//g' | grep -v 'label\|img\|<br>\|HOSTED'
Sunil Prakash
Sunil Prakash
REDACTED</pre>
</div>
REDACTED</div>
1921
2250
2375
2500 MB

Removing the trailing div and MB's we finally have sanitised data:

tidy <first-fetch 2>/dev/null | grep label | sed 's/^.*\/label>//' | 
sed 's/\\nt//g' | grep -v 'label\|img\|<br>\|HOSTED' | sed 's/<.*//' | sed 's/ MB$//'
Sunil Prakash
Sunil Prakash
REDACTED
REDACTED
1921
2250
2375
2500

Which well put in a file called 'data'. Pasting these lines together in sets of eight, separated by comma's and with the empty fields padded with a period, we get:

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/' | head -n 2
REDACTED,REDACTED,REDACTED,REDACTED,1921,2250,2375,2500
REDACTED,REDACTED,REDACTED,REDACTED,40,2250,2375,2500

Piping this through awk, setting the field delimeter to comma and calculating the last field divided by the fourth results in:

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/' | 
awk -F, '{print $0 "," $5/$8*100"%" }' | head -n 2
REDACTED,REDACTED,REDACTED,REDACTED,1921,2250,2375,2500,76.84% 
REDACTED,REDACTED,REDACTED,REDACTED,40,2250,2375,2500,1.6%

Finally it's time to sort by the last comma separated field, the utilisation percentage, in reverse numerical order:

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/' | awk -F, '{print $0 "," $5/$8*100"" }' |
sort -t, -k +9 -n -r | tail -n 2
REDACTED,REDACTED,REDACTED,REDACTED,0,2250,2375,2500,0
REDACTED,REDACTED,REDACTED,REDACTED,0,225,237,250,0

And we already find two empty mailboxes, which at the very least could be downgraded to the cheapest package! To make things more readable, let's lay them out in a nice column

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/' | awk -F, '{print $0 "," $5/$8*100 }' |
sort -t, -k +9 -n -r | column -t -s , | tail -n 5
REDACTED REDACTED REDACTED REDACTED 1      2250  2375  2500   0.04 
REDACTED REDACTED REDACTED REDACTED 0      225   237   250    0

And there we have it, an overview of the usage of all our mailboxes. Now finally let's use awk to filter by those mailboxes using a package larger than the minimum (250MB), and utilising less than ten percent, as these can definitely be downgraded:

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/' | 
awk -F, '{print $0 "," $5/$8*100 }' | sort -t, -k +9 -n -r | 
awk -F, '{ if ($8 > 250 && $9 < 10) print $3 "," $9"%" }' | 
column -t -s,
REDACTED 0.32% 
REDACTED 0.2%

And there we go, a whole list of accounts that can easily be be saved upon. Let's finish of with a quick calculation of how much we just saved:

cat data | paste -d , - - - - - - - - | sed 's/,,/,.,/'x | 
awk -F, '{ if ($8 > 250 && ($5/$8) < 10) print $0}' | 
wc -l | xargs echo "100 *" | bc
6100

There we go, looks like two hours of playing with CapsperJS, tidy, sed, awk and grep just saved us €6100, and probably a factor two more once I inspect the data a bit closer. Not a bad result for 18 lines of javascript and a few lines of BASH!

Comments (4)

  1. Chris - Reply

    February 11, 2014 at 12:42 am

    Nice tutorial. I always use Python and Selenium (which can be used with PhantomJS), but I like the workflow you outline here. I find I'm often too tempted to try and do everything (scraping, parsing, saving) within a single program.

    • Joshua Appelman - Reply

      February 11, 2014 at 12:45 am

      Thanks for the feedback! Python is pretty great for data manipulation, but when dealing with poorly structured text nothing beats sed/grep/awk

  2. Michiel - Reply

    February 11, 2014 at 8:43 am

    Nice inspirational post & quick win! ;-)

  3. Rand - Reply

    February 12, 2014 at 12:03 am

    Fantastic writeup, love the command line fu!

Add a Comment