Category Archives: Uncategorized

PHP and Big Data

This post first appeared as part of the PHP Advent Calendar.

Big data, data science, analytics. These are some of the hottest buzzwords in tech right now. Five years ago, the boasting rights went to the geek with the largest number of users: these days he with the biggest data wins.

There are a number of approaches to dealing with vast quantities of data, but one of the best known is Apache Hadoop. Hadoop is a toolkit for managing large data sets, based originally on the Google whitepapers about MapReduce and the Google File System. For Socorro, the Mozilla crash reporting system, we use HBase, a non-relational (NoSQL) database built on the Hadoop ecosystem.

The Hadoop world is largely a Java world, since all the tools are written in Java. However, if you feel the same way about Java as Sean Coates, you should not lose hope. You, too, can use PHP to work with Hadoop.

Let’s start by understanding MapReduce. This is a framework for distributed processing of large datasets.

A MapReduce job consists of two pieces of code:

A Mapper
The job of the Mapper is to map input key-value pairs to output key-value pairs.
A Reducer
The Reducer receives and collates results from Mappers.

More parts are needed to make this work:

  • An Input reader generates splits of data for each Mapper to work through.
  • A Partition function takes the output of Mappers and chooses a destination Reducer.
  • An Output writer takes the output of the Reducers and writes it to the Hadoop Distributed File System (HDFS).

In summary, the Mapper and Reducer are the core functionality of a MapReduce job. Now, let’s get set up to write a Mapper and Reducer against Hadoop with PHP.

Setting up Hadoop is a non-trivial task; luckily, a number of VMs are available to help. For this example, I am using the Training VM from Cloudera. (You’ll need VMWare Player for Windows or Linux, or VMWare Fusion for OS X to run this VM.)

Once you’ve started the VM, open up a terminal window. (This VM is Ubuntu based.)

The VM you have just installed comes with a sample data set of the complete works of Shakespeare. You’ll need to put these files into HDFS so that we can work with them. Run the following commands to put the files into HDFS:

cd ~/git/data
tar vzxf shakespeare.tar.gz
hadoop fs -put input /user/training/input

You can confirm this worked by viewing the files in the input directory on HDFS: hadoop fs -ls /user/training

Next, we need to create the mapper and reducer. To demonstrate these, we’ll reproduce what is often referred to as the canonical MapReduce example: word count.

You can find the Java version of this code in the Cloudera Hadoop Tutorial.

As you can see (if you know Java), the mapper reads words from input, and for each word it encounters, emits to standard output the word and the value 1 to indicate that the word has been encountered. The reducer takes output from mappers and aggregates it to produce a set of words and counts.

The easiest way to communicate from PHP to Hadoop and back again is using the Hadoop Streaming API. This expects mappers and reducers to use standard input and output as a pipe for communication.

This is how we write the word count mapper in PHP, which we’ll name mapper.php:


$input = fopen("php://stdin", "r");

while ($line = fgets($input)) {
$line = strtolower($line);
if ($words = preg_split("/W/", $line)) {
foreach ($words as $word) {
echo "$wordt1n";


We open standard input for reading a line at a time, split that line into an array along word boundaries using a regular expression, and emit output as the word encountered followed by a 1. (I delimited this with tabs, but you may use whatever you like.)

Now, here’s the reducer (reducer.php):


$input = fopen("php://stdin", "r");
$counts = array();

while($line = fgets($input)) {
$tuple = explode("t", $line);
$counts[$tuple[0]] += $tuple[1];


foreach($counts as $word => $count) {
echo("$word $countn");

Again, we read a line at a time from standard input, and summarize the results in an array. Finally, we write out the array to standard output.

Copy these scripts to your VM, and once you have saved them, make them executable:

chmod a+x mapper.php
chmod a+x reducer.php

You can run this example code in the VM using the following command:

jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2+320.jar 
-mapper mapper.php 
-reducer reducer.php 
-input input 
-output wordcount-php-output

In the output from the command you will see a URL where you can trace the execution of your MapReduce job in a web browser as it runs. Once the job has finished running, you can view the output in the location you specified:

hadoop fs 
-ls /user/training/wordcount-php-output

You should see something like:

Found 2 items
drwxr-xr-x   - training supergroup          0 2010-12-14 15:40
-rw-r--r--   1 training supergroup     279706 2010-12-14 15:40

You can view the output, too:

hadoop fs 
-cat /user/training/wordcount-php-output/part-00000

An excerpt from the output should look like this:

yeoman 13
yeomen 1
yerk 2
yes 211
yest 1
yesterday 25

This is a pretty trivial example, but once you have this set up and running, it’s easy to extend this to whatever you need to do. Some examples of the kinds of things you can use it for are inverted index construction, machine learning algorithms, and graph traversal. The data you can transform is limited only by your imagination, and, of course, the size of your Hadoop cluster. That’s a topic for another day.

Parenting Versus Programming

This post was written for and first appeared in the PHP Advent Calendar 2009.

Advent calendars are about Christmas, and for me Christmas has always been a time for family. This year I have recently joined the ranks of the parents among you. I am taking a short break from work and focusing on being a mother rather than being a programmer. This has led me to reflect on the similarities between parenting and coding. I present these here for your enlightenment, or so you can laugh at me.

Lesson One: Smells

Babies, like programs, are associated with a variety of interesting smells. If you try to ignore the smells associated with babies, they only get worse over time. Then the screaming starts.

It was Martin Fowler who popularized the notion of code smells. These are defined as parts of your code that do things in an ugly way, or to put it a different way, they are hacks. Typically, when we find these parts of code, our eyes begin to glaze over, and we enter a strange, zombie-like state. (This will also be familiar to parents.) In this state, we are paralyzed and thus prevented from doing anything about the smell. It is only when we emerge from the cave of code smells into the daylight of clean code that we can again be productive. I am not sure whether it is the horror of the code, a tendency to procrastinate that is endemic among programmers, or simply a fear of breaking things that prevents most people from doing something about the mess. It is always the smelliest parts of the code that are the most fragile.

The lesson programmers can learn from babies, here, is to face your smells and get rid of them as quickly as possible. Then everybody’s happy.

Lesson Two: Sleep

I have now been programming for mumble years. Not long after I started work at Mozilla, a delightful Canadian journalist asked me what it was like to work with a group of people “so much younger than myself.” While I am not actually that old, I am old enough to have picked up some skills. One of these skills is the ability to survive on 4 to 6 hours of sleep a night for extended periods. Although this is not my favorite hobby, I have gained a certain level of mastery.

Another thing I’ve learned is that surviving on 4 to 6 hours of sleep a night for an extended period makes you mildly deranged. With parenting, as with programming, it is sometimes required. When you achieve a certain level of derangement, you find that very silly things start to sound like good ideas, or they just start to happen, whether you intended them to or not. For example, you find yourself accidentally filing the baby in the filing cabinet or implementing a new framework.

This is a lesson both programmers and parents can learn from the military: sleep when you can do so safely, and as often as possible. If you can snatch a few minutes of nap here and there, it may not make you feel a lot better, but your degraded IQ will recover somewhat.

Lesson Three: With Great Power Comes Great Responsibility

When you set out to implement a new web app or raise a child — and yes, I do realize these things are not quite on the same scale — you have a great responsibility to do a good job. Otherwise, everyone who has to interact with your app or child in the future will curse your name. Repeatedly. Your baby — whether it is human or code — is totally dependent upon you to do a good job.

Lesson Four: It Takes a Village

On that note, it is important to realize that it is very hard to raise a child or write a web app completely on your own. Some things are better done with a little help from your friends. Whether that help is providing a role model, a shoulder to cry on, advice when you just don’t know what to do, or purely someone to vent to, it really helps to surround yourself with people you can rely on.

That summarizes the commonalities between parenting and programming that I have learned over the last 7 weeks. I suspect I have a great deal more to learn.

One final note: If anyone approaches you to work on a web app that will take 18 years, run away as fast as you can, and do not look back.

Seven Things

I feel like I’m about last to the party, but after getting tagged by Ben Ramsey I thought I’d contribute to this meme/tag/whatever that’s going around the PHP community.  I intend to blog more in 2009, so this is a starting point anyway.

Seven Things you didn’t know about me:

1.   I learned to program in the 4th grade on an Apple II, in LOGO.  A high school near me had a program where a group of selected nerds from local schools would go there once a week and share two machines.  I was the only one in primary (elementary) school.

2. I used to do a lot of singing – school choir, madrigals, musicals, an a capella group with my friends, and a youth gospel band.  I sing alto.

3. I’m not religious and consider myself a (non militant) atheist.  We were raised that way since my family is a combination of Catholic, Jewish and Presbyterian and my parents didn’t believe in organized religion.  Despite this I went to an Anglican girls’ school for 11 years (also see the above mentioned gospel band) and am technically Jewish.

4. I met my husband Luke Welling in Advanced Software Engineering at college (RMIT University in Melbourne, Australia).  We had to do a review of each other’s code as part of an assignment.  The first words he ever said to me were, “This is crap.”

5.  I moved house every 1-2 years when I was a kid, left school at 16, went back at 19, and then spent far too much time nerding out at college, which I loved.

6.  I’ve been riding horses since I was four years old.  At four, I rode my Shetland pony Froggy through the house to annoy my mother.  (I believe I succeeded.)

7.  I sold my first article to a magazine when I was about 13 years old.  It was a light humorous piece about the challenges involved in buying a horse and was printed in a national horse magazine.  (I always wanted to be a writer when I grew up…or possibly a veterinarian…or maybe a secret agent.  I never thought it would end up being tech books that I wrote.)  I have also written one complete bad novel, and the beginnings of several others.  (Nanowrimo, I’m looking at you.)

OK, now here are my tag-ees.

Tag-wise, here are the rules:

  • Link your original tagger(s), and list these rules on your blog.
  • Share seven facts about yourself in the post—some random, some weird.
  • Tag seven people at the end of your post by leaving their names and the links to their blogs.
  • Let them know they’ve been tagged by leaving a comment on their blogs and/or Twitter.

Write Beautiful Code at OSCON

I gave my talk yesterday at OSCON 2008, and here are the slides.

It’s interesting – I think every time I have given this talk I focus on a slightly different aspect. Yesterday it was the importance of decoupling parts of your application architecture as much as possible. This is better for security reasons (allows paranoid coding practices), for scaling (allows you to switch out and/or scale components independently and quickly), and for maintainability.

OSCON is good as usual – if you’re here be sure to join Mozilla at Beerforge tonight, and come say hi.

Blog relocation

After a bunch of DNS and other broken-ness issues, I have given up on TypePad and moved my blog to self hosted WordPress.

Some links are still broken (notably About and Talks) but I’m working on that and hopefully they should be fixed soon.

Links to specific posts will also be broken, but they are all here somewhere.  (If I get time I might put in a bunch of rewrite rules but I’m not sure there are enough links into specific posts here for that to be worth it.  Tell me if you have specific requests.)

Moving on and OSCON

For those that haven’t heard it from me or the grapevine, I’m moving on from OmniTI.  I’ve had an exceptional amount of fun working with the good peeps over there for the last couple of years – it’s an excellent team and they do great work.  I’m still looking for my Next Big Thing.  Not sure yet what that will be but I am sure it will be fun.

I will be travelling a bit over the next week or so, will be in San Francisco next weekend and then on to Portland, Oregon on Sunday for OSCON.  I’ll be presenting the PHP and MySQL Best Practices tutorial with Luke, hope to catch up with all the usual suspects, meet some new people, learn some cool stuff, and have the general uber experience that is OSCON.  Hope to see some of you there.

kiwi foo, morale, and body enhancement

Today I am at Kiwi Foo Camp, also known as Baa Camp.

It’s entertaining and educational.  I’ve met a bunch of people I have not met before – I’ve kind of gotten used to knowing lots of people at conferences that I go to.  This one has a large quotient of New Zealanders and hence I’m meeting tons of new people.

I gave a talk this afternoon called From Startup to Google:  How do I grow?  where I looked at a bunch of issues to do with growing companies: how to start, how to fund yourself, how to hire good people, and how to implement a basic software process.  One of the issues I talked about is something I feel really strongly about, and that is developing your company to have a good culture, making it a place where you and other people want to work, and where people can be passionate about what they want to do.  I have noticed that this often falls by the wayside as companies grow large, and a friend of mine commented that it seems to happen somewhere around the 100-200 employee mark.  I’m interested to know what other people think.

I’ve been to some great talks today: a free flowing discussion on user experience and another on email security, and a talk by Robert O’Brien on Atom, another on agile web dev tools, but the real humdinger of the day for me was Quinn Norton on body modification and enhancement.  The concept of a drug that allows you to control your sleep, implanted rare earth magnets that let you feel your hard drive spinning in your fingertips, or another drug that makes you tanned, thin, and increases your libido…well, who wouldn’t be interested? It’s like ShadowRun made real.

2006 Year in Review

In the style of everybody else I know I thought I’d post my year in review.  2006 was a crazy, crazy year for me.


  • Rode my first decent EFA dressage test
  • Said goodbye to family and friends
  • Moved to Columbia MD for three months to work at OmniTI.


  • Began work in MD, working on Ecelerity webconsole
  • Made new friends, found a place to ride, met lots of new horsy people
  • Fell over on the ice a lot
  • Fell in love with working at OmniTI


  • Finished work on Ecelerity 2.1
  • Got disgustingly homesick
  • Got promoted to Director of Web Development, and made tentative plans to stay in MD


  • Tore the cartilage in my knee and began eight miserable weeks on crutches and doped up on painkillers
  • Thanked the powers that be for my friends who helped me in this crisis when I was on the opposite site of the planet from my husband and support system
  • Was visited by Luke
  • Spoke at MySQL UC, on crutches


  • Chris became a Principal
  • Finally had knee surgery and could walk again
  • Started riding again


  • Spoke at NYPHP
  • Babysat the office while everybody else was out of town
  • Spoke at ApacheCon in Dublin


  • Went home for the first time since January
  • Went to OSCON and gave a tute and a talk


  • Returned to MD, further homesickness ensued.
  • Went to Germany for long awaited vacation with Luke


  • Travelled to Microsoft for the Web Dev summit
  • Moved to our new, bigger office in Columbia, started hiring a few more staff to fill it


  • Spoke at ApacheCon in Austin and DCPHP
  • Rode in my first US horse trials (and was 5th)
  • Had a hideous car accident


  • Appointed Principal and took a trip home to celebrate


  • Moved into my own house in MD at last
  • Divisional Champion at a jumper show
  • Trip home, Christmas, need I say more?

It’s been a year of amazing ups and downs.  Here’s hoping 2007 has just as many ups and less downs.

I took a photo as a marker of where I am right now:

Be always at war with your vices, at peace with your neighbors, and let each new year find you a better man. 
-Benjamin Franklin

Not only is another world possible, she is on her way.  On a quiet day, I can hear her breathing.
– Arundhati Roy

My new role at OmniTI

I have exciting news.

As George put it in his email: "I am very excited to announce that Laura Thomson has been promoted to
the position of Principal. "  My role will include focusing on securing new
business and improving the quality and effectiveness of service

I’m really excited about this opportunity to help take OmniTI onwards and upwards, and I’m really looking forward to the next few years.  I’d like to thank Theo, George, Chris, and Sherry for giving me this opportunity.

And no, contrary to rumor, I will not be changing my name to Schlomson to fit in with the crew.  🙂