Updating the Twittering Shipping Forecast

Sep 4, 2010 02:58 · 872 words · 5 minute read

Back at the beginning of last year, I put up a post about how I built the Shipping Forecast Twitter bot.

Since then it’s been chuntering away quietly - or at least it did until the Met Office in their infinite wisdom decided to change the layout of the page that is scraped for data by my Ruby script.

Strictly speaking, I don’t actually have any room for complaint - I’m probably breaking all kinds of terms of service by scraping their site for content. It’s also their site, to do with as they see fit. So moaning about “them” breaking “my” bot is somewhat churlish.

However, as the Met Office are a “trading fund” - for which read quasi-public sector body that takes tax monies to produce information, and then sells that information back to the people who’ve already paid for it - the chance of them providing any kind of open data is minimal. So seeking not permission, but forgiveness - and scraping the data - is really the only option.

The other issue with the Met Office is that they are your typical public sector organisation when it comes to semantically-valid, standards-compliant web design. That is to say, they wouldn’t know semantically-valid, standards-compliant web design if it fell out of tree onto them. The page source looks like it was rendered by Microsoft Office’s “save as HTML” function.

So all this makes the whole scraping job ever so slightly more complex. For the geeks, obsessives and anyone else who might be interested, this is how I did it - the second time around.

There are two options for retrieving the source data - either the flat printable version of the forecast, or the whizzy (for a given value of whizzy) interactive map page. The former is pretty useless for scraping purposes, because there’s no structure in the document. The latter is also fairly useless, because the map is created client-side by a rolling dirtball of Javascript.

While it works fine in a Javascript-enabled browser, it doesn’t work using a server-side command-line retrieval tool like cURL or wget. Instead of getting the HTML source for the map page, you get the “you need to enable Javascript in your browser” error page, which defaults to the flat printable version of the forecast data.

This had me stumped for a while, until I hit on the idea of sidestepping the webpage entirely. Poking around the source I discovered that when the page is loaded, the embedded Javascript functions grabs a small external file containing the actual forecast data, and parses that to build up the map-based view.

Because Javascript runs client-side, that data file has to be publicly-accessible - otherwise there’d be no way of creating the page in the first place. If a browser can download it, so can my script - so that’s exactly what it does.

The other advantage is that the data isn’t embedded within HTML, so it’s much easier for the script to parse - no messing about with Xpath and so on. Needless to say, this being the Met Office it’s NOT in a useful and open format like JSON - so it does need some munging - but it’s still less hassle than a whole series of Xpath queries.

The other major change to the bot is to deal with the consequences of the OAuthapocalypse - the rather nicely-named decision by Twitter to switch off basic username/password authentication to their API. That’s actually a good thing - OAuth is a far more robust and secure authentication process, even if Twitter’s implementation has a flaw or two. But it did mean that having taken the easy option at the outset, I needed to retrofit all my bots with OAuth if they were going to continue working.

OAuth isn’t the most straight-forward process to get your head around, and I cheated slightly. The first part of the process is simple enough - register your app with Twitter and get a pair of credentials that it then uses to identify itself. The second is trickier - getting access from the app to a Twitter account involves requesting permission from the user in return for which the app is issued with a temporary pair of access tokens. These then need to be exchanged for a permanent pair which get used to control the app’s ability to post to the user’s stream.

Richard Taylor has produced a Twitter-specific OAuth library for Ruby, but I was being a bit slow that day and found it a bit fiddly to get the authorisation process working. So I cheated, and used the Tweepy Python library. Jeff Miller has a very thorough writeup on his blog which shows how to use Tweepy from the command line to grab all the required tokens, and from there it was a simple process of plugging those into the Ruby code.

After that, all my bots are now updated to use OAuth, which has the added advantage of providing a whole load more status and diagnostic information from the Twiiter API. Given that the API can be temperamental, the feedback allows much more defensive coding to create scripts which can cope with the odd glitch now and again.