This post is about a strand of work I’ve been doing at Rattle which attempts to tell stories about the BBC from the data mined in subtitles and is to compliment the post written by BBC R&D “From Channelography and beyond“.
Way back in 2009 we kicked off a project for BBC R&D which looked at taking BBC subtitle data and seeing what these subtitles could tell us about BBC TV output. It was a data research project with the aim of seeing if subtitle data could be mined effectively and if that resulting mined data was of any value.
The project was called Channelography (@channelography) and has only basic styling (we call it an alpha to manage expectations, but people stlll expect an R&D project to be beautiful!). To mine the data we use a service we built called Muddy, which we’re in the process of simplifying and open sourcing, to extract terms and match them to a controlled vocabulary of things. This controlled vocabulary is based on Wikipedia, which has over three and a half million articles (things) for us to match to.
This alpha project highlighted a few things:
- Approximately 58% of programmes have subtitles, so this is the majority, but by no means all.
- The quality of data in pre-recorded programmes (where subtitles can be done in advance) is superior to live programmes – so in news, for example, there are far more incorrect spellings and even words.
- Children’s programming throws up far more false positives as the subtitles tend to use similar phonetic words to the sounds made and the character names used, you know, weird stuff!
The data itself showed for the first time, the number of mentions of people, places and events on BBC TV by channel, genre, format and also programme brand. (Data methodology point: We only take one mention per show of an entity. We chose to do this to remove the bias toward things mentioned in programmes with a specific current affairs remit.)
Here is Elton John, so far mentioned 207 times on BBC TV since we’ve been collecting the data (Since Sept 2009):

And when we dig down we see Elton the recent programmes he was shown in…

And then the specific mentions in those shows, and in this example, Mastermind:

You can also see number of shows and repeats per channel, for example in March BBC2 had 1341 broadcasts of which 445 were new programmes. This compared with 1018 broadcasts on BBC1 of which 770 were new.
So, it stimulates our curiosity about programming and content and ultimately what the BBC presents.
Later in 2010 the BBC commissioned us to take this further and present a dashboard of this data, ‘slices’ of the data presented in ways to engage an audience. The audience for this was mainly BBC management. And so we created two dashboard views on the data. This was the first:

In doing the information design our principal heuristics were that the time periods should fit with people’s understanding of “TV Time” and also that information should be interesting without us having to rely on the data to deliver the story. In practice this meant choosing things like “number of films this week versus last week” and curating mentions of companies that were well known (banks, oil companies, social networks and consumer electronics companies). In short we provided the narrative framework to drop the data in to. We weren’t too happy with version 1 initially. It felt too fussy and didn’t give the impression of being a dashboard, something conveying authority and power.
The second attempt was influenced by Russell Davies’ talk at Playful in 2009 and in particular the idea that pretending is an important to create an emotional connection with users. Help people to believe that they’re in charge of a nuclear submarine, even though it’s just their email client. That kind of thing.
It feels a bit more like a dashboard should do, although actually many people prefer the first iteration. Ideally we’d like to have had as the design brief “The Director General’s Dashboard” complete with Defcon style warnings for when repeats hit a certain level or when mentions of D list celebs starts to climb. That would be neat. Who wouldn’t want to be the DG!
(You can grab the archive of daily dashboard views on the Channelography Flickr stream).
From this work, we became increasingly interested about how you could tell stories with data and so the BBC commissioned us to produce an annual of BBC TV for 2010, told through subtitle data as well as other data that is available from the iPlayer feeds (such as repeats). We created this as a printed, concertina document, intended as a Pocket Guide. This format was used as we wanted BBC management, who are generally busy people, to have something they could share and show amongst others, over a coffee. The material document somehow made this R&D concept more accessible. The BBC 2010 Annual can be viewed online, here.
There were two key design challenges in the production of the annual. The first was to show the year, the sense of movement throughout the year, together with macroscopic views into particular domains, such as “opportunities to view”, mentions of place etc. We did this by providing a timeline of key programmes and events. The second design challenge was to provide meaningful stories from the data for the eleven panels. We chose to design for things that would be inherently interesting regardless of the data. This brute-force approach helped to save time (limited on this two week project!) but in adpoting this approach we potentially missed some of the patterns in the data.
Here are some examples of the panels.
Dead Yet Alive – Top 5 historical figures by mention on BBC TV (together with a sample of their appearance in subtitles):

Company mentions correlated with share price

Repeats Opportunities to view

Head to head mentions
I think this is my favourite. It’s easy to understand and taps into a desire to want to have winners and losers.

So, in a little under two years we took an interest in data and started to explore what you could say about the BBC from subtitles. The answer is you can say quite a lot and you can start to infer things about the BBC and the data such as that Afganistan featured more prominently in the news in 2010 than Iraq and that Jamie Oliver is more popular than Gordon Ramsay.
Where’s the (public service) value?
There are different potential sources of value in this data and in using this data like we have. The first is navigation. Describing content more effectively for it to be indexed, allowing people to get to content, and discrete bits of content, for example the mention of Elton John in Mastermind, above (and providing the show is available to watch via iPlayer), which might be relevant to someone looking up Elton John in BBC Music pages. We get a heap of emails from people researching particular people and who come across Channelography as it places so well in SEO; most of them look to correct particular factual errors they’ve seen or see if we have contact information! Currently BBC shows only have a description that is indexed and not the subtitle or associated subtitle metadata so subtitle data would be incredibly useful to boost the SEO of programme and iPlayer pages.
Secondly, there is potential business value in aggregating the data to know what the organisation is putting out, when, especially when combined with other sources such as the Guardian API, you might start to get a sense of how news events were covered.
Thirdly, there is the storytelling. How different people appear together or cluster and how over time the data could become a way to tell stories around content and as a proxy for British culture more generally and the things that pre-occupy us, for example how Victorian drama is replaced by Edwardian or how Shakespeare’s influence ebbs and flows, all hugely interesting and only do-able when you have data available on this scale by a media organisation as central to the culture of a nation as the BBC (PBS, for all it’s excellent work doesn’t have nearly the reach or the scale in the US the BBC has on the UK).
Next steps
We’re hoping to do the same for radio as we’ve done for TV, utilising speech to text. Radio has had something of a resurgence in recent years and yet it still remains hard to search across. Storytelling the Archers (for you folk over the pond the Archers is a British institution, and the longest running soap opera ever) through themes would be awesome! We’re also hoping to extend our work in visualising information to create a more useful dashboard, perhaps with a bit more pretending built in.
If you’ve got any questions about this work do get in touch via Rattle. Thanks.