Friday, August 21, 2009

Delicious Today: data mining journalism, $$

I commented briefly on data mining journalism in Monday's post which has attracted some interest. Today, this topic for some reason pops up again and again in my usual reading of RSS feeds and tweets. Maybe I've stumbled upon something big.

#1: Free the Facts: The Guardian's editor-in-chief on why open data matters
#2: Guardian Data Store: use our content to improve your site
#3: Guardian Data Blog
"Comment, as Guardian founding editor CP Scott said, is free. But the second part of his maxim holds equally true for the Guardian today: facts are sacred. ... The web has given us easy access to billions of statistics on every matter. And with it are tools to visualise that information, mashing it up with different datasets to tell stories that could never have been told before. ... That is where the Data Store and the Datablog come in. Every day we will publish the raw statistics behind the news and make it easy to export in any form you like. It is about freedom of information. But it is not a one-way process – we want you to tell us what you have done with the data and what we should do with it. The facts are sacred — and they belong to all of us."
#4: The Future Of Work: It’s Data, Baby (NYT)
"The ability to extract stories from a world of increasing and abundant data will be increasingly critical to many industries. Indeed, the opening of U.S. federal government data at data.gov (and the appointment of Sir Tim Berners-Lee to similarly open the UK’s data archives) implies a new societal and cultural importance for data wranglers."
#5: Hans Rosling shows the best stats you've ever seen (TED conference video)
"You've never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called 'developing world.'"
Mark Twain once said that "there are three kinds of lies: lies, damned lies and statistics." Yes, we can't always trust the "statistics" we read in news report because oftentimes it's interpretation rather than data which can involve human errors or, worse, deliberate manipulation. But with data storage price falling down almost to the floor and data analysis technology becoming easier to access and use, we can put more raw data in journalists' and the public's hands and give "transparency" a lot more meaning.

I'm very impressed by Guardian's vision in this direction and the huge volume of work they have already put in. Data mining journalism requires at least three steps:
  1. Collecting data from government and enterprise database (e.g. data.gov, UN databases), a media organization's own archive (e.g. American Archive project), user contributed data, etc.
  2. Analyze data focusing on finding interesting relationships, i.e. data as links instead of points
  3. Tell story from data
A good example of data mining journalism in U.S. is Patchwork Nation, a joint effort to tell economic stories through data by NewsHour of PBS and Christian Science Monitor, two fine public media organizations. Another example is EveryBlock as I wrote about in Monday's post.

It's expensive to start data mining journalism because it takes a lot of time and money to write computer code from scratch to retrieve and present data. But the cost of keep doing it becomes less and less because of code reuse and scale of economy. For this reason, data mining journalism can be a sustainable business model that can draw revenue from two sources: (1) charging readers and/or other media companies for data content which is unique and hard to duplicate; and (2) selling data-mining technology and/or access to other media companies.

No comments: