A prospective study at the behest of Columbia Journalism School to evaluate the demand for a new digital platform to serve and train data journalists.

Main challenges

Medium data problem

Several interviewees pointed out that journalists do not deal with big data (billions of records), but medium data (millions of records). Medium data is not big enough to justify the use of big data tools (like Hadoop) but you cannot treat it as small data (and open it on Excel). That poses a problem: according to the same interviewees, the open source tools for medium data have several limitations, especially when it comes to time of loading, importing, exporting, and processing data. For instance: it takes 30 seconds to import all census data on SAS (a private tool) and several hours to import it on PostgreSQL with a Python script (popular open technologies). Taking into account the usual time constraints in a newsroom, that limitation can be highly problematic.

Content Management System (CMS) limitations

Most interviewees from publications that were not born-digital complained about their Content Management Systems (CMS). Those systems were designed to simply republish the print content. Therefore, they do not explore the new affordances of the digital medium: a news piece in those CMS is usually composed of a headline, some paragraphs, and a picture. When data journalists and the interactive team want to publish something more sophisticated than that (a complex visualization or news app), they are usually told by the IT department that the CMS does not allow that kind of content. The workaround is to host the content outside the CMS, on a different server, that is maintained by the newsroom (with the additional costs, work, and inconveniences that such a solution entails).

Publishing for different devices

Another problem is the multiplication of devices for news consumption. It is not enough to beautifully visualize the data on a laptop screen. Ideally, the same content must be accessible on a smartphone, a tablet, or a smartwatch. Most CMS and visualization toolkits are not truly prepared to be device-agnostic.

Lack of numeracy in the newsroom

In most newsrooms, the data team does not solely work on its own projects. It also provides data services for the newsroom at large. Data journalists often complain that they spend an unreasonable amount of time with menial tasks like helping other journalists to sort Excel spreadsheets or explaining why certain numeric or data operations cannot be performed. They argue that basic numeracy in the newsroom would free them to work on more meaningful projects.

Lack of statistical sophistication in the data team

Although most data teams have one or more developers, few have a statistician. Several data journalists mentioned that they are often uncertain about the meaning of their results. When the matter is not controversial, they might publish it and wait to see if someone raises any doubt. In more sensitive topics, they usually reach out to an expert who might be willing to do some pro bono work. Several data reporters said that they would welcome tools that help them with statistical analysis.

Dependence on the Visualization/Interactive Desk

Data journalists usually rely on the visualization desk to produce any publishable graphics, even when those graphics are mere pie charts or histograms. That is the case because tools where data journalists create visualizations (Excel, R, python) do not output publishable graphics. Such dependence could diminish if those tools produced appealing visualizations. Then the visualization desk would be able to perform more sophisticated tasks.

Lack of processes to keep track of revisions and manage source code

Journalists are now facing a challenge that was solved by the software industry several years ago: how to keep track of revisions (often made by different people) in the source code or in the data analysis. Newsrooms usually do not have the processes in place to guarantee that everybody follows the same guidelines. Editors usually give a lot of leeway to data reporters when it comes to choosing tools or programming languages. Ben Welsh argued that innovative tools that merge a programming environment (like IPython Notebook) with a repository hosting service (like Github) could solve that problem because, to some extent, that could be an indirect way of enforcing some guidelines and best practices.

Fear of developing interactives that will demand updates or future maintenance

Data journalists and visualization editors are understandably reluctant to create news apps that will demand future maintenance. However useful those interactives might be for the community, reporters and editors want to be free to embrace new projects. Tools that diminish the cost and time commitment to maintain a news app would probably stimulate the development of more interactives.

Lack of a collaborative culture

In the past years, several newsrooms have started open-sourcing internal tools and code. However, that is still an incipient trend. The open-sourcing of data is even less common. For instance, many media outlets created internal databases about shootings in the US—in order to have a more realistic picture than that provided by FBI figures. Those outlets could work together to create a more consistent and rich database. Some initiatives have gone in that direction (e.g. the initiative to explore campaign finance data in California that brought together CIR, LA Times, and Washington Post).

Unstructured and semi-structured data

A good example of unstructured data is speech. Computational techniques to deal with unstructured data have been evolving fast in the past years, however it is still far from newsroom applications despite its journalistic potential. Even semi-structured data (like PDF forms where the data is not organized in a well-behaved tabular) continue to pose a challenge for data journalists.

Technologies currently in use

Bread-and-butter

  • Microsoft Excel
  • Relational databases: PostgreSQL and MySQL
  • Python: for Web scraping, data import, and data analysis
  • JavaScript: for data visualization

Occasional

  • Microsoft Access
  • R: for statistical computing (some newsrooms are migrating from SAS and SPSS)
  • Github: revision control and source code management
  • Django and Ruby on Rails: as Web application frameworks
  • Tabula or Abbyy: for extracting data from PDFs
  • Google Fusion

Rare

  • Machine learning: for data cleaning and classification
  • Virtual reality
  • 3D interactive models (Three.js)
  • News-writing bots
  • PANDA (pandaproject.net)
  • NoSQL databases: MongoDB and Elasticsearch

Absent

  • Tools for monitoring Social Media
  • Sensors
  • Voice or face recognition
  • Speech-to-text in order to perform natural language processing
  • Drones

Journalists Interviewed for this Memo

Los Angeles Times
Ben Welsh
Editor at the Data Desk

Tampa Bay Times
Adam Playford
Director of data/digital enterprise

The Wall Street Journal
Andrea Fuller
Data specialist in the investigative team

The Washington Post
Steven Rich
Database editor for investigations

Associated Press
Troy Thibodeaux
Interactive newsroom technology editor

The Guardian
Helena Bengtsson
Editor of Data Projects

ProPublica
Olga Pierce
Deputy data editor and reporter

USA Today
Jodi Upton
Senior Database Editor

BBC
John Walton
Journalist on the Visual Journalism team

The Center for Investigative Reporting (CIR)
Jennifer LaFleur
Senior editor at Reveal and CIR

The Marshall Project
Gabriel Dance
Managing editor
Tom Meagher
Deputy managing editor

VG - Norway
Dan Kåre
Editor, Interactive/Visualization Team
John Bones
Senior reporter