This is the P2PU Archive. If you want the current site, go to www.p2pu.org!

Open Journalism & the Open Web

Week 4: Big Ugly Datasets for Thumb-Fingered Journalists

Sarah Chacko's picture
Fri, 2010-10-08 17:44

Week 4, October 11-15, 2010: Big Ugly Datasets For Thumb-Fingered Journalists
Lecturer: Nick Judd, Assistant Editor, Personal Democracy Forum

Call URL: 
  http://apps.calliflower.com/conf/show/131557

 

Goals 
Introduce programmers and journalists to the concept of deploying entire large-scale databases for use in reporting. We live in a data-driven world; from Facebook's Open Graph to Data.gov, databases of massive size and sometimes frightening scale are more available now than ever before — but just because they're available doesn't make them useful.  Participants in this course will leave with a better understanding of what a single journalist or a small team can accomplish in the new world of big, ugly datasets.

 
Readings
J-Learning.org, "Introducing Databases" and "Planning Your Database Tables"
- http://www.j-learning.org/build_it/page/introducing_databases/
- http://www.j-learning.org/build_it/page/planning_your_database_tables/
 
Paul Bradshaw, "Data journalism pt2: Interrogating data"
- http://onlinejournalismblog.com/2010/04/26/data-journalism-pt2-interrogating-data/
 
Bill Allison, "Editor's Notebook: Follow the Muddled Money"
- http://reporting.sunlightfoundation.com/2010/editors-notebook-following-muddled-money/
 
Matt McAlister, "A year in open data: how the Open Platform has changed what we do"
- http://www.guardian.co.uk/news/datablog/2010/may/20/open-platform-data-guardian
 
 
Reaction Questions (post your responses before the Monday lecture)
When can data help you find — or tell — a story?
What are the risks inherent in relying on databases to inform journalism? How do you mitigate those risks?
What can you do using databases that you couldn't otherwise do?
What can't you do using databases that you could through more conventional means?

Meeting times
Live meeting (audio & chat): Monday, October 11th, 1PM Eastern (attendance highly recommended)
Mid-week check-in (group chat): Wednesday, October 13th, 4PM Eastern
Peer assessment (group chat, possible video): Friday October 15th, 1PM Eastern.

Comments

The example given in the

Marlon x's picture
Marlon x
Sun, 2010-10-10 02:00

The example given in the Bradshaw article of how to use data to uncover a story was very cool. Based on that I'd say one way to use data is to treat deviations in the data set as leads, and follow up on them accordingly. On the other side, you can use data like you would an interview with an expert: quoting the relevant bits to lend authority to the story.

This is also a downside of using a database, since although the idea of a computerized repository of data seems very authoritative and reliable, it's potentially just as fallible and flawed as any other source. The Bradshaw article mentioned trying to "repair" corrupted data, but how can you be sure you've repaired corruption if you're not sure what the data should be? I think attention should be given to techniques for taking multiple independent data sets and checking them against each other. Consistency suggests reliable data, while any inconsistency is a potentially interesting lead.

Data sets do differ from individual sources in that they are the collected accounts of very many people or very many incidents, etc. So it's possible to get a much wider view of a situation than would ever be possible otherwise. The inverse is also true - the data set is inherently an abstraction, and few databases can convey the human elements which often make a story compelling and relatable. Although as more of our lives become recorded and indexed in extensive detail, I wonder if this will change...is it possible to write moving personal stories based on database-sourced information?

I live with data-driven

Matt Carroll's picture
Matt Carroll
Sun, 2010-10-10 22:14

I live with data-driven stories every day, and I’m baffled more reporters don’t do them. The fed, state & local govts put all this data out there, just begging reporters to pluck the stories out of numbers, but few do. Still, there are probs, and I’m sure I’ve been a victim of most of them. First, relying on numbers too heavily can turn a story eye-glazing. 2nd, the numbers might be the base of the story, but not necessarily the heart of the story. Usually finding the data that is the kernel of a good story is just a starting place. Now you have to go find the people to bring the story to life. Also, unfortunately, all too often data is screwed up. Sometimes it’s just a piece or two, which invariably are the outliers that piqued yr curiosity in the first place. More often, I get data from the govt that is a mess bc the context is not clear, or the thinking behind the database creation was muddled so the data is useless.
Data can give a story sweep – you can use it to show trends statewide, countrywide, worldwide. But data is a pile of cold numbers if you can’t get the right quotes or the picture that makes you cry. The numbers are just a part of the puzzle for building the story.

When can data help you find —

Steve Myers's picture
Steve Myers
Mon, 2010-10-11 07:54

When can data help you find — or tell — a story?
We're used to relying on data to tell governmental and crime stories, but more and more structured information is being gathered every day. As more public transit systems start tracking the location of their trains and buses, we'll be able to measure their performance better (and we can tell people when the bus is really coming). Social media is providing all kinds of new data for us to filter; by working with structured elements of online social interaction, we can see how people react to the World Cup or an election. Some of this data will provide us with new stories; others will provide us with new ways of telling stories.

What are the risks inherent in relying on databases to inform journalism? How do you mitigate those risks?
You run the risk of thinking there is a story when there isn't one, or of simply misunderstanding the story. You mitigate that by understanding your data as deeply as you can -- what is supposed to go in each field, where the data comes from and how it's all entered and processed. This risk is compounded when you start crunching numbers and linking unrelated databases; a single bad formula could throw everything off. You can avoid this by vetting your process -- starting with the originating data and ending with your data-based conclusions -- with a third party, perhaps the subject of a story.

What can you do using databases that you couldn't otherwise do?
By filtering and sorting the data, you can, relatively quickly, look at volumes of information from a variety of different angles. With the right queries, you can dig for information that meets certain criteria -- perhaps to test a hypothesis. And you can use a database to give you a broad look at the information at hand, which frees you from relying on anecdotal information.

What can't you do using databases that you could through more conventional means?
Conventional reporting methods such as interviewing and developing sources can give you the "why" behind the data. Knowledgeable sources can also tell you what story the data should illustrate, so that you know what to look for. The example in Paul Bradshaw's post about the drop in certain kinds of parking offenses in London is a great example of the limitations of data -- it was a simple change in how the offenses were categorized. It was also an example of dirty data, which shows that data is only as good as the data entry.

Potential lecture discussion: I would be interested in discussing different ways to fact-check large data sets.

Data provides a telephoto

Terri  Langford's picture
Terri Langford
Mon, 2010-10-11 17:48

Data provides a telephoto lens to a story. It allows users to zoom in and out quickly, giving them can both the macro and the micro about any particular subject.

Looking at the macro, users get an overall picture of a program or a history of a situation. Data can be used to tell a simple story, like how many people or how much money a particular agency or organization has at any given time or over time.

Or, data can zero in on the specific, showing the recidivism rate among criminals who commit a particular crime or how one type of government employee seems to quit more often than others.

Data is like any other research tool, it has to be vetted.
Often, it is dirty and incomplete. Anyone who uses data must be mindful of where it came from, how it was gathered and what isn’t there. Reporters must apply the same scrutiny to data that they give to interview subjects and documents.

But data, when it’s cleaned up and vetted, can zero in on the proof or direction your story is lacking, particularly when you cannot find people to talk to you about a certain subject.

Data by itself provides a thin story. It has to used in concert with other elements.

And often, what data does not say can often be the story. If a government agency refuses to count something that could improve the work it does, that’s a story.

For example, I often ask the state to count how many children who die of abuse came from families who were previously investigated by the state. The fact that deaths among previous investigated families is not a data benchmark, for a state agency charged with keeping children safe from abuse, is a story.
I’m working on a project right now that could not have been done without the direction data has given me.

However, I’m not relying on the data completely. I’m following the direction and complementing the story with old-fashioned shoe-leather journalism, to test the data and prove what it is telling me is true or false.

Data often can’t spot emerging trends or experiences. That you get from interviewing people. A district attorney who tells you they are starting to see a particular pattern of defendant or crime may not be able to point to data yet. The data may come months down the line.

Data, which by its very nature is cumulative, is lousy when it comes to the immediate. And it’s not terribly accurate if it hasn’t been programmed or “asked” to look for what you want to know, as in the example above with the abusive families’ histories.

Sometimes what you want to know, isn’t the type of data that is collected. If a government agency fails to monitor a type of business it licenses, for example, then it is tough to determine if the businesses are legitimate. All you have is a license. Nothing about the financial health or the safety of the business because all you have is registration or license info.

So you have to layer that data with data from other agencies, perhaps, other interviews to give a complete picture.

I wish this course would

david mason's picture
david mason
Mon, 2010-10-11 19:09

I wish this course would focus less on old style databases and more on emerging linked open data, which today is already practical - governments around the world are releasing their data. Sure, there are times you need to normalize data and use ACID properties, but I'm not linking these terms becase they're not that important for everyone. Computers are so powerful these days, programs so sophisticated, that I'd suggest it's more worthwhile to explore semantic databases that use link relationships, which whether ad-hoc or formalized let you codify not just your own data, but data from around the web (and please do think about giving back, the rewards will be monumental).

As for the role of data, every day I run across people who are talking about journalism's role today, and it's clear it's vitally important, to tell an informed, connected, responsible, comprehensible story with trustworthy facts. But I think things will become more complicated and involved, and data plays a part in that, like a puzzle the more data there is the more missing pieces become evident. If data gathering can become like a game we can step away from corporate control and into a more complex yet individually responsible reality that features journalism as a pinion.

For a multitude of projects I've developed ways to import data into a semantic database that can link people, places, times, etc (http://canbudget.zooid.org is an example but it needs more explanation). If anyone wants to try it out write me a note (simple spreadsheets are the easiest to import).

David, Sadly, linked open

Nick Judd's picture
Nick Judd
Tue, 2010-10-12 16:06

David,

Sadly, linked open data is not yet as practical — at least not for my sort of purposes — as we both would hope. Governments are just now starting to get used to the idea that they should be providing data, period. The arc of openness is long — but it does bend towards normalization. There are only so many Tim O'Reillys and Sir Tim Berners-Lees in the world, but in the US and UK, there are many people in government who are listening.

After our "101" discussion, I'd love to have a "201" discussion along the lines of "what can we build?"

For example, I think it would be hugely useful if there were tools that allowed journalists to use linked data to cut through the noise and get at the signal in the datasets we deal with every day.

To be more specific: A central problem to thumb-fingered journalists getting the full use of big datasets, I think, is that we often don't have the understanding -- both in subject-matter expertise and technical knowhow -- to normalize multiple tables, acquired from different sources, in order to run queries against joined tables.

Can hackers like you use linked data to help us solve this problem? How?

Thanks for your response

david mason's picture
david mason
Tue, 2010-10-12 19:47

Thanks for your response Judd. One of the main principals of formal Linked Open Data (LOD) is any data can be joined to any other data (as long as you can find an identifying key). But I should have been more clear that it's still quite useful to understand and engage low level methods around importing, storing and accessing data using "traditional" tools, I just don't think everyone needs to get down to the often rather arbitrary and non transferrable nitty gritty. Understanding the basics of coding (variables, keys, if/then, querying, regular expressions, and now inference) will always be a huge benefit. But clearly opening data is happening at a breakneck pace, I bet there is an open data initiative for the cities and countries of everyone on the list (here's an example — http://www.toronto.ca/open/catalogue.htm ). When we start adding inferences to these massive connected data sets, we're at a next level of distributed comprehension. So there are many opportunities there, and in six months to two years people can understand how to find, navigate, organize or request data (which is not particularly a technical problem). I'm working on some examples of tying together different sources of data, otherwise there are lots of examples of useful (unformalized) "mash ups" on the Web. http://mashupguide.net/1.0/html/ch01s02.xhtml is an outstanding example.

Inference is of particular interest, it is artificial intelligence that can be used to augment discovery and build agents. When journalists start adding inferences to these large data sets, for example (I'm not a journalist!) an unusual number of new restaurant licenses and health problems in an area, a lot of the groundwork could be done without any effort.

I was going to put in my

Mai Hoang's picture
Mai Hoang
Wed, 2010-10-13 04:40

I was going to put in my (very late) response to the assignment, but this point made by Nick Judd caught my eye:

"To be more specific: A central problem to thumb-fingered journalists getting the full use of big datasets, I think, is that we often don't have the understanding -- both in subject-matter expertise and technical knowhow -- to normalize multiple tables, acquired from different sources, in order to run queries against joined tables.

The best part of yesterday's lecture was realizing that there are so many opportunities to find, import and organize data in new ways. I think the big revelation is that I actually have a lot to learn. I have always considered myself pretty computer savvy, but many of the things that came up on Monday was very new to me.

I think the most encouraging thing, however, is that once I do learn it, there is opportunity for me to provide and visualize data in ways I never have before as a reporter. Of course the lease encouraging thing is that it takes a lot of front-end work to get there.

David Mason happens to be my partner, so I know that the issue of linked data will be a discussion point as we work on our assignment for this week.

Anyway, back to my response:

I believe that data is not something you do one time and hope that a story comes out of it. Rather data is something you collect over time and over times trends -- and a good story -- will emerge.

For example, the investigation by the Sunlight Foundation of the CSS Action Fund came when editor Bill Allison came upon the group in the foundation's "Follow the Unlimited Money" widget.

Data can also serve a component of a story you're already working on. For example, I wrote a story about real-estate owned homes, foreclosed properties taken by the bank after it fails to sell in a trustee sale. I already had data about the median prices for my county collected over the last 24 months, so I was able to use that to compare that with the median prices of REO sales. So the median price wasn't the star of the story, but it provided a very important part of it.

And data can sometimes be a great place to track evergreen or boiler plate data. Chances are you won't win awards for keeping the data, but it comes in really handy when you need some random fact for a story.

I agree with many of the journalists is that data can't really tell you why you see something. I will repeat what others say here is that you need to talk to experts and your sources of what's in the data.

And I think you do have to be mindful of the sourcing of the data. Although it's raw data, Nick Judd was right in our lecture that chances are the data being offered is being seen as benign to offer. On the other hand, sometimes the data can provide holes or other revelations that maybe was not realize at the time of the data's release.

And I think the other challenge of data is again that you will have to wait and as others said here sometimes a story won't come out of it. In that case you just learn to move on and shelf the data.

I think the greatest takeaway I got from the readings is that there is a lot of data available and there a ways to collect and organize it. The next challenge for me is to learn all the techniques!

I think it's also worth

david mason's picture
david mason
Wed, 2010-10-13 05:03

I think it's also worth mentioning that data is a way to engage with readers. Whether it's providing a bare fact about their neighbourhood, enlisting their help in collecting it, or talking about the emerging data phenomenon. Participatory Web and the evolution of journalism (I presume :]) is also a huge topic to be covered for hacks and hackers. Most of the news sites I visit treat the people who comment like riff-raff, which is very disappointing to me since there are some very informed opinions and it is representative of interested persons.

Great conversation &

Phillip Smith's picture
Phillip Smith
Thu, 2010-10-14 18:09

Great conversation & resources folks. :) Let's keep it going!

David — The identifying key

Nick Judd's picture
Nick Judd
Thu, 2010-10-14 20:21

David —

The identifying key is the trickiest part! Help me walk the class through how to best address these dilemmas from a hacker's perspective.

I hope this gets everyone's gears turning.

Here's what a "201"-level response to my assignment might look like, based on joining data. My apologies to anyone who might have thought of this to fulfill the assignment:

Take the White House visitor logs, the OpenSecrets.org database of lobbyists, and the OpenSecrets.org database of lobbying as received and cleaned from the Senate Office of Public Records.

Suppose you want to see all the lobbyists who have been to the White House since the Obama administration started releasing these records.

In theory, you join the lobbyists database to the lobbying database:

The lobbying database only contains information about each lobbying firm and its clients, and the lobbyists database only contains information about lobbyists, but you can join them because each record of a lobbyist seems to pertain to a single document number that is also pertinent to only one lobbying firm.

What's more, because the database is so large, you have to output the result of that union to a temporary table and create an index on it -- otherwise queries to that database would take forever to resolve, or would never return results. Hacks: This does exactly what you'd think it does -- it allows the RDBMS you're using to more quickly search the database for relevant records. But to do this quickly, the "page number" side of the index might only be generated based on the first few characters of the column (think "subject") you're using to create the index. The fewer characters you use the more quickly the index will work, but when you run a query and the RDBMS uses that index to find your results, you're also more likely to get the wrong result.

A hacker friend of mine fixes this somewhat by generating an additional column on both tables to serve as the source material for the index. Rather than the often similar business of a single column with first and last names, he fills that column with the output of an algorithm called MD5, which generates what's called a "hash" -- a sequence of letters and numbers -- based on each name. This should allow an index that's both reasonably fast and accurate.

Thus you can connect individual lobbyists to possible clients, and then compare the names on that list -- through your "hash" -- with the names on the White House visitor logs.

What you have at the end of this is a list of lobbyists who have the same names as people who visited the White House. Not immediately useful with lobbyists that have names like "John Smith."

With additional drilling on each name, you can then get a better guess if this person or that person really is a lobbyist.

For example, say you go out and scrape the White House phonebook so you have a list of the names, contact numbers and departments for everyone in the administration.

(Extra credit assignment!!)

Then, for each name, assume the visitor is a lobbyist, look at his or her client, and see if the name of the client makes sense based on the department of the visitee.

If the visitor's client is, for example, a Native American reservation, and the visitee is in the Department of the Interior (which is responsible for Native American affairs), you may be in business.

But that requires the bulk generation of inferences. You're talking about doing that for each of hundreds of names. And even if you get those names, what is the utility to you? You have evidence suggesting that a lobbyist paid by a reservation has visited someone in a position to influence policy.

Unless the date(s) of that visit(s) can be juxtaposed against, for example, the dates of important policy decisions — in this case from Interior — or against the issuance of a grant or contract award, the utility is minimal.

But wait! The Federal Register, which lists announcements of administration decisions, is now released daily in XML. And you can get data dumps from USASpending.gov, which tracks federal spending.

... And now you're generating a new table, one that lists the visits from probably lobbyists within, say, three months of a contract award to the client or a mention of the client in the Federal Register. How many different joins have you made at this point? And how many possible points of failure?

By the time you figure out how to juxtapose these things in a meaningful way, how much time have you invested in this project? And what's your likelihood of success?

This is what I was talking about in lecture when I suggested that you be careful how much time you invest in projects involving large and complex databases.

Now, hackers: What was wrong with my approach? How could I streamline it? And what tools could you build to streamline this type of process for future hacks?

And hacks: Am I asking the right questions? What would you do with the same type of information?

Another good point, David.

Nick Judd's picture
Nick Judd
Thu, 2010-10-14 20:25

Another good point, David. Nick Diakopoulos, my colleague and advisor in this course, points us here: http://americanpublicmedia.publicradio.org/publicinsightjournalism/

American Public Media is building a database of experts among its "crowd," to whom they can turn for expert opinion.

If if were doing a one-off,

david mason's picture
david mason
Thu, 2010-10-14 23:05

If if were doing a one-off, I'd probably use http://en.wikipedia.org/wiki/Denormalization or store the data in separate 'tables,' and write a program to associate and analyze them. For the results and extra/ongoing aspects, I'd import interesting data into an editable semantic system (eg SMW) and use an agent approach to periodically update it, although it could only store millions of linked entities.

... could you translate that

Nick Judd's picture
Nick Judd
Fri, 2010-10-15 01:12

... could you translate that for the hacks? :D

First off I should say I

david mason's picture
david mason
Fri, 2010-10-15 04:51

First off I should say I didn't really mean denormalization, more like unnormalized. Store the data as convenient blob of fields, which means there's duplication but querying is easier. So instead of having to create a join on a key and doing complex queries because your design is so efficient each item is only represented once, you're querying on a list of items, maybe using 'distinct' to narrow down as required. Basically treating an RDBMS as a big spreadsheet, duplicating each combination, but for a purpose rather than generalized system it works. The table could be

Visitor|Organization|Department visited|Visit Date
John Smith|Ciggies Inc.|Dept of Health|June 1, 2010
Johnathan Smith|Tell It To The Hand, LLC|Depth of Health|June 5, 2010
Johnathan Smith|Tell It To The Hand, LLC|Depth of Health|June 15, 2010
John R. Smith|Ciggies are Healthy! Research|Depth of Health|July 15, 2010

and I would use a program to compare these to other tables...

1. for each $org, $dept in "select organization, department visited" from visits
1.1 for each $clientInterest in "select clientInterest from orgclients where org = $org" do
1.2 "select true from govDept where dept = $dept an deptInterest is not $clientInterest"
1.2.1 if not true, print "What the heck is $org meeting with $dept about $clientInterest for?"

Output might be:

What the heck is Ciggies are Healthy! Research meeting with Dept of Health about Letting the public know about the benefits of ciggies for?

Since it's a program I can also do things like find different varieties of names, companies, etc.

Of course this is simplified but I hope it makes sense.

For the canbudget project, I used semantic mediawiki. It couldn't easily handle tens of millions of entries, but for hundreds of thousands of entries (a narrowed down list?) can be useful. I entered the items in a spreadsheet, but any data source can be used. Once you have a data source, it's imported and can be viewed, queried and updated.

Say http://canbudget.zooid.org/wiki/2010/G20 is "White house visits." say the first Cost item is a visit. They're ordered by number (cost), let's say it's by number of visits.

The query for this single item is

{{ #ask: [[Topic::White house visits]]
|?Visitor name
|?Organization
|sort=Visits
|order=Desc
|limit=1}}

which would yield one item, also represented as http://canbudget.zooid.org/wiki/Special:Ask?title=Special%3AAsk&q=[[Topic%3A%3A2010%2FG20]]%0D%0A&po=Supplier&sort_num=&order_num=ASC&sort[0]=Canadian+dollar+cost+2010&order[0]=DESC&eq=yes&p[format]=broadtable&p[limit]=1&p[headers]=&p[mainlabel]=&p[link]=&p[intro]=&p[outro]=&p[default]=&eq=yes (long result directly into a query, check the bottom for the result)

but we're in a semantic wiki, so

http://canbudget.zooid.org/wiki/2010/G20_and_G8_Budget/License_for_use_o...

Click "edit to form" to see what updating looks like.

Or a clickable graph http://canbudget.zooid.org/wiki/Graph_of_procurement_process_and_topic , timeline drill-down (facet) browser http://canbudget.zooid.org/wiki/Procurement_exhibit , timeline http://canbudget.zooid.org/wiki/Dates etc

We have canned queries on each data type; procurement process:

http://canbudget.zooid.org/wiki/Sole_Source

Supplier:

http://canbudget.zooid.org/wiki/GTAA

You can also add inline annotations; "class of Smallville 1995" could be a subproperty of "went to school together" could be a subproperty of "probably knows." Then you can do broad to narrow queries on "probably knows" or any other annotations.

You can add inferences to a template, for example, if the address within 40 miles of a Ciggies Inc, add it to the category "Close to ciggies inc." These categories can be queried.

Since its a semantic system marked text has meaning and can be reused and constantly built on.

Because it's a wiki with a front end, you can open it up to a group of people to query or edit with forms. It's standard open source software so you don't have to "invent" or maintain anything, though you can extend it as needed.

We're starting to see ways to use open linked data (LOD), so your data can include data from sites like Freebase, and "official" sources. Here's a sample list today.http://esw.w3.org/SparqlEndpoints http://www.data.gov/ might be of particular interest to Usonians. With LOD the "back end" doesn't matter as long as it conforms to the data/query standard, so the data sets can be very large an distributed.

Enjoy! (but stay away from the ciggies)

I forgot to mention with

david mason's picture
david mason
Fri, 2010-10-15 05:01

I forgot to mention with regard to "agents," Mediawiki (the software basis of Semantic Mediawiki, and the same software used by Wikipedia) supports a programming interface, so people can write programs to automatically (periodically) query and update content. There's quite a huge ecosystem of "bots" for Wikipedia.

I made a couple of errors and

david mason's picture
david mason
Fri, 2010-10-15 05:23

I made a couple of errors and it's hard to read there's a version at http://canbudget.zooid.org/wiki/Churnalism

"Churnalism" refers news generated based on stats, of course the point here is people would work together to make real stories out of this.

"Potential lecture

Phillip Smith's picture
Phillip Smith
Wed, 2010-10-20 22:05

"Potential lecture discussion: I would be interested in discussing different ways to fact-check large data sets."

Great idea, Steve. Wonder if it would be worth posting that over on help.hackshackers.com to see what wisdom the crowd has to offer on the topic?

Phillip.

"I wish this course would

Phillip Smith's picture
Phillip Smith
Wed, 2010-10-20 22:12

"I wish this course would focus less on old style databases and more on emerging linked open data"

Hey there David,

Appreciate the recommendation, and I'm a big fan of linked data; see my comments on the list of topics that lead to this course: http://help.hackshackers.com/questions/611/hacks-hackers-and-mozilla-wan...

The biggest challenge I see around integrating linked data into a course like this is the availability of linked data catalogs. Typically, Governments are not releasing their data in a way that is query-able at all (like an API), let alone a linked data repository. More often then not, I see a list of CSV or XLS files, or -- on a good day -- maybe KML or shapefiles, or XML.

So, I'll kick the question back to you David: Where would you point participants to find linked data datasets that they can use today, without having to load up the data themselves into a repository or semantic wiki?

Phillip.

I hope this gets everyone's

Phillip Smith's picture
Phillip Smith
Wed, 2010-10-20 22:18

I hope this gets everyone's gears turning.

Love it, Nick. :)

I might have been more

david mason's picture
david mason
Wed, 2010-10-20 23:22

I might have been more precise and said "I wish this course could" (instead of would), but there are today many linked open data (LOD) sources (you pointed out http://www.data.gov/semantic/index , and most regions in the world are working on this), and kits including Drupal 7 and Ruby frameworks support LOD features naturally. Semantic Mediawiki, my favourite kit, for reasons I'm happy to go on about at great length, allows easy import of spreadsheet style data, and interaction with LOD. In many cases the best approach is to create local databases, but many of the biggest benefits going forward are going to be LOD, not least because of the phenomenon (puzzle pieces becoming available) but also because of what it offers.

The situation regarding

Mariano Blejman's picture
Mariano Blejman
Mon, 2010-10-25 01:05

The situation regarding public information scanned in Argentina seems to differ greatly from the situation that occurs in the central countries. In general, public information digitized from public institutions is not as developed as in America. Managing digital resources of journalists is still not as well established, and national newspapers generally have not developed great teams and interactive to normalize public information quickly and with journalistic criteria. On the other hand, social non-governmental organizations seem to work more on the "send" information system to news media better than develop information to allow journalists to access. In many cases, Wikipedia articles are not fully developed and their sources are not confirmed, so that the work to be done is even larger. There are huge tasks to do and great social issues that deserve a lot of community work and distributed, in which databases can play a crucial role. I give an example, and I invite especially those fellow programmers to think about solutions for this: at present takes place in Argentina a delayed trial for crimes against humanity that occurred in the last military dictatorship between 1976 and 1983.

http://www.juicioprimercuerpodeejercito.blogspot.com/

http://www.pagina12.com.ar/diario/elpais/1-152781-2010-09-08.html

During each morning, about five or six parallel trials occur in domestic courts. Dozens of witnesses testify each day, and investigators, prosecutors, human rights organizations are wondering how to organize information that is published there. For now, prosecutors and complainants have not gone much further than trying to download the information to paper in informal remember what is happening there. I think this is an unprecedented opportunity to demonstrate what has been talked in this course: perhaps an organization through a Wiki platform or through a distributed system of access to information would give testimony that occur each day ( carrying a couple of years and will surely be a couple of years) the possibility of organizing such a wealth of information and to find relationships that have not yet been found. This is a great challenge for newspapers with great social implications. Is anyone interested into work about this?