This is the P2PU Archive. If you want the current site, go to www.p2pu.org!

Open Journalism & the Open Web

My recent threads

You haven't posted any discussions yet.

Recently updated threads

Assignment for week one. Ideas

Go back to: General discussion

I´ve been thinking about how to solve the assignment for this week. Pulling data programmatically. My first thought was to parse the page finding all the <a > </ a> tags, where the href would be the source and what´s inside the tags the fact related to that source.

First problem when doing that: some texts simply don´t have hiperlinks. Bad journalism, i know, but you can find plenty of that out there.
Second problem, not always what it´s inside the <a > tags is the fact. Maybe we should pull the data with a script and then check the info manually...

Any ideas?

Michael Morisy's picture
Michael Morisy
Wed, 2010-09-22 20:09

I've been going over this in my head a few times, and I can't think of many ways that this works out easier programmatically than manually with all the edge cases and double checking.

My initial thought was to run the news url page through something like readability to get rid of the cruft, use some basic regex to pull out everything between quotation marks for the quote, and then ... what?

Maybe an open calais run to highlight all the sources because it does name recognition?

There's an old journalism rule of thumb that each paragraph is supposed to be a single idea or a single fact, but I think that might be an oversimplification.

david mason's picture
david mason
Wed, 2010-09-22 20:38

I am going to use Semantic Mediawiki to create a set of re-usable pages that have relationships between each other, with queries and views like timelines, maps, etc.

An annotated phrase would end up looking something like:

On [[Date::Sep 26, 2010|Sunday]], [[witness::John Smith]] said "[[statement::The pizza was late]]."

The result will look something like http://subvention.zooid.org/wiki/People (the source is fully accessible)

I haven't selected an article yet but if it has a lot of nameable entities I might use either gate.ac.uk or opencalais. However people are better annotators so I'd have to refine these annotations.

Fernando Alvarez's picture
Fernando Alvarez
Wed, 2010-09-22 21:41

David, I like the idea of semantic mediawiki, in long term projects the posibilities of a well /*constructed databese*/ connected databases can be awesome.

I´m cheking opencalais and looks great, didn´t know about it.

Anyone else have more ideas?

david mason's picture
david mason
Wed, 2010-09-22 21:12

Well connected databases. :)

Fernando Alvarez's picture
Fernando Alvarez
Wed, 2010-09-22 21:40

:) thanks

Steve Myers's picture
Steve Myers
Thu, 2010-09-23 04:40

I picked a very basic story, http://cityroom.blogs.nytimes.com/2010/09/22/governors-race-suddenly-tig... . There's nothing special about it, except that it does seem to have a combination of verifiable and sourced facts and those that are harder to verify and source.

Steve Myers's picture
Steve Myers
Thu, 2010-09-23 16:30

Deconstructing the story by hand was a pretty interesting exercise. I thought about different types of facts before I did it and came up with these categories:
-identifiable source
-identifiable but unnamed source, able to be inferred from other references
-unknown source
-opinion, belief or analysis (anything other than a fact)
-prior reported fact or common knowledge (such as biographical or historical information)
-reporter's or news org's observation or knowledge (which I surmised)

As you can see, there are gradations of sourcing evident here. And there's lots of room for argument.

In the story I picked, I saw that everything in the story could be assigned to five of those categories. The most common ones were
-unknown source
-opinion, belief or analysis (anything other than a fact)
-identifiable source
-reporter's or news org's observation or knowledge (which I surmised)

Thoughts?

Steve Myers's picture
Steve Myers
Fri, 2010-09-24 04:47

From reading Chris' comment, I realize that one thing is worth noting here -- that this analysis is totally based on how I think the information was obtained, in part based on my knowledge of how reporters and news organizations work. Indeed, there little in the story that looks behind the curtain, so to speak.

Chris Nicholson's picture
Chris Nicholson
Thu, 2010-09-23 20:47

My take on this assignment is the simple fact that a lot of online journalism is effectively just a web version of the dead tree press. As an example, I took a look at one (admittedly extreme) story from a well-known tabloid rag that often runs stories like this.

HOW MIGRANTS SNATCHED OUR HOMES
http://www.dailyexpress.co.uk/posts/view/201280

The story is typical of a Daily Express narrative that has been promoted for a few years; that immigration is out of control in the United Kingdom. There is certainly evidence for and against, but the Express always run stories that confirm their inherent bias.

The Daily Express is also quite typical of a lot of newspapers (from broadsheet to tabloid, from right-wing to left wing) in that it has very basic interactivity. The "Have Your Say" section has been disabled and there are *no* hyperlinks. In that respect, we have *no* real metadata that we can pull from this, aside from the text itself. There are a number of progressive sources of journalism (such as the BBC) that are moving forward in storing their information in the form of XML. But the Express is very bog-standard text; only basic structural HTML in there too, in the form of visual blocks.

A simple tool would be one that would recognise peoples' names, to even validate some of their quotes in the "MIGRANTS" Express article. I follow a large number of "watchdog tabloid blogs" and most of them find out that a lot of quotes are made up or embellished.

Just in case anyone in this thread had any romantic ideas that UK journalism was structured, was double-sourced and was literate, because it often isn't any of those things! So, how would we be able to use programmatic tools to gain any knowledge of sources in such an article?

David Medinets's picture
David Medinets
Fri, 2010-09-24 00:56

>So, how would we be able to use programmatic tools to gain any knowledge of sources in such an article?

Perhaps an algorithmic visualization of the overlap in coverage across many news outlets. If you see the same sources in all left-leaning outlets but only 2% of right-leaning that might allow some conclusions to be drawn. You could even create similarity word vectors and cluster around common themes. Granted this kind of comparison is beyond the ability of a single journalist but well within the ability of an organization like the Knight Foundation. Over time, you could place each article on a left-right continuum based on which outlets wrote it. I'm half-joking about that last comment.

Steve Myers's picture
Steve Myers
Fri, 2010-09-24 04:43

This is interesting. I think there's potential in a broad look at what sources are used in what types of media.

Chris Nicholson's picture
Chris Nicholson
Thu, 2010-09-23 20:55

I read the NY Times occasionally and admire it's journalistic zeal (hence the reason I was very impressed with Steve Myers approach). I almost went down the same route, by using a similar article from the NY Times (or The Guardian UK or the BBC UK), but figured I'd do something a bit differently - by choosing a piece of journalism that wasn't as well-structured and well-written.

Using Steve's system, the sources only really fall into two categories for the Express article, since many of the quotes utilised (such as the Taxpayers Alliance and Philip Davies MP) are just copied-and-pasted from older stories of a similar vein.

Opinion, belief or analysis (anything other than a fact)
Identifiable source (but with quote rejigged)

Mariano Blejman's picture
Mariano Blejman
Fri, 2010-09-24 05:02

Hi,
It's been an absolutly difficult week here in Argentina, I couldn't participate too much, and I missed the wensday chat. but I have been obssesivly reading all your posts and links. Anyway, I am still keep going. I need, if some one can help me, define "reasonable lenght" and how long do you think the work will get in characters posted into the page? I haven't picked up an article yet, but I think that If I pick up an spanish story, I won't be able to share my excercise with you. Other question is "When does this week ends? When is the dead line? Sorry to post all this basics questions...

David Medinets's picture
David Medinets
Fri, 2010-09-24 05:27

Excellent questions. If you pick a fairly short Spanish article we can translate it using a tool like http://babelfish.yahoo.com/. That might be an interesting twist on the assignment.

Michael Roberts's picture
Michael Roberts
Mon, 2010-10-04 19:23

Hey, all, I've been insanely busy with the paying work, and it's just today letting up (well - I slept on the weekend, it was refreshing). Mariano, if the article is short-ish, I'll translate it. It's what I do for the paying work anyway, so I'm fast.

Rick Martin's picture
Rick Martin
Fri, 2010-09-24 06:30

Here comes the story of the Hurricane: http://1rick.com/project/igor (so far, anyway)

After seeing this video round-up by the CBC, I whipped up a map of hurricane videos using Spreadsheet to CSV to Google Fusion Tables. Not the most elegant solution, but after failed attempts in other formats, this worked ok.

I'm thinking about taking this chunk of wind and rainfall data and visualizing it somehow as well.

Phillip Smith's picture
Phillip Smith
Mon, 2010-10-11 18:11

Here comes the story of the Hurricane: http://1rick.com/project/igor (so far, anyway)

Rick: Very interesting. I'm going to mention this on today's lecture. :)

Michael Roberts's picture
Michael Roberts
Mon, 2010-10-04 19:25

I think in general URLs aren't going to be helpful for this sort of thing. This is a classic case of textual analysis. The tools are research-worthy, i.e. not mature. We're on the verge of really interesting possibilities in this area, though.

However, if you are looking at URLs, don't use regexps! Use Perl and HTML::TreeBuilder. It has a real live parser and will give you a lot better resilience.