This is the P2PU Archive. If you want the current site, go to www.p2pu.org!
Open Journalism & the Open Web
Week 1: Assignment
The idea is to think about stories in new ways
- Pick a story: print story of reasonable length, an audio or video story is fine although we'd recommend finding one that includes a transcript
- Pry out data: sources and facts. Are sources named or unnamed? What facts are attributed to each source, what is attributed to a single source, anonymous source or multiple sources. Data can be pulled manually or programmatically.
- Organize this information.
- Post your findings as a comment, right here on this page. It can be text or spreadsheet - or use a Google Docs spreadsheet, and post the link.
Comments
I choose a story that talks
I choose a story that talks about shady financial dealings at the Vatican. I used Referata.com to create a free Semantic Wikipedia page at http://open-journalism-and-open-web.referata.com/wiki/David_Medinets. I've attempted to add a bit of semantic tagging to the article. I did not copy the whole article to avoid copyright issues (or at least, hopefully avoid them). At the bottom of the page, the tags (or properties) are displayed in a table. That feature, pulling out the tabs into a table, enables the user to skim the article content to decide if the article is relevant. It's more useful than I had thought it would be.
I am going to look at this
I am going to look at this story from the Guardian as it touches on international relations, health issues and sport:
http://www.guardian.co.uk/sport/2010/sep/23/commonwealth-games-scotland-...
I may well create a small application to go with it!
Wow, David Medinets, very
Wow, David Medinets, very cool. :) I'm sure it's evident how all the pages could be filled in to constantly create more detail. The meaning of the annotations can also be expanded with different types, and different views can also be added. Eventually combining the linked data approach including details like "budget breakdown of a region" or "who's formally responsible for that" will require little work.
A note that aside from Referata, Wikia offers semantic mediawiki hosting, and it's an extension to the (free) open source mediawiki so anyone can set it up on their own server.
My partner is David Medinets
My partner is David Medinets and we're still going back and forth on our assignment. However, I was inspired by his semantic wiki and wanted to put something out there that I'm messing with. I'm looking for an efficient way to do source analysis for the news site I'm working with--and I'm talking source in the hack sense.
What I want to do is take a period of time and look at every quoted source in every published article. For my little experiment, I took a feature I wrote in 2009 (because I have the copyright) and created a semantic wiki page for it. I labeled every quoted source using a simple taxonomy:
Source Class 1: In Person Interview
Source Class 2: Phone Interview
Source Class 3: Email Interview
Source Class 4: Observed
Here's how it looks (scroll to the bottom for a breakdown by source class):
http://sourceanalysis.referata.com/wiki/Main_Page
I didn't create a class for facts gleaned from off-the-record sources or documents but you get the idea.
So how is this useful? In this example, I'm satisfied with the quantity of sources named in the piece. Nearly all of my interviews were done in person or over the phone. However, ALL of my quoted sources were men. Since my sources were mostly sports writers who were 50+, this is not a huge surprise. But a look at my original source list shows that I reached out to five women sources. We weren't able to connect by deadline, but that's my fault for not being more persistent.
Looking at articles published by the news organization I'm working with now, I'd create a class for spokespeople (our reporters seem to lean heavily on them) and quoted press releases.
I can also see a utility to creating a class for the quotes themselves to have a look at just how substantive a reporter's quotes are--though this could be misleading when a reporter sets up a simple quote with something of substance they pen themselves.
As a hack, this little experiment comes from a concern I have that reporters consumed with feeding the beast are making fewer phone calls and doing fewer in-person interviews--which means they are not building the sorts of relationships that make a reporter effective.
I have no doubt there is a more efficient way to do this sort of analysis for a large number of articles, but I have no idea what that is.
Hackers, you have any thoughts?
And to the hacks: what other classes would you create?
I am looking for or working
I am looking for or working on a way to have a pipeline of batches of documents through semantic encoders (gate.co.uk) to publish to semantic mediawiki, which will also allow recognized entity editing (the really exciting part imo) but it's not going to be ready soon. This is probably close to what you're describing, but will still need human revision. As was discussed in chat, Natural Language Processing (NLP) is very dicey, 80% correct is fine statistically (systems can determine "most people like chocolate") but useless in specific applications ("John Smith likes chocolate"). Computers systems augment people, they rarely replace them.
As is usually the case, the problem is not particularly technical, things can be done one way or another, it's cultural. A big part of doing this sort of analysis is to get a large number of people involved and share insight flexibly — social hacking. Part of the culture change will be to get everyone, including journalists, to routinely contribute to and use sites like Wikipedia, with their semantic extensions.
Complex Semantic Web technologies like topic maps and RDF provide a good way to neutrally share this information, and many current frameworks are emphasizing them, so 2011 will likely be the Semantic Web year. In the meantime Semantic Mediawiki is one way to encode and share documents.
The vision behind this is anyone's work can coalesce different sources of trusted information (when thinking about trust and distributed data, think about a puzzle, what's that missing or inconsistent piece? it becomes more glaring as fewer holes remain). The journalistic requirement will exist for at least another generation :)
Interesting. Also, I want to
Interesting. Also, I want to make a shirt that says: "2011: The Year of the Semantic Web"--just because.
I like the thinking behind
I like the thinking behind the Source types. Did you have a reason to use generic tags (ie, class 1 and class 2) instead of specific tags (ie, In Person Interview and Personal Obversation)?
With semantic mediawiki, I
With semantic mediawiki, I did not see a way to mark something in multiple categories. For example, in order to mark someone as both a crook and a donor, you'd need to mention that person twice which will lead to awkward sentences. Another approach is generic tags (like Person) which leads to disambiguation pages.
David Medinet, you can use
David Medinet, you can use this syntax:
[[Crook::Donor::John Smith]]
From http://semantic-mediawiki.org/wiki/Help:Properties_and_types
For complex repetitive cases you'd use a template.
I haven't had a chance to use
I haven't had a chance to use GATE yet. Looking at its web page was a bit daunting.
I'm here to help with GATE,
I'm here to help with GATE, though I will have a lot more info after a meeting in two weeks.
I should reinforce that when using Mediawiki (Wikipedia's software) or Semantic Mediawiki, you should really have one "entity" per page. So our Crook Donor, John Smith, would have his own page, and your article would define the relationship with Smith semantically... if your page was Townsville Bank, it could be [[robbed by::John Smith]]
Semantically, we'd think of it as
Townsville Bank robbed by John Smith (Subject Predicate Object)
Of course, this framework leaves it open for anyone to compose different queries based on subject, predicate or object interactions with the generated information.
{{ #ask:[[robbed by::John Smith]] }} (an SMW query)
would return all of Smith's stored robberies. You might want to say Property:Robbed by [[has type::crime]], then you can query across the supertype. You can also add information like date, witnesses, etc, and add inferences (if someone is from outside the country and is on the lam, they're probably in the category of illegal immigrants. Mark this once in the class definition (template) and all instances become queryable).
The point here is you're creating a flexible, re-usable, co-developed database with something closer to text than traditional programmer's tools, with the transparent culture of wikis, and it "speaks" the distributed knowledge friendly RDF.
Here are a couple of nice example SMW sites http://www.placeography.org/index.php/Main_Page http://discoursedb.org/wiki/Main_Page (I find discourse systems very interesting and am currently developing one).
SMW uses MIT's Exhibit, which are designed to be reusable Web tools for knowledge comprehension. A lot of web sites create Flashy custom widgets for explaining information, which I think is counterproductive.
David Medinet: The class
David Medinet: The class approach was really just a matter of keeping it generic for the experiment--helped me to better imagine other classes while looking at the results.
OK, so I looked at this story
OK, so I looked at this story from The Guardian:
http://www.guardian.co.uk/sport/2010/sep/23/commonwealth-games-scotland-...
The piece is balanced and quotes from a variety of sources both directly and indirectly as follows:
1. The comment that the village is "unfit for human habitation" is not sourced.
2. Chairman of Commonwealth Games England, Sir Andrew Foster, is quoted directly.
3. A second quote from Andrew Foster is attributed to an interview on Radio 4.
4. Statistics attributed to the Hindustan Times.
5. Front page headline from the Delhi News quoted.
6. Indian journalist Norris Pitram quoted, source unknown.
7. CGF Chief Executive Mike Hooper quoted.
8. Michael Cavanagh, Commonwealth Games Scotland Chairman quoted.
9. Craig Hunter, English chef de mission, quoted from a direct conversation with The Guardian.
10. Delhi chief minister Sheila Dikshit quoted from a conversation with reporters.
11. Final summary quote from William Hague.
The story is balanced and well-structured and all of the sources can obviously be verified. There was not too much data to be grabbed from viewing the page source - whenever a country is mentioned it simply links to the Guardian section for that country.
What proved more interesting was the ability to expand on that story. Using an extension for Google Chrome called Tweetbeat I was able to look at the Twitter stream for the Hindustan Times, for example, right in the article inline. Following that thread, I have created an accompaniment to this article at storify.com. I'll add the link to that when it's ready.
I looked at this story at The
I looked at this story at The Independent
http://www.independent.co.uk/news/business/news/richest-dynasties-back-i...
* Title seems an The Independent editor's interpretation: "Richest dynasties back in the money after crunch" not really developed in the article.
* The top list of US billonaires came from Forbes 400 annual survey: 272 of the 400 richest Americans are self-made.
* The distribution of the "wealth of millonairs" proposal has no direct source.
* Using Forbes 400 as a "recruiting rool" is "even told" by Mr Buffett
* Mr Buffett quote "The way I got the message..." is direct, but It doesn't said where did they took from. Internet?
* The "rebourse of the economy" has unknown source.
* Facebook owner information has not defined source, but reader can presume that is Forbes 400 also.
* "Mr Moskovitz left Facebook in 2008" has not known source.
* Biggest looses source, should be an interpretation from reading Forbes 400.
* David Koch direct quote "One day, my father..." is unknown. The article said that it's something that Koch said "in 2003".
* The Forbes Rich Top List: 25, placed after the article has of course Forbes as unic source.
My conclusion is that there is one "main source" that is Forbes 400 list, but it is not clear when and where the journalist use the Forbes list information, and when he is interpreting the information by him self.
The second "main" source of this article should be something like "Internet". The two direct quoted are no specified when and where they had been told, and the interpretation about the trends of the economy has no source. Neither the information about the idea of distributing the wealthy from millonairs is sourced in any way. For the journalist It seems that is "something that people already knows".
One extrange thing: the "financial" word in the text has a not related link to an advertisement.
So storify.com is in closed
So storify.com is in closed beta and I couldn't include tweets from the Hindustan Times Twitter stream but you get the idea:
http://storify.com/buddhamagnet/commonwealth-games
Here is the story I
Here is the story I pulled.
http://news.yahoo.com/s/ynews_excl/ynews_excl_pl3712;_ylt=AkHluPzqww2_86...
I chose a story from Yahoo! News because it is destination many people use for news. Yahoo! News brings together a nice variety of content from news outlets I view mostly as reputable.
This particular story is based on a phone interview with the primary source. There are several statements of fact sprinkled into the story that are not sourced but seemed to be based on common knowledge or can be inferred from quoted material with the primary source.
The article also contains links to other articles that add facets to the story without bogging down the primary reporting.
Source: Shepard Fairey, phone interview
Facts
Unnamed Sources
“The poster of Barack Obama became a rallying image during the hope-and-change election of 2008.”
“It's been reproduced countless times on the Internet, and a parody version, with Obama as The Joker and "Socialism" in place of "Hope," is a favorite at Tea Party rallies.”
“Fairey's blue-and-red image was altered from an Associated Press photograph of Obama, , and the artist is embroiled in an ongoing lawsuit over use of that picture. (He didn't discuss the case with National Journal.)”
I chose the ESPN Outside The
I chose the ESPN Outside The Lines' story on food safety at stadiums. This story had a video, interactive, web story (text) and chart elements. http://sports.espn.go.com/espn/eticket/story?page=100725/stadiumconcessions. I concentrated on the text/chart elements and was pretty satisfied with the sourcing. Most of it could be traced to interviews with food safety experts and inspectors, industry representatives, health departments and workers. The bulk of the story came from ESPN's requests for health inspection reports from 107 North American (US and Canada) health departments responsible for inspecting food concessions at arenas and stadiums. I dissected the facts in the story with Excel. It's here: http://bit.ly/9j45JV. My programming partner on this, Marlon, was much more diligent in ferreting out facts and attribution using Excel, which I knew he would be, coming as he does from a more precise discipline.
My partner (Sarah Laskow) and
My partner (Sarah Laskow) and I took a story from the NY Times about the egg crisis, and each broke the attribution down, separately.
I broke it down this way:
-- single attribution
-- multiple attribution
-- no attribution (but presumably from older stories)
-- assumed information (based on info that is widely available)
-- & linked attribution, which means there was no direct attribution in the story but there was a link to an archived story or source material. (this overlapped with some other categories.)
Which looked like this:
Type of attribution Number
Single attribution 11
Multiple attribution 15
No attribution in this story 31
Assumed 4
Linked attribution 5
& was also translated into a pie chart, which is here:
http://bit.ly/cPCMiR)
Story: "The Vanishing Beat
Story: "The Vanishing Beat Cop"
Byline: Ben Joravsky, Chicago Reader, August 12, 2010
http://bit.ly/BeatCopReader
This story begs for an interactive.
While there are a couple data sets that could be focused on, the most important is the attempt to confirm how many cops are actually on the streets. (In order to be truly useful, we'd need to get more data...)
Here's the spreadsheet:
https://spreadsheets.google.com/pub?key=0Ak2OhAcTJjOPdE5oQmhuQkpCYVFqOVl...
Sol
I worked with Terri on the
I worked with Terri on the same story: http://sports.espn.go.com/espn/eticket/story?page=100725/stadiumconcessions
We both made an independent spreadsheet of facts and sources. Mine is here: https://spreadsheets.google.com/ccc?key=0AmQ6_0IhhTX8dElya2FwZWxGRFQ2Qm1...
It's interesting to observe how similar our resulting datasets were, which I guess suggests that this kind of data extraction could be somewhat standardized. Differences I noticed were that I identified more "facts" (perhaps doing a more literal reading of the text), and Terri noted significantly more detail and context about each fact. I think Terri's dataset is an example of what you get when a professional journalist organizes data by hand, while mine is more like what a computer algorithm might produce.
The story is here:
The story is here: http://www.nytimes.com/2010/09/22/business/22eggs.html
And you can see my analysis here -- http://bit.ly/9rA3YI
I broke down the sentences a little bit differently than Matt did.
Direct Quote 8
Indirect Quote 11
Document 10
Assertion 8
Government Action 10
Unsourced 12
Government Action was a category I used to note facts that seemed likely to be acts on the public record that anyone could find with a little legwork. I also looked at if the source was "named" or "unnamed." For people this was fairly straight-forward, but for documents it was more slippery slope. I judged a document "named" if it was linked to or clearly described.
Named 21
Unnamed 39
In the context of discussions we've had already, this assignment prompted me to think about transparency in journalism. How hard would it be for me to recreate the reporting in this article? Pretty hard. There were plenty of facts in this article -- I'm thinking of government actions in particular -- that are likely open to anyone, but the reporter only left some clues that would help another person retrace his steps. Even if he didn't link to specific records that he used to report the story -- inspection records, for instance -- it'd be incredibly useful for someone else reporting on this type of health recall to know the name or form number of the documents he references. But that sort of information is nowhere to be found.
I pulled out this story:
I pulled out this story: "Latest prize is a tweet, not a celebrity meet-and-greet" It was published in yesterday's Wall Street Journal.
I pulled the data by creating a Word document to analyze the facts and sources of the story. I also used the document to suggest other ways I could have gathered and shared the facts and figures mentioned in the story. I have started on some of those ideas, which I will share as I get them completed. Those ideas include using Storify, which some of you saw with Dave Goodchild's piece.
Here is a spreadsheet that shows all the people mentioned in the story, some information about them and how they were included in the story. I think this could serve as a practical exercise to keep track of how I use sources in my stories.
Some thoughts:
** You can hide bad reporting (or shortfalls in reporting) in a story if you write it in a certain way. Of course, I knew this (I've had to write around a fact I didn't have from time to time), but doing this exercise made me realize how easy it is to expose our sourcing and facts if people took the time to deconstruct it like we're doing here. On the flip side, in this story, I was impressed at how well the writer did in compressing a whole lot of facts into one easy-to-read story!
** That said, as journalists, trust from the reader is a very important thing. I wonder if providing the raw information in some sort of data platform could help in building that trust.
** Forming a database, of either the people sourced or the facts themselves, is also great as boiler plate material for future stories. I find myself searching my newspaper's archives for background information when I need it. Having it in an organized structure would be much easier and faster, which is very much needed these days.
** My partner, David Mason, is working on a separate story, but we have plans to follow up and evaluate each other's assignments and perhaps get feedback on how to further improve our respective assignments and perhaps provide more ideas on how to pull and organize data.
Very nice! I hope to use
Very nice! I hope to use Storify with my assignment as well. Working on it right now.
Story:
Story: http://bit.ly/9Pe1pS
Analysis: https://spreadsheets.google.com/pub?key=0Agbu2FfNye2GdEpPNnY1MDY0NWFlZEo...
Nothing too shocking or revelatory about my analysis here, though it did lay bare how biased our daily newspaper is in Vancouver. The story had three sources: a government minister, a doctor and the executive director of an organization hired by government who also happens to be the brother-in-law of our premier (shocking!), and a bureaucrat/doctor. No sources from health care advocates or patient advocates, etc.
What excited me, though, is the potential for tagging stories in this way. Not only could you tag the source, but could further tag the source's descriptors: doctor, executive director, brother-in-law of the premier. That would be one powerful database.
So, yes, excited to learn more about semantic mediawiki and what David Mason and others were discussing. I especially liked:
"The point here is you're creating a flexible, re-usable, co-developed database with something closer to text than traditional programmer's tools, with the transparent culture of wikis, and it "speaks" the distributed knowledge friendly RDF."
Fascinating and looking forward to learning more.
Story: A Look at Who Got in
Story: A Look at Who Got in Where Shows Preferences Go Beyond Racial Ones
Link: http://online.wsj.com/public/resources/documents/golden1.htm
Analysis: http://bit.ly/bxlV2G
Ended up going through by hand, which seemed pretty simple and straight forward since the story pretty much followed the golden "one sentence, one fact" rule. Interesting twist was that much of the story was based an a document that was anonymously obtained, so I filed that under "anonymous source" but could see an argument either way.
For this assignment I started
For this assignment I started to read an article about the Sudan that appeared in Newsweek but I soon determined that the article was more of an opinion piece that would not work well for this assignment. Sure there were facts and sources but overall it was longer history type piece about the Sudan.
I shifted my sights to an article that appeared in my local paper that was much more of a just the facts ma'am type article that would work better for this assignment.
The article I read was entitled "Colorado has worst ethnic education gap in the Country that appeared in the Boulder Daily Camera.
All of the sources in this article were named and are as follows: a report commissioned by the Governor entitled "Colorado's Strategic Plan for Higher Education"; David Aragon, director of student success in the Office of
Diversity, Equality and Community Engagement at CU; John Poynton, spokesman for the school district of St Vrain; and CU Regent Michael Carrigan.
The facts were as follows: CO has a 39 percentage-point ethnic achievement gap compared to 19.3 nationally; Hispanics are the fasatest growing demographic in the state but only 6% have an associate's degree and 8% a bachelor's degree; in-state and out-of-state Hispanic enrollment at CU has increased; Hispanic graduation rates have increased and the St. Vain Valley School District offers additional classes for at-risk students.
I was surprised how many of the facts were not tied specifically to sources. None of facts tied to CU were sourced.
Michael Newman and I are
Michael Newman and I are working on a piece from the New Orleans Times-Picayune about the current hurricane season.
I’m a hack. And even though I’ve spent the past week reading about all kinds of codey stuff, I still don’t even know the right terminology to describe most of this. Here goes anyway. I think what we are talking about in part is attaching some kind of tag or metadata to parts of written articles so that the components could be extracted and manipulated/repurposed.
But because human readers and writers have almost subconscious associations and biases, and because certain words and phrases are code or shorthand for much larger ideas, and because writers often assume their readers have background knowledge, the written word taken literally (as a machine might) can be confusing.
First I marked this up as if I were fact checking it. Fact? Says who? Who’s that? Name spelled right? Institutional affiliation and role specified?
Then I tried to think like a researcher. I wouldn’t go to the Times-Picayune for weather data, but if I wanted to know if the people of New Orleans were more religious or superstitious now than before Katrina, I might look at articles about hurricanes in the local paper. Maybe I’d be looking for knowledge gaps in hurricane science. That might show up in speculative language when meteorologists are quoted.
So that's how I approached it. CK indicates supposedly factual information. Some, maybe all of it could be considered data. If I were looking for information to extract from this, I’d ignore all the historical stuff because it’s already in lots of other databases so why waste time on it. Most of this is weather data. Other notes are to show how I would approach this as an editor.
The various experts and their agencies, and the relationships between them, could be useful in the future, so maybe tag and link them somehow. The vaguely religious tone of the piece might be interesting to someone researching attitudes, so maybe tagging stuff like “feel blessed” and “benevolent hand” would be worthwhile. Speculative words and phrases, and those that reveal uncertainty, such as: could, may, might, ‘is likely to’, expected, predict-, potentially, ‘hope to’, might be worth tagging.
Not all things marked CK would really be checked. If the writer was an unknown person who might make things up, you’d check a lot, but if it was someone who you knew was a trustworthy reporter, you might check some spellings, clarify agency connections, and pin down the term “expert” as it applied to the guy who “grew up in Mandeville”.
Sometimes you don’t fact check quotes as long as they are not actionable in any way (slanderous, libelous). This might be because you are letting someone hang himself or as a way to show someones’ mindset or bias. In this case I’d probably trust the people quoted, but maybe check the exact numbers just to avoid embarrassment. Sometimes even experts misspeak. But information in quotes should probably not be automatically treated as factual.
If I'd been working on paper I would have highlighted things that needed checking. Here's a link to a Google Doc version of that process: http://tiny.cc/nr4si
(Long link in case tiny fails:
https://docs.google.com/document/edit?id=1yesncpqRGLGHvOw7eFItga4RI8NtU9...)
(And guess what? I just found out that if you put something inside those sideways Vs <> it disappears. I'm coding! Too bad it was what I used to enclose all my comments on the piece.)
Aside from everything else, I
This is in response to Jeff Severns Guntzel post here: http://sourceanalysis.referata.com/wiki/Main_Page
Aside from everything else, I really enjoyed this piece. And the thing I liked about it illustrates one problem I see with the idea that information equals data and is somehow extractable and codable. By focusing on this one guy and his world, the article beautifully describes a certain time and place in journalistic culture. That multi-dimensional portrait only emerges when the "data" in the piece is viewed as it was intended. You have to read it and let the details (aka data) click together in your mind in order to see the whole scene. So you could tag it 'journalism, culture, newspapers, sportswriter,' etc., but is tagging what the semantic-mediawiki-XML-Ruby-Python thing is about? Or are tagging and semantic markup stuff two totally different things? This is a hack knowledge gap. Right now I can search for whatever I want. How does the linking/markup thing improve on that? Do the links make it easier to extract the data? Anyway, cool piece.
>is tagging what the
>is tagging what the semantic-mediawiki-XML-Ruby-Python thing is about?
No. You've lumped three things together. XML is a language to describe and store data. It's inert, non-acting. Ruby and Python are programming languages. They do not describe or store data. The languages are used to execute actions (programs). Semantic MediaWiki is a web-based hybrid. The articles both describe data (via markup) and store data (as articles). And SMW also executes queries so it has a small bit of programming to it.
>are tagging and semantic markup stuff two totally different things?
They are different. Tags are non-discrimating; just a bunch of words related to the article. Markup is pulling out specific pieces of information.
>How does the linking/markup thing improve on that?
Markup allows simple and complex queries to execute against your document set.
Wow, thank you! Between your
Wow, thank you! Between your reply to my questions and a conversation I had today with my hacker collaborator, my understanding of this has increased exponentially. Still miles to go though. There is one sentence here I don't totally understand: "The articles both describe data (via markup) and store data (as articles)." I'm not sure how you are using the word 'articles.' I really appreciate the explanations. Thanks.
Each page of a wiki can be
Each page of a wiki can be considered an article. I was just being 'loose' with my terms.
I chose a story I wrote a
I chose a story I wrote a year or two ago for the Atlantic Monthly, partly because I have the copyright, and partly because I think the editors there did the internet side of this story particularly well. (E.g., the "Darwin's Revenge" sidebar.) You can see the original at: http://www.theatlantic.com/magazine/archive/2008/09/heart-of-darwin/6932/
I have put up a partly annotated copy on Google Docs:
https://docs.google.com/document/pub?id=1GdeBSAWEKArBsxNb2qca4DwKFtAksdO...
A couple of general thoughts: A lot of people in this class have said something like "we're in the information business." I don't agree. Humans are not a data-mining species. We are a story-telling species, and a lot of us are in the story-telling business.
Data can of course be useful. And sometimes you can just put it out there and say, "ok, do me." (I liked some of the neat wiki things readers are doing with raw data in that Guardian article someone posted.) But most of the time, for most readers, an editor's job is to make the data get up off the bed and dance with the reader a little. We need to make it tell a story.
I also disagree with the suggestion that stories lacking lots of internal web links are a form of "bad journalism." In my Atlantic story, those links work fine in the "travel advisory" sidebar, which is primarily information. But within the body of the story, links would just invite the reader to break the spell, step out of the story, and go somewhere else. Like footnotes in a book (which I also avoid; endnotes are less distracting).
When I submit a manuscript to the editor, however, I always include those links, along with source names and coordinates, page numbers in books referenced, etc. I do this for fact-checking, because magazines generally insist that I back up every assertion with a source. For the most part, I do this in Word comments. (Which is not great: My manuscripts tend to be polka-dot pink with sidebar comments and dotted lines, and it's sometimes hard to figure out which comment goes with which fact.) The URL links are a godsend to me--particularly when something like the Darwin Correspondence Project puts almost every letter Darwin wrote and received at my finger tips. I could see that this might also be a godsend to readers in an online story. But it would only appeal to me as a writer if the information was turned off as a default, and available for interested readers to turn on as a follow up. The editors I work with at places like The NYT online, Yale Environment 360, etc, also seem to want to avoid too much clutter and protect the mood of the story.
So here's a little experiment
So here's a little experiment I did at Storify. Basically it's sort of an extension of the criticisms mentioned in the original WSJ. It's nothing groundbreaking, but I definitely can see it's use for online journalism. One use: providing ongoing reader reaction to a major news event.
http://storify.com/maiphoang/criticisms-of-twitchange
By the way, I take it the
By the way, I take it the shortcut Chris alluded to was OpenCalais at API Playground. If not, what am I missing?
I worked on
I worked on http://www.thenation.com/article/katrinas-hidden-race-war and I failed miserably... I tried to find a way by using Gate (and I spent an enormous amount of time fiddling with it) to determine which sentences are facts and which quotations. It turns that no matter how I look at parts of speech, noun phrases and co-references I cannot come up with a logical rule. This story in particular is nicely written, facts and quotations are tightly interwoven and even as a human I find hard to decompose it. I search various research papers, most of the fact extraction algorithms are towards simple statements "Paris is in France" style. The only approach that would be feasible appears to be machine learning over human annotated examples. Here is a PDF gridinoc.name/j/article.pdf with the article that has highlights for noun chunks (blue), names (yellow) and locations (green).
Have you tried the entity
Have you tried the entity extractor located at http://viewer.opencalais.com? You can paste into the form and the OpenCalais service will percolate and extract information like places and people. For example, it found NOPD (New Orleans Police Department) but incorrectly identified 'Body of Evidence' as a movie.
Sorry for my tardiness! I
Sorry for my tardiness! I just now had time to write up my thoughts on this.
First, I did a manual analysis of the facts in a New York Times story on the Republican primary for governor:
http://docs.google.com/Doc?docid=0AcK5ra0CMfEGZGM1dm4yOG1fNzVocGZ4OXRkYw...
I parsed the story, breaking everything into one of these categories and identified the source in brackets:
-identifiable source
-identifiable but unnamed source, able to be inferred from other references
-unknown source
-opinion, belief or analysis (anything other than a fact)
-prior reported fact or common knowledge (such as biographical or historical information)
-reporter's or news org's observation or knowledge
At first I was surprised by the amount of content that was "opinion, belief or analysis," considering this was a simple, incremental story. But in thinking about it more, I can see that most political stories would probably be similar. (I put some more thoughts about the process in the doc itself.)
Next I ran OpenCalais on it, using its sample Web entry. I wasn't able to export the marked-up text file in a usable way, so I just created a separate file and highlighted the few facts that OpenCalais identified.
OpenCalais performed a good entity extraction and correctly placed attributes on each entity. So it identified cities, organizations, political events, "industry terms" (one of which it misidentified) and position.
I was not impressed with its ability to pull out facts, however; it was not able to parse many of the things in the story that I considered facts. I would expect some difference, but I wouldn't think that so much of the story would have been undigestible.
It identified five types of "events and facts."
* "Person career" (person, position, careertype, status)
* "Political endorsement" (groupendorsee, groupendorser)
* "Polls result" (politicalevent, winning candidate, datestring)
* "Quotation" (person, quote)
* "generic relations (subject, verb, object)
Biggest takeaway: Judging by OpenCalais' performance -- and I haven't tried another method -- there are too many subtleties and unspoken meanings in written language for machines to read them without help. For instance, OpenCalais doesn't know that a subject-verb-object construction that follows the word "that" isn't a fact -- it identified "Mr. Cuomo lacks the backbone to face his attacks" as a fact. OpenCalais also doesn't know writers commonly attribute the first part of a quotation and then continue the quotation without attributing the second sentence.
Further thoughts are noted in the file:
http://docs.google.com/Doc?docid=0AcK5ra0CMfEGZGM1dm4yOG1fNzdobWJxODljcw...
Very late here, my apologies.
Very late here, my apologies. My partner is Steve Myers, and we wound up going solo a bit. He picked the story, and it's a good one for these purposes.
I decided to read the assignment literally, and set up two classes of data: Sources and Facts, which I organized as the X and Y axes of matrix.
I manually culled the data, and chose a matrix to visually display their interrelations.
In choosing a matrix format, I found myself wishing I could make one with a "Z" axis, where I could list the assumptions/assertions/analysis that emerged out of the data.
But I lack the 3d skills, and instead stuck with two classes of info, "Datum" and "Source," and added an asterisk * whenever a particular datum lacked anything immediately verifiable other than the reporters' own good word.
I like the "meta" approach Steve took of color-coding types of data -- an easy way to visually test the strength of a story's sourcing.
For this piece, I noticed that anything that didn't have a direct source but was presented as factual had the reporters as the source. From that reading, the reporters were the most frequently used sources for their own story.
Of course, they're doing a news analysis, so unattributed assertions are easier to present.
Here's the matrix ... it seems like a useful way to analyze the strength of a story based on its sourcing:
https://spreadsheets.google.com/pub?key=0AhPQrzxcJophdDNIRF8tX2QyOWZrWWR...
Corrections -- I meant, two
Corrections -- I meant, two classes of data, "Fact" and "Source" ... and, rather than set up separate third+ classes of data for subtypes of data ("unattributed fact," "anonymous source") I decided, due to the limits of my matrix-building skills, to assign the characteristics of those subtypes through flags on each datum.
Thus an assertion may exist -- the "fact" of the assertion's existence is real, i.e. "John says X" -- but the nature of it as conjecture (maybe true or maybe false, as opposed to verifiably true) is indicated by an asterisk (which indicates that a given "asterisked" datum contains or consists entirely of assertions that are conjecture posited by an asserter).
I've finally posted my week 1
I've finally posted my week 1 assignment here. http://canbudget.zooid.org/wiki/David_Mason/P2PU_week_1_assignment