This is the P2PU Archive. If you want the current site, go to www.p2pu.org!

Open Journalism & the Open Web

Week 4: Assignment

Danielle Fankhauser's picture
Mon, 2010-10-11 20:13

  You are an investigative team that does freelance work around the country and are working up a pitch for your next project.
 
+  Pick a subject matter you want to investigate
+  Identify a dataset or datasets that will help you formulate your story. For this exercise, only pick a dataset that is already available on the Web, e.g. via Data.gov or a state- or city-level data website. Plan:
- What you need to do to clean these data, e.g. remove columns, perform functions on other columns, remove special characters ...
- The schema you'll make to house the dataset(s) — column names, column types
- What are you doing with this data — Are you proving an existing thesis (and if so, what is it) or are you seeking to better understand the subject matter (and if so, what questions will this data answer)?
- What will your queries look like? Are you going to join multiple databases together? If so, how, and why will the results of the join be accurate and relevant? Hackers: If possible, write out a query in SQL.
- How will you express the results of your inquiry? In the text of an article? As a chart or graph? Informing a search for relevant photos and video?
- What questions won't the data answer that you want to address in your project? Who will you turn to as you start looking for those answers?
 
  
 
 
  

Comments

I couldn't get the entire

Terri  Langford's picture
Terri Langford
Fri, 2010-10-15 20:53

I couldn't get the entire data set into Google Docs, so I've put a sample of it here. http://bit.ly/ccbMYH

This is a spreadsheet with two sheets: 1. list of names of assisted living facilities in Texas, then 2. the violations at assisted living facilities in Texas based on the vendor number of the facility.

In my sample, I just show 10 or so rows of each sheet, one underneath the other. (Sorry, had trouble putting this up it's so large, Google wouldn't take it).

This is pretty clean, but I'll need to go through it again for spellings. I'd also have already requested a layout sheet so I can eliminate some of these fields that might be unnecessary.

Then I'd put this into Access and join the two sheets by vendor number and query the data so the resulting database shows me each facility and its violations and the dates of the violations, the county, the address, etc.

Now, one thing looking at this..is that it's clear that some violations are more minor than the others. So I'd have to either assess a value to each or eliminate the more minor ones.

I would check to see if there were any deaths during this time as a result of these violations, which may be another datasheet to add, plus I would check the online inspection reports to see if I could get more narrative about the more serious violations.

For a basic state story, I could try to determine if incidents were falling or rising and present the information on a map, showing where the most violations are, but also use census data to bring back the data to earth by showing how many people live in assisted living or long-term care homes, say over the age of 70.

Without really looking at the data, it's probably a good bet there's a Houston facility with problems, try to see why that is, was their new ownership, did the violations happen after a major storm or hurricane. Could that have explained it? I'd also look at the staffing numbers for the more problematic facilities and find out what the state recommendation or standard for staffing is.

What this won't tell me is who was involved in any serious incident. For that, I'd have to lean on advocacy groups, family member associations, etc to see if I could unlock some identities on those involved or those hurt. Was there a police report or a lawsuit filed against this particular facility?

"I couldn't get the entire

Phillip Smith's picture
Phillip Smith
Fri, 2010-10-15 21:42

"I couldn't get the entire data set into Google Docs, so I've put a sample of it here. http://bit.ly/ccbMYH

Looking good. And you've managed to stumble on the main theme of Nick's lecture: spreadsheets are _very_ limited in the amount of data they can work with. :)

Otherwise, fantastic work on the assignment. Very impressive!

Phillip.

I'm interested in

Marlon x's picture
Marlon x
Sun, 2010-10-17 03:13

I'm interested in investigating trends of police enforcement in Atlanta, GA, especially related to how crime and police activity has changed over the transition from housing bubble to foreclosure crisis.

To do this, I grabbed the police report datasets available here: http://www.atlantapd.org/index.asp?nav=crimemapping

The data was in quasi-csv format, but it had various inconsistencies, especially between years. Some fields were padded with spaces and some enclosed unnecessarily in quotes. I opened each year's file in OpenOffice Spreadsheet and ran some regexp find+replaces to remove the inconsistencies. Then I imported the clean csv files into mysql, which created one big database containing all the crime reports from 2004-present.

There's no specific thing I'm trying to learn, but some areas of general curiosity I have are: Did theft and other "poverty" crimes change as the foreclosure crisis expanded? If so, what crimes became more prevalent? For example, I have a hunch that bank robbery has become significantly more common. Did areas of high enforcement change? Did traffic citations and other revenue-generating police activities change?

Because the dataset is pretty much one long table, the queries I need aren't very sophisticated. Queries like: SELECT * FROM police_reports WHERE offense_description regexp ".*LARCENY.*" and quad = "NE" ORDER BY report_date
should give me everything I need.

Since I'm primarily interested in trend information, I imagine that line graphs showing change over time would be a good way to present the information. To demonstrate correlations between trends, overlaying multiple lines on the same time axis would probably be useful.

The central issue which this data cannot answer is what sorts of policy and police activity led to its creation. For example, is a rise in reports about a particular crime due to a rise in people comitting that crime, or an increased focus by police? These are things to be discovered by gathering information directly from sources other than the police department, like residents in affected communities and those actually arrested. Unfortunately that information has to be gathered the old fashioned way, because so far nobody has bothered to compile it into a neat database :)

I have to admit I have been

Tim Groves's picture
Tim Groves
Sun, 2010-10-17 17:46

I have to admit I have been working on this for over a month but I just finished my first CAR project this week and it fits perfectly with the assignment.

http://toronto.mediacoop.ca/story/buying-influence/4843

It is an investigation into which Toronto lobbyists tried to buy influence by donating to candidates campaigns.

I manually scrapped all the names off the City of Toronto Lobbyist Registry, and downloaded all the campaign donations in Toronto's 2006* Municipal Election. (* the 2010 election is underway but financial disclosures are not required until April 2011). I then ran an inner join query to see which names appeared on both tabels.

The hardest part was cleaning the data, some companies used slightly different names, (inc vs inc. vs incorporated), some were abbreviated, etc. I went through each table in a spreadsheet and used find/replace to make every ending I could think of match up properly.

Once I hade generated a tabel of "Candidate Name", "Amount Donated" "Lobbyist Name" "Type of Lobbyist", "type of donator", I had 148 entries, but sometimes the same lobbyist donated numerous times. At this point I made a new spreadsheet that grouped the lobbyists donations, with the total amount they donated and the total number of councilors they donated to. I went back to the lobbyist registry and found the issues that the top ten donators had lobbied on, adding it to the spreadsheet. That is what I published in my report.

I would have loved to make a searchable database of the results, but this is beyond my know how. Instead I made 2 different spreadsheet and attached them to my article. I also wrote a list in the text of the top 8 lobbyists in the second spreadsheet. So far I have gotten really good feed back, including a few people who had inside knowledge in one of the lobbyist on the list.

The report I published is really this is a warm up round for the next election. but since it is only a week away I decided to publish what I had before looking into any of the questions raised by the research. However the things that aren't told are the details of what the lobbyists are up to. The giving of some lobbyists indicates they were trying to buy influence, but on its own this report does not show that any councillors were swayed by the donations. The next steps would be to interview the lobbyists and people familiar with city politics, as well as going through voting records, (which are not stored in a central location, but included with in individual meeting minutes).

The Municipal election is only a week away, and the report I did is really this is a warm up round for the next election.

Wow, that's really great. Let

david mason's picture
david mason
Sun, 2010-10-17 18:13

Wow, that's really great. Let me know if you'd like to work together on putting a searchable database and front end on it.

--Goal: Look at Lottery sales

Matt Carroll's picture
Matt Carroll
Mon, 2010-10-18 01:24

--Goal: Look at Lottery sales in Mass. by Zip code.
--Datasets: Lottery sales by Zip code. Avg income by Zip using IRS data.
--The Lottery data is in pdfs. These need to be sent to the “cometdoc.com” site, which can turn them in Excel. However, extensive cleanup is required bc the columns do not come out quite right. The IRS data is relatively clean, but some cleanup is needed. Also, the data has the total number of filers and total AGI (income, basically), but a simple calc needs to be done to get the avg.
--Thesis: Poorer neighborhoods bet the Lottery more than wealthier neighborhoods. The Lottey ends up as a “tax” on people who can afford it less.
--Lottery Zips and income data needs to be joined by Zip.
--End result, with story explaining what the data shows, along with a strong data visualization showing the Lottery spending pattersn by Zip.
--Need to talk with experts on gambling. Also the data will have outliers – for instance, communities that border other states might have higher or lower Lottery spending, depending on what types of lotteries are in those states, and how attractive they are to residents of the this state.
(I have the data for Lotteries and income, but am tied up on other stories. This story will be done in a few weeks.)

Good work, Terri!

Steve Myers's picture
Steve Myers
Tue, 2010-10-19 05:34

Good work, Terri!

(Reposting here b/c I think I

Steve Myers's picture
Steve Myers
Tue, 2010-10-19 05:48

(Reposting here b/c I think I posted in the incorrect place earlier)

We decided to look into U.S. foreign aid since 9/11 -- specifically, we were curious to see whether aid to Muslim countries has increased or decreased in that time.

DATASET
We'll need two sources of data -- the database of U.S. aid and a table of countries with their Muslim populations -- which we will join.

1. U.S. Overseas Loans and Grants, also known as the "Greenbook" http://gbk.eads.usaidallnet.gov/
2. Wikipedia, "Islam by country" http://en.wikipedia.org/wiki/Islam_by_country

CLEANING
1. The grants table looks pretty good, from my initial review. I notice that the "country" field includes some catch-alls, such as "Western Africa (not specified)," which I'll have to keep an eye out for. But otherwise it's quite clean.

2. When I imported the Wikipedia table into Google Spreadsheet (here's how: http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-... ), I saw several parts of the table that will have to be cleaned:
-Remove footnotes and brackets
-Figure out how to turn this "< 0.1" into something that GDocs recognizes as a number
-Rename columns in database-friendly format, without spaces

(Thanks to David Mason, who gave me some advice on how to tackle these. I'll use a recognizable small dummy number for the "< 0.1" entries, and I'll use regular expression in OpenOffice to get rid of the footnotes.)

The two data sources may not name all of the countries the same, as evident from the catch-all entries in the aid database. But because we're just interested in the Muslim countries, not all of them, this shouldn't be much of an issue.

SCHEMA

Tables; Columns (column type)

Country; aid_country_name (text), muslim_country_name (text)
Grants; program_name (text), year (date), amount (number)
country_muslim_population; country (text), muslim_pop (number), muslim_pct (percentage), world_muslim_pct (percentage)

I think a separate table is necessary for the country names because I'm joining the data based on the names. I want to keep the names as entered in each of the source data sets to sort out any problems resulting from inconsistencies.

The last column in the last table is not necessary for my work, but it's already in the table, so I will keep it in case it comes in handy later.

PURPOSE
This is exploratory; I would like to see if US aid to predominantly Muslim countries has decreased since 9/11. My first guess is that it had; but after briefly looking over the data, it seems that my hypothesis may be incorrect. It makes sense that the U.S. would provide more aid, not less, to Muslim countries since 9/11. My next question: what kinds of aid is the U.S. providing? I will learn some degree of this from names of the relevant programs, but that will only be a starting point.

QUERIES
Haven't done the SQL yet.

PRESENTATION
At first glance, this appears to be a good data set to display visually. However, because I'm just exploring the data, I don't know what correlation to highlight. For instance, displaying amount of aid in relation to the percentage of the Muslim population doesn't make sense because there are too many other variables that could factor into aid decisions.

UNANSWERED QUESTIONS
This won't tell me what the aid actually paid for -- just the name of the program under which the money was disbursed. That could give me a general idea of the purpose of the money, but then again, Congress could've directed Agency X to disburse money that has little connection to Agency X's stated mission.

Steve, Great work! I'd be

Nick Judd's picture
Nick Judd
Wed, 2010-10-20 18:00

Steve,

Great work!

I'd be curious to know if OpenOffice on your computer can handle the dataset for you to run regexp on it. As I mentioned in the lecture, if that kind of tool fails, go old-school — robust and simple text editors like vim are difficult to learn at first, but are very difficult to crash.

Here's the vim cheat sheet I promised during lecture: http://www.lagmonster.org/docs/vi.html

Hi Nick, The Muslim countries

Steve Myers's picture
Steve Myers
Wed, 2010-10-20 18:37

Hi Nick,

The Muslim countries table, which is the one that has to be cleaned, is really small. But it's good to have another tool. I don't know if any of my other applications handle regular expressions. It looks like Komodo will do it, though I've never tried.

I'll work on the cleaning tonight, and I'll see if I get far enough to put together the SQL queries.

So, Nick...Now that I'm

Steve Myers's picture
Steve Myers
Thu, 2010-10-21 03:55

So, Nick...Now that I'm looking into regular expressions, I see that I could use OpenOffice's Writer, Calc and Base (even Word) and a text editor like Vim. So why would I use one or the other? Is this a situation in which there are many programs that have this functionality and I just decide which one I like? I'm not even sure how I would decide that.

Yes. It's one of those highly

Nick Judd's picture
Nick Judd
Fri, 2010-10-22 02:25

Yes. It's one of those highly subjective things. Do you keep the tie for unexpected political interviews in the office drawer or the trunk of the car? Mets or Phillies? Ruby or Python? Vim or emacs? Ah, the existential quandaries we face in life.

The issue with Word is not regular expressions, it's that if you open a file of great size in Word, it will crash or take forever depending on your hardware. AND Word doesn't quite understand that you don't want rich text formatting in your CSV file, which has caused me some irritation in the past. So when I'm working with this: http://www.fec.gov/data/CandidateDisbursement.do?format=html I use vim, because it's less apt to break.

UPDATED: For this purpose, I'd bet OpenOffice Calc could do the job. As you get more savvy with regular expressions, and become more familiar with the command line, you'll graduate to vim, because you can accomplish with two lines of code what would take you two dozen mouse clicks. As you surpass me and become more like my developer friends, you will look with scorn upon a repetitive task like opening a text editor and doing recursive find and replace operations, will build yourself a Python script to handle the most common operations, and will pass your files through that script on their way into your RDBMS.

Python has a large overhead,

david mason's picture
david mason
Fri, 2010-10-22 02:48

Python has a large overhead, I'd use vim or in extreme cases sed (stream editor), which requires a unix-based command line (where the output of each program is "piped" through the next), or cygwin on Windows. There's an old expression, those who don't understand unix (concepts) are bound to re-invent them. All these graphical programs add layers and layers which end up being inefficient and not suited to every task. Ultimately I think it's better for hacks and hackers to work together on interesting projects, with the same general interests but each with their specialized corner cases.

Matt: An interesting thesis

Nick Judd's picture
Nick Judd
Fri, 2010-10-22 04:38

Matt:

An interesting thesis and the right data to approach it. Where do you get the lottery sales dataset? What format is it in? And how big is it?

Do your ZIP Codes correspond nicely to the neighborhoods you're looking at? What do you need to turn the joined data into a visualization?

It sounds like you have this project well in hand. I would wager that there are not enough records in this set to warrant using a RDBMS — and you could probably get away with simply ordering the two spreadsheets the same way in Excel and then pasting one into the other. However, the issues of "cleaning" data and planning through what you're getting are the same — they're just not happening on a scale such that you'll need to automate those tasks through regular expressions or automated find and replace.

Marlon, This is an

Nick Judd's picture
Nick Judd
Fri, 2010-10-22 04:42

Marlon,

This is an interesting project. It sounds like you've already fixed the inconsistencies in the data and are now sitting on a database that can answer exactly the questions you're asking.

I'm curious: Did you find answers?

EXTRA CREDIT: Geocode the addresses, find the KML file you'd need to display this as a map, and display the results of your inquiry. Perhaps using OpenStreetMap?

Nick and David: Yes, in this

Steve Myers's picture
Steve Myers
Fri, 2010-10-22 04:53

Nick and David: Yes, in this case I did do a couple regular expression find and replaces (Here's a good tutorial on regex in OpenOffice: http://wiki.services.openoffice.org/wiki/Documentation/How_Tos/Regular_E... )

I've never used regular expressions, so this was a good introduction to how I could use them to save some time when dealing with a data set. (Actually, I may have used it once or twice, but I didn't know what I was doing.)

David has a good point -- hacks shouldn't spend all their time trying to learn what hackers already know. But I think data cleaning shouldn't just be left to the hackers -- many data-oriented reporters deal with these situations. I guess it's up to the hack to decide how far to go down the coding track in pursuit of cleaner data. As for me, in my previous job I spent a lot of time doing super-repetitive cleaning on data sets, so I wish I had learned this earlier. I bet a lot of my needs would've been satisfied with regex in Calc.

Absolutely, I would say using

david mason's picture
david mason
Fri, 2010-10-22 16:26

Absolutely, I would say using regular expressions is an essential tool for anyone who works with data, to find or fix text that conforms to patterns. My point was more that hacks shouldn't have to *become* hackers (spend most of their time learning hacking, start watching Star Trek, etc). But the good news is, mostly due to the open source culture, there is a lot of technical documentation out there. People should be aware that while regexes are a general idea, implementations are subtly different so make sure, as Steve has done, that the tutorial applies to what you're using.

Slightly connected to this

Tim Groves's picture
Tim Groves
Fri, 2010-10-22 16:37

Slightly connected to this thread I am gonna put out a call for some help in converting a couple .pdf docs into .xls or some other sort of table so that I can use them in a database. I need to get this done before the municipal election, which is in 4 days. and I decided to ask for help.

Two of the three mayoral candidates in the toronto municipal election released a list of their financial backers, and I am trying to convert that list from a pdf into an excel document so that I can compare it to the names in the lobbyist registry, which I already scraped for my first investigation. unfortunately I am stuck.

I thought I new how to do this. run the pdf through OCR and get a .txt file, convert it to .csv and open it in excel, then use fixed width deliminators to get the data in the right columns. But that hasn't been working. When I run OCR it take the files out of the tables andlists first all the dates, then below that all the names, and below the all the donations, often without spaces between the names. Am I mixing up how to do this? does anyone have tips, or know where there are tutorials?

I also tried running it through a couple "pdf to xls" services, but was told that no tables could be found in the document.

Anyone have any ideas or suggestions. right now I am using a mac, but if there are better programs on PC of linux, I can try to borrow a computer to generate this table. (as a note both the PDFs are just over 4 MB).

If anyone can help me out, I would appreciate it a lot.

-tim

n 2008 the Argentine

Mariano Blejman's picture
Mariano Blejman
Mon, 2010-10-25 04:44

n 2008 the Argentine government decided to raise taxes on soybean production due to the disproportionate increase in soybean plantings in the country. The reason for the excessive growth was offset the gain that other crops such as corn, wheat and livestock. For the protests of the owners of the fields that paralyzed the country for several months, the government backtracked on the measure. The thesis of this article is that the planting of soybeans has gone up over the last ten years, while the rest of Argentina's strong plantations (corn, wheat, etc,) have fallen consistently, and the growth of soybeans has increased further in the last two years. Furthermore, compared with major world producers, Argentina has increased its global ranking as a producer of soybeans. The main information related to this work is in the Usda.gov to which I accessed through data.gov. Most information is available on the web, but it is difficult, at least in what concerns this reporter, export it in a systematic way to process it.

I have used the Crop Explorer

http://www.pecad.fas.usda.gov/cropexplorer/cropview/CommodityView.cfm?cr...

Clean data:

It is necessary to identify the last ten years of soybean production, the maps identify the shift of the agricultural frontier, and identify as sources the remaining countries in the ranking prepared by the USDA.

Schema:

Create a data base with fields
Country / year / production / Compare / Create Map

We are testing an existing thesis: to demonstrate interactively that soybean
1 - displaced other crops during the past ten years
2 - rose sharply after the government's attempt to stop its growth through heavy taxes.

Results
The results are shown in tables and maps, as well as stories about the growing use of soy and stories of life around the crops.

Data Answer:
The investigation of this work has to do with the problems caused by the planting of transgenic soybeans: clearing the land, monoculture as a model country, glisfosato damage in the communities in soybean planting.