Paper | Accountability in the Information Age

Our methods in more detail

Based on a meeting with the city of Berkeley IT department and research on government open data policies we determined five key categories that we could use to compare the open data platforms of different local governments in the Bay Area — quantity, recency, accessibility, breadth and quality.

To select the cities to compare, we decided to focus on the governments which use Socrata. Socrata is a software company which many cities use to power their open data platforms, including San Francisco and Oakland. Next, we considered various statistics we could collect about each city’s data platform, and how these could inform our metrics.

Having decided on a set of cities and metrics by which to assess them, we built a collection of web-scraping algorithms. These deterministically searched from the navigation page of an open data platform, “clicking” every possible link and checking to see if the link fulfills the “dataset” criterion. We applied Regular Expressions (RegEx) on each webpage’s source code and link to guide the scraper’s search.

Once we had a list of links to the datasets published by each city, we were able to extract more information from the corresponding JSON (Javascript Object Notation) files. Our scripts extracted relevant information from the metadata section of each file, such as view count and recency.

Finally, we organized these statistics by city and dataset and wrote our results to files for further analysis. We discuss our analytical methods below.

1. QUANTITY

Ranking based on number of datasets published, scaled according to size of city government

We defined the quantity of data to be the number of items a city published in their data catalog under the type of “Datasets.” For example, some published items include data visualizations and maps. Once filtered, all of the quantity numbers were extracted from pages similar to this page (San Francisco’s list of datasets).

Once the data was collected, we noticed that larger cities have more published datasets than smaller once. We corrected for by normalizing the number of datasets by the population of cities, with the idea that larger city population entails bigger government infrastructures. Our new statistic (number of datasets per one thousand people) gives a ratio rather than a raw score, which is easier to compare.

The number of datasets, parsed from the html of data platform web page, varies from source to source, since each source had a different consideration about what it considered datasets.

For example, some numbers included data visualizations and maps. In order to mitigate this and disallow for double counting, all of the quantity numbers were extracted from pages similar to this page on San Francisco’s list of datasets. This filters all of the results to simply the datasets.

2. RECENCY

Ranking based on average amount of time passed since updating and creating of datasets

The recency portion of the metric awards cities that have more recent and updated datasets. We collected from each of the JSON files the date last updated and the date created.

We defined the “present” to be Monday April 30th, and took the difference in time in days since dataset creation and last update. This relatively arbitrary definition of the “present,” has no impact on the score, as long as the data remains the same, since the “present” is only used as a general point for relative comparison.

We then calculated the average overall update and file creation time of data files for each city. However, simply ranking recency based on average update time did not take into consideration that certain datasets do not require updating. For example, the Salary Ranges by Job Classification dataset in San Francisco does not need to be updated as often as police department incident datasets.

We decided to model update time using a cumulative distribution function which shows the percentage of data sets that were updated a certain number days before Monday April 30th. This allows comparison between city update rate, without falsely penalizing datasets that do not need to be updated.

3. Accessibility

Measured using average dataset download count and average number of views

We initially planned to use the number of “clicks” needed to reach the open data platform from the city homepage as a measure of accessibility. We created a network graph of all the different links that could be reached from the city homepages. However, we soon realized this is not a good measure.

Firstly, the city home page and the open data home page are almost always connected directly, even if it isn’t obvious to find. In order to generate a network graph, we were extracting all of the links from a website. This means that even if you see a degree one connection (connected directly) between the city home page and the open data platform, the URL that connects the two might not be obvious to the user and thus, the validity of the degree one connection cannot qualify as “accessible”.

The second issue was that this graph was too expansive. Any given website has links not only to the directly relevant resources, but also to advertisers, development platforms, and many other miscellaneous websites. While it is possible to filter out some of these results, the filtration quickly starts to overfit to the city that we were testing on and then became difficult to generalize.

As a workaround to this, we used number of views per dataset and number of downloads per dataset. We parsed this data from the JSON files and then took both the median and the mean of these numbers for each city.

4. Breadth

Calculation of the spread and relevancy of datasets available using tags and dataset titles

Naive approaches to measure breadth of data might use a city’s own categorization scheme and simply compute the number of datasets under each one. However, this has major issues.

Cities do not share the same categorization schemes, so this would prevent inter-city comparison.

In many cases, there is no “proper” way to categorize a dataset. Should a dataset of city health clinic budgets go under the “Budget” category, or “Public Health,” or both?

To remedy these issues, we chose a measure of breadth of data that is a Natural Language Processing (NLP) benchmark: GloVe similarity (Pennington et al). GloVe is a machine-learning based measure of similarity between pairs of words, assigning to each word-pair a score of 1 for maximal semantic similarity, and -1 for semantic opposites.

GloVe uses deep neural networks to represent words as points in a 50-dimensional vector space. Intuitively, words that have similar meanings should be physically close together in this space. GloVe is especially good at capturing the extent to which a pair of words is similar, which is why we used it instead of methods such as word2vec or Latent Semantic Analysis (LSA).

After modeling 8 categories of data types that are relevant to local government on New York City’s open data categorization schema, we picked a set of words that are representative of each category. For example, the public safety category contained categorical words such as “police” and “crime.”

To measure breadth for each city and category, we scraped dataset tags from each of the dataset web pages per city. After embedding both our category words and dataset tag-words into the GloVe vector space, we computed a similarity score of each website to each category, by averaging inner products of the embedded category word-vectors and the website tag word-vectors. For example, DataVille might have 500 word-tags that we collected, and for each of these words we compute a similarity score against all words in each of our categories. This might result in DataVille having a similarity score of 0.2 to the education category, but 0.8 to the public health category.

GloVe solves both issues with the naive approach. By capturing similarity of meaning, it avoids issues with inconsistent naming schemes for city data categories. From GloVe’s point of view, the words “education”, “schools” and “pupils” are all roughly the same. Further, it captures the fact that a single dataset might contain information from several categories by comparing each dataset to each category word.

Scores are on a 0 to 1 scale, but as we expected, no city scores exceeded 0.5 for any category. This result would have been quite surprising, since a score represents a percentage matching to any one category. Therefore, in our DataVille example, the scores would (roughly speaking) mean that DataVille has 4 times as many datasets about public health as education.

5. Quality

Measure of dataset machine readability and adherence to data delivery standards.

Project Open Data was founded in the second Obama administration to provide tools for data transparency and accountability for federal agencies. Their website offers a tool for JSON validation, which examines the metadata of a file to see if it is in accordance with federal standards. We evaluated data files for the 6 cities, checking against the Non-Federal Scheme version 1.1.

All of the cities we examined used Socrata, which natively supports data.json and extended metadata fields. However, when we checked JSON metadata of each city at https://labs.data.gov/dashboard/validate, we ran into errors and it is unclear what caused them. We conclude that the issue needs further study, and that the problem could be a result of an error on Socrata's part or Project Open Data.

See more on our github.