We look at the use case Transparency and Democracy through text that is produced by and for politicians. The main data set consists of written questions and answers from the Amsterdam Municipality Council. To showcase possibilities of structured data, Dutch Parliamentary Proceedings are analysed as well.
Our approach uses an automated text analysis to enrich the content of the documents. Applications of the use case, that further "open up" the data, are built on top of these enriched political documents.
Looking at textual documents, one distinction we can make is transparency within documents and transparency between documents. Within documents, data can be made more accessible by for instance facilitating search. Between documents, data can be compared and linked to other, new data sources.
These are both made possible by the text analysis, where each document is summarised by its most relevant terms and the detection of named entities therein.
Before the data is processed, it is good to look at the "openness" of the data. The two data sets differ in the way they are available, and in the level at which the analysis is consolidated.
The written questions are available from roughly 2010. They contain questions and answers from the overall municipality council.
Proceedings of the government are digitally published since 1995. They contain a (slightly redacted) transcript of the oral questions and discussions in the Lower House and Senate.
Before the data is used in applications, it is processed through several steps.
Documents are published in many forms. For the Amsterdam municipality data, the officially published PDF files are downloaded.
Many documents are easy to read and understand for people, but not so much for computer tools. The textual content of the (visually oriented) PDF documents is extract and stored separately.
Known implicit structure can be made explicit. XML is a useful structured data format that is used to this end. The second data set of governmental proceedings is already available as XML and downloaded as such.
The text in the document is analysed word by word by software. Words and phrases are converted to lemmas, Named Entities (typically proper nouns) are detected, and distinctive terms (relative to other documents) are determined.
Known terms and named entities are annotated with links to Wikipedia. In the proceedings, known politicians (and parties) are explicitly identified as unique people.
The structured, annotated data is made accessible online, through search interfaces, document viewers and aggregate charts.
Here we present five tools that were implemented using the data sets. They are examples of many other possible tools, that range from data exploration to fact checking.
Most of the tools can be accessed through ode-tools.
With the documents and analyses available, a first application is to visualise a summary of each document. A popular method for such visualisations is a wordcloud. The terms shown here are from a document about local amateur football club AFC. Each term is lemmatised, and only verbs, nouns and adjectives are included. The sizes signify how representative the word is for this document (compared to the other documents in the municipality data set).
Summaries can go further than standard wordclouds. Our document viewer for instance also lists entities, ordered by count, with an internal link to the first occurrence. The summary can thus serve as a document index on subjects.
After documents are made available as open data, a good next step is to facilitate search within those documents. Most search engines will show documents with a short text snippet where the query term was found. The relevance of a document might not be directly clear from just a snippet.
The document summaries can be used to create a short indicator of the topic of a document. In the search results shown here, several roles of Amsterdam Schiphol Airport can be distinguished, such as an employment provider, a country border, or a destination within bird colonies.
The more structured governmental proceedings can sometimes be quite long, even for a single topic. When reading through such documents, it can be useful both to have a quick overview of the current subject, and find related information online.
The example shown displays a single paragraph of text, with the detected named entities listed explicitly below. Each entity links to Wikipedia for background information.
Larger data sets, spanning a more significant length of time, are by often well suited for chronological time line visualisations. One example shown here is a plot of an entity (Libya) from 2001-2012. A significant spike is visible at the beginning of 2011, when an important event took place that was widely discussed internationally.
Time lines can be deceptive and require interpretation to have meaning. They are an interesting possibility for exploration however. The chart tool allows a direct link a search for documents, containing the given entity within the specific time range. It is easy to imagine a new combined tool that shows for instance a summary of a set of documents for a given time period.
The last application combines several different annotations and an external data source, for an automated news query.
When questions are asked in the Lower House during a “question hour”, related news articles are often published by newspapers. An external institute, The European Media Monitor, collects news articles and has these available for querying.
From the structured proceedings data, we know several things. First, the questions are present in the document as text elements attributed to the specific person asking the question. Second, the date of the the meeting determines when news articles are likely published. Finally, the document summary describes the likely topic of the news articles.
The information is combined to create a query to the EMM service. In words, the queried news articles must contain at least the name of the speaker who asked the question, should be published around the date of the meeting, and should contain at least one of the top ten distinctive terms. The returned set of news articles is further then analysed to collapse similar articles.