ODE II - Work Package 6 - M2: Search with named entities and distinctive terms

A first analysis of detected entities and the value of distinctive terms is presented for three data sets. The first data set is an update of the Amsterdam City Council written questions. The second data set is a sample of annotated Dutch governmental proceedings. The third data set is a collection of notifications of building and environmental permits.

Notable deliverable results, explained further below, are a search engine for the Amsterdam data and annotated Dutch governmental proceedings.

A presentation (Dutch) for the Registry of the Amsterdam City Council gives an overview of the work done within this Work Package.

Amsterdam City Council

The data for the City Council is the same as used in M1. An update has been made on the internal representation of detected entities, and the visual (HTML) view has been improved upon. The data contains still the full textual FoLiA annotations, with part-of-speech tags and word lemmas.

A search engine is created that performs a full-text search on the processed documents, and adds to each search result the top relevant terms. This way, before picking a possibly relevant search result, the content of each document is much clearer at a glance. For example documents pertaining to the Zuidas can be about subjects ranging from the nearby football club to the temporary 'occupation' by demonstrators in 2011.

With the explicit representation of entities, it is possible to list all available entities in the data set by means of their Wikipedia link. An added benefit is the clear overview of all different textual manifestations of an entity. For instance the College van burgemeester en wethouders occurs 55 times as "B & W", 32 times as "College van B & W", and 3 times as "College van Burgemeester en Wethouders".

It is also possible to search for all documents wherein an entity is mentioned. This similarly has the benefit that the exact manifestation is unimportant. A disadvantage is that the exact entity-name is needed however. For example, a query for the Noord/Zuidlijn, present both as "Noord-Zuidlijn" and "NZ-lijn", gives a list of documents ordered by the number of times the entity was found within.


Data Source

Source PDFs of the written questions can be downloaded at http://www.amsterdam.nl/gemeente/gemeenteraad/instrumenten-raad/schriftelijke_vragen/.

Dutch Governmental Proceedings

The addition of full text descriptions (part-of-speech tags, lemmas) can be a strain for large data sets. For the Dutch governmental proceedings of the oral questions, only the summary information is added to the data. The full text analysis, represented as FoLiA, is run as above, after which the relevant terms and entity annotations are added to the input documents. For this second milestone, a small sample set of one day of meetings is processed, containing one large document.

The same entity listing and search as for the Amsterdam municipality data are available. In the document view three additions are visible.

First is the detection of members of parliament, an important subset of the entities within the proceedings. When possible, these people are explicitly linked to an existing data set of politicians by means of a member-reference, shown in the Entities table in the "member" column. A separate step was implemented in the entity linking process that first considers solely politicians. Using additional knowledge (the type of document, the date of the document, and the possible politicians) otherwise ambiguous names can be fully determined.

Second is the first occurrence of an entity in the document. Where the municipality data consists of a title and one large text element, the granularity of the proceedings is much smaller. Although the per-word annotations are not present in this data set anymore, the entity references still allow for a direct access to the relevant text part.

Third is the entity references that are shown at the smallest available structural element, per paragraph. If an entity was found within a paragraph, the original word is listed below the text as a link to its respective Wikipedia page.


Data Source

For a description of the governmental proceedings, see the documentation of the PoliticalMashup project.

Building Permits

The general approach used for the municipality data was applied to a set of around 1800 building and environmental permit notifications. These documents are very small, often consisting of only one or two sentences. With such little text, summary results (entities and relevant terms) become worse to the point that they can not be trusted fully. The detected entities are mostly street names. It is still insightful however, to look at the specifics of the results and do an in depth analysis of word and lemma co-occurrences. NB the queries below are run real-time and can take several seconds to complete.

One idea is to summarise the documents with a noun-verb combination of the most distinctive term of each. Reading these shows that the summaries can work well, but not systematically. Understanding important concepts in the data is supported by looking at bigrams. These analysis can also be filtered to for instance only look at nouns as lemmas or verbs as words, ignoring all words in between. Note that while lemmas are present as annotations, they are not part of the "text" as such, thus the demo search with lemmas might not yield results.


Data Source

Permit data is available through official channels. The exact location changed during the project, and can currently be found here at geozet.koop.overheid.nl with a filter set for Amsterdam permits. Full capabilities of the service can be requested with http://geozet.koop.overheid.nl/wfs?request=GetCapabilities&service=WFS, in particular see outputFormat=json.

Processing Pipeline

The processing of documents differs slightly per data set, but the general approach is similar. First raw input data is transformed and contained in a uniform XML format. Secondly specified XML elements, containing text, are annotated with word lemmas and possible named entities (NE), adding Wikipedia links to NEs if possible. Thirdly parsimonious language models are computed using the lemmas, to construct word clouds. Last the found entities and language models are added to the documents, as a summary of this document, and viewed as either XML or HTML.

Raw input to clean input

The Amsterdam municipality data, described in the previous milestone M1 (Dutch), is in source available as PDF documents. From these PDFs the textual content is extracted as is, using the pdftotext -layout tool (see Xpdf). The text, and if available a document title, is then cleaned (removing invalid XML characters etc.) and embedded in a <pmx:text> element in a generic document structure. This strucure described in a RelaxNG schema http://schema.politicalmashup.nl/genericX.html and contains (separately extracted) meta-data on notably the publication date and a document identifier.

The building permits data is available as one large XML document, from which separate permit-documents are constructed, similar to the municipality data and adhering to the same schema.

The Dutch governmental proceedings are already available as clean XML. The general document structure on meta-data is equal to the above, but the content of the document is much more structured, where the smallest textual level are <pm:p> elements.

Annotate text

For each of the above clean input documents, FoLiA annotations are added on a per-element basis (either <pmx:text> or <pm:p>). An integration of several tools generates this output. Notable are frog for part-of-speech tagging, lemmatisation and entity detection, and an earlier iteration of the UvA semanticizer for entity resolution to Wikipedia.

For the Dutch proceedings, and additional hook is used to first check possible entities against a knowledge base of known members of parliament. This member detection is available through the PoliticalMashup member identifier.

Compute word clouds

The lemmas, normalised word forms, are used to compute word clouds using parsimonious language models with the weighwords tool (also available on pypi). These word clouds give the most distinctive terms for a given document, by comparing them to the text used in other documents within that data set. These distinctive terms are calculated using all words, and while considering only verbs, nouns or adjectives.

Conjoin documents and summary.

The output is finalised by adding the detected named entities and computed distinctive terms to the meta-data of the clean input documents. For the municipality and permit data sets, these are added to document with the full FoLiA annotations preserved. For the proceedings, the original documents are used, with each named entity referencing the specific paragraph (<pm:p>) in which it occurred. The output XML documents are stored in the eXist database that also runs the data analysis tools mentioned above. Viewing the data is easiest as HTML that is generated on the fly, and highlights the annotations and entities where appropriate.