weeknotes.data-search

2018 Week 14

Community

Matthieu went along to a data visualisation workshop run by Alan Smith, dataviz editor of the Financial Times. He thought it was a really interesting summary of the basic techniques. Highlights included:

On his return, Matthieu had a brief chat with Liz. The visual vocabulary chart reminded her of the Data Viz Project. Liz also recommends this post on how we position and what we compare.

One world, one web, one team

Jianhan has been working with the Content Team to add and remove Lords Members’ photos.

Samu worked with Jamie and Jenna to expand our article content model and accompanying API. They’ve added some features to enable better contextual representation on the website, including the ability to display what kind of article it is and how it relates to other articles and collections (including multiple breadcrumbs for an article in a polyhierarchy).

Dan and Samu met with the team looking into Research briefings on beta.parliament. Dan gave a lot of context about how this feature fits in with the rest of our data and web estate. Samu outlined an iterative plan for a scalable new publication workflow that builds on the strengths of the current Research briefings publishing process, while allowing for future improvements.

Liz has been using webmaster tools to find all the sites that have links to www.parliament.uk. It provides a list of 1,000,000 links and the site name, from which we can get the volume of links from a site, and the amount of ‘distinct’ sites. Distinct is tricky because there are variations of the same site. Nevertheless, Liz estimates there are 10,044 or less sites that link to www.parliament.uk, ignoring http and .co.uk variations. 1,000,000 is probably the limit and this is covering pages that have eight or more links. The highest linked pages have ~20,000 inbound links. She’s currently looking at how to visualise this.

Domain modelling

Anya, Ben and Michael were all on holiday this week. No domains got modelled.

Data platform

Jianhan worked on an OData implementation to create, update, and delete objects in the triple store. It allows for:

He’s tested with a locally installed triple store and Postman to show it works.

Raphael did some more work on displaying business items and procedural steps in graphs using Graphviz / the DOT language. There’s an example of one subject to the “Made affirmative” SI procedure here. Note you can already request data in DOT format from our Fixed Query API e.g. https://api.parliament.uk/Live/fixed-query/person_by_id.dot?person_id=43RHonMf, from Samu’s previous work.

Search (and indeed indexing)

Work has continued on analysing the data generated by the Indexing and Data Management Section of the House of Commons Library. The data exists to pull together material for parliamentary search in a way that browsing the website never quite achieved. To date it’s been locked behind a search form, but now we’re looking at how we might liberate the data and expose it to the web.

Sara wrote an R script to do some pre-processing of the massive JSON files Alex exported from Solr. These files have a lot of data fields that can be filtered out, but the filtering was taking too much CPU and time to process. The R script reads the JSON file, extracts the data fields we want and outputs a CSV file that is 94% smaller than the original. Sara would like to thank Alex for helping out.

Alex is also working on an application to regularly compare ‘non-indexed’ material and content coming out of the daily Solr exports, in order to update older content which has since been ‘indexed’. Currently re-indexing isn’t being reflected in the data, leading to some inaccuracy.

Raphael worked with Alex to make a data pipeline that takes the Solr data from source to sink (destination). Analysts can then connect to the sink in an online notebook and work in their desired language (usually Python or R), without installing the language on laptops.

Matthieu broke out the XSLT to make our controlled vocabulary more SKOS compliant. He found Muenchian grouping to be a fun discovery. He had even more fun hunting down naughty BOMs in the source XML.

Corporate data

Away from Historic Hansard, Dan’s week consisted of two different things about corporate information dashboards, a pretty good meeting about corporate data and a new CRM project.

The latest version of the House of Commons HR feed was implemented. This helps keep the job data in People Data (our internal system for managing data about… people) up to date. The new integration is also smarter. It now does a comparison before processing an update. If nothing has changed, it won’t process a record. This reduces the number of records to be processed from ~5000 to around 800 - 1000 each run.

The Stock Management System integration with the resource planning system is in testing phase for the House of Commons. For the House of Lords version, Lew is waiting for a reconfiguration of their HR and Finance system.

Capability

Liz and Matthieu spent some time preparing a nice test for the interview of our data analyst candidates.

Is it possible to get promoted?

Yes, apparently it is.

Did anybody get promoted?

Chris and Raphael both got promoted. Top work, boys.

Was sarcasm deployed?

Samu would like it to be known that Chris was particularly lacking in sarcasm in his correspondence with other teams regarding faults with their systems. No further promotion until this gets sorted.

Customer service of the week award…

…goes to everyone who contributed to the Historic Hansard goal line clearance. It was being moved to a new home but the imminent shut down of its old hosting arrangements made things slightly hectic. Matthieu, Mike and Samu all chipped in. Dan and Mike also answered many, many emails.

The older version of Historic Hansard was served as static HTML from our old hosting infrastructure via hansard.millbanksystems.com. The new version uses the same HTML files hosted in blob storage and served through api.parliament.uk/historic-hansard. In switching over, we ran into DNS issues. Samu created a DNS Zone in the new hosting environment and Robert pointed the authoritative DNS at this, from where we redirected to api.parliament.uk. As part of the migration we decommissioned the old Historic Hansard search and started using the beta search instead. Unfortunately we found that the polite spiders used by Google and Bing take a very long time to index a site with over 250k pages previously indexed. So instead we restricted the search with a site:hansard.millbanksystems.com filter to get all the previously indexed results and send users back to api.parliament.uk/historic-hansard via the redirect. Switching the DNS records made us realise that most mobile providers’ 4G DNS servers are really slow on updating their records. It can take 2 days instead of the usual couple of hours for regular network propagation.

As part of the migration we’ve added the standard telemetry that comes with the data platform. Which should make life easier for the data analysts amongst us.

Things that caught our eye