weeknotes.data-search

2018 Week 37

Babies

We’re delighted to announce that the break of dawn on Saturday was met with the soft and gentle cries of a new born baby. And Jianhan Jnr entered the world. No name as yet, we’re sure it will follow along soon. Because our Jianhan is a great computer scientist. And we probably need to settle on an identifier pattern first. Welcome aboard Jianhan Jnr.

In the very same week Samu was also delivered of a new baby, in the shape of a public SPARQL endpoint. Joyful tears were shed. Samu has since been spotted wandering round the office with a rictus grin, 1000 yard stare and a cigar clamped between his lips. At least we think it’s a cigar. We’re told that mother and baby are both doing well.

About that

As we’ve been building out the new data platform, we’ve prioritised the use of open standards. So our data storage layer is a triple store with a SPARQL access layer. This means every time you request a page on the new website, the web application queries the data platform with a pre-canned SPARQL query, gets back some data and renders a page. To oversimplify things.

We’ve also taken the approach that our website is our API. Or at least that data views should be available to match every page type. Try browsing beta.parliament.uk, view the page source and check out the rel-alternate links. Truly beautiful. Brings tears to your eyes. We think. It’s been a very emotional week.

The problem with this pattern is that it limits us to building the things we think of and have time for. As the graph of data expands, new ways of querying it become possible. And somebody somewhere out there might want to query across the data in a way that we haven’t exposed as a URL on the website. The word librarian springs to mind here.

So rather than restricting data access to pre-canned SPARQL queries and rather than restricting access to people inside Parliament, we now have a public SPARQL endpoint. Which means that anyone anywhere can query our data in any way they find useful. And if we find common patterns we can roll them back into the website.

Team:Samu have been beavering away at this for a while, and announced it on Wednesday. Samu encourages people to give it a try and kick its tyres. For now, queries time out after 5 seconds and are rate limited to 10 per second per IP address. So no matter how ham-fisted you are, you can’t do too much damage. Give it a whirl. Let us know what you think.

Matthieu has already started hacking around and made a quick MP name lookup service last night. You’d think he’d have enough of this stuff in his day job. Keen as mustard that lad.

The delivery is the strategy

Dan ran another data strategy open session. Which we’re told attracted a good audience. He thinks it would be great if things continued in this fashion. You can see Dan and his beard talking about it here.

Dan also logged a call with the service desk to unblock thebridalfile.co.uk website. Which means we can unsubscribe from their marketing spam to the data@parliament.uk inbox. And also maybe check out the latest in canapé trends and what have you.

Community

Over in library land, work continued to decant Michael Rush’s card index files into spreadsheets. That said, it did almost come to a grinding halt. Michael realised he’d made a schoolboy error with his original export of reference data. And that quite a lot of it was missing. He crept back to Tothill Street expecting only the disapprobation of librarians. But Anya stepped in to help. And promptly ballsed things up to a completely new level. There was some sitting in silence. It was not companionable. Tuesday passed. But no strong words were issued and no punches thrown. They combined all of their brains and all of their computational skills and pulled off yet another audacious goalline clearance. From zeroes to heroes.

Anya and Michael met with Lorna from the Scottish Parliament to talk all things taxonomy and search and data. The subject of SEO came up but Michael was strong. He did not flinch. Or froth. He did ask for a trip to Edinburgh though.

One world, one web, one team

Anya and Michael trotted off to the House of Commons Journal Office to see Mark Hutton, Clerk of Journals. They chatted about web analytics and user data and privacy. Particularly in the context of online petitions. More chats to follow.

Sara and Matt have been looking further into the use of natural language processing to classify calls made to the helpdesk. They’ve been borrowing ideas from work by the Ministry of Justice analytical services on their parliamentary questions tool and taken a turn toward Latent Semantic Analysis. Sara and Matt got a proof of concept running, then Matt worked on it alone. And broke it. Luckily, Matt’s also been chatting to Sam Tazzyman who was responsible for the MoJ work. Sam helped Matt get it working again. Thanks Sam.

In simple terms, Sara and Matt are looking at how often words are used together in sentences, as well as the words those words are used with, and so on, to group sentences by common topics. We’re hoping that we can then use common topic-ness to help with assigning calls to teams. Sam also provided a link to the more recent work they’ve been doing, where you’ll find some very good write ups on what this stuff actually is.

Domain modelling

Anya, Robert and Michael continued with the long hard slog of editing model comments. It’s like chipping away at a block of rock and failing to see a figure emerge. Anya and Michael get the feeling that Robert is trying to train them in the art of comment writing. Or English. They’d probably prefer it if he just rewrote them. Their trust is implicit.

September spawned a monster…

The draft affirmative procedure changes that caused so much confusion last week have now been sense checked by House of Lords Jane and House of Commons Jack. And declared to be sane. Or as sane as they’re ever likely to get. Anya and Michael had hoped to add them to the procedure data this week but the Rush data calamity cost them a day. So that’s a joyful thing to look forward to next week.

Data platform

Mike found some server errors in the Search Service telemetry and had a hunch that it was happening on searches with no results. Samu debugged and confirmed Mike’s suspicions. Since we upgraded to the latest version of Bing, our code wasn’t handling cases when the external search provider found no results.

Samu’s grin temporarily disappeared when he realised he’d made the noobest possible programming mistake and not checked for nulls. But he soldiered on and did not think badly of Sir Tony Hoare. A fix was deployed within the hour.

Mike continued to peer at the usage statistics for the Search Service and found that since the fault was introduced, only 1.38% of searches were affected. And that it wouldn’t have been noticed by users because the search page just says ‘no results’ when the underlying service fails. Which in this case was actually true. So really 0% of searches were affected.

On search. And indeed indexing

Sara continued with her investigations into search terms and the House of Commons Library controlled vocabulary. Alex helped out by exporting a list of all the controlled terms in use. Which worked out to be 41,936 preferred terms and 27,130 non-preferred terms. Which is a lot of words. Sara took this list and and attempted to match it to search terms from four sources: our new website search, parliamentary search, the internal only parliamentary search and Hansard search.

To make things more manageable, the analysis covered the period from May to June 2018. It’s not yet complete, but we have some numbers. So that’s good. Out of 154,918 unique search terms, 67,082 matched with preferred terms and 23,042 search terms matched non-preferred terms. So a total coverage of 58%. Super.

Corporate data

Noel and David wiped the sweat from their brows and emerged into a rubble strewn landscape to continue the configuration of their new development and testing environment. Noel continued to clean up the duplicate or incomplete People Data records left in the wake of its recent explosion. And David reviewed an issue that’s currently affecting the Time and Labouring System. Which sounds pretty dystopian.

Elsewhere, Lew finished setting up our hosted dev machines so we can develop BizTalk solutions in a more stable environment. And made progress towards a test environment. Dan says this is important work. Lew also continued with the development work for the new catering finance system. Steamed pudding and custard being considered mission critical round these parts.

Aidan, Lew, David and Dan worked through some potential SQL upgrades for the integration servers.

Strolls

Nothing. Nada. Nowt. All quiet on the stroll front. Although Anya, Robert and Michael did take a train to go and ride on some trains. Including this absolute unit. Sadly they did not ride on Thomas, because he was sat in a siding with his face missing. Michael feels he knows that feeling.

Things that caught our eye

Dan read the Harvard Business Review article on why Design Thinking Is Fundamentally Conservative and Preserves the Status Quo and why the alternative is messier, but more democratic. He does not offer an opinion here.
Robert and Dan both read this article on how Amazon is stuffing its search results pages with ads — and they seem to be working.
Sara listened to a radiolab episode on hate speech and Facebook.
Dan read a Google post on Five insights on voice technology. He also suspects Robert read it. Probably first. Or ^first^ as we used to say.
Dan read Translating data ethics by Sarah Gold from projectsbyif and took a look at their 9 practices for organisations operating digital services.
The Government Computational Service are removing support for a whole bunch of data formats from their registers. We’re not sure why. Or what Roy might think.