The GSS Data Project and table2qb: a data transformation tool
a blog from Swirrl
Data science courses on the GSS website
The data science courses advertised on the GSS website
Users have difficulty in finding and using GSS data because:
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – aims to fix this problem by:
This Linked Data is better for users because:
Government statistics are not produced or managed by a single entity. We operate a federated system under the common banner of the Government Statistical Service. All central government departments, plus arms length bodies and the devolved administrations are part of this union and between them they publish around 25,000 tables of statistical data each year. We distribute this huge number of outputs across the internet on many portals and websites.
This disparate landscape has been a problem for our users.
Making so much data available in the open, free to download and use (and reuse) is the right thing. But our users are struggling to get to grips with this wealth of data. Some of the problems are:
How do we overcome these barriers and challenges to accessing and using the data we produce?
A project can’t achieve this in isolation. We are all responsible, including our users, for the co-creation of this new vision for the GSS.
Connected Open Government Statistics (COGS) – previously called the GSS Data Project – is developing the tools that can help to support this. The heavy lifting equipment we are building will reduce the burden on departments. We are working in partnership with centralised teams that are already supporting the betterment of the GSS. We are collaborating with the growing communities of practices out there, such as Reproducible Analytical Pipeline (RAP) champions and presentation champions, that are already working support better data.
This project is about being that catalyst for change and running towards a better way.
To increase the impact and reach of GSS statistical data by improving how we organise, integrate and deliver GSS data online.
We need to re-imagine our current publication model. Spreadsheets have played the lead role for many years. They will continue to play a part but this should be a supporting role and not the primary channel as we expand our distribution routes.
To support this, our focus needs to be on how to make it easier for publishers of official statistics to share their data in ways that offer the greatest use to users.
This means moving to a place where we are producing data that is web-ready and not ready to sit on the web. COGS is building the blueprint that will enable us to exchange data and metadata over the web and make better use of all the different information we offer.
This will make our data more findable and useful for our users. This is the pain relief our users want.
To pioneer a better way for users to connect with our data, we need to understand the current journey taken by our users to find and reuse our data and the pain points they face along the way
The user journey is simple. Awareness that there is data available and the ability to find it. The users want to have just enough information provided that allows them to consider whether the data is what they need before they make any selection. Once they have found the right source they need to be able to work with it to complete their aim.
We have not catered for this simple user journey. Instead we have created a disparate and confusing landscape for our users. For example, housing data is published by a range of official statistics producers, in a range of different formats with various codes and labels. The myriad of possible entry points to this data confuses our users. Which data? Which website? Which organisation?
Users don’t care about departmental boundaries or services. They want our data. COGS is about creating a system that pulls the threads of all this related data together into related data groupings.
The approach is not as simple as just putting all the spreadsheets on one website. We need to make use of standards for data and metadata on the web. We must also do a better job of harmonising the statistics when possible and explain the differences when harmonisation is not possible. This is the biggest challenge we face given the differences across our statistical system.
The hard bit of the project is taking the data we have and making it ready for use on the web.
Spreadsheets lock the data into a single format. The formats can be whatever we want them to be.
This leads to organisations using different structures, deriving bespoke standards and inconsistent formatting. The documentation on how our reference data are defined and encoded is nowhere near the data. This makes it hard for users to uncover and, even when found, it is incomprehensible for many users.
That is at the heart of why some statistical data is hard for people to use: it’s not always clear what the data means and whether data from different sources can be compared.
This is where our concept of ‘dataset families’ comes into play.
Dataset families aim to link groups of related data into collections for users to discover and navigate around. It requires no previous or explicit knowledge of what organisation the data comes from.
Users are interested in getting data on a subject or domain. Breaking down the organisational boundaries and offering an interconnected set of related datasets should make it easier for users to find, understand and apply the data to their needs.
By examining and modelling the relationship between the various datasets in that ‘family’ collection it should allow:
The ONS has been working with a company called Swirrl, investigating how to bring datasets closer together as a family, by working on the structure of the data and the vocabulary of terms and identifiers used.
Our first step is to take the spreadsheets that have been published and remodel them. Using tools like Python and Pandas we normalise the data into a simpler format – stripping out all the presentational stuff from the spreadsheets.
The output of this we call ‘Tidy Data’. This is a machine-ready version of the data we started with. Having data in this form makes it much easier to push the data through the next stage of our pipelines to create the web-ready formats which are ‘Linked Data’.
This step is where the clever stuff happens and adds the real value to the data. We have developed ‘table2qb’ (pronounced ‘table to cube’) that takes the Tidy Data format and integrates it with an enriched metadata file. This produces a web standard format called CSV on the Web (CSVW). The World Wide Web (W3C) consortium developed this form to support data on the web.
The CSVW provides the ingredients we need to create Linked Data which is the mechanism used to link all the things together and makes it easier to discover new related things about the data on the web. For our purposes Linked Data outputs not only provide an enhanced machine-readable format but a web-ready format.
Having data that is machine-readable and web-ready means greater use for us working in the Government Statistical Service and our users working outside of government departments:
The summer of 2019 saw the project grow from a small scale feasibility study into a full project team of business analysts, data engineers, architects and data analysts . This project will need time to deliver all of its objectives. We have suggested a five-year programme of work. There will be several phases throughout the lifetime of the project. The first phase is to build the foundations – establishing standards for data and metadata. The second will involve developing the service to deliver a connected statistical service. The third will see the implementation of the products and services that will manage these processes into the future.
The strands of this first phase:
The team has expanded to deliver on these areas. Most of the resource and effort will be focused on:
It’s important that this work follows standards. Adopting data and metadata standards will have the biggest impact. It will improve our capabilities as individual producers and ensure operability between producers.
We welcome engagement with anyone across the GSS – you don’t have to be involved in RAP or other initiative just get in touch with us if you’d like to know more or get involved earlier. The project has engaged with many departments.
Those most involved so far are:
The project team is also getting great support from centralised teams and communities already supporting the GSS: harmonisation team, classifications teams, RAP champions and presentation champions.
We are currently working on:
And there are many more are in the pipeline.
Once a dataset family is onboarded into the COGS platform the data is available immediately to users and has the following benefits:
Our project board is made up of:
The GSS Data Project and table2qb: a data transformation tool
a blog from Swirrl
Data science courses on the GSS website
The data science courses advertised on the GSS website