GSS Data Project: Final Report

What has the project learnt?

We have developed tools that can reduce the GSS reliance on publishing file attachments, and move into a world where we can produce the data once and then distribute it in any format we need to. Using web standards, we have created an environment where our data is no longer isolated but connected. We have proven that we can have the best of both worlds – where people and machines can make better use of the wealth of data we produce.

We have proven we can publish our data ‘in the web’ and not just ‘on the web’, enabling users to more simply access the data, as there is no requirement for software other than a browser. Delivering 5* Linked Open Data means we can enable other organisations to easily connect their own linked data to ours for ease of reuse, contrast and comparison.

The project has:

pioneered ‘Jobs To Be Done’ (JTBD) user research
gained understanding of why users need our data and what problems it solves
developed an open source set of transformation pipelines that automates the process of delivering 5* Linked Open Data from unstructured spreadsheets, databases or APIs
modelled and transformed data from 3 dataset families – Trade, Migration and Alcohol-related deaths
built a user interface with a powerful faceted search function
created an environment where an interconnected set of data from a range of departments can be accessed by users from one place
given us confidence that these activities can scale across the GSS

User Research

Research drove the project. The team took a different research approach and pioneered the Jobs To Be Done method to help understand the end-user motivations and needs. We developed a user journey model to help identify at what stages users encountered ‘pain’ in getting their particular job done.

The approach is detailed in Annex F. This is not available online, but you can access it by emailing cogs@statistics.gov.uk to ask for a copy.

A summary of the key findings:

User need	Description
All data in one location	One location for all data would act as a comprehensive data list available across the GSS. This would make the data easier to find.
Search	Users want to use search terms that make sense to them, instead of being expected to know the terminology data publishers use. This is often a barrier to finding data.
Filtering and metadata	Users want to use filters to narrow down the number of datasets they see (from searches or browsing). Users want to know what each dataset contains and would like metadata to support this, particularly the date period, geographical coverage and dimensions available within the datasets.
Preview the data	Most users stated they need to see the data as early as possible in their selection process, to help them understand if it’s what they need.
Sharing data	Most analysis performed by users is shared with others, such as managers or colleagues in other teams. It needs to be easy for users to share their data or analysis so it is reusable, and to cite where they have sourced the data from.
Comparing data	It is not always easy for users to compare data as different outputs have different geographical coverage. Also, levels of geography are released at different times of the year or with different time periods (e.g. quarterly or monthly).
Data at the lowest level	Users focussing on specific geographical areas told us they struggle to consistently find the data they need at the right geographical level (Middle Super Output Areas (MSOA), Lower Super Output Areas (LSOA), Ward, etc. or levels specific to their analysis (uniquely defined areas, catchment areas, etc).
Saving data, re-finding and sharing	Users struggle to re-find data they have used previously. They want to easily re-find it, shape or manipulate it as they need, and then save it for next time. This would be a huge time saver for these users.
Consistency	More consistency across data producers would make finding, using and understanding data easier for users. This would also reduce any existing confusion that has arisen around the terminology and how the data can and should be used.

Skills and capabilities needed

To move to this new position, we are reliant on new skills and capabilities. Some of these are new software skills like coding our processes, baking in automation and validation routines, but others are about ways of thinking, such as working across organisational boundaries.

It is hard to know the exact nature of the skills we are going to need but there is a pattern emerging that will get us on the right footing.

From this project’s viewpoint, the statistical community should focus on:

training for statisticians in coding modern languages like R, Pandas and Python
developing knowledge of web-ready data formats such as JSON, JSON-LD and CSV on the Web (CSVW)
introducing ‘Reproducible Analytical Pipelines’ (RAP) techniques – learning to use code to build automated output processes
developing a fusion of software engineer-type skills within the statistician groups

Key take-aways for the GSS

The project took a wide-ranging review of the landscape and tried new approaches and methods to identify, map and test different hypotheses.

Overall we learnt:

we need to improve the quality and the consistency of classifications and code lists that are used across our data or these will continue to be a problem for interoperability between data
creating coherent code lists will need a dedicated work stream to review with producers to adopt any recommendations
end users and less technical analysts prefer cross-tabulations and/or charts
we need data initiatives such as RAP and Data Access Platforms (DAP) to ultimately remove the friction on our path to 5* Linked Open Data
we feel developments such as United Nations Global Platform could offer a potential hosting platform for the GSS
we must work towards a common data standard such as CSVW to integrate our data on the web

Challenges for revolutionising statistical data in government

Static tables serve a purpose and will continue to do so. However, we need data that works well in software, so data can be searched, filtered and processed further for deeper analysis.

We need a common approach to data structures and formats to make this work. If software works for one dataset, it should work for all of them. In Tim Berners-Lee’s 5* Open Data Model, there is a framework that allows us to model the relationships between data consistently, which works well with software, like R and Python, and services, such as Google.

We have investigated and tested well-established standards for data structures and data formats and have chosen a set that have the backing of W3C and other major organisations, and are a good match for the needs of statistics.

The biggest challenges we face to creating this joined-up Government Statistical Service are around standards and harmonisation. These two areas are both critical to a linked ecosystem, but are also its biggest risk.

It is impossible for everyone to agree a standard for everything, but we need to work on doing this. If you want to know more, please ask for a copy of annex H -‘Developing a 5* Linked Open Data Solution’.

Conclusion

The project offers a game changing approach to distributing and reusing our statistics. It puts the user, departments working together and adherence to standards at the heart of the solution. These are the ingredients that will ensure we are reaching and meeting our user needs.

Linked Open Data provides opportunities for new ways of working with the statistical outputs, using data to dynamically produce reports and visualisations and bringing together data from disparate sources to tell a cohesive story.

A next phase would lay the foundations from which to build the 5* Open Data system for the GSS. The project is facing a funding challenge, and securing the resources needed to adequately deliver in this first year will be tough. Anything less will cause minimal progress around dataset families and little development of the front-facing channels.

A potential loss of momentum now could lead to a loss of interest from the GSS and a renewed push for departments to invest in expensive individual solutions.

Annexes

We have not published these online but they are all available. If you wish to access them please email cogs@statistics.gov.uk

GSS Project

Annex A: Phases of the GSS Project – describes the phases of work the project has carried out over the last two years

Annex B: Future Skills and Capabilities – highlighting future needs, skills and capability we should be looking to develop

Annex C: Project Approach and Engagement – why and how we worked on the project to give context for the findings in this report.

Dataset Families

Annex D: Dataset Families – a concept – setting out the idea behind dataset families and what they can bring for users and the GSS

Annex E: Mapping Dataset Families – an illustration of our early analysis around what dataset families likely exist in the ecosystem

Research

Annex F: User Research – a detailed look at the approach, our planning, what we did and what we found out during the two years of the project

Annex G: Benefit Dependency Mapping – an illustration of the benefits and what is needed to realise them (needs zooming in).

Developing a solution

Annex H: Developing a 5* Linked Open Data Solution – describes the technical approach of delivering data in machine-readable format using the transformation pipelines

Annex I: Managing the transformation process for the prototype – describes the technical aspects and processes for modelling statistical data into 5* open data

Annex J: Benefits of a 5* Linked Open Statistical Data to the GSS – a table of the perceived user benefits and the deliverables from the project that will be produce to realise them

Contact

Email: cogs@statistics.gov.uk