Skip to content

Synthetic Data Fund: what can digital twins tell us about data linkage?

Two moths balance on a single green stem

Overview

Synthetic datasets are increasingly recognised by researchers and data analysts as a useful tool. In 2024, Research Data Scotland awarded funding to three projects exploring the untapped potential of synthetic data and how it can be used to support and speed up research projects for the public good. One of these projects was an investigation from the Social and Public Health Sciences Unit at the University of Glasgow, which RDS jointly funded with the Systems Science in Public Health and Health Economics Research (SIPHER) Consortium.

The key aim of the project we funded was to provide an overview of opportunities and challenges associated with the provision of synthetic linked data under a scalable ready-and reuse model. The findings from this project work will be used in shaping our strategy on how we can provide researchers with synthetic datasets. RDS is currently designing a pathway for researchers to access synthetic data in Scotland, recognising that this might allow researchers to start working with data while waiting for the necessary permissions to use actual data.

The team identified the priorities and requirements of key stakeholders for the purpose of identifying the role and requirements of synthetic data across different types of application. The team also explored information governance implications for a scalable creation and provision. Further to this, the team created four minimum-working examples that demonstrated different approaches to the creation and linkage of low fidelity linked data, with the aim of replicating basic relationships observed in the real data.

Who was involved and what was found?

This year long project was led by Andreas Höhn at University of Glasgow with co-investigators Alison Heppenstall, Petra Meier, and Charlie Mayor of West of Scotland Safe Haven, who contributed on the information governance aspects of the project. 

The project draws in particular on the insights gained through the creation of the SIPHER Synthetic Population and its distribution via the UK Data Service. The SIPHER Synthetic Population is a novel “digital twin” dataset developed and updated by researchers of the UK-wide SIPHER Consortium. Created from open access census – and general license survey data, the dataset provides an easily accessible synthetic full-scale population. As the dataset builds directly on the UK Household Longitudinal Study (Understanding Society), the resulting SIPHER Synthetic Population reflects many of the survey’s strengths, and captures a variety of life domains such as  health, education, employment, and housing for Great Britain’s population.

The project also explored the information governance requirements for creating such digital twins to support the provision of linked synthetic administrative data in Scotland under a ‘ready and reuse model’. To further demonstrate the opportunities researchers generally have at hand to create synthetic study cohorts, a dataset was created based on joint information in an open access publication. This was used to test linkage capability of different synthetic datasets which were created from aggregate-level descriptions of study populations in research papers, for example using information from the Scottish Care Information-Diabetes (SCI-Diabetes) database.

To appropriately tailor future synthetic data offerings by RDS, the team found that a better understanding of our projected user base for these synthetic datasets would be required. A clear translation of relevant information governance requirements will also be essential to future work. The findings from this project will help to inform our strategy and service offer for synthetic data at RDS as we continue to develop work in this area.

Research Data Scotland’s impact

Joint funding from RDS and the SIPHER Consortium allowed for an exploration of the untapped potentials of using open access data, standard reporting encountered in publications, simulation approaches for the creation of synthetic data.

RDS exists to make it faster and simpler for researchers to access public sector data, and tools such as synthetic versions of datasets can be valuable in the early stages of projects. We are developing an approach to using synthetic data in Scotland, recognising the untapped potential of this asset that allows researchers to test approaches while waiting for the necessary permissions to use actual data. RDS has set up and chairs the Scottish Synthetic Data Working Group to support an aligned strategy, which we have developed for synthetic data production in Scotland. This will deliver improved and faster access to this type of data for research. We have conducted user research to understand demand for synthetic data and engaged with the public to explore concerns. We are also working with data controllers to produce synthetic datasets for training, data discovery and code development.

As part of our capacity building work, we created a synthetic data fund with over £85,000 which was awarded to three organisations to investigate how synthetic data can be used by researchers, including this project from the University of Glasgow and the University of Edinburgh. Our aim is to use this learning to develop valuable synthetic data services.

We know that synthetic data can be helpful to researchers and projects, and that it is possible to create synthetic data in ways that ensure personal data is kept safe. Providing researchers with these tools is in line with our core mission to promote and advance health and social wellbeing in Scotland by enabling timely and cost-effective access to public sector data. Providing synthetic versions of datasets could encourage use of the real data for research projects, which in turn have the potential to impact public policy and services

Find out more

Those interesting in taking part in user engagement on synthetic data can join our engagement contact list.

The SIPHER Synthetic Population is now available as part of the UK Data Service curated collection. Find out more about access requirements for the dataset and how to become a registered user of the UK Data Service on their website.

You can learn more about the SIPHER Consortium project through the University of Glasgow, or take a look at their new Dashboard tool.

Related content

Image shows a toy train track and trains across a model city with roads, houses, shops and people

Synthetic census data now available for research

Research Data Scotland (RDS) is delighted to host new synthetic datasets from National Records of Scotland (NRS) which will enable researchers to learn more about Scotland’s census.

11 Dec 2024