Skip to content

Synthetic data

Read about our work on synthetic data.

About synthetic data

A synthetic dataset is designed as an imitation of the original dataset that it is based on.

Synthetic datasets follow the same pattern and structure of their original datasets, but they do not contain real information about people or places. Instead, they are populated with information that, while plausible, is created at random. This provides an extra layer of privacy protection for individuals.

Researchers can benefit from access to synthetic versions of datasets in the early stages of their work. Real data is essential for analysis and decision making, and synthetic data is never used in this way. However, synthetic data can be used alongside the available metadata as a method of data discovery. This can help researchers to determine which datasets are the most appropriate for use in their research. It can also be used for code development while waiting for access to the real datasets required.

Research Data Scotland (RDS) has been working with researchers to understand their requirements around synthetic data, as well as engaging with our Scotland Talks Data public panel on our work to make synthetic datasets available for research. At present, this work is focused on low-fidelity synthetic datasets.

Learn more about synthetic data, including how it’s used, different types of data fidelity, and why it’s useful, in our public-friendly data explainer.

What is synthetic data?

 

Watch this video with British Sign Language (BSL) interpretation

Our work and impact

Discover some of Research Data Scotland's (RDS) work on synthetic data below.

How RDS is making synthetic data available to researchers

We now have the functionality to make synthetic data available on request to researchers through our metadata catalogue.

If a researcher is interested in accessing a synthetic dataset, they will be asked to enter their contact details and complete an End User License Agreement. RDS will then look to confirm their credentials and send them the synthetic dataset they have requested. This method of access maintains a level of safeguarding and controls around releasing the dataset, while fulfilling the need from researchers that synthetic data access be straightforward.

At present, we host synthetic datasets from National Records of Scotland (NRS). Synthetic versions of selected variables from Scotland’s Census 2001 and Census 2011 can be requested through our metadata catalogue. NRS have developed their own methodology and approach to synthetic data generation and have performed their own quality checks to confirm the required RDS standards of structure, labelling, disclosure and documentation are met.

In addition to hosting synthetic data from partner organisations, analysts in RDS have written code in R to generate synthetic datasets from existing dataset metadata. The key features of this being that the code is reproducible, that the synthetic data generated is low-fidelity, and that a quality check of the data is included which requires comparison with the original dataset. Our team have worked closely with Professor Gillian Raab of the Scottish Centre for Administrative Data Research (SCADR) to develop appropriate and consistent quality checks and standards for creating low-fidelity synthetic data and making it available for research.

Synthetic Data Fund

Research Data Scotland awarded £85,000 through our Synthetic Data Fund to three organisations investigating how synthetic data can be used by researchers to improve and speed up research access to public sector datasets.

These projects, which ran throughout 2024, explored key areas of interest on synthetic data, including disclosure risk and information governance, synthesis of data, and access, promotion and engagement around synthetic datasets.

Findings from these projects are now being used to inform our synthetic data strategy, ensuring that we incorporate key learnings and best practice into our work. You can also read our case study: Synthetic Data Fund: what can digital twins tell us about data linkage?

Contributing to wider synthetic data work across the UK

As our understanding of synthetic data and the needs of researchers grows, it’s important that we continue to collaborate with our networks of partner organisations, researchers and members of the public. To do this, we have worked on a number of synthetic projects across the UK, including:

  • Contributing to Discussing Data (formerly DELIMIT), a project run by Cardiff University. 39 members of the public were recruited to participate in workshops which explored public attitudes towards the use of synthetic data for research. The recommendations from this consultation will be included within our work on synthetic data at RDS.
  • Running the Scottish Synthetic Data Working Group, a networking meeting to share progress and learning across Scotland.
  • Our Senior Data Analyst, Sophie McCall, is a co-chair of the UK Synthetic Data Community Group funded by DARE UK.
  • Hosted and funded researcher workshops from Prof. Gillian Raab on the synthpop synthetic data generation tool.

Interested in using synthetic data? Explore our metadata catalogue for the most up to date availability on synthetic datasets or sign up to our engagement contact list to be the first to know about opportunities to shape our work.

Future plans

In future, RDS hopes to work with data controllers to produce synthetic datasets for training, data discovery and code development on an ongoing basis.

Related content

Was this information helpful?