Skip to content

Interview: Liam Cavin on NRS Synthetic Census Data

Toy cars
Blog posts

11 Dec 2024

Researchers can now apply for access to synthetic versions of Scotland’s census data.

We caught up with Liam Cavin, Statistician at National Records of Scotland, to find out more about the project. 

Hi Liam, thanks for joining us. Can you start by telling us about the dataset that National Records of Scotland will be making available?  

Liam Cavin: At National Records of Scotland (NRS), our purpose is to collect, preserve and produce information about Scotland's people and history and make it available to inform current and future generations. One of the information sources we are responsible for is Scotland’s census. 
 
Our team’s purpose is to make that available for researchers, in a secure and ethical way. We have created some basic, low fidelity synthetic datasets for Scotland’s 2001 and 2011 censuses as a further resource for researchers working in this area.  

Why create a synthetic version of a dataset? 

It can take a long time for research projects to be put together. Researchers have told us that, often, they aren’t even sure what the data will look like once it’s ready. This kind of synthetic data will help researchers understand some of the basic features of census data, even when they are in the early stages of scoping a project. By speeding up this part of the process, the intention is that the wider public could see the benefits of research projects more quickly, whether that’s updates to policy or changes made to public services.  

 

“By speeding up this part of the process, the intention is that the wider public could see the benefits of research projects more quickly...”

Liam Cavin, Statistician at National Records of Scotland

Who do you think will be interested in using this synthetic dataset? 
 
Anyone interested in exploring which datasets can be linked for analysis of Scotland’s population. To give some examples of the kind of things census data is currently used for: 

  • Investigating the connections between air pollution and poor housing on childhood respiratory illnesses 
  • Understanding how socioeconomic factors affect people suffering from both stroke and cancer 

What type of projects do you hope to see developed using this dataset? And how could this help to inform research for the public good? 
 
Most of the existing pool of projects we have are focused on health. Census data contributes to these projects by helping us to understand how socioeconomic inequalities affect health outcomes. But the potential uses are wider, including education, economics, transport and more.  
 
I’m glad that you mentioned the public good – it’s essential that anyone seeking access to census data can demonstrate that public benefit is the main focus of their work. 

 

“...it’s essential that anyone seeking access to census data can demonstrate that public benefit is the main focus of their work. ”

Liam Cavin, Statistician at National Records of Scotland

Can you tell me about the collaboration between NRS and Research Data Scotland (RDS)? 

The audience for this data is researchers who are considering accessing public sector data, particularly for linked data projects. RDS have a central role in this area, not least in communicating to researchers what data is available and how it can be accessed. The RDS metadata catalogue is the place where researchers can explore the wide range of public sector data. It’s the right place to make these synthetic datasets available.  

RDS is leading the way in Scotland on how to use synthetic data in research and has resources to help researchers better understand synthetic data, such as this introduction to synthetic data.  

The terms 'fidelity' and 'utility' are used a lot when discussing synthetic data — how do these relate to datasets you have generated? 

Fidelity relates to how much of the statistical properties of the real data are embedded in the synthetic dataset. We opted to create low fidelity synthetic data. In other words, it superficially looks like real census data at first glance. All of the variable names are the same as in the real dataset. All of the variable codings are the same as you would find in the real dataset. But none of the rows (representing people) have real or even realistic combinations of values. This method ensures that people’s personal data is kept safe.  

So why did we do it this way? That brings us to utility – what people can use the dataset for. Researchers told us that something very simple like this would be useful, if they can access it with fewer barriers. It would help them to discover what data is available, and plan projects. We kept the fidelity low, so that we could be confident that there weren’t any risks with making it widely available. 

 

“RDS is leading the way in Scotland on how to use synthetic data in research...”

Liam Cavin, Statistician at National Records of Scotland

Can you tell me more about how NRS generate the synthetic version of the data?  
 
It’s nothing fancy really. We used the census to identify a set of around 100 variables in each dataset that would likely be of interest to researchers. Then we created a randomised set of values for each of those variables. We kept it small, with 5,500 rows of information. This allowed all possible values to exist in the dataset, whilst keeping the file size small. Lastly, we spent some time checking to make sure that none of the information in the synthetic dataset revealed any information about real people. 

While there isn’t information about real people in this synthetic data, we do run checks to make sure we haven’t accidentally replicated any real records or created a dataset that could be mistakenly perceived as being real.  

Are there any plans to make other synthetic datasets available?  
 
Next year we will be making 2022 census data available for linkage. We will make a synthetic version of that available, in a similar way to what we have done here for the 2001 and 2011 censuses. 

Thanks for chatting with us Liam! Is there anything else about these synthetic datasets that you would like people to know? 

We would love to hear from anyone who uses it! Is this useful for you? Could it be improved or extended? Please get in touch by emailing dataaccess@nrscotland.gov.uk 

 

You can find out more about these new synthetic census datasets by reading our launch story. Researchers can apply to access the NRS synthetic census datasets in the metadata catalogue

Related content

Image shows a toy train track and trains across a model city with roads, houses, shops and people

Synthetic census data now available for research

Research Data Scotland (RDS) is delighted to host new synthetic datasets from National Records of Scotland (NRS) which will enable researchers to learn more about Scotland’s census.

Research Data Scotland

11 Dec 2024

Subscribe to our updates 

To stay updated with Research Data Scotland, subscribe to our monthly newsletter and follow us on X (Twitter) and LinkedIn

Subscribe to our newsletter
Illustration of an envelope with a letter sticking out and a mobile phone with a person