Can you tell me about the collaboration between NRS and Research Data Scotland (RDS)?
The audience for this data is researchers who are considering accessing public sector data, particularly for linked data projects. RDS have a central role in this area, not least in communicating to researchers what data is available and how it can be accessed. The RDS metadata catalogue is the place where researchers can explore the wide range of public sector data. It’s the right place to make these synthetic datasets available.
RDS is leading the way in Scotland on how to use synthetic data in research and has resources to help researchers better understand synthetic data, such as this introduction to synthetic data.
The terms 'fidelity' and 'utility' are used a lot when discussing synthetic data — how do these relate to datasets you have generated?
Fidelity relates to how much of the statistical properties of the real data are embedded in the synthetic dataset. We opted to create low fidelity synthetic data. In other words, it superficially looks like real census data at first glance. All of the variable names are the same as in the real dataset. All of the variable codings are the same as you would find in the real dataset. But none of the rows (representing people) have real or even realistic combinations of values. This method ensures that people’s personal data is kept safe.
So why did we do it this way? That brings us to utility – what people can use the dataset for. Researchers told us that something very simple like this would be useful, if they can access it with fewer barriers. It would help them to discover what data is available, and plan projects. We kept the fidelity low, so that we could be confident that there weren’t any risks with making it widely available.