What did we learn?
Our project decided that very simple, 'low-fidelity' synthetic datasets can be really helpful for teaching students, or helping programmers plan computer code for health applications. Low-fidelity means that the data only resembles real-world data in a very basic way. Simple distributions, like the number of people aged over 50 in a population, are preserved.
But more complicated relationships, like the number of people over 50 that have heart failure AND use a steroid inhaler AND have attended hospital as an outpatient in the last three years, are not represented. Keeping synthetic data simple and ‘lo-fi’ is a good way to guarantee that real patient details are never revealed.
Straightforward sharing licenses, and clear guidance for data controller and users, can also offer reassurances when using synthetic health data.
PETs like synthetic data are a great way to promote data science and research in the health sector. We are planning to create some small synthetic datasets to share with NHS staff and postgraduate students in Glasgow to teach them about health data science. Users will be able to safely test computer code with our synthetic data, without having to access any real-world patient information.