13 Data privacy and anonymity – Reproducible Data Processing and Visualization

13.1 Data privacy laws, ethical responsibilties, and ‘Private Data’

<Prolific.com> is a website that many researchers use to collect data online and where the public go to be paid to complete studies in psychology and other fields. Prolific ID codes are needed to identify which accounts participated and should be paid. However, Prolific IDs can also contain a lot of Personal Data (in the legal and data privacy sense of the phrase), and are essential to remove from dataset before sharing them (e.g., making them public or sharing with other researchers).

Other forms of Personal Data which we all have legal responsibilities to remove before sharing data include, but are not limited to: names, addresses, telephone numbers, email addresses, and other personally identifying information. Note that the rows referring to “psychiatric diagnosis” certainly appear to be sensitive information, but they might not be if the other data is suitably anonymized. Note that data de-anonymization/re-identification is for more possible than most researchers appreciate (e.g., knowing your age, gender, diagnosis, and region might allow a third party to link your data to your identity). This course is not on data privacy laws, compliance, or de-anonymization/re-identification risks, which are complex topics, it is only on the code methods you might use to make your project compliant and safe for participants.

13.2 Negative filters for Private Data

TODO

13.3 Saving anonymized data to disk

TODO

13.4 Excluding raw unanonymized data from subsequent sharing, e.g., on github via gitignore or directory seperation; acknowledgement of encryption requirements.

TODO

Purging git histories that contain sensitive information if it was accidentally committed.