Precision of detecting identifying information

doi: 10.53962/r5gg-jjv0

Originally published on 2022-11-28 under a CC0 Public Domain Dedication

Authors

Summary

In this precision report, we provide information on the specificity and sensitivity of using regular expressions to retrieve identifying information. Overall, technical identifiers are highly specific and sensitive (e.g., email, IP addresses), except for IMEI numbers (hardware numbers for phones). Phone numbers can be detected if reported in their international form (e.g., +1 555 55 555). For location information, regular expressions are highly precise in the case of latitude/longitude combinations, but not for street addresses (sensitive but not specific). For direct identifiers, we see that gender is hardest to detect with high specificity, but given the risk of disclosing marginalized gender identities, we consider this important to check for nonetheless (specificity gets worse as the dataset is larger in size). Similar issues exist for credit card information, where the risk is high if disclosed. As a result, we recommend scanning for all identifying information looked into in this report, except for IMEI and street addresses, as they underperform.

Main file

report.docx

Supporting files

References

  1. Privacy Protection in the Era of Open Science. doi: 10.31234/osf.io/ybzu9