Data re-identification

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the person to whom the data belongs. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

Legal protections of data in the United States

Existing privacy regulations typically protect information that has been modified, so that the data is deemed anonymized, or de-identified. For financial information, the Federal Trade Commission permits its circulation if it is de-identified and aggregated. The Gramm Leach Bliley Act (GLBA), which mandates financial institutions give consumers the opportunity to opt out of having their information shared with third parties, does not cover de-identified data if the information is aggregate and does not contain personal identifiers, since this data is not treated as personally identifiable information. Medical records Medical information of patients are becoming increasingly available on the Internet, on free and publicly accessing platforms such as HealthData.gov and PatientsLikeMe, encouraged by government open data policies and data sharing initiatives spearheaded by the private sector. While this level of accessibility yields many benefits, concerns regarding discrimination and privacy have been raised. Protections on medical records and consumer data from pharmacies are stronger compared to those for other kinds of consumer data. The Health Insurance Portability and Accountability Act (HIPAA) protects the privacy of identifiable data about health, but authorize information release to third parties if de-identified. In addition, it mandates that patients receive breach notifications should there be more than a low probability that the patient's information was inappropriately disclosed or utilized without sufficient mitigation of the harm to him or her. The likelihood of re-identification is a factor in determining the probability that the patient's information has been compromised. Commonly, pharmacies sell de-identified information to data mining companies that sell to pharmaceutical companies in turn. The final revisions affirmed this regulation. == Re-identification efforts ==

Re-identification efforts

There have been a sizable amount of successful attempts of re-identification in different fields. Even if it is not easy for a lay person to break anonymity, once the steps to do so are disclosed and learnt, there is no need for higher level knowledge to access information in a database. Sometimes, technical expertise is not even needed if a population has a unique combination of identifiers. In 1997, a researcher successfully de-anonymized medical records using voter databases. There are existing algorithms used to re-identify patient with prescription drug information. The data was released by Netflix 2006 after de-identification, which consisted of replacing individual names with random numbers and moving around personal details. The two researchers de-anonymized some of the data by comparing it with non-anonymous IMDb (Internet Movie Database) users' movie ratings. Very little information from the database, it was found, was needed to identify the subscriber. Only removing a person's identity from location data will not remove identifiable patterns such as commuting rhythms, sleeping places, or work places. By mapping coordinates onto addresses, location data is easily re-identified or correlated with a person's private life contexts. Streams of location information play an important role in the reconstruction of personal identifiers from smartphone data accessed by apps. Court decisions In 2019, Professor Kerstin Noëlle Vokinger and Dr. Urs Jakob Mühlematter, two researchers at the University of Zurich, analyzed cases of the Federal Supreme Court ofSwitzerland to assess which pharmaceutical companies and which medical drugs were involved in legal actions against the Federal Office of Public Health (FOPH) regarding pricing decisions of medical drugs. In general, involved private parties (such as pharmaceutical companies) and information that would reveal the private party (for example, drug names) are anonymized in Swiss judgments. The researchers were able to re-identify 84% of the relevant anonymized cases of the Federal Supreme Court of Switzerland by linking information from publicly accessible databases. This achievement was covered by the media and started a debate if and how court cases should be anonymized. == Concern and consequences ==

Concern and consequences

In 1997, Latanya Sweeney found from a study of Census records that up to 87 percent of the U.S. population can be identified using a combination of their 5-digit zip code, gender, and date of birth. Unauthorized re-identification on the basis of such combinations does not require access to separately kept "additional information" that is under the control of the data controller, as is now required for GDPR-compliant pseudonymization. Individuals whose data is re-identified are also at risk of having their information, with their identity attached to it, sold to organizations they do not want possessing private information about their finances, health or preferences. The release of this data may cause anxiety, shame or embarrassment. Once an individual's privacy has been breached as a result of re-identification, future breaches become much easier: once a link is made between one piece of data and a person's real identity, any association between the data and an anonymous identity breaks the anonymity of the person. Re-identification may expose companies and institutions which have pledged to assure anonymity to increased tort liability and cause them to violate their internal policies, public privacy policies, and state and federal laws, such as laws concerning financial confidentiality or medical privacy, by having released information to third parties that can identify users after re-identification. == Remedies ==

Remedies

To address the risks of re-identification, several proposals have been suggested: • Higher standards and uniform definition of de-identification while retaining data utility: the definition of de-identification should balance privacy protections to reduce re-identification risk with the refusal of companies to delete data • Heightened privacy protections of anonymized information • Creation of data-release policies: making sure de-identification rhetoric is accurate, drawing up contracts that prohibit re-identification attempts and dissemination of sensitive information, establishing data enclaves, and utilizing data-based strategies to match required protection standards to the level of risk. • Implementation of Differential Privacy on requested data sets • Generation of Synthetic Data that exhibits the statistical properties of the raw data, without allowing real individuals to be identified While a complete ban on re-identification has been urged, enforcement would be difficult. There are, however, ways for lawmakers to combat and punish re-identification efforts, if and when they are exposed: pair a ban with harsher penalties and stronger enforcement by the Federal Trade Commission and the Federal Bureau of Investigation; grant victims of re-identification a right of action against those who re-identify them; and mandate software audit trails for people who utilize and analyze anonymized data. A small-scale re-identification ban may also be imposed on trusted recipients of particular databases, such as government data miners or researchers. This ban would be much easier to enforce and may discourage re-identification. ==Examples of de-anonymization==

Examples of de-anonymization

• "Researchers at MIT and the Université catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them. In other words, to extract the complete location information for a single person from an "anonymized" data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person's whereabouts." • "Here, we report that surnames can be recovered from personal genomes by profiling short tandem repeats on the Y chromosome (Y-STRs) and querying recreational genetic genealogy databases. We show that a combination of a surname with other types of metadata, such as age and state, can be used to triangulate the identity of the target." == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com