Six Good Reasons to De-Identify Data

Tuesday, May 08, 2012

Rebecca Herold

65be44ae7088566069cc3bef454174a7

De-identification is a great privacy tool for all types of businesses, of all sizes. 

If you have personal data that you want to use for research, marketing, testing applications, statistical trending or some other legitimate purpose, but you don’t need to know the specific individuals involved in order to meet your goals, then you should consider de-identifying the personal data. 

Even though it sounds complicated there are many good methods you can use to accomplish de-identification.  And the great thing is, under many legal constructs de-identification is an acceptable way to use personal information for purposes beyond which the personal data was collected.

Unfortunately, because of the results of some studies, many organizations now think that de-identification is not a good option.  Over a dozen organizations I spoke with at the IAPP conference in DC early recently told me, after hearing in a couple of the sessions about the de-identification studies, that they were thinking about advising their business management when they got back to work not to use de-identification because they thought the speakers were advising that it would not be a good privacy preserving action. 

But wait, it is! While the related research studies are valid, the results have often been viewed out of context.  I want to explain, at a high level, why more organizations need to use de-identified data.

What is de-identified data?

Basically, de-identified data is what you have left after removing directly identifying data items from a file of personal data and the remaining set of data can no longer be associated with a specific individual, or individuals.  Here’s an over-simplification: imagine a photo of a concert audience, in front of the stage where the band is playing. 

If you replaced the head of each apparent female in the audience with a red balloon, each apparent male in the audience with a green balloon, replaced each person’s body with a cucumber, and their arms and legs with pipe cleaners, you would have effectively de-identified the photo. 

You’ve removed the information that was necessary to identify each individual.  However, you’ve left enough information to be able to do research and such things as determine how many total people were in the audience, how many likely females, how many likely males, and so on.  

Now this is very, very simplified, and data de-identification typically involves completely removing data items without replacing them with other things.  But, hopefully it gives you a good visualization for what de-identification means.

It is important to additionally note also that, contrary to what I’ve seen written in various recently published legal articles, encrypting data does NOT make it de-identified data!  No data has been removed when it is encrypted.  Encryption simply jumbles data, using one of many possible encryption algorithms, so that no one can interpret the data.

When would you use de-identified data?

De-identification is on the verge of being used much more widely as we continue further down the path of “Big Data.”  

Certainly it makes sense; as we have significantly larger amounts of data, significantly more computing power, and significantly more sophisticated data mining techniques, it will become easier to take huge amounts of data that would be impossible to sort through manually, or even with our personal computers individually, and within the blink of an eye analyze literally hundreds or even thousands of terabytes (each terabyte is one trillion bytes!) of data to put together a little bit of information from here, and a little bit of information from there, each on their own with no significance to an individual, but when combined possibly revealing a person’s life story.

Because of these increased capabilities to take many different data sets and correlate personal data items to reveal information about specific individuals’ lives and activities, it becomes more important than ever to use de-identification within these data sets to significantly reduce the related privacy risks.  De-identified data can then be used for such beneficial activities such as:

  1. To allow for groundbreaking healthcare research with patient data that will not infringe upon patient privacy.
  2. To allow for innovative energy research with energy usage data that will not reveal the corresponding energy usage consumers.
  3. To allow for improved marketing based upon consumer activity data without revealing information about the individual consumers from whom the data was collected.
  4. To allow libraries to preserve their readers’ privacy regarding their reading and viewing activities while maintaining statistics and trends about which items are accessed and read.

And this list could go on and on.

Reality does not match recent negative comments regarding de-identified data

Some impressive academic research studies have shown that under the right circumstances, and with no other, or insufficient, security controls in place, de-identified data may potentially be re-identified.  For example, the one by LaTanya Sweeney in 2000 and another by Paul Ohm in 2010

These research papers, and others, continue to point out the ways in which de-identified data can be re-identified.  The research findings are very important for those using de-identification to know to better understand related risks.

These types of findings understandably alarm business leaders, along with information security and privacy managers, when described in terms of absolutes because they usually do not want to use within their organizations a method that has been, in their interpretation, labeled as not being effective for privacy protection.

Again, business leaders must look at and understand the context within which these, and many other, studies were executed.  The ones I’ve seen did not if additional safeguards such as policies, training, monitoring and logging, just to name a few, were also, or should have been, in place. 

And like any other scientific research study, it is important to understand that you take the findings, which typically report the worst case scenario, and then determine the mitigating controls that are necessary to bring the privacy risks down to an acceptable level to also then allow for real advances in such areas as research, marketing, modeling and other areas.

Consider the de-identification requirements under HIPAA.  The HIPAA Privacy Rule allows for two types of de-identification standards:

(a)          The Safe Harbor Standard which requires the removal of 18 specific data elements that could uniquely identify an individual in addition to having the other required security policies and procedures in place for using the resulting de-identified data.

(b)          The Statistical Standard which requires that a person with appropriate knowledge of and experience with generally-accepted statistical and scientific principles and methods for rendering information not individually identifiable determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information and how that determination was made.

To date these HIPAA requirements for de-identifying protected health information (PHI) for the purposes of research have worked comparatively well within the healthcare sector, considering there have been only with comparatively few privacy complaints and no reported incidents of re-identification breaches.  There have been five breaches reported for research facilities.  Three of them involved the theft or loss of mobile computer equipment, one involved print information, and one involved unauthorized access to a server

None reported have involved re-identifying de-identified data that I could find.  And, according to research published in the Journal of Clinical Research Best Practices (“HIPAA Complaints in Clinical Research,”Vol. 4, No. 2, February 2008): “Counting just industry-sponsored clinical trials, at least two million study subjects sign HIPAA authorization forms each year. Since HIPAA enforcement began in 2003, at least ten million people have signed the forms. The OCR complaint rate is thus about 1 per 600,000 subjects [0.017%]. The rate is probably ten times lower (1 per 60,000) if we include people who participated in studies not sponsored by industry or who were contacted but did not enroll in a study.” I couldn’t find a more recent similar study, but would love to see any if they are available.

Safeguards are still necessary for de-identified data, just not as stringent

Now, here’s a very important point; keep it in your head: Even if you’ve de-identified data, you still need to have some safeguards in place around it!

Information security isn’t only necessary just to protect personal data.  Information security is necessary to protect all types of data that are valuable to you and your business.  Your de-identified data is valuable to you; otherwise you would not have taken the time and effort to de-identify it.  It is valuable to have for the purposes for which you de-identified it.

I’ve heard some business leaders express the opinion that if data is de-identified, then it no longer needs to have security controls implemented to protect it.  This is not true!  Security controls are still necessary, however, not nearly as many controls and restrictions are necessary as if the data was not de-identified.

So what types of security controls do you need for de-identified data? That depends upon the data set you are de-identifying, the purposes for which the de-identified data will be used, where it will be stored, and how many will have access to the de-identified data.  Some of the basic information security controls typically necessary include:

  • Establish and document policies and supporting procedures detailing the situations when de-identification needs to occur (e.g., for lab research, marketing research, consumer stats reports, etc.) and the associated controls that need to be in place.
  • Establish a position or individual with documented responsibility for appropriately using and securing de-identified data.
  • Provide training for those who will be de-identifying data, and those who will be using and otherwise accessing the de-identified data.
  • Only allow those with a business need to access de-identified data.
  • De-identified data should not be allowed to leave your organization except for clearly documented situations and the associated conditions.
  • Delete de-identified when it is no longer needed for the purposes for which it was originally created.
  • Obtain direct consent from the related individuals in situations where all personally-identifying data cannot be removed as a consequence of the goals of the research.
  • Perform risk analysis to determine the reasonable likelihood of re-identification.  The higher the risk, the more security controls that will need to be implemented and the more restricted access.  The lower the risk, the fewer the security controls, and the larger the audience that can be provided access.
  • Enter into contracts, and establish internal procedures to support compliance with those contracts, to not re-identify data if the possibility is likely.
  • Require any third parties with whom you share de-identified data to comply with the policies you’ve established within your organization for using and safeguarding de-identified data.
  • Perform periodic audits to confirm the de-identified data is still protected at the same levels as it was originally agreed.

Six good reasons to de-identify data

I work with a large number of small and medium sized businesses (SMBs) whose clients are other businesses.  Many of these SMBs perform applications and systems development work and have been using real production customer data for their research, testing and analysis. 

The same goes for marketing SMBs doing work for a large number of client companies, plus a wide array of other types of businesses.  So, given the information you now have about de-identified information, it should be clear that it is good for organizations of all sizes, from the largest to the smallest, to consider using de-identification as a privacy protection for several reasons, including the following six.

  1. By using de-identified data wherever possible not only will organizations mitigate the risk of privacy breaches for their clients, they will also dramatically reduce their own liability by demonstrating due diligence in the event a privacy breach occurs within their business that involves their clients’ data.
  2. Using de-identified data lessons the risk of legal non-compliance.  By using only de-identified data where feasible organizations doing work for other businesses can significantly reduce the risk of not only privacy breaches but also legal compliance infractions.
  3. De-identification is an effective way to protect privacy in the event the de-identified information is seen or obtained by someone not authorized to see or have the personal data it from which it was created.
  4. De-identification allows for important health research to occur while protecting privacy to a much greater extent than if actual patient information, with all identifiers, were used.
  5. De-identification allows for marketing research to occur while honoring the posted privacy policies that indicated personal information will not be used for marketing purposes.  Of course, this depends upon how the rest of the privacy policy is written.
  6. De-identification is a unique tool that allows for data to be used in business in many more ways that no other security tools, such as encryption or access controls, can provide, while also lessening privacy risks.

If the risk thresholds for de-identified data are not mitigated by using accompanying safeguards as described above, the risk of re-identification increases.  However, some of the a blanket statements I’ve seen published, typically only considering the technical aspects, that simply say that de-identification on its own is not 100% non-reversible and therefore should not be used, would be similar to saying that simply using seat belts alone does not keep 100% of vehicle passengers from being hurt or killed, so therefore seat belts shouldn’t be used.

Doesn’t this seem like silly logic?  Well, it does now, but early on when seatbelts they were introduced most people didn’t want to use them.  History has shown the effectiveness of seatbelts.  Of course you want to use seatbelts, along with airbags, keeping doors shut, having good-working brakes, and so on, to help reduce the harms that could occur in car accidents. 

Likewise, de-identification should be used along with other safeguards to reduce the privacy risks. Simply because de-identification may not be a 100% privacy panacea does not mean that it should not be used. I’m confident history will also show that de-identification is an effective privacy tool, used in conjunction with other privacy tools and information security controls.

Other good de-identification resources

To learn more, here are some other good resources that cover various other de-identification topics and additional viewpoints.

Cross-posted from Privacy Professor

Possibly Related Articles:
15770
General
Industrial Control Systems
Legal Compliance Databases Personally Identifiable Information Data Protection Big Data Data Collection Anonymization de-identification
Post Rating I Like this!
Default-avatar
Kevin Black Very good post and very well written. I would also hazard a bet that if someone was so motivated to get to the data that re-identifying data looks attractive then they will get to it. There are much easier ways to get to the data. I recall an incident many years ago where an unscrupulous insurance agency was paying service desk employees for healthcare records :)

The idea that there is a magic bullet or "one ring to rule them all" type solution kills me. It is amusing to think of someone trying to say that there is no value in de-identifing sensitive and controlled data when moving it from a highly secure environment where every access is on an approval basis and logged to a relatively open test environment with little to no controls. I think I would end that conversation as useless and just walk away.

Anyhow, again, very good post!
1336750079
65be44ae7088566069cc3bef454174a7
Rebecca Herold Thank you for your comment, Kevin; I appreciate it! :)
1336872734
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.