The site gawker.com has some click-bait titled “Public NYC Taxicab Database Lets You See How Celebrities Tip.” Despite the gossipy nature of the title, the article itself goes into some very specifics on how this was possible despite the information being “protected.” It’s not a giant leap from there to why one must ensure the use of strong encryption for HIPAA-protected data.
NYC Releases Data to Researcher
A data analyst was the recipient of an “enormous database of every cab ride taken in New York City in 2013” (legally obtained from NYC officials). The city did make attempts to anonymize the data via a method known as “hashing”.
Hashing is known as a “one-way algorithm” among some. What this means is it’s easy to take A and convert it to B, but it’s impossible to figure out from whence B came from. For example, I have the number 12, drop it into a machine, and it comes out as 71. What is the relation between the two? Who knows? It’s impossible to figure out the exact way that 12 became a 71. Perhaps you added 59. Or maybe you multiplied by 5 and added 11. Or maybe the formula asks you take the 59 and subtract it from 100. This leaves 41. Then you divide it by 3, rounding up the number if a decimal is involved (which gives 13), then multiply by 5 (that leaves 65), then add the first digit of the result (which is the 6 of the 65), which gives 71.
Obviously, the formula is made extremely complicated to prevent easy analysis. The point is, there’s no way to know what the exact formula is. And, the result B will depend on what you’ve entered as the input A. There’s a way around this supposed complication, however.
Since a hash creates a unique output depending on the input (in our example above, the 12 will always lead to a 71), you just feed it everything and note what you get as an output, and so create a database of the linked input and output. Of course, the help of a computer is needed to create such a database. You can distribute the workload across multiple computers, run in 24/7 and soon enough you’ve got a huge database that you can reference.
One of the oldest and most studied hashes is MD5.
All Your Passwords is Belong to Us
MD5 is notoriously weak for a number of reasons. As I already noted, it’s old and heavily researched. The former means that advances in hardware have aided in the defeat of MD5: faster computing means that running a list of numbers through MD5 is also accelerated, spitting numbers faster and faster. The latter means that people have humongous databases. If the input to MD5 is, say, shorter than 20 characters, there’s a very good chance its output has been documented somewhere.
Indeed, MD5 is the reason why security researchers were apoplectic over certain online data breaches in the past: the breached companies said that the passwords were protected but failed to mention was that they had used MD5. You may as well proudly declare that your bank vault’s combination has been set to 0-0-0-0-0.
This is not to say that MD5 is useless. There’s a technique known as “salting” where random characters are introduced in addition to the input to create to (hopefully) undocumented outputs. For example, if a user enters 12 as the input, perhaps $12j9wK (the salt) is added to the beginning. Technically the input becomes $12j9wK12, which would create an output other than 71.
But this can be defeated as well. In the NYC taxicab database story, the researcher knew something about the inputs. The database included NYC taxi medallion numbers, which has a known format, as well as other known data. This could be used to reverse engineer the original data if it was hashed without salting, obviously. But one could hit upon the original data even if a salt had been used – assuming that a weak salt has been used (remember, MD5 has been researched heavily and for a long time).
What Does this Mean for HIPAA Encryption?
Basically, encryption algorithms are also beholden to similar and other problems that MD5 has shown over the years. This is why HIPAA defers to the National Institute of Standards and Technology when it comes to choosing a particular data security solution. Although NIST cannot provide a recommendation on what to use, it does provide a list of features, parameters, and other requirements that an appropriate encryption solution must posses. Furthermore, it will provide a certificate of validation for any solutions they examine (and meet the requirements, obviously).
Choosing a validated solution means that you’re using a strong encryption solution. Using something else could lead to problems down the line (for example, it could turn out that the solution had vulnerabilities or that it didn’t work from the beginning).
Related Articles and Sites: