CloudKit – now that’s how to do a secure Database for users

Data Breach Hand Brick Wall Computer

One of the big controversies here relates to the appetite of the current UK government to release personal data with the most basic understanding of what constitutes personal identifiable information. The lessons are there in history, but I fear without knowing the context of the infamous AOL Data Leak, that we are destined to repeat it. With it goes personal information that we typically hold close to our chests, which may otherwise cause personal, social or (in the final analysis) financial prejudice.

When plans were first announced to release NHS records to third parties, and in the absence of what I thought were appropriate controls, I sought (with a heavy heart) to opt out of sharing my medical history with any third party – and instructed my GP accordingly. I’d gladly share everything with satisfactory controls in place (medical research is really important and should be encouraged), but I felt that insufficient care was being exercised. That said, we’re more than happy for my wife’s Genome to be stored in the USA by 23andMe – a company that demonstrably satisfied our privacy concerns.

It therefore came as quite a shock to find that a report, highlighting which third parties had already been granted access to health data with Government mandated approval, ran to a total 459 data releases to 160 organisations (last time I looked, that was 47 pages of PDF). See this and the associated PDFs on that page. Given the level of controls, I felt this was outrageous. Likewise the plans to release HMRC related personal financial data, again with soothing words from ministers in whom, given the NHS data implications, appear to have no empathy for the gross injustices likely to result from their actions.

The simple fact is that what constitutes individual identifiable information needs to be framed not only with what data fields are shared with a third party, but to know the resulting application of that data by the processing party. Not least if there is any suggestion that data is to be combined with other data sources, which could in turn triangulate back to make seemingly “anonymous” records traceable back to a specific individual.Which is precisely what happened in the AOL Data Leak example cited.

With that, and on a somewhat unrelated technical/programmer orientated journey, I set out to learn how Apple had architected it’s new CloudKit API announced this last week. This articulates the way in which applications running on your iPhone handset, iPad or Mac had a trusted way of accessing personal data stored (and synchronised between all of a users Apple devices) “in the Cloud”.

The central identifier that Apple associate with you, as a customer, is your Apple ID – typically an email address. In the Cloud, they give you access to two databases on their cloud infrastructure; one a public one, the other private. However, the second you try to create or access a table in either, the API accepts your iCloud identity and spits back a hash unique to your identity and the application on the iPhone asking to process that data. Different application, different hash. And everyone’s data is in there, so it’s immediately unable to permit any triangulation of disparate data that can trace back to uniquely identify a single user.

Apple take this one stage further, in that any application that asks for any personal identifiable data (like an email address, age, postcode, etc) from any table has to have access to that information specifically approved by the handset owners end user; no explicit permission (on a per application basis), no data.

The data maintained by Apple, besides holding personal information, health data (with HealthKit), details of home automation kit in your house (with HomeKit), and not least your credit card data stored to buy Music, Books and Apps, makes full use of this security model. And they’ve dogfooded it so that third party application providers use exactly the same model, and the same back end infrastructure. Which is also very, very inexpensive (data volumes go into Petabytes before you spend much money).

There are still some nuances I need to work. I’m used to SQL databases and to some NoSQL database structures (i’m MongoDB certified), but it’s not clear, based on looking at the way the database works, which engine is being used behind the scenes. It appears to be a key:value store with some garbage collection mechanics that look like a hybrid file system. It also has the capability to store “subscriptions”, so if specific criteria appear in the data store, specific messages can be dispatched to the users devices over the network automatically. Hence things like new diary appointments in a calendar can be synced across a users iPhone, iPad and Mac transparently, without the need for each to waste battery power polling the large database on the server waiting for events that are likely to arrive infrequently.

The final piece of the puzzle i’ve not worked out yet is, if you have a large database already (say of the calories, carbs, protein, fat and weights of thousands of foods in a nutrition database), how you’d get that loaded into an instance of the public database in Apple’s Cloud. Other that writing custom loading code of course!

That apart, really impressed how Apple have designed the datastore to ensure the security of users personal data, and to ensure an inability to triangulate data between information stored by different applications. And that if any personal identifiable data is requested by an application, that the user of the handset has to specifically authorise it’s disclosure for that application only. And without the app being able to sense if the data is actually present at all ahead of that release permission (so, for example, if a Health App wants to gain access to your blood sampling data, it doesn’t know if that data is even present or not before the permission is given – so the app can’t draw inferences on your probably having diabetes, which would be possible if it could deduce if it knew that you were recording glucose readings at all).

In summary, impressive design and a model that deserves our total respect. The more difficult job will be to get the same mindset in the folks looking to release our most personal data that we shared privately with our public sector servants. They owe us nothing less.