The NSA’s NoSQL Data Store: Accumulo


I found this fascinating article on PCWorld about the NSA’s home-grown (or government-grown, if you will) key-value store.  The fact that the government has designed and buit its own NoSQL database solution to fill its needs shows the lengths the government will go to get and store your data:

http://www.pcworld.com/article/2060060/nsas-accumulo-nosql-store-offers-rolebased-data-access.html

Now I’ll be one of the first people to say that the US government’s unwarranted wide-ranging collection of data, with no other reason other than “just in case you’re a terrorist” justification, is shameful and needs to change. However, from a technologist’s standpoint, I’m fascinated by the solution itself.  It always amazes me how technology solutions are so easily and quickly created to fulfill a specific need.

The government needed a solution with many of the same requirements that most companies do:  uptime, durability, easily maintainable data schemas, etc.  But the one feature that is generally unique to the government, and a spy agency in particular, is very strict role-based access control, or RBAC.

Accumulo

The result?  The Accumulo data store.  And while the name itself generally indicates its purpose of “accumulation”, its a distributed key-value store based on Google’s BigTable design, and built on top of Hadoop, Zookeeper, and Thrift.  But the distinguishing feature of this solution is its ability to give specific pieces of data certain “labels”, by which you can control user access.  Take a look at its feature list here, with specific attention to whats listed under the heading “Cell Labels”.   By labeling each field, the database system can control access by a specific user or role at the field level itself, at query time.

No Read Up, No Write Down

The concept of data labeling is not new. Oracle had it in its system for a long time, in a feature called Oracle Label Security (formerly known as Trusted Oracle).  No surprise there, as Oracle’s origins date back to a project initiated by the CIA back in 1978. This type of feature allows the enforcement of a security concept called “No read up, no write down”, as outlined in the Bell-LaPadula data security model.

Basically, what this model states is that, based on your level of clearance (Secret, Top Secret, etc), you cannot read any data that is marked as a higher classification than you have (no read up), and also you cannot write data with a higher classification as a lower classification (no write down).  Certainly it adds a ton of compute overhead when reading and writing data, but it serves its purpose. And with a budgeting ability of the US Government, they can simply appropriate more funds to buy bigger computers to handle the overhead (our tax dollars at work).No Read Up - No Write Down

But carrying that feature forward into the NoSQL world enables the government to marry all the features of NoSQL, with flexible schemas, fast data access, distributed data storage, etc, with the ability to apply very fine-grained control over all the data.  I’d be very interested to see how that read/write overhead of data labeling carries over to a NoSQL solution.  Well, I guess we could find out. After all, Accumulo has been open-sourced through Apache! (This did come from the NSA right?)

What purpose could this type of access control serve outside of a spy agency? The article above suggests a scenario of controlling access to a patient’s health care data by only that patient’s doctor.  I guess that could be a use for it, although I doubt its needed at the field level. Record level is probably enough, and that’s easily handled by implementing many other simpler methods of role-based access that are already in use across the industry. My guess is that this solution will live mostly in areas where data requires strict classification control, and that means companies with government contracts, or the US Government itself.

Leave a Reply