NSA PRISM – The Mother of all Big Data Projects
As a data engineer and scientist, I have been following the NSA PRISM raw intelligence mining program with great interest. The engineering complexity, breadth and scale is simply amazing compared to say credit card analytics (Fair Issac) or marketing analytics firms like Acxiom.
Some background… PRISM – “Planning Tool for Resource Integration, Synchronization, and Management” – is a top-secret data-mining “connect-the-dots” program aimed at terrorism detection and other pattern extraction authorized by federal judges working under the Foreign Intelligence Surveillance Act (FISA). PRISM allows the U.S. intelligence community to look for patterns across multiple gateways across a wide range of digital data sources.
PRISM is unstructured big data aggregation framework — audio and video chats, phone call records, photographs, e-mails, documents, financial transactions and transfers, internet searches, Facebook Posts, smartphone logs and connection logs – and relevant analytics that enable analysts to extract patterns. Save and analyze all of the digital breadcrumbs people don’t even know they are creating.
The whole NSA program raises an interesting debate about “Sed quis custodiet ipsos custodes.” (“But who will watch the watchers.”)
What is the PRISM Program?
The program is called PRISM, after the prisms used to split light, which is used to carry information on fiber-optic cables. Think of this as a massive aggregate of aggregates.
Each vendor Facebook, Google, LinkedIn etc. collects a incredible amount data across their portfolio of properties and applications. What the NSA has done is take this to the next level by creating a massive Mashup of all the sources to look for end-to-end patterns and relationships.
The challenge that NSA is tackling is look-forward real-time intelligence. Can you predict in almost real-time a potential threat … intercepting a mobile phone call while someone is on the move towards a target and being able to create a rapid response to avert the threat. This is not a trivial problem to solve (but essential in the world we live in where soft civilian targets are increasingly being chosen).
Connecting the dots from the raw intelligence perspective is the essence of PRISM. Linking the end-to-end chain “Raw Data sources -> Raw Aggregates by provider -> Aggregated Data -> Contextual Intelligence -> Analytical Insights/Inferences -> Decisions” is an engineering feat. PRISM basically puts everything together in one place, combs through all aggregated data, identifying related pieces of information, and surfaces trends (positive or negative).
A slide briefing about the program outlines its effectiveness and features the logos of the companies involved. These slides posted by The Washington Post and the Guardian, represent a selection from the overall document, and certain portions are redacted.
The program is using two types of data collection methods: Upstream from the switches themselves (raw feeds) and downstream from the various providers (contextual feeds).
Mobile data collection is the new growth area. People were already walking sensor platforms. Every mobile phone generates a significant data exhaust.
Monitoring a target’s communication — This slide shows how the bulk of the world’s electronic communications move through companies based in the United States. Most of the data goes through bulk taps in switches at ATT and Verizon making it relatively easy to capture.
Providers and data — the PRISM program collects and ingests a wide range of data from the nine companies, although the details vary by provider. One of the NSA’s research projects aim is to forecast, on the basis of telephone data and Twitter and Facebook posts, when uprisings, social protests and other events will occur. The agency is also researching new methods of analysis for surveillance videos with the hopes of recognizing conspicuous behavior before terrorist attacks are committed.
Participating providers — This slide shows when each company joined the program, with Microsoft being the first, on Sept. 11, 2007, and Apple the most recent, in October 2012.
Apparently the data is extracted, transferred and loaded into servers at the Utah Data Center in Bluffdale (shown below). According to Der Spiegel, there enough capacity to store a Yottabyte of data… large enough to store all the electronic communications of all of humanity for the next 100 years. It will be interesting to compare this data center to the ones Amazon Web Services is continously building out. I am willing to bet that AWS is bigger.
Why do you need to store everything? Ira Hunt, CTO for the Central Intelligence Agency, said in a speech at the GigaOM Structure: Data conference that “The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time.”
The Terrorism Detection Use Case
Here is a use case of how PRISM can be used to unearth patterns. This is an enhanced version of one that was presented in the BusinessWeek article referenced below.
“In October, a foreign national named Joe Jackal does a Google Search and purchased a one-way plane ticket from Cairo to Miami, where he rented a condo. Over the previous few weeks, he’d made a number of large withdrawals from ATM machine linked to a Russian bank account and placed repeated calls to a few people in Syria. More recently, he rented a truck, drove to Orlando, and visited Walt Disney World by himself. As numerous security videos indicate, he spent his day taking pictures of crowded plazas and gate areas.
None of Jackal’s individual actions would raise suspicions. Lots of people rent trucks or have relations in Syria, and no doubt there are harmless eccentrics out in amusement parks taking pictures. Taken together, though, they suggested that Jackal was up to something. And yet, his pre-attack prep signature would have gone unnoticed. A CIA analyst might have flagged the plane ticket purchase; an FBI agent might have seen the bank transfers. But there was nothing to connect the two.
The day Jackal drives to Orlando, he gets a speeding ticket, which triggers an alert in the PRISM system. An analyst types Jackal’s name into a search box and up pops a wealth of information pulled from every database at the government’s disposal. There’s fingerprint and DNA evidence for Jackal gathered by a CIA operative in Cairo; video of him going to an ATM in Miami; shots of his rental truck’s license plate at a tollbooth; phone records; and a map pinpointing his movements across the globe.
As the CIA analyst starts poking around on Jackal’s file, a picture emerges. A mouse click shows that Jackal has wired money to the people he had been calling in Syria. Another click brings up CIA field reports on the Syrians and reveals they have been under investigation for suspicious behavior and meeting together every day over the past two weeks. Click: The Syrians bought plane tickets to Miami one day after receiving the money from Jackal. To aid even the dullest analyst, the software brings up a map that has a pulsing red light tracing the flow of money from Cairo and Syria to Jackal’s Miami condo. That provides local cops with the last piece of information they need to move in on their prey before he strikes.”
Data and Intelligence at the Extreme
What is fascinating about the NSA PRISM program is the how they are able to push the envelope to the right. The figure below from Informatica illustrates the amazing progression we have made over the past 5 decades.
“Web -> Cloud -> Multi-cloud -> Social -> Internet of Things” – the intelligence capability has steadily progressed to keep up with the underlying data availability.
The Skillset, Toolset and Dataset behind PRISM
I am extrapolating from multiple sources but PRISM has to do several things:
- Extract, transfer and ingest disparate data sources, providing common views of unified data;
- conduct relational, temporal, geospatial, statistical, and network analysis in one unified analytical framework (potentially using a federated model – as no tool can do everything)
- identifying non-obvious relationships or connections in the data and supporting visualization and exploratory visual analysis;
- share investigations and analytic insights/discoveries in a secure broadcast environment to enable situational awareness and collective understanding.
The target goal is to enable analysts to conduct rich, iterative cross-channel investigations that span many large datasets of different formats which originate from various internal or external sources. To enable this you need indexing and hypothesis testing capabilities.
Indexing….Hadoop on steroids…According to InformationWeek, the centerpiece of the NSA’s data-processing capability is Accumulo, a highly distributed, massively parallel processing key/value store capable of analyzing structured and unstructured data. Accumolo is based on Google’s BigTable data model, but NSA came up with a cell-level security feature that makes it possible to set access controls on individual bits of data. Without that capability, valuable information might remain out of reach to intelligence analysts who would otherwise have to wait for sanitized data sets scrubbed of personally identifiable information.
Slicing and dicing…hypothesis testing…. Once ingested into and/or connected to “PRISM” framework, data is quickly accessible to analysts in a rich data model that contains metadata, temporal, statistical, geospatial, and relational-behavioral information.
According to a NSA presentation a Carnegie Mellon technical conference, Graph search, in particular, is a powerful tool for investigation. In an in-depth presentation about the 4.4-trillion-node graph database it’s running on top of Accumulo. Nodes are essentially bits of information — phone numbers, numbers called, locations — and the relationships between those nodes are edges. NSA’s graph uncovered 70.4 trillion edges among those 4.4 trillion nodes. That’s an ocean of information, but just as Facebook’s graph database can help you track down a long-lost high school classmate within seconds, security-oriented graph databases can help spot threats.
The underlying architecture probably looks something like this… (again extrapolated from In-Q-Tel funded company Palantir’s documentation available on the Web. In-Q-Tel is a Intelligence agency venture fund).
Commercial Impact of PRISM – Predictive Search Industry
Predictive Search algorithms are at the core of the PRISM program. This is increasingly bleeding into commercial products where similar concepts are being leveraged.
A range of start-ups – Cue, reQall, Donna, Tempo AI, MindMeld and Evernote – and big companies like Apple, Google are working on what is known as predictive search or augmented reality — new tools that act as personal valets, anticipating what you need before you ask for it.
Google, for instance, is continuously changing the landscape of search with predictive analytics.
Google launched the practice of predictive search back in 2004 with Google Suggest, which was then renamed to Google AutoComplete in 2010. In 2010, Google Instant came on the scene, generating search results instantly as users type. Google’s Knowledge Graph in 2013 further enhances predictive search by predicting what type of information a user is searching for when they search a celebrity name “Brad Pitt” and generates specific related content right alongside normal search results.
Google Now is the next generation of predictive search, serving as a valet or personalized assistant that can predict your needs, wants, and deep desires. This is basically taking multiple buckets of data and intelligently connecting them to facilitate decisions….everyday data supported decision making. For some, Google Now delivers important information about the traffic on your morning commute, your updated flight itinerary, and the results of last night’s hockey game on your phone, without you even asking.
How does Google Now work….In order to provide relevant contextual info that relates to you and only you, Google uses your private data, accessing your location, Gmail, daily calendar, and other info in order to keep tabs on things like appointments, flight reservations and hotel bookings. Or auto-suggesting restaurants from the Zagat’s guide to have dinner at.
Google Now is evolving and forms a key foundational element for Google Glass. For instance, you are running thru the airport wearing Google Glass, which uses its predictive powers to send a gate change or flight delay alert as a Glasshole arrives through the airport.
Having Android on every smartphone allows Google to do extremely creative things enabling more and more of the augmented reality revolution going forward. Google is also in a unique position to know what information people are most interested in seeing and when they want it based on the giant volume of Web searches processed by the search engine daily. The different cloud services that it enables creates a web of rich data that is unsurpassed by few other firms. Facebook and Apple might be the closest in terms of knowledge about you. It’s amazing how Microsoft dropped the ball on Predictive Search enabled services.
Bottomline…. similar to GPS technology, the Internet, Robotics…. NSA, DoD, CIA funded work does change our lives.
A fuller picture of the exact operation of Prism will emerge in the coming weeks and months. Stay tuned as i explore what Prism is – and, crucially, isn’t. I am really curious about the architecture and techniques being used to extract patterns.
Notes and References
- PRISM stands for “Planning Tool for Resource Integration, Synchronization, and Management”
- PRISM not the only Big Data analytics program out there. Recently, the Guardian released details of another N.S.A. data-mining program, called Boundless Informant. This data mining tool appears to record and analyze where intelligence comes from; it can show on a map the amount of intelligence the N.S.A. collects from every country in the world.
- According to the Guardian, in March 2013, the N.S.A. collected 97 billion pieces of intelligence; over a separate 30 day period ending in March, the agency collected almost 3 billion pieces of intelligence from within the United States.
- GigaOM cited a report from Federal Computer Week which said the Central Intelligence Agency has contracted Amazon Web Services to build a private cloud. Neither the CIA nor Amazon has confirmed the deal, which the report said was worth $600 million over 10 years. So if the CIA truly picks Amazon to build its private cloud, that would give IBM and other private cloud vendors like Microsoft, HP, VMware and Citrix, a formidable new competitor in the Federal cloud spend.
- NSA has shared Accumulo with the Apache Foundation, and the technology has since been commercialized by Sqrrl, a startup launched by six former NSA employees. Sqrrl has supplemented the Accumulo technology with analytical tools including SQL interfaces, statistical analytics interfaces, text search and graph search engines.
- Informationweek source: http://www.informationweek.com/big-data/news/big-data-analytics/defending-nsa-prisms-big-data-tools/240156388
- NSA Graph http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
- Very interesting article on how Graph Analysis actually works in simple language. http://www.businessweek.com/magazine/palantir-the-vanguard-of-cyberterror-security-11222011.html
- Google provided some clarification on how it transmits FISA information: by hand (tapes), or over secure FTP. Google claims that they don’t participate in any government program involving a lockbox or other equipment installed at its facilities to transfer data to the government. I find it hard to believe that peta-bytes of content can be effectively transferred and ingested by tapes or Secure FTP.
- PRISM – US Gov. mining data from Google, y, msn, skype, youtube, and FB (washingtonpost.com)
- ‘Boundless Informant’ Is a Secret NSA Tool to Data-Mine the World (mashable.com)
- Why the NSA Prism Program Could Kill U.S. Tech Companies (popularmechanics.com)
- Will Palantir be the next Silicon Valley company linked to NSA’s PRISM program? (bizjournals.com)