NSA Boundless Informant explicated - for moar efficient flat databases of all yr phone records
SOURCE: Cryptome.org : http://cryptome.org/2013/11/nsa-boundless-informant-explicated.htm
25 November 2013
NSA BOUNDLESS INFORMANT Explicated
Date: Mon, 25 Nov 2013 15:37:33 -0800 (PST)
Subject: A very interesting forum post on electrospaces
This was written from a person who purports to actually use the Boundless Informant tool. The email address is fake of course, but it sounds both knowledgeable and credible.
If the source is genuine, it provides considerable insight into the use and capabilities of the tool. It seems to do a lot more than we've seen so far, including the ability to see individual call detail records.
It also gives us clues to how mobile interception is accomplished.
Anonymous jbond@MI5.mil.gov.uk said...
I'm seeing a great deal of confusion out there about NSA databases and how reports are generated from their architecture. Here is how it works:
Let's begin with rows and columns making up a matrix, variously called a table, array, grid, flatfile database, or spreadsheet. In the database world, rows are called records, columns are called fields, and the individual boxes specified by row and column coordinates -- which hold the actual data -- are called cells.
For cell phone metadata, each call generates one record. NSA currently collects 13 fields for that call, such as To, From, IMEI, IMSI, Time, Location, CountryOrigin, Packet etc etc, primarily from small Boeing DRTBOXs placed on or near cell towers.
Because metadata from a single call can be intercepted multiple times along its path, generating duplicative records, NSA runs an ingest filtering tool to reduce redundancy, which is possible but not trivial because metadata acquisitions may not be entirely identical (eg timing). After this refinement, one call = one metadata record = one row x 13 columns in the BOUNDLESS INFORMANT's matrix.
Cell phone metadata is structured, unlike content (he said she said). However, as collected from various provider SIGADs, it is not cleanly or consistently structured -- see the messy example at wikipedia IMSI. So another refinement is needed: NSA programmers write many small extractors to get the metadata out of its various native protocols into the uniformly formatted taut database fields that it wants.
After all this, for a hundred calls, a metadata database such as BOUNDLESS INFORMANT consists of 100 records and 13 fields so 100 x 13 = 1300 cells. A counting field (all 1's) and consecutive serial numbers (indexing field) for each record may be added to facilitate report generation and linkage to other databases, see below.
-1- The first point of confusion is between BOUNDLESS INFORMANT as a flatfile database (we've never seen a single row, column or cell of it) and the one-page summary reports that can be generated using BOUNDLESS INFORMANT as the driving database (eg, the Norway slide).
These BOUNDLESS INFORMANT reports give the number of records (rows) in the table after various filters have been applied (eg country, 1EF = one end foreign, specified month, DNR type, intercept technology used, legal authority cited FISA vs FAA vs EO 12333).
BOUNDLESS INFORMANT does NOT report the number of cells nor gigabytes of storage taken up. It easily could, but it doesn't. Instead, it reports the main object of interest: the number of calls, after some filtering scheme has been applied.
-2- The second point of confusion arises over database viewing options. Myself, I like scrolling down row after row, page after page, plain black text in 8 pt courier font, lots of records per screen, thin lines separating cells, no html tables. A lot of people don't.
So a cottage industry has evolved around generating pretty monitor displays, web pages, and ppts from databases; these typically display one record per screen. All database views are equivalent: given a presentation, you can recover the database; given the database, you can make the pretty user interface.
Views are dressed up injecting the data fields into a fixed but fancy template (eg dept of motor vehicles putting your picture field into an antique wood frame and your name field into drop-shadow text). Nothing but a warmed-over version of spewing out form letters by mail-merging an address database into a letter template.
We've not seen *any* view of BOUNDLESS INFORMANT records to date, only summary reports it has generated. You cannot recover the underlying database from a few summary reports, only information about the number of records and a few of the 13 fields.
November 25, 2013 at 2:34 PM
Anonymous jbond@MI5.mil.gov.uk said...
-3- The third point of confusion: a given database like BOUNDLESS INFORMANT is capable of self-generating many summary reports about itself. Summary reports can have views too -- injections into templates. We've seen 3 of them for BOUNDLESS INFORMANT, Aggregate, DNI and DNR.
Databases can be sorted, according to the values in any column. For example, if NSA sorted by IMSI, that would pull together all the call records made from a particular cell phone with that id. Using the counting field, allowing the activity of each phone to be tallied. Or they could sort to pull up the least active phones-- to identify the user who tosses her 'burner' phones in the trash after one use.
Databases can be restricted. If NSA wanted to count the number of distinct cell phone calls during a given month that originated in Norway and terminated abroad (1EF one end foreign), it can restrict the records to the relevant time and location fields, masking out the others. They could compress each cell phone to a single line and count rows to get summary data on the number of phones doing 1EF. That summary data could be injected into a template for a BOUNDLESS INFORMANT slide.
Databases can be queried (tasked) to pull out only those records satisfying some string of selector logic. For example, you could submit a FOIA request to NSA in the form of a query that consisted of your selectors and a database like BOUNDLESS INFORMANT to see what call metadata they have on you in storage.
Here you would be wise to request simple output (rows of plain text with column values separated by commas,CSV format), to keep file size down. Then you could make your own mail-merge templates and spew out colorful BOUNDLESS INFORMANT graphs and reports about yourself, or just use the default templates provided by Excel.
November 25, 2013 at 2:36 PM
Anonymous jbond@MI5.gov.uk said...
-4- Next up on confusion, relational databases. NSA maintains hundreds of separate flatfile databases that might however share a field or two in common, for example someone texting, google searching, or shopping as well as making phone calls with with a given phone, the number or IMSI being the common field.
Those other activities involve different fields from those already in BOUNDLESS INFORMANT, such as your login to eBay or search term text instead of email subject line.
It could all be put into BOUNDLESS INFORMANT by expanding the number of fields. However this doesn't scale very well : it results in the voice call fields being massively blank for an IMSI making lots of google searches, creating a huge sparse table that is very slow to process, wasting analysts time (called high latency by NSA).
Instead, BOUNDLESS INFORMANT will just link to all the other databases which share a field. And those in turn could link to other simple databases sharing some other field that BOUNDLESS INFORMANT might lack. And so on -- it's how all the little constituent databases can be seamlessly integrated..
A query now calls through to this whole federation of linked databases, which can reside geographically anywhere on the Five Eyes network (though NSA is moving to one stop shopping from their Bluffdale cloud to improve security and reduce latency).
The primary provider of relational database software of this complexity is Oracle. However you can do about all of it free and friendly with open source MySQL. The Q is for querying -- what NSA calls tasking -- sending off some long-winded boolean logic string of field selector values and constituent databases that does the filtering you want.
The result of the query is a new little database, usually temporary, that you can use to generate fancy views and summary reports. The databases being updated continuously and storage retention varying, the same query tomorrow will give a slightly different outcome.
Your all-about-me FOIA request could be formulated in MySQL (first need to know names of linked databases) and surprisingly, the query string would be recognized and fulfilled by Oracle or whatever big relational database NSA ended up using/developing, it's that standardized.
If you're online or call a lot, that could still be a big file given 12 agencies keeping tabs, notably NSA, Homeland Security, and FBI's DITU. But if you wrote the query right, it would only take a small data center in the garage to host the response.
November 25, 2013 at 2:37 PM