Data Mining Is a Risky Business

Servers. Photograph by Torkild Retvedt

In the last few weeks a vigorous discussion on the legality of the NSA’s data mining efforts in the war on terror has raged. The revelations by former NSA contractor turned whistleblower Edward Snowden have concerned American citizens and governments around the world. No one had grasped the size and scope of US intelligence-gathering.

Most of the debate has revolved around the Fourth Amendment and privacy considerations. There is one component that should be more thoroughly discussed, however. Is data mining safe to use in a national security context?

Data-mining has weaknesses. It becomes more and more accurate as the data filters become finer and finer, but a base level of inaccuracy exists in the raw results.

According to public reports, the NSA captures a great deal of American phone record metadata and the world’s internet communications. It then applies a variety of filters to determine whether a packet of information has originated within the United States (where monitoring it is illegal) or outside (where it is not). If analysts are 51 percent sure that the information originated outside the US, they can proceed. But 51 percent is basically a coin toss which allows them to look at half the traffic.

“Foreignness” is binary. You are either inside the territorial United States or you are not. Due to the complexities of how communications traffic is routed from server to server to server, some packets of information have a very high degree of location certainly and some have a very low degree. The point is that the NSA has set the bar quite low. Technology can never give us 100 percent, or even 80 percent, certainty that only people outside the US are being monitored.

And what if you want to know whether a person is a terrorist? One man’s terrorist may be another man’s nuisance. In the United Kingdom terrorist acts are those which are “designed seriously to interfere with or seriously to disrupt an electronic system.” Although spammers are life forms somewhere between cockroaches and skunks, they are not terrorists. But under UK law, they very well could be.

At some point, additional filtering is done based on content. For computers, analyzing human speech and making decisions is still problematic. For those of you with an iPhone, you know full well the accuracy of Siri.

Beyond just problems with speech recognition, there are certain nuances and context that are simply unavailable to a computer (and in cases, to a human examiner). For instance, in almost every marriage a frustrated spouse has declared “I’m going to kill you!” In few cases – very, very few — this is meant literally. But a computer cannot discern whether these words should be taken literally.

Because of these complications, suspicious communications need to be examined by a human analyst. That is why thousands of innocent domestic communications have been monitored by the NSA. In fact, the incidence of privacy violations has actually increased because of Big Data.

The government claims that terrorist acts have been detected with data-mining techniques. If true, this is certainly good news. But how many innocent people had their privacy violated in order to achieve this goal? The US government has not released this figure.

Here’s an analogy. If the government monitored the bedrooms of every married couple it would detect many cases of domestic violence. At the same time, it would also monitor far more cases of spouses in intimate relations. Couples who were “collateral damage” in this hypothetical campaign against domestic violence would have a right to be less than pleased.

What happens when a government makes a decision based on information gleaned from data-mining? Will we see innocent people on no-fly lists? False arrests? Inability to get jobs in sensitive areas like finance or law enforcement?

Furthermore, what if the bad guys manipulate the algorithms? In the intelligence world, there is a sub-discipline of counter-intelligence. Part of this involves the use of deception so that an adversary will believe things that are not true and will make bad decisions based on bad information.

Data-mining may sound like an obscure technology. But we take advantage of highly sophisticated public data-mining technology every time we use a search engine. Search engines work by scanning the entire internet so that when a consumer wants to find web pages related to dinosaurs, they see only web pages related to dinosaurs.

The first item in a Google search on a prominent person will almost always be Wikipedia. That isn’t due to the authority of Wikipedia per se. But because so many people link to Wikipedia, data-mining algorithms give it a higher “confidence score”.

Google has a name for search manipulation: it’s called advertising. If you search for “lawyers” it will display lawyers near where you live. Who rises to the top of the list? The firm with the best advertising agent. I should know: I run a firm that charges good money to ensure that our clients appear at the top of the first page of a search result.

An entire industry has developed around manipulating search engine — search engine optimization (SEO). Most SEO is legitimate marketing – “white hat SEO”. But its nasty little brother, “black-hat SEO”, uses similar techniques to infect computers with malicious software or to tempt web surfers into buying pornography.

In other words, an SEO expert can manipulate Big Data. Sure, the algorithms used by the NSA are secret but so are the ones created by Google (and Bing and Yahoo and all the others). The NSA whistleblower Edward Snowden has released enough information to give black hat SEO experts a head start in deceiving the NSA.

The NSA monitors metadata for phone calls (the source and destination and the length of call). But it isn’t difficult to poison that well. Hackers in the 80s would routinely tap phones lines to make “free phone calls”. A device to do this is called a beige box and designs are freely available on the internet.

The equivalent nowadays is cell phone “cloning”. This is far more difficult, but it is possible. A recent vulnerability may have exposed 750 million cell phone users to just this kind of attack. In other words, guys in black hats can log fake phone calls on your cell phone.

The US Drug Enforcement Agency has been using NSA data to track suspects. Could the black hats manipulate the phone records to prompt a SWAT raid on an innocent citizen? Yes, they could.

As for fake emails to direct intelligence interest at an innocent victim, this is a trivial task for a black-hat programmer (it’s called email spoofing). We see the results every day as the avalanche of spam in our inboxes. Even a novice computer user can spoof an email. Imagine what would happen if you were to send spoofed emails mentioning bombs, jihad and assassination in the name of a friend. A practical joke could land your friend (or, more likely, ex-friend) in the calaboose for a few days until the intelligence services were sure that he was innocent.

Similarly, why couldn’t SEO experts working for terrorists or hostile governments manipulate Big Data so that intelligence analysts waste time on wild goose chases or create useless blacklists?

Data mining is a useful tool for business but it is still a technology in its infancy. There are real risks of inaccuracy, manipulation and violation of privacy. There needs to be a well-informed examination and discussion of the risks of acting on corrupted data.

Written by John Bambenek
John Bambenek is a computer security expert from Champaign, Illinois. He is President of Bambenek Consulting, a cybersecurity firm, and a visiting lecturer in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He can be reached atjcb@bambenekconsulting.com.