Signal To Noise In Real Life

A fundamental problem in life is “What intellectual tool can allow me to decide X?”.  Sadly for the advance of civilization, the answer is often not just “No. There is no such tool”, an answer that allows hope.  Rather “Hell No! There can be no such tool from the nature of nature, the fundamentals of reality as we can map it so far, we think for all time, but you never know”.

In the analog world, Signal-to-Noise ratio is the ratio of the signal power in a channel to the noise power.  Easy to define if the signal is a sine wave of a certain frequency, less easy as the signal gets more complex, but always the ratio of signal power inside the envelop to total power.  All of that is engineering from the 1930s and 1940s.

More recently, S/N is extended to digital signals, also subject to solid engineering analysis with character sets optimized for electronics driving the channel, error detection and correction.  That was the engineering challenge for the 1960s and 1970s, and underlies all modern communications.

From there, S/N is stretched to discussions of other streams of information.  It is vague* in those areas, but ‘accuracy of meaning in context’ is hand-wavingly close, and that is good enough to analyze the possibility of some proposals.

My analysis suggests that NSA has not done its engineering wrt S/N in intelligence systems**.  NSA’s problem is that the signal in their data is a probability that, for instance, a particular individual’s database entry indicates that the individual is a jihadi, or a jihadi-supporter, or any of dozens of other categories of persons to be watched at some more level of detail.

Consider that data collection system and calculations based on the data.

Data comes from NSA’s world-wide copying of internet data.  The tuple (source address, destination address, port, protocol) is saved and collated by source ID and again by destination ID, then entered into an ‘IP database’ keyed by those IDs.  Telephony metadata is the tuple (caller phone #, callee phone #, call time, end time, cell tower and other such info.) They are put into a ‘phone metadata database’ with caller indexes and callee indexes, so every call you ever made and every call you ever received are listed**.  The main database ‘records’ all contain the individuals name, social security number, address, phone numbers, all of the information they buy from data brokers of your personal information, and IP addresses.

NSA’s additional data, extracted from its surveillance, for each individual will be lists of codes, the results of functions such as ‘has Muslum name’ or  ‘web visits to jihadi sites?’.  Suppose that the first version of the calculation said that ‘yes’ returned from ‘Muslim name’ alone would warrant having a human look at the individual.  That simple algorithm returns far too many suspects, nearly all ‘false positives’, nearly every Muslim who is in the database.

Combining (‘and’ function) ‘visits jihadi sites’ is an improvement, but still identifies far too many false positives because there are many reasons to do that and the definition of ‘jihadi’ site cannot be clear.  Anding the truth/false values of ‘is male’, ‘age between 15 and 30′, … will narrow the field, but also miss the young women who are turning jihadi.

And so it goes, you can identify all possible suspects, but will never have the resources to check them all.  Or you can zero in on a group, still have far too many false positives, and begin to generate false negatives.

Searching for people in categories is standard data mining tech used to decide questions such as “Who might be pregnant, given online searches?” or “What ad will most likely entice different segments of the population?”.  There is big reward for doing that better and researchers from many different areas have applied themselves to the problem.

So immediately, you know that the NSA can’t do the job they claim to be able to do : That woman avoided diaper ads.  Also, are the online ads you see what you would like to see?

Nobody gets rich simply using prior data in the stock market to predict future prices, (or they wouldn’t sell ‘expertise’, they would get rich doing what they claim to be able to do, isn’t that obvious?), that is why various frauds are so prevalent, predicting the future is very difficult.

Surely human beings are as difficult to predicts as stocks wrt future behavior?  That NSA has NEVER predicted a terrorist attack tends to confirm this.****

Just like electronic signals : with weak signals, amplify the signal sufficiently, and noise starts making it through the filter.  That is why communications uses digital signals.  ‘Meaning’ is far more analog than digital, words have fuzzy edges in meaning.

There are so very many sources of noise in data about people and its interpretation.  People with Pakistani names often are not Pakistani, not Muslim, not any other characteristic associated with Pakistanis, for any of a very long list of good reasons.

The question now is, given that you lower the filter hurdles 10% to catch the next ?% of the terrorists, how much do you increase the false positive rate?  That depends on the distribution of individuals in the signal dimensions.

So consider something terribly obvious : visiting a jihadi site many times.  Who does this except jihadis, potential jihadis and jihadist supporters and potential supporters?  Even if I accepted that this was a step function, yes/no, how to define a jihadi site?  Jihadis go to many Muslim religious sites, as do many others for many other reasons, etc.  We know that the web is a long-tail medium, low cost to produce and low-cost to operate, and therefore there will be many sites linking to sites that link to jihadi sites and which jihadis visit.  That alone means there is a high overlap of ‘visited jihadi site’ with these people.

Worse, search engines account for about 15% of web vists, so people may be surprised when looking for ’63:3′, which is a convenient shorthand search for Psalm 63:3, you will find 63:3 in the Koran and jihadi sites quoting it.  Google saves your search, NSA very likely does also.  One hopes there are no worse mis-interpretations done by analysts looking at your risk.

Noise is the signal and vv.

So NSA can’t predict because of the nature of nature. The reason they do do it is because it is much easier to gain power over people than over nature.

NSA’s problem is just another case of trying to predict the future from the past.  Past info is characteristics of all the suicide bombers.  But some one person was the first female, English-speaking, non-Muslim, <every exception you can think of>.  Even worse, the evolution of culture means the signals get ever subtler, even for some inconceivable-to-everyone craziness like becoming a suicide bomber.  But, of course, an entire Nation shared the religious belief that they must die for their Emperor when asked as recently as 1945, so the inconceivable often isn’t.

XKCD nailed this, of course.

“Predict the future from the past” problems are common in life, if we can recognize them.  Contracts are the same : every contract is a prediction about the future.  We can’t do that, and pretending that both sides can tie down a cover-all-contingencies contract is nuts.  Japanese do long-term alliances, develop trusting relationships in nights of drinking, write a general understanding and shake hands.  They work in a different world, where long-term employment allows that to work.  It could even be true that they spend as much on geisha and saki as much as we spend on lawyers, but their version is much more humane.*****

Corporate policies are another example. TJRoger’s Cypress Semiconductor is tightly controlled.  They have a process for everything, reporting to know the exact state of every project, very good information via trip reports for every visit to a customer or suppier, by all standards a perfectly-managed company.  They have done consistently well, but have not conquered the world of semiconductors.

All of this, as well as signal-to-noise , is maps versus territory.  The world out there, the reality we all share, is the territory of fractal, very complex evolving open systems interacting chaotically.   Black Swans happen unexpectedly often, and unexpectedly often in unexpectedly large swarms.  The idea that you can tie down that extremely complex reality with meanings built on top of a mere 400,000 English words, most existing in dictionaries and not contemporary language, is ludicrous.  Reality is infinitely complex, so complex we can’t even be sure of what is important to learn next.  Our words are a mapping from that continuous-in-a-fractal-dimension extremely multi-dimensional world to associations of firing neurons in our mind.  Some of those map their own actions into a ‘mind’ and meanings.

Our language is our main tool for understanding things, math 2nd.  Math has the same problem, Goedel’s Proof.  This blog has a number of examples of the problems of mapping math to reality.

Nick Silver Bayesian analysis of sports and politics is not at all the same problem because the meaning of a soccer player’s running N meters or making Y attempts at a goal does not change from one season to another, at least not enough to matter season to season, and there are very many games to provide the data to be analyzed. Even so, it isn’t easy :

The reviewer wanted to be told specifically how to make better predictions. That is a reasonable request and there is fortunately a straightforward answer.

In field of statistical consulting, it is generally said that after your Ph.D. in statistics, you need about three years of study in a particular domain in order to become a competent statistical analyst in that domain. Analysts draw on their domain-specific knowledge to choose the most promising strategies to collect and analyze data in a particular situation. The deeper your knowledge of a field, the better your statistical analyses will be. It’s that simple.

There are books and articles intended to guide the statistics student to becoming a competent applied statistician. They especially work to drive home this point. Two good books are Gerald Hahn, _A Career in Statistics: Beyond the Numbers_ and David Hand, _The Statistical Consultant in Action_. A more technical and abstract discussion of this issue is in Peter Huber, _Data Analysis: What Can Be Learned From the Past 50 Years_.

*As an honestly honest member of the Honest Party who strives to be an intellectual hardass, I confess that there is only a handwaving analogy between S/N in signalling systems and the ‘meaning of information’ as I extrapolate here.  But the use of ‘noise’ in information systems is widely-used.  Problem is defining the signal, as ‘meaning’ is in the mind of the beholder, thus individual mental maps and not the territory of the reality that produced the information.  Noise must then be ‘any information degrading the signal’, recursively vague.

**NSA keeps it all, despite their disclaimers.  The US makes about 2.4B cell calls per day.  Extrapolate to 100B calls world-wide on all the phones.  Call Detail Records are not large — I haven’t worked with cell phone data and can’t find the definition, but 200 bytes per records is generous.  100B * 200 = 40,000GB == 40TB.  Today, retail, I can buy a Seagate 4TB drive for $134, retail, and the chassis, power supply and electronics for 10 drives for another $500.

Of course they keep it all forever.  You never know when mentioning a politician’s phone calls to the prostitute 20 years ago will be necessary to ensure your budget and power.

Added a day later : Thinking more about CDRs, there is one record for the beginning of the call and another at the end so the telco can bill for the time.  I think those are consolidated to save space, the result contains the beginning and duration, so those calculations are still ball-park correct and conservative.

***Of course they have.  NSA has many people far smarter than I am, and Google just as many.  If Google hasn’t solved the problem with its relatively clean data, how can NSA do so for much noisier data?

****NSA has reasons to NOT predict such events, of course.  Its budget is raised with every such event that it misses, because ‘collect it all’ has no limit.

***** It is a piss-poor society that leads so many of its young and ambitious into such a purgatory.  None of the people I know of who went into law like it at all.  We tried to tell them.

Added later.  This shows how the population of potential jihadis is changing, worse than my example.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s