Where does bulk interception of data stop and mass surveillance start and in the world of Big Data and algorithmic surveillance is it even relevant to make such a distinction?
It emerged last week that these are important questions, following a ruling by the UK's Investigatory Powers Tribunal and subsequent response by the UK government and its electronic spying outfit, GCHQ (see the details in this Guardian report). This response proposes that mass surveillance doesn't really happen (even if it may look a bit like it does), because all that is really going on is bulk interception of data and this is OK (and thus can be allowed to happen).
One of the most disturbing revelations flowing from Edward Snowden's exposure of the Prism and Upstream digital surveillance operations is the extent to which the US and UK governments have been capturing and storing vast amounts of information, not just on possible terrorists or criminals, but on everyone. This happened in secret and its exposure has eventually prompted a response from government and this response has been to assert that this collection and storage doesn't constitute mass surveillance, instead it is "the bulk interception of data which is necessary to carry out targeted searches of data in pursuit of terrorist or criminal activity."
This is the needle in the haystack argument - i.e. we need to process a certain amount of everyone's hay in order to find the terrorist needles that are hidden within it. This seems like a reasonable justification because it implies that the hay (i.e. the information about all of us) is a disposable asset, something to be got rid of in order to expose the needles. This is basically the way that surveillance has always operated. To introduce another analogy, it is a trawling operation that is not interested in the water that passes through the net only the fish that it contains.
However, this justification falls down because this is not the way that algorithmic surveillance works. Algorithmic surveillance works by using information derived from the hay to predict where the needles are going to be. In this instance the hay (i.e. the information about all of us) is the key asset. It is not disposable, which is presumably why we find it is not being disposed of. This subtle but fundamental difference is astonishingly important because it turns our entire notion of privacy and data protection on its head and until such time as the public and civil liberties groups are aware of this fact, there cannot be an assumption that our governments are proceeding with sufficient consent. It renders their explanation irrelevant, unless supplemented by more detailed information on exactly what the algorithms are doing with the data they are fed.
We have a situation at the moment where the seemingly irrelevant data about our daily lives could be (and we have to assume therefore is) being used to attach to all of us a score, based on our predicted probability of being a criminal or a terrorist. This score could indeed have almost nothing to do with the reality of our lives, (as determined by old-fashioned mass surveillance) rather it is the activities of other people / everyone that provide the framework within which to assign an individual's probability score. Indeed a process of mass surveillance, were it feasible, would probably only serve to set at zero the score of those who are not terrorists or criminals.
The anonymity argument (i.e. the Google defence) also falls down for algorithmic surveillance. Saying that anonymity makes the process safe is a bit like saying that driving a car is 100 per cent safe, which it is right up until the moment you have a crash. You can remain anonymous right up until the moment that an algorithm assigns an identity to you. An algorithm will assign an individual an identity based on 'anonymised' data gathered from other people. Your own identity doesn't actually mean anything. It may be based on facts, rather than probabilities, but in an algorithmic world it has the status simply of personal opinion. The only identity that counts is the identity the algorithm assigns to you.
It is quite possible, in fact likely, that the small group 'identified' with the highest terrorist probability score will then form the basis for more detailed, old-fashioned, surveillance. But in the process we have all been labelled and we are not able to see what our label is, not challenge it. We will not have passed through the net, we will have been assigned a position within it. Nor indeed are we even aware that this process is happening and even the algorithms themselves will not be able to explain how they have come to their conclusions in specific instances. Neither are we aware of the implications that will flow from the scores that have been attached to us. At this point, mass surveillance starts to look like the preferable option because at least it can be understood and be made to be transparent or be exposed. With bulk interception of data, we simply have an impenetrable black box.
It is highly unlikely that the people charged with supervising our security services have any idea how algorithmic surveillance works. It is also highly unlikely that the people at GCHQ are going to enlighten them. But until such enlightenment takes place, we have to regard everything that GCHQ et al do as being illegal on the basis of proceeding with insufficient consent. As last week's ruling shows - knowledge of a process can confer legitimacy upon it and visa versa.
And, on the basis of such knowledge, we then have to work out how to set up a process of democratic oversight that starts with understanding which questions to ask - because the current questions we are asking are the wrong ones.