Monday, May 2, 2016

The Bigger Picture of Big Data

Data is everywhere: I wake up and check my disappointing sleep patterns; monitor my sluggish steps to the kitchen; before my morning French press I’ve generated at least 3 of the 4.5 billion-plus Facebook likes for that day; and with one click I’ve warned morning commuters about the wild turkey-caused traffic jam on Russell Blvd. In the modern world, information is plentiful, and, more importantly, predictively useful.

Bing Predicts has consistently predicted NFL winners based on information about team statistics, match-up dynamics, online searches and discussion, and even facts about weather conditions and turf type. What some might take to be trifling online discussion actually increases the accuracy of Bing Predictions by 5%. Useful for spring season prediction, we can now trace allergy zones based on tree planting information; and soon, you’ll be able to plot your sprint from pollen in live time.

Forbes reported on the increasing amount of data in the world:
  • 40,000 Google search queries every second
  • By 2020, we can expect about 1.7 megabytes of new information to be created every second for every human being on the planet.
  • And this one: At the moment less than 0.5% of all data is ever analyzed and used.
Does more data mean reliable data? Modern trends in data analysis that deal with heavy volume, velocity, variety, scope, and resolution can be grouped under ‘Big Data’ (Kitchin 2014). Some views on Big Data seem to suggest that the small blunders can be tolerated because a comprehensive trend is generated in the process (Halevy et al. 2009; Mayer-Schönberger and Cukier 2013). Suppose you’re using an app that reports live-time pollen spread through user clicks. As soon as a user experiences blaring sneezes, watery eyes, observes pollen clouds, etc. that user can click different options, warning others. The problem is that a small portion of the population is subject to the nocebo effect, where sheer expectation produces real physical symptoms (even with allergic reactions), thus creating bits of false positive data that mislead others about allergy trends. The solution is if there’s more data then the “true” trend (signal) pierces through while the false positives disappear into the background noise. This is exactly what is behind aggregation effects, where averaging multiple individual guesses produces reliable results, and errors cancel out (Yi et al 2012). 

One methodological concept driving the value of Big Data seems to be that data produces notable trends, even in the presence of small blunders caused by, e.g., the malfunction of individual data measures. (As I’ve mentioned in a previous post, multiple independent pieces of data are only useful if the data is generated by independent sources.) But there is more to Big Data methodology. Big Data methods aren’t merely about analytically sitting back, generating a groundbreaking algorithm, and letting the significant relationships come to surface. Rather, Big Data is a hands-on process that consists of data gathering, classification, and analysis. One of the notable features of the Big Data process is the role of, what I call, ‘selection.’ Since Big Data doesn’t provide a comprehensive picture of a system (See Callebaut 2012), scientists have to figure out what to focus on and what to ignore at every step in the Big Data process. This means that it matters not only how much data we have, but also how we select at the data-gathering, classification, and analysis stages.

Broadly, I use ‘selection’ to refer to scientific engagement with limited system variables—whether this is at the sampling, instrumentation, or modeling level (see van Fraassen 2008). Selection in data gathering is when information about certain variables is not recorded. For example, Kaplan et al (2014) describe that social and behavioral choices—e.g., the consumption of fruits/vegetables—are only recorded 10% of the time in electronic health records (EHR’s), even though such choices are empirically linked to relevant medical conditions (343). Selection in data classification limits the type and amount of categories used to sort the data. When filling out an online survey is there an option for ‘indifferent’? When filing an insurance claim, is there a category for (unprovoked) moose attack? Information can be lost when using classificatory systems that are missing categories or contain vague/ambiguous categories (e.g., the use of ‘inconclusive’ as a category) (Berman 2013).

Selection at the analysis level requires finesse in working with data sets. The analyst engages Big Data by selecting relevant data sets and matching variables between sets to perform the proper statistical comparison. Selecting sets and matching occurs in prospective and retrospective designs, so this is not a new problem. However, methodological transparency (e.g., knowing the methods used to generate the data) is often limited in Big Data contexts because this requires that we ask “preanalytic” questions without available answers (Berman 2013, 95). Furthermore, without methodological transparency, our understanding of relationships between variables is limited. 

Suppose that scientists aim to find the parasite responsible for some physiological condition P by analyzing data from brain samples from different regions and decades. Analysts find data on brain samples with P and brain samples without P, and then compare the samples to limit possible parasite culprits. To better match our groups for confounds we can ask specific questions: What is the process of selecting the individual samples? Are all brain samples taken post-mortem? What is the process of preparing the sample for measurement analysis? At the analytic level, this information is lost. One major difficulty emerges. We have limited information about possible error sources, such as, selection bias and laboratory techniques that can potentially alter intrinsic components of the tissue. This is problematic because if the experimental and control groups have different error sources, we may mistake confound-driven frequency differences for statistically significant ones.

While transparency poses a problem, the aim of this picture is not to make us skeptical about Big Data, but rather to shift our focus on the Big Data process. Reliable results aren’t just about aggregate outcomes. They’re about careful selection at each step in the data process.


Callebaut, Werner. 2012. “Scientific perspectivism: A philosopher of science’s response
to the challenge of big data biology.” Studies in History and Philosophy of Biological and Biomedical Science 43(1):69-80.

Berman, J.J. 2013. Principles of Big Data: Preparing, Sharing, and Analyzing

Complex Information (1st ed.) San Francisco, CA: Morgan Kaufmann Publishers Inc.

Halevy, A., Peter N., and Fernando P. 2009. “The Unreasonable Effectiveness of Data.”
IEEE Intelligent Systems 24(2):8-12.

Kaplan, R., Chambers, D., Glasgow, R. 2014. “Big Data and Large Sample Size: A

Cautionary Note on the Potential for Bias,” Clinical and Translational Science 7(4). DOI: 10.1111/cts.12178.

Kitchin, R. 2014. “Big data, new epistemologies and paradigm shifts,” Big Data and
Society 1:1-12.

Mayer-Schönberger, V., and Cukier, K. 2013. Big Data: A Revolution That Will

Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt Publishing Company.

Yi, S. K. M., Steyvers, M., Lee, M. D., and Dry, M. J. 2012. "The Wisdom of the Crowd
in Combinatorial Problems," Cognitive Science 36(3). doi:10.1111/j.1551- 6709.2011.01223.x.

Vadim Keyser
Department of Philosophy
Sacramento State


  1. Vadim, thanks very much for this. My understanding of Big Data is pretty limited, but one thing that is pretty consistently emphasized in everything I read about it is that the Big Data approach avoids familiar problems associated with non random sampling of the population by not sampling, i.e., by working with data generated by the entire population.

    My basic question is this. Do the problems you discuss here come up because this n=all thing is an ideal that rarely occurs in actual practice, or is it that all of these problems arise regardless?

  2. Solid question, Randy. This would be a good follow-up piece: “the n=all myth”.

    First, I'm not sure we can get rid of sampling entirely. While Big Data samples are large, they're not complete; and this creates sampling biases. For example, we don't have access to all medical conditions within a region--only reported ones from insured folks.

    Second, if we had access to the total population would these problems still arise? I think it depends on the point in the data process we're talking about and the type of Big Data measure.

    Gathering/recording and classification will always involve selection. In clinical data we will not have access to every variable and variables may not be similarly classified between data pools. So, we may have complete populations but incomplete variables about those populations.

    Additionally, depending on if we're working with e.g., clinical Big Data or astronomy Big Data, we may get different kinds of selectivity. If in a clinical setting we could compare n=all populations, we wouldn't need to match, and so our selective scientific activities would be limited. But in a case where we have thousands of variables from Big Data astronomy imaging, we'd still have to be selective about which features to look at.
    By using multiple data dimensions (e.g., data from all wavelengths), Big Data astronomy can provide information about billions of objects. Older techniques can’t capture all of the dimensions. This means that we have to figure out new techniques for classifying objects in Big Data images in order to rule out certain astronomical artifacts. Pesenson et al (2010) have a nice article where they discuss that to effectively deal with high volume data, new methodologies for classifying and representing Big Data astronomy must be invented. This means that even if we have n=all in our imaging techniques, we still have to focus on selection. So, the same problems would arise. Interestingly, these problems would vary depending on the Big Data context, so I think you’ve asked a powerful question.