Bing Predicts has consistently predicted NFL winners based on information about team statistics, match-up dynamics, online searches and discussion, and even facts about weather conditions and turf type. What some might take to be trifling online discussion actually increases the accuracy of Bing Predictions by 5%. Useful for spring season prediction, we can now trace allergy zones based on tree planting information; and soon, you’ll be able to plot your sprint from pollen in live time.
Forbes reported on the increasing amount of data in the world:
- 40,000 Google search queries every second.
- By 2020, we can expect about 1.7 megabytes of new information to be created every second for every human being on the planet.
- And this one: At the moment less than 0.5% of all data is ever analyzed and used.
Broadly, I use ‘selection’ to refer to scientific engagement with limited system variables—whether this is at the sampling, instrumentation, or modeling level (see van Fraassen 2008). Selection in data gathering is when information about certain variables is not recorded. For example, Kaplan et al (2014) describe that social and behavioral choices—e.g., the consumption of fruits/vegetables—are only recorded 10% of the time in electronic health records (EHR’s), even though such choices are empirically linked to relevant medical conditions (343). Selection in data classification limits the type and amount of categories used to sort the data. When filling out an online survey is there an option for ‘indifferent’? When filing an insurance claim, is there a category for (unprovoked) moose attack? Information can be lost when using classificatory systems that are missing categories or contain vague/ambiguous categories (e.g., the use of ‘inconclusive’ as a category) (Berman 2013).
Selection at the analysis level requires finesse in working with data sets. The analyst engages Big Data by selecting relevant data sets and matching variables between sets to perform the proper statistical comparison. Selecting sets and matching occurs in prospective and retrospective designs, so this is not a new problem. However, methodological transparency (e.g., knowing the methods used to generate the data) is often limited in Big Data contexts because this requires that we ask “preanalytic” questions without available answers (Berman 2013, 95). Furthermore, without methodological transparency, our understanding of relationships between variables is limited.
While transparency poses a problem, the aim of this picture is not to make us skeptical about Big Data, but rather to shift our focus on the Big Data process. Reliable results aren’t just about aggregate outcomes. They’re about careful selection at each step in the data process.
Callebaut, Werner. 2012. “Scientific perspectivism: A philosopher of science’s response
to the challenge of big data biology.” Studies in History and Philosophy of Biological and Biomedical Science 43(1):69-80.
Berman, J.J. 2013. Principles of Big Data: Preparing, Sharing, and Analyzing
Complex Information (1st ed.) San Francisco, CA: Morgan Kaufmann Publishers Inc.
Halevy, A., Peter N., and Fernando P. 2009. “The Unreasonable Effectiveness of Data.”
IEEE Intelligent Systems 24(2):8-12.
Kaplan, R., Chambers, D., Glasgow, R. 2014. “Big Data and Large Sample Size: A
Cautionary Note on the Potential for Bias,” Clinical and Translational Science 7(4). DOI: 10.1111/cts.12178.
Kitchin, R. 2014. “Big data, new epistemologies and paradigm shifts,” Big Data and
Mayer-Schönberger, V., and Cukier, K. 2013. Big Data: A Revolution That Will
Transform How We Live, Work, and Think. New York: Houghton Mifflin Harcourt Publishing Company.
Yi, S. K. M., Steyvers, M., Lee, M. D., and Dry, M. J. 2012. "The Wisdom of the Crowd
in Combinatorial Problems," Cognitive Science 36(3). doi:10.1111/j.1551- 6709.2011.01223.x.