Big Data, new epistemologies and paradigm shifts

Authors: Kitchin
Publication, Year: Big Data & Society (2014)
Link

Big Data, new epistemologies and paradigm shiftsSummaryBig DataA fourth paradigm in science?The end of theory: Empiricism rebornData-driven scienceConclusion

Summary

It is argued that: (1) Big Data and new data analytics are disruptive innovations which are reconfiguring in manyinstances how research is conducted; and (2) there is an urgent need for wider critical reflection within the academy onthe epistemological implications of the unfolding data revolution, a task that has barely begun to be tackled despite therapid changes in research practices presently taking place. After critically reviewing emerging epistemological positions, it is contended that a potentially fruitful approach would be the development of a situated, reflexive and contextually nuanced epistemology.

Big Data

Has the following qualities:

huge in volume, consisting of terabytes or petabytes of data
high in velocity, being created in or near real-time
diverse in variety, being structured and unstructured in nature
exhaustive in scope, striving to capture entire populations or systems (n=all)
fine-grained in resolution and uniquely indexical in identification
relational in nature, containing common fields that enable the conjoining of different data sets
flexible, holding the traits of extensionality (can add new fields easily) and scaleability (can expand in size rapidly). (see boyd and Crawford, 2012; Dodge andKitchin, 2005; Laney, 2001; Marz and Warren, 2012;Mayer-Schonberger and Cukier, 2013; Zikopouloset al., 2012).

A fourth paradigm in science?

Following Kuhn's lead, Kitchen suggests that we may be in a new paradigm of research, brought on by the possibilities of Big Data. The development of four major paradigms are described in the below table, created from the text.

Paradigm	Nature	Form	When
First	Experimental science	Empiricism; describing natural phenomena	Pre-renaissance
Second	Theoretical science	Modeling and generalization	pre-computers
Third	Computational science	Simulation of complex phenomena	pre-Big Data
Fourth	Exploratory science	Data-intensive; statistical exploration and data mining	Now

Taken from this text, which compiled it from Hey et al. (2009)

The end of theory: Empiricism reborn

Some have argued that the "end of theory" is here, brought on by Big Data and computers. Here are the central arguments:

Big Data can capture a whole domain and provide full resolution
There is no need for a priori theory, models or hypotheses
Through the application of agnostic data analytics the data can speak for themselves free of human bias or framing, and any patterns and relationships within Big Data are inherently meaningful and truthful
Meaning transcends context or domain-specific knowledge, thus can be interpreted by anyone who can decode a statistic or data visualization

Lots of these idea have originated in business / marketing circles where understanding the world is not necessarily important.

Kitchin points out that,

Whilst this empiricist epistemology is attractive, it is based on fallacious thinking with respect to the four ideas that underpin its formulation.

Big Data may seek to be exhaustive, however, this all-seeing-eye perspective is limited by regulations and logistical realities. Thus, Big Data is still a sample that may not be representative and is subject to sampling bias like any other data.
- Data are shaped within a system based on prior assumptions and the intended goals for creating the data, which actively shape it's outcome
Big Data does not "arise from nowhere, freefrom the ‘the regulating force of philosophy’ (Berry,2011: 8)"
- Systems are designed to capture certain kinds of data using methods which have been tested and (hopefully) verified in some manner based on scientific reasoning and classical scientific approaches. Thus, you cannot remove the classical scientific perspective from the equation as it was utilized to create the approach for gathering the data.
- "New analytics might present the illusion of automatically discovering insights without asking questions, but the algorithms used most certainly did arise and were tested scientifically for validity and veracity"
"Data are not generated free from theory, neither can they simply speak for themselves free of human bias or framing"
- Making sense of data is done through a particular lens and is shaped by the interpreters a priori beliefs
- More importantly, patterns found in data are not necessarily meaningful. They can arise at random and these false connections can be exacerbated by Big Data which appears to want to hunt down any and all associations within as large of a dataset as possible
The idea that data can speak for themselves suggests that anyone with a reasonable understanding of statistics should be able to interpret them without context or domain-specific knowledge.
- Domain-specific knowledge will always be valuable
- Kitchin rips into computer and data scientists, as well as physicists, who (specifically, in the study of cities) he claims
  - … willfully ignore a couple of centuries of social science scholarship, including nearly a century of quantitative analysis and model building. The result is an analysis of cities that is reductionist, functionalist and ignores the effects of culture, politics, policy, governance and capital (reproducing the same kinds of limitations generated by the quantitative/positivist social sciences in the mid-20th century).
- The central point being made here is that the more data-minded folks are going to likely make fools of themselves if they ignore literature that already exists.

Data-driven science

Basically, Kitchin suggests that we need to find a middle ground between the classical science and the hardcore Big Data folks. Clearly there are benefits to employing scientific methods and approaches, however, we should not limit ourselves to only these methods. Thus, using Big Data in a scientific, abductive manner is likely they way forward.

Abduction is a mode of logical inference and reasoning forwarded by C. S. Peirce (1839–1914)(Miller, 2010). It seeks a conclusion that makes reason-able and logical sense, but is not definitive in its claim.For example, there is no attempt to deduce what is th ebest way to generate data, but rather to identify an approach that makes logical sense given what is already known about such data production.

Just use your noggin to do stuff. 😏

Moreover, the advocates of data-driven science argue that it is much more suited to exploring, extracting value and making sense of massive, interconnected data sets, fostering interdisciplinary research that conjoins domain expertise (as it is less limited by the starting theoretical frame), and that it will lead to more holistic and extensive models and theories of entire complex systems rather than elements of them (Kelling et al., 2009).

Conclusion

Nonetheless, as Kitchin (2013) and Ruppert (2013) argue, Big Data presents a number of opportunities for social scientists and humanities scholars, not least of which are massive quantities of very rich social, cultural, economic, political and historical data. It also poses a number of challenges, including a skills deficit for analyzing and making sense of such data, and the creation of an epistemological approach that enables post-positivist forms of computational social science. One potential path forward is an epistemology that draws inspiration from critical GIS and radical statistics in which quantitative methods and models are employed within a frame work that is reflexive and acknowledges the situatedness, positionality and politics of the social science being conducted, rather than rejecting such an approach out of hand. Such an epistemology also has potential utility in the sciences for recognizing and accounting for the use of abduction and creating a more reflexive data-driven science. As this tentative discussion illustrates, there is an urgent need for wider critical reflection on the epistemological implications of Big Data and data analytics, a task that has barely begun despite the speed of change in the data landscape.

Notes by Matthew R. DeVerna