A Decade of Social Bot Detection


A Decade of Social Bot DetectionSummary VideoKey InsightsSummary of SuggestionsIntroductionThe Social Bot PandemicPrevalence of BotsGrowth of ResearchThe Dawn of Social Bot DetectionProblems with the ApproachThe Issue of Bot EvolutionBlurring Lines of "Real" and "Fake"The Rise of Group ApproachesBot Detection Techniques Over the Past 10 YearsA Glimpse into the Future of Deception DetectionAdversarial Machine Learning to the Rescue?Adversarial Bot DetectionGenerative Adversarial NetworksGeneral Generative Adversarial Network ModelChallenge for the Future of GANsOpen Challenges and the Way AheadSummary of Suggestions


Summary Video

 

Key Insights

  1. Social bots are a long studied, yet unsolved, problem in our online social ecosystems and several detection trends appeared through time. The latest and most-promising advance is represented by group-based detectors.
  2. Deception detection is intrinsically adversarial. The application of adversarial machine learning can give us an edge in the fight against all forms of online manipulation and automation.
  3. Recent advances in computing and AI (for example, deepfakes) make individual bots indistinguishable from legitimate users. Future efforts should focus on measuring the extent of inauthentic coordination rather than on trying to classify the nature of individual accounts.

Summary of Suggestions

Future deception detection techniques should:

  1. Focus on identifying suspicious coordination independently of the nature of individual accounts
  2. Avoid providing binary labels in favor of fuzzier and multifaceted indicators
  3. Favor unsupervised/semi-supervised approaches over supervised ones
  4. Account for adversaries by design

In addition, part of the massive efforts we dedicated to the task of detection should also be reallocated to measure (human) exposure to these phenomena and to quantify the impact they possibly have.


Introduction

In the aftermath of the 2016 U.S. elections, the world started to realize the gravity of widespread deception in social media. Following Trump’s exploit, we witnessed to the emergence of a strident dissonance between the multitude of efforts for detecting and removing bots, and the increasing effects these malicious actors seem to have on our societies. This paradox opens a burning question: What strategies should we enforce in order to stop this social bot pandemic?

 

The Social Bot Pandemic

There is still no agreed upon social bot definition. The diversity of definitions is a result of different fields studying the problem as well as the diversity of behavior that bots display. For example:

Bots are either malicious or benign and researchers tend to focus on malicious bots.

Prevalence of Bots

Even more worrisome, when strong political or economic interests are at stake, the presence of bots dramatically increases.

Figure 1 below illustrates the global nature of social bots attempt at influencing the world, however, their impact is not always real.

The ubiquity of social bots is also partly fueled by the availability of open source code...

Number of Twitter bot repos on Github...

20162018
4,00040,000
[source][source]

The looming picture is one where social bots are among the weapons of choice for deceiving and manipulating crowds. These results are backed by the same platforms where information operations took place—namely, Facebook ([source]), Twitter ([source]), and Reddit ([source])—that banned tens of thousands accounts involved in coordinated activities since 2016.

Growth of Research

The growth of the problem has, fortunately, led to the growth of work seeking to detect social bots (Fig. 2)

 

Perhaps even more importantly, the rate at which new papers are published implies that a huge worldwide effort is taking place in order to stop the spread of the social bot pandemic. But where is all this effort leading?

 

The Dawn of Social Bot Detection

Many early attempts focused on the analysis of individual accounts. (Figure 3A below)

The key assumption in this approach is that bot accounts are clearly different from non-bot accounts

Problems with the Approach

Systems which focus on only a few characteristics can be easily gamed, however, even more sophisticated algorithms which take into account thousands of account features can run into issues.

  1. Training machine learning classifiers requires training data. Unfortunately, ground-truth data is hard to come by

  2. Most "ground-truth" data that is utilized is simply hand-coded by humans

    1. Issues arise from this approach for at least two issues

      1. Different coders may utilize different methods, which may lead to different results/conclusions
      2. Humans have been proven to suffer from several annotation biases and largely fail at spotting more sophisticated bots — only correctly annotation 24% of these accounts [source]
  3. Many of these classifiers return a binary classification. However, many malicious bots display a mixture of bot-like and human-like behavior

  4. The evolutionary nature of social bots creates an endless cat and mouse game

 

The Issue of Bot Evolution

Bot Evolution — the process by which the creation of new sophisticated bot detection methods "force" the creators of malicious bots to design more sophisticated malicious social bots.

Newer bots are more similar to legitimate human-operated accounts than to other older bots.

Why?

Blurring Lines of "Real" and "Fake"

Kate Starbird discusses how the lines are blurring between what is "real" and "fake" online in this Nature article.

As one form of “social Web virus,” bots mutated thus becoming more resistant to our antibodies. The social bot pandemic gradually became much more difficult to stop. Within this global picture, dichotomous classifications — such as human vs bot, fake vs real, coordinated vs not coordinated — might represent oversimplifications, unable to grasp the complexity of these phenomena and unlikely to yield accurate and actionable results. Ultimately, the findings about the evolution of online automation and deception tell us the naive assumption of early, supervised bot detection approaches — according to which bots are clearly separable from legitimate accounts — is no longer valid.

 

The Rise of Group Approaches

Around 2012-2013, a number of group-based approaches began to pop up.

Bot Detection Techniques Over the Past 10 Years

The authors conducted a survey of 230 bot detection papers, see Figure 5 below for their findings.

A couple of examples are provided within the original text to provide clarity on how the different types of methods are represented by the survey's classification. Specifically, see the end of page 79 and the beginning go page 80 for these details.

As a consequence [of certain communities favoring certain methods], some combinations of approaches—above all, text-based detectors that perform unsupervised, group analyses—are almost unexplored and definitely underrepresented in the landscape of existing bot detectors. In the future, it would be advisable for multiple efforts to follow the directions that have been mostly overlooked until now.

 

A Glimpse into the Future of Deception Detection

The authors then move on to cover the potential areas of improvement in deception detection. Their comments are based on the following two observations

  1. Both the individual and group-based methods are reactionary

    1. i.e. they typically respond to bad actors by gathering data on them, and then building models on those bad actors behavior, networks, etc.

In other words, scholars and OSN administrators are constantly one step behind of malicious account developers.

  1. Most of the machine learning methods utilized are designed to by operated within environments that are stationary and neutralboth of these assumptions are violated with respect to social bot detection

    1. Stationary — violated due to the constant evolution of bots over time
    2. Neutral — violated due to the fact that malicious bot creators are actively trying to avoid detection

Adversarial Machine Learning to the Rescue?

All tasks related to the detection of online deception, manipulation and automation are intrinsically adversarial.

Adversarial machine learning is a paradigm specifically designed for applications where what you're trying to detect is actively trying to fool the learned models that are doing the detection.

High-level Goal: study vulnerabilities of existing systems and possible attacks to exploit them, before such vulnerabilities are effectively exploited by adversaries. Early detection of vulnerabilities can then be utilized to build more robust detection systems.

Adversarial Bot Detection

In adversarial bot detection, researchers create meaningful adversarial examples which they use to test existing bot detection methods.

Adversarial examples might be:

These example above are driven by the creativity of researchers, however, in the future they could be driven by cutting edge AI by utilizing generative adversarial networks

Generative Adversarial Networks

Making Machine Learning Robust Against Adversarial Inputs from CACM on Vimeo.

Generative adversarial networks (GANs) are a powerful machine learning framework where two competing deep learning networks are jointly trained in a game-theoretic setting. [source]

In particular, a GAN is composed of a generator network that creates data instances and a discriminator network that classifies data instances.

Goal of Generator — to create synthetic data instances that resemble real data

Goal of Discriminator — to classify input data as either synthetic or organic

… the generator of a GAN could be used as a generative model for creating many plausible adversarial examples, thus overcoming the previously mentioned limitations in this task and the scarcity of labeled datasets.

This provides us with two very useful and practical applications of GANs with respect to social bot detection

  1. Creating much needed training data
  2. Testing and improving existing detection models by detecting their weaknesses

This paradigm has never been applied to the task of social bot detection, but it was tested with promising results for related tasks, such as that of fake news generation/detection. [source]

General Generative Adversarial Network Model

Challenge for the Future of GANs

 

 

Open Challenges and the Way Ahead

  1. Organization— The large body of social bot detection research needs to be organized.
  2. Standardization — Standard benchmarks, frameworks, and datasets should be developed.
  3. Generalizability — This aspect of classifiers performance have been largely overlooked. Creating deception detectors which generalized across time as well as type of bot are needed (see Figure 7 below)

  1. Create social bot repositories

… to reach this ambitious goal, we must first create reference datasets that comprise several different kinds of malicious accounts, including social bots, cyborgs and political trolls, thus significantly adding to the sparse resources existing as of today.

  1. Develop a diverse set of methods for developing adversarial examples

    1. This also requires methods of quantifying the value of adversarial examples — e.g. based on their novelty, diversity, how genuine they would appear to humans, etc.

 

Summary of Suggestions

Future deception detection techniques should:

  1. Focus on identifying suspicious coordination independently of the nature of individual accounts
  2. Avoid providing binary labels in favor of fuzzier and multifaceted indicators
  3. Favor unsupervised/semi-supervised approaches over supervised ones
  4. Account for adversaries by design

In addition, part of the massive efforts we dedicated to the task of detection should also be reallocated to measure (human) exposure to these phenomena and to quantify the impact they possibly have.

 


Notes by Matthew R. DeVerna