A Decade of Social Bot DetectionSummary VideoKey InsightsSummary of SuggestionsIntroductionThe Social Bot PandemicPrevalence of BotsGrowth of ResearchThe Dawn of Social Bot DetectionProblems with the ApproachThe Issue of Bot EvolutionBlurring Lines of "Real" and "Fake"The Rise of Group ApproachesBot Detection Techniques Over the Past 10 YearsA Glimpse into the Future of Deception DetectionAdversarial Machine Learning to the Rescue?Adversarial Bot DetectionGenerative Adversarial NetworksGeneral Generative Adversarial Network ModelChallenge for the Future of GANsOpen Challenges and the Way AheadSummary of Suggestions
- Social bots are a long studied, yet unsolved, problem in our online social ecosystems and several detection trends appeared through time. The latest and most-promising advance is represented by group-based detectors.
- Deception detection is intrinsically adversarial. The application of adversarial machine learning can give us an edge in the fight against all forms of online manipulation and automation.
- Recent advances in computing and AI (for example, deepfakes) make individual bots indistinguishable from legitimate users. Future efforts should focus on measuring the extent of inauthentic coordination rather than on trying to classify the nature of individual accounts.
Future deception detection techniques should:
In addition, part of the massive efforts we dedicated to the task of detection should also be reallocated to measure (human) exposure to these phenomena and to quantify the impact they possibly have.
In the aftermath of the 2016 U.S. elections, the world started to realize the gravity of widespread deception in social media. Following Trump’s exploit, we witnessed to the emergence of a strident dissonance between the multitude of efforts for detecting and removing bots, and the increasing effects these malicious actors seem to have on our societies. This paradox opens a burning question: What strategies should we enforce in order to stop this social bot pandemic?
There is still no agreed upon social bot definition. The diversity of definitions is a result of different fields studying the problem as well as the diversity of behavior that bots display. For example:
Bots are either malicious or benign and researchers tend to focus on malicious bots.
The reason why becomes obvious when you look at how Stieglitz et al. categorizes bots by their intent and capacity to imitate humans
Even more worrisome, when strong political or economic interests are at stake, the presence of bots dramatically increases.
Figure 1 below illustrates the global nature of social bots attempt at influencing the world, however, their impact is not always real.
The ubiquity of social bots is also partly fueled by the availability of open source code...
Number of Twitter bot repos on Github...
The looming picture is one where social bots are among the weapons of choice for deceiving and manipulating crowds. These results are backed by the same platforms where information operations took place—namely, Facebook ([source]), Twitter ([source]), and Reddit ([source])—that banned tens of thousands accounts involved in coordinated activities since 2016.
The growth of the problem has, fortunately, led to the growth of work seeking to detect social bots (Fig. 2)
Perhaps even more importantly, the rate at which new papers are published implies that a huge worldwide effort is taking place in order to stop the spread of the social bot pandemic. But where is all this effort leading?
Many early attempts focused on the analysis of individual accounts. (Figure 3A below)
The key assumption in this approach is that bot accounts are clearly different from non-bot accounts
Systems which focus on only a few characteristics can be easily gamed, however, even more sophisticated algorithms which take into account thousands of account features can run into issues.
Training machine learning classifiers requires training data. Unfortunately, ground-truth data is hard to come by
Most "ground-truth" data that is utilized is simply hand-coded by humans
Issues arise from this approach for at least two issues
Many of these classifiers return a binary classification. However, many malicious bots display a mixture of bot-like and human-like behavior
The evolutionary nature of social bots creates an endless cat and mouse game
Bot Evolution — the process by which the creation of new sophisticated bot detection methods "force" the creators of malicious bots to design more sophisticated malicious social bots.
Newer bots are more similar to legitimate human-operated accounts than to other older bots.
Why?
Cyborgs exist halfway between traditional concepts of bots and humans
These cyborgs are now using AI to create text (e.g., via the GPT-2 and 3 deep
learning models [source]) and profile pictures (e.g., via StyleGANs deep learning models [source])
Kate Starbird discusses how the lines are blurring between what is "real" and "fake" online in this Nature article.
As one form of “social Web virus,” bots mutated thus becoming more resistant to our antibodies. The social bot pandemic gradually became much more difficult to stop. Within this global picture, dichotomous classifications — such as human vs bot, fake vs real, coordinated vs not coordinated — might represent oversimplifications, unable to grasp the complexity of these phenomena and unlikely to yield accurate and actionable results. Ultimately, the findings about the evolution of online automation and deception tell us the naive assumption of early, supervised bot detection approaches — according to which bots are clearly separable from legitimate accounts — is no longer valid.
Around 2012-2013, a number of group-based approaches began to pop up.
Most group detectors have proposed shifting from general-purpose machine learning algorithms (i.e. SVMs and decision trees), to ad-hoc algorithms that are designed for detecting bots, in an effort to boost detection performance.
Group detectors are also typically based on unsupervised or semi-supervised approaches.
- This will help overcome the generalization problems of supervised detectors that limit the availability of exhaustive and reliable training datasets.
The authors conducted a survey of 230 bot detection papers, see Figure 5 below for their findings.
A couple of examples are provided within the original text to provide clarity on how the different types of methods are represented by the survey's classification. Specifically, see the end of page 79 and the beginning go page 80 for these details.
As a consequence [of certain communities favoring certain methods], some combinations of approaches—above all, text-based detectors that perform unsupervised, group analyses—are almost unexplored and definitely underrepresented in the landscape of existing bot detectors. In the future, it would be advisable for multiple efforts to follow the directions that have been mostly overlooked until now.
The authors then move on to cover the potential areas of improvement in deception detection. Their comments are based on the following two observations
Both the individual and group-based methods are reactionary
In other words, scholars and OSN administrators are constantly one step behind of malicious account developers.
Most of the machine learning methods utilized are designed to by operated within environments that are stationary and neutral — both of these assumptions are violated with respect to social bot detection
All tasks related to the detection of online deception, manipulation and automation are intrinsically adversarial.
Adversarial machine learning is a paradigm specifically designed for applications where what you're trying to detect is actively trying to fool the learned models that are doing the detection.
High-level Goal: study vulnerabilities of existing systems and possible attacks to exploit them, before such vulnerabilities are effectively exploited by adversaries. Early detection of vulnerabilities can then be utilized to build more robust detection systems.
In adversarial bot detection, researchers create meaningful adversarial examples which they use to test existing bot detection methods.
Adversarial examples might be:
These example above are driven by the creativity of researchers, however, in the future they could be driven by cutting edge AI by utilizing generative adversarial networks
Making Machine Learning Robust Against Adversarial Inputs from CACM on Vimeo.
Generative adversarial networks (GANs) are a powerful machine learning framework where two competing deep learning networks are jointly trained in a game-theoretic setting. [source]
In particular, a GAN is composed of a generator network that creates data instances and a discriminator network that classifies data instances.
- See Wu et al. for examples of an early GAN approach on Twitter
Goal of Generator — to create synthetic data instances that resemble real data
Goal of Discriminator — to classify input data as either synthetic or organic
… the generator of a GAN could be used as a generative model for creating many plausible adversarial examples, thus overcoming the previously mentioned limitations in this task and the scarcity of labeled datasets.
This provides us with two very useful and practical applications of GANs with respect to social bot detection
This paradigm has never been applied to the task of social bot detection, but it was tested with promising results for related tasks, such as that of fake news generation/detection. [source]
This research is in it's infancy and researchers need to invest their time to develop it
Need to develop techniques for:
… to reach this ambitious goal, we must first create reference datasets that comprise several different kinds of malicious accounts, including social bots, cyborgs and political trolls, thus significantly adding to the sparse resources existing as of today.
Develop a diverse set of methods for developing adversarial examples
Future deception detection techniques should:
In addition, part of the massive efforts we dedicated to the task of detection should also be reallocated to measure (human) exposure to these phenomena and to quantify the impact they possibly have.
Notes by Matthew R. DeVerna