Searching for superspreaders of information in real-world social media

• Authors: Sen Pei, Lev Muchnik, José S. Andrade Jr., Zhiming Zheng, & Hernáan A. Makse
• Publication, Year: Scientific Reports, 2014

Overview and Major Findings

• Utilizing real-world social network data, the authors compare the performance of common topological measures in order to identify the best way to locate the most influential users within a network.

• Metrics compared:

1. k-core/k-shell: see the wikipedia page for a quick crash course or this paper for more details.
2. PageRank: Wikipedia for details
3. Degree: the degree of a node in a network is the number of connections it has to other nodes. Wikipedia for details
• Networks examined:

1. LiveJournal.com connections
2. Scientific publishing in American Physical Society (APS)
4. Facebook friend/post networks from a regional network corresponding to the city of New Orleans, LA, USA
• For all of these networks they show that k-core does the best job of identifying superspreaders of information

• They also develop a "local proxy for influence" which appears to perform similarly to k-core — the sum of the nearest neighbors' degree

Background

• Central motivation for this research: previous work seeking to develop methods for identifying superspreaders of information did not look at real-world data and instead rely on simulations based on models of infection and or rumor spreading such as SIR.

• As a result, they highlight numerous times that the conclusions within this work is model dependent and has led to conflicting results
• Thus, they collected large real-world network data to test the methods most commonly utilized (PageRank, k-core, degree) General approach taken in this study

• Essentially, what they then do is calculate the average influence and recognition rate for each of the three network topology metrics, for all of the different networks.

average influence

The average influence, $M(k_s, k_{in})$, is calculated for nodes with a given combination of $k_s$ (k-core) and $k_{in}$ (in-degree) as:

Where,

• $\Upsilon(k_s,k_{in})$ is the collection of all the users participating in a diffusion of information
• $N(k_s,k_{in})$ is the number of those users

recognition rate

The recognition rate is defined as

Where $I_f$ and $P_f$ are sets of nodes ranking in the top $f$ fraction by influence and predictor, respectively, and $|I_f|$ is the number of nodes in $I_f$ .

• So, basically what they do is:

1. Create a ranked list of all nodes from highest to lowest influence.
2. Figure out which nodes are identified as the most influential by a specific predictor (i.e., k-core, in-degree, or PageRank) above some specific fraction $f$
3. Take the intersection of the top $f$ fraction of the influential list ($I_f$) and the predicted list $P_f$
4. Take the number of nodes in this intersection and divide it by the total number of nodes within the top $f$ fraction of the most influential nodes
• Thus, if $I_f$ and $P_f$ were the exact same set of nodes, they $r(f)$ value would equal $1$.

Results

They illustrate a number of different ways that k-core does the best for all networks in pretty much all ways. I simply paste the figures and their descriptions below to illustrate the results.    Local Proxy

Since k-core is a global measure it's usefulness is quite limited — rarely to we have a full picture of all the connections within a social network.

Thus, the authors offer a local proxy which seeks to stand in for k-core by utilizing only partial network information.

Because k-core appears to not only take into account the degree of an individual node but also the degree of those around it, the authors create and test two simple metrics:

1. $k_{sum}$ — Node $i$'s $k_{sum}$ is the sum of the degree of all the nearest neighbors of node $i$.
2. $k_{2sum}$ — Is the same as $k_{sum}$ except it now accounts for the nearest-neighbors of node $i$'s nearest neighbors. Thus, this measure takes the sum of node $i$'s nearest neighbor as well as the next nearest neighbor.

We can see below that $k_{sum}$ performs on-par with k-core and $k_{2sum}$ doesn't add much additional value.

Note: While this would certainly require much less Twitter data than reconstructing the entire network, gathering this data on just the nearest neighbors for even a small sample of Twitter users (say a few hundred) is still extremely temporally expensive and can require days to simply gather that data. 