I've been reading comics since I was a little kid. Neil Gaiman, Alan Moore, Warren Ellis, Art Spiegelman are among my favorites authors now, but when I was younger I was a big fan of Marvel Super Heroes.

When I discovered the Marvel Social Graph dataset, I immediately wanted to discover how the different characters influenced each other.

By using cluster analysis, I am going to figure out how my heros actually interacted with each other in the Marvel Universe.

The Marvel dataset is composed of a list of co-occurrences of Super Heros. For example, every time Spider Man appears in a comic book with Captain America, we will have a line with both their names. To illustrate this, here is a visual of the data in Data Science Studio.

Because of these connections, it is possible to create a graph, where each node being a character and each link or line shows us the presence of a co-occurrence in the dataset.

## A first view of the graph

I did all the network analysis with python (mostly using networkx package) and used sigmajs to build a small web application to be able to visualize the network.

Here is the first visualization of our Marvel Social Network. The node size corresponds to the degree (number of edges adjacent to the node) in the graph and the node color to graph clusters detected with the Louvain method. I used the sigmajs force layout to arrange it.

Now, most people will tell you it is rarely interesting to plot the whole graph because it will be hardly interpretable. But in this case, we can validate a few hypothesis :

• the hero with bigger degree are Captain America, Spider Man and Iron Man. So these heros are found to be of high importance in the social graph which is logic since they belong to the Marvel universe starts.

• some clusters appears :

• the pink one on the top left corresponds to Thor universe, the green one on the bottom to the X-Men, the pale blue on the top right to Spider Man ...
• the Avengers are mostly in the center of the graph, the four big orange nodes corresponding to the Fantastic-4.
• smaller clusters can be detected in the graph periphery whoose characters are most of the time absent of wikipedia

## Prunning the graph

We would like to understand the internal structure of the Marvel graph but how can we prune the graph from all the unknown characters, such as Fah Lo Suee, polluting our graph?

One way to do that would be to use edge weights. Indeed, previously, an edge was defined by the existence of one co-occurrence.

However, Captain America and Spider Man appear together so in many comics: It would be smart of us to take account of this.

The solution is for us to use edge weights. An edge weight is some extra information about the node that we add to our graph to change its shape. In our case, this extra information will be the number of co-occurrences between characters.

For example, the edge weight of the link between SpiderMan and Captain America is the same as the number of comic they both appear in together.

If we prune the graph by keeping only edges (and corresponding nodes) having a weight higher than a specific threshold (K for example), we will drastically simplify the graph : all nodes with less than K appearances in the dataset and all edges with weight less than K will disappear.

So the more we increase K, the simpler the graph will be and we will get closer to the Marvel Graph skeleton.

For example, here is the graph generated when the value of K is 10.

It is starting to get easier to describe Marvel Universe. Three heroes have their own very rich universe (cluster) : SpiderMan, Captain America and Thor. On the bottom left, the X-Men are still clustered together.

Other less known Super Hero teams start to appear like the Eternals,(light blue) or Alpha Flight. ("Canada's answer to the Avengers" according to Wikipedia.)

Meanwhile, Iron Man, Hulk, the Fantastic 4, Hawkeye Ant Man and Vision all belong to the central cluster. This is because the ratio of the number of their appearance together (as the Avengers) by their appearance alone is very high.

We can do a similar analysis for K = 30.

This is the backbone of the Marvel Universe. There is a cluster for Spider Man, Hulk (dark green), Namor (pale green), Thor, Fantastic-4 (pink), though Captain America and Iron Man are still clustered together in the Avenger team.

Note that the X-men are now separated in three clusters : X-men and Generation X and New Mutants, two spin-offs of the X-Men franchise.

## Identyfing Influencers

Numerous graph criterion were derived to describe them : degree centrality , PageRank, closeness centrality or betweeness centrality. I find betweenness centrality very interesting because its value represent how much the node is important to convey information in the network to other individuals in the network who are not connected to each other.

Obviously, Marvel Stars like SpiderMan, Thor and Hulk are good candidates and have indeed very high values of all imaginable centrality measures. But if we draw the graph for a value of K of 50 the real Marvel Universe cornerstone clearly appears (the one with highest betweenness values).

It is Beast ! What is surprising is that he is not the X-Men biggest star (Wolverine, Storm, Professor X ... ) but he is the main link between X-Men and the Avengers. So if I was to choose a negociator between these two groups in a modern company, to target a person to dismantle a terrorist network or to choose my brand ambassador in Marvel Universe, Beast would be the perfect candidate.

## Conclusions

I had a lot of fun mining the Marvel graph for clusters and influencers.

Obviously this analysis is a pretty basic one so here are some other things I would consider interesting :

• having time information would have been awesome. I would have loved to see the network evolving with the Marvel needs for new characters, spin-offs and crossovers
• label the characters with the side they choose during Civil War : can we predict their affiliation the way we can in the well known Zachary's Karate Club graph ?
• I am personally more of a DC fan so if someone is able to direct me to the corresponding dataset, I would be more than happy to analyse it.

So if you see that Marvel or DC has some new data to share, contact me on twitter @prrgutierrez , I would be happy to do a second post on the subject !