All interactive visualizations for this project can be seen separately at https://olincollege.github.io/illuminatimap/plots/vizhub.html

Illuminati Map

Jacob Smilg and Markus Leschly

Software Design Midterm Project, Spring 2021

Introduction:

When finding topics to research for this project, we became interested in seeing how the most viewed people on Wikipedia’s pages were connected. However, we quickly realized we needed a rather small category to focus on specifically. We decided to proceed with the American billionaires category since we expect it to be a highly interconnected group, with many possible interesting connections. We also assumed that the most interesting connections would likely occur between the most viewed pages within the category. Therefore, our research question is: How are the Wikipedia pages with the most views in the category “American billionaires” connected?

Data Aquisition:

Since we determined that Wikipedia was the optimal source for our data, we used the Wikipedia API to pull the data. Specifically, to answer our question we needed two primary sets of data. The first was the number of page views for each billionaire. This was done using the Wikipedia API’s built-in function that finds the number of daily page views for the past 60 days. We then summed these daily views to find a total, which allowed us to create a viewership rank for the billionaires.

Our second set of required data is something that can measure connections between pages. To do this, we used the Wikipedia API’s built-in function that finds all of the pages that link to a selected page. We could then process this data to find which of those linked pages were within the billionaires category. This allowed us to create a list of who was connected to which page.

After querying the data through Wikipedia’s API, we then altered the format of the data to more precisely suit our needs. We do this by first reformatting the data into a dictionary, where each person is a key, and then has a sub dictionary with relevant information (i.e. page views, links to the page). Wikipedia returns many duplicate or empty objects as a result of making such a large generator query, and this process filters out what we don’t need, and combines what we do need into a usable format. We then create a list of the links to a person’s page from within the category for each person, and add it to each person’s dictionary entry. Finally, we export all of the data as a pickle file, which backs it up so we don’t need to generate a new dataset every time we run our scripts.

For more information about how specific scripts work, please see the README.md file in the main folder.

Network Graph:

After formatting our data in a way that would allow us to understand how billionaires are connected, we began to consider the types of graphs we wanted to use.

We initially considered using a chord diagram, but quickly realized that the resulting graph was both too cluttered and difficult to see clusters of connections in.

We instead decided to use a network graph. While we initially used a 2D network graph, we again found the graph to be too cluttered, and the ordering algorithm struggled to yield interesting groupings of billionaires.

We finally settled on using a 3D network graph, where the additional dimension and the ability to rotate the view allowed for a much clearer understanding of the relationships between the billionaires. Furthermore, the additional dimension allowed the ordering algorithm to more effectively group clusters together.

Note: For more information about the family of ordering algorithms related to the one we used (Kamada-Kawai), please see the following article: https://en.wikipedia.org/wiki/Force-directed_graph_drawing

To create the 3D network graph, we used a library called igraph which includes features for creating and ordering 3D networks. To create these 3D networks, igraph uses two types, called nodes and edges. Nodes represent a person (such as Warren Buffet), while edges represent the connections between people (such as a connection from Warren Buffet to Bill Gates). Since our data is not naturally formatted this way when exported from get_data.py, we use a helper function called dict_to_nodes, which converts the people and connections found earlier to nodes and edges.

In the cell below, you can see that we begin by unpacking the pickle file containing the data we collected and formatted earlier in get_data.py. We then trim the number of people that we want to graph down from ~900 to 150 since we found that this yields a good balance between interesting connections and reducing clutter. Finally, we use the helper function dict_to_nodes to convert the dictionary of billionaires into a node and edge format that igraph can understand.

import igraph as ig     # used to generate the layout
import pickle           # used to import our backed up data
import numpy as np      # used to get logarithmic color scales later
import plots_config     # config options for plots stored separately here to avoid cluttering the notebook
import helpers          # helper functions for processing the data

full_links_dict = pickle.load(open('data/billionairesdict.pkl','rb'))
links_dict = helpers.trim_dict(full_links_dict, 150)
data = helpers.dict_to_nodes(links_dict)

Here, we further unpack the nodes and edges.

L=len(data['links'])
Edges=[(data['links'][k]['source'], data['links'][k]['target']) for k in range(L)]

G=ig.Graph(Edges, directed=False)

The cell below shows the information contained within a node. The group value is equivalent to the number of page views that the page has recieved. This helps us color code the nodes later based on the number of views.

data['nodes'][0]
{'name': 'Elon Musk', 'group': 4279914}
labels=[]
group=[]
for node in data['nodes']:
    labels.append(node['name'] + ' | Links to this page: ' + str(len(links_dict[node['name']]['linkshere_within_category'])))
    group.append(node['group'])

Below, we use igraph’s built in layout feature to assign each node a point in 3D space using a Kamada-Kawai algorithm. Here you can see the 3D coordinates for Elon Musk’s node.

layt=G.layout('kk', dim=3)
layt.scale(100)
layt[0]
[-55.75683663386775, 76.25895838948244, 261.8611831673949]

Next, we setup lists of the coordinates for the actual plot generation, since plotly needs them in a slightly different format than igraph provides.

Xn=[layt[k][0] for k in range(len(layt))]# x-coordinates of nodes
Yn=[layt[k][1] for k in range(len(layt))]# y-coordinates
Zn=[layt[k][2] for k in range(len(layt))]# z-coordinates
Xe=[]
Ye=[]
Ze=[]
for e in Edges:
    Xe+=[layt[e[0]][0],layt[e[1]][0], None]# x-coordinates of edge ends
    Ye+=[layt[e[0]][1],layt[e[1]][1], None]# y-coordinates
    Ze+=[layt[e[0]][2],layt[e[1]][2], None]# z-coordinates
import plotly.graph_objs as go      # we use plotly to create our graph
from plotly.offline import iplot    # plotly has a separate offline rendering module that isn't used by default, but we need it here.

trace1=go.Scatter3d(x=Xe, y=Ye, z=Ze, **plots_config.network_plot_config['trace1'])

trace2_config = plots_config.network_plot_config['trace2']
trace2_config['marker'].update(dict(color=np.log10(group)))
trace2=go.Scatter3d(x=Xn, y=Yn, z=Zn, text=labels, **trace2_config)

axis = plots_config.kk_axis_config

layout_scene=dict(xaxis=dict(axis), yaxis=dict(axis), zaxis=dict(axis), bgcolor='rgb(22,16,25)')
layout_config = plots_config.network_layout_config
layout_config['scene'] = layout_scene
layout = go.Layout(**layout_config)
data=[trace1, trace2]
fig=go.Figure(data=data, layout=layout)

iplot(fig)