Finding important connections in a network - automatically
One of the domains for which data lends itself well to be represented as a graph is trade. We can take any tradable good, represent the trading actors as nodes, and represent the (amount of) traded goods as (properties of) edges. As an example, the graph below displays the international trade of fish in 1998 (source data), where the nodes represent countries, the sizes of nodes represent the total exports per country, and the edges represent the fact that the edge's source country exported a given amount of fish to the destination country:
As international trade is influenced by various factors such as local supply and demand, tariffs, or currency exchange rates, it might be interesting to see which trading routes between countries are preferred over any alternatives. For example, zooming in onto some of the bigger players,
we'd like to know what the preferred trading routes are between the neighboring countries UK, Iceland and Norway. In general, it would be nice if we could discover such routes automatically for the global trade network at once. How can this be accomplished?
The simple solution: filtering on volume
As a first experiment, we could try to apply a filter that drops all connections for which the trading volume does not belong to the top 1 %. This leads to the following result (after re-applying layout for the graph):
We can learn from this that there's only a small group of countries between which most fish trade takes place, in absolute terms. However, we cannot answer the question stated before, because by applying the volume based filter, we already lost any connection between Iceland and our "core countries" (and more than 80% of the other countries that are not displayed in above picture). Continuing the experiment, we can lower the filter to let the connections with the top 5 percent trade volumes pass, which leads to the following result:
As can be seen, lowering the volume threshold only increased the complexity of the network between the large exporters/importers of fish, and hardly provided new insights into preferred trade routes with countries in the periphery of the network. From this we can conclude that simple filtering on trade volume has two major problems:
- Based on the chosen threshold it either completely discards most of the network, or adds enough complexity to the connected part to make it hard to find the preferred routes,
- The choice of a particular threshold requires domain knowledge in the best case and is fully arbitrary in the worst case.
A more advanced (but elegant) solution: link salience
- Choose a property that defines the weight of an edge in a given network,
- Define the distance between two nodes as 1 / the weight,
- Using above distance measure, take a node and compute the shortest path to all other nodes in the network,
- Combine all edges from the previous step into a set, which is called the Shortest Path Tree (SPT),
- For each edge in the SPT, increase a "salience" counter,
- Repeat steps 3-5 for every other node in the network,
- For each edge, divide its salience counter by the total number of nodes in the network, leading to a salience property with a value in the range [0..1].
The paper claims that (a.o.) for networks where node degrees and edge weights are distributed according to a power law, the edges' salience will display a bimodal distribution, with clusters close to the 0 and 1 values. This is a nice property because it now can be used as a (near-) binary classifier for importance of network connections. For example, if we take the same fish network and filter connections based on the top 5% link salience (instead of our previous attempt, i.e. edge filtering based on absolute weight), we get the following picture:
And zooming in to find an answer to our original question, we find out that Iceland prefers to import fish from Norway rather than directly from the UK:
As we can see, salience-based filtering has two distinct advantages compared to filtering on absolute edge weights:
- It doesn't require any parameters, the only domain knowledge required is when choosing which property to use as edge weight for computing salience,
- On a natural network as discussed here, taking the top few percent of the edges with highest salience leads to a nearly fully connected network where most routes between any two nodes are unique.
The link salience paper discussed above mentions several applications such as airline traffic and food chains. Discovering the salient links in airline traffic may help contain disease outbreaks, while for food chains it may teach us how removal of certain nodes may disrupt a complete food chain.
Another interesting domain is the analysis of IT networks: by discovering the salient links in a complex network (e.g. based on network traffic), we can learn for which connections it is most important to have a low latency, focussing our performance optimization efforts on these connections.