It takes 20 seconds to calculate the MST for a 500 node, ~2000 edge graph.

One way that we can improve the performance of the algorithm is by storing the MST in a union find/disjoint set data structure.

To refresh, Kruskal’s algorithm does the following:

- sort edges E in order of increasing cost
- let T = minimum spanning tree, m = number of edges
- for i=1 to m:
- let e = selected edge (u,v)
- if there’s no way to go between u and v using edges in T:
- add e to T

- return T

In the first version of the algorithm we stored the MST in an array and then we did a depth first search to check that adding an edge to the array wasn’t going to create a cycle which would mean we no longer had an MST.

As I understand it the way that we use the union find data structure is that we start out with n connected components i.e. every node is in its own connected component.

We then merge these components together as we come across new edges which connect nodes in different components until we only have one component left at which point we should have our MST.

- sort edges E in order of increasing cost
- initialise union-find uf with n connected components
- let T = minimum spanning tree, m = number of edges
- for i=1 to m:
- let e = selected edge (u,v)
- if u and v not in same connected component in uf:
- merge u and v into same connected component in uf
- add e to T

- return T

I came across this bit of code written by Michael Luckeneder which implements the union find data structure in Ruby and adapted that slightly to fit the nomenclature used in the Algorithms 2 videos.

The union find data structure looks like this:

class UnionFind def initialize(n) @leaders = 1.upto(n).inject([]) { |leaders, i| leaders[i] = i; leaders } end def connected?(id1,id2) @leaders[id1] == @leaders[id2] end def union(id1,id2) leader_1, leader_2 = @leaders[id1], @leaders[id2] @leaders.map! {|i| (i == leader_1) ? leader_2 : i } end end

We have two methods:

- connected? which we use to check whether or not two nodes are in the same connected component.
- union which we use to put two nodes into the same connected component.

The way it’s implemented is that we have a collection of ‘leaders’ and initially each node is its own leader. As we find edges which belong in the MST we call union passing the two nodes as arguments. After this method executes every node which has the same leader as the first node will now instead have the same leader as the second node.

For example if we have a simple graph with edges 1 -> 2 -> 3 -> 4 -> 5 our initial union find data structure would look like this:

> uf = UnionFind.new 5 => #<UnionFind:0x45e5a9b3 @leaders=[nil, 1, 2, 3, 4, 5]>

At this stage each node is in its own connected component but if we process the edge 1 -> 2 we’d first check if those two nodes were already in the same connected component:

> uf.connected?(1,2) => false

Given that they aren’t we can call union:

> uf.union(1,2) => [nil, 2, 2, 3, 4, 5]

As we can see from the output nodes 1 & 2 now both have the same leader since they are in the same connected component while the other nodes still remain on their own.

We could then process the 2 -> 3 edge which would put nodes 1, 2 & 3 together:

> uf.union(2,3) => [nil, 3, 3, 3, 4, 5]

The outline for Kruskal’s algorithm which makes use of this data structure is like so:

set = UnionFind.new number_of_nodes @minimum_spanning_tree = [] edges = file.drop(1). map { |x| x.gsub(/\n/, "").split(" ").map(&:to_i) }. map { |one, two, weight| { :from => one, :to => two, :weight => weight}}. sort_by { |x| x[:weight]} edges.each do |edge| if !set.connected?(edge[:from], edge[:to]) @minimum_spanning_tree << edge set.union(edge[:from], edge[:to]) end end puts "MST: #{@minimum_spanning_tree}" puts "Cost: #{@minimum_spanning_tree.inject(0) { |acc, x| acc + x[:weight]}}"

This version of the algorithm runs in 1.9 seconds, a significant improvement on the initial version. The full code is on github as usual!