ClosestRabbit

We are given an $$N \times M$$ rectangular board, where each cell is either empty or full ($$1 \leq N,M \leq 20$$). We define the distance between cells $$(i_1,j_1)$$ and $$(i_2,j_2)$$ to be $$(i_1-i_2)^2 + (j_1-j_2)^2$$.

In addition, there are $$R$$ rabbits who randomly assign themselves to the empty cells so that there is at most one rabbit per empty cell. After being assigned, we consider a graph with $$R$$ nodes, each node corresponding to a different rabbit. For each rabbit $$a$$ we add an edge to the closest rabbit $$b$$. More specifically, if a rabbit is assigned to cell $$(i_1,j_1)$$, we consider the rabbit who is assigned to $$(i_2,j_2)$$ such that the distance between $$(i_1,j_1)$$ and $$(i_2,j_2)$$ is minimal (as defined above), and we add an edge between the two corresponding nodes. If there is a tie, we select the rabbit whose $$(i_2,j_2)$$ is lexicographically smallest. We do this for each rabbit (in particular, there are exactly $$R$$ edges in this graph; there may be duplicate edges, but there will be no "self-loop" edges).

Given the description of the board, and the integer $$R$$, find the expected number of connected components in this graph.

(Please see the problem statement for further clarification)

1. Grid / Checker-board
2. Grids $$\Rightarrow$$ bipartite graph?
3. DP?
4. Matching?
5. Probability / Expectation / Linearity of Expectation, etc.
6. Random Permutation (i.e.: order doesn't matter)
7. MST / Minimum Spanning Tree?
8. Forests with Cycles
Notation: If rabbit $$y$$ is the closest to rabbit $$x$$ (under the tie-breaking rules), and we add an edge between rabbit $$x$$ and rabbit $$y$$, then we say $$x$$ points to $$y$$.
This section describes some key observations needed to come to the solution. For a concise description of the final algorithm, you may skip this and go to the Solution Summary below. Otherwise, please continue reading if you would like to follow the problem-solving process.
Looking at examples or drawing some pictures of the graphs are extremely useful for understanding the problem and finding obvious patterns. In this case, it was fairly helpful; drawing a few examples of boards (the sample test-cases) and their related graphs led to the following observation.
Every component of the graph will contain exactly one cycle.
Start with a single rabbit. Follow its "closest-rabbit" edge to another rabbit. Then follow this rabbit's "closest-rabbit" edge. And so on... Eventually you will run out of unique rabbits, so the process must cycle. This shows that every component has at least one cycle. And a similar argument shows that there is no more than one cycle in each component. (See the Related Problems section below for other problems / theorems related to this concept)
Note that the graph always has $$R$$ edges and $$R$$ nodes. Any graph with this property must consist of a collection of components, each containing exactly one cycle. Knowing this requires having some experience with Graph Theory.
From the above observation, we get a good characterization:
The number of components is exactly equal to the number of cycles in the graph.

At this point, I began looking at the structure of the cycles (based on the above characterization). Intuitively, based on the fact that the edges are to the "closest" neighbours, it seemed like there can't be very long cycles. For example, consider a triangle of rabbits (a cycle of length three) $$[a,b,c]$$. Then $$b$$ is the closest neighbour of $$a$$, and $$c$$ is the closest neighbour of $$b$$, and $$a$$ is the closest neighbour of $$c$$. Intuitively, it seems like they all must be closest neighbours of each other. It has to be an equilateral triangle.

We can actually formalize and prove this.
Consider any cycle in the resulting graph. Then all the rabbits/nodes on this cycle must have the same distance to each of the other rabbits.

Suppose we have a cycle of nodes $$[v_1,v_2,...,v_k,v_1]$$. Without loss of generality, assume that $$v_1$$ points to $$v_2$$ who points to $$v_3$$ and so on (where "points to" is defined above). Then $$v_2$$ is the closest neighbour to $$v_1$$. In particular, this tells us that $$|v_1v_2| \leq |v_1v_k|$$ (where $$|v_iv_j|$$ is the euclidean distance between the corresponding cells of $$v_i$$ and $$v_j$$). Similarly, $$v_3$$ is the closest neighbour to $$v_2$$, so we know that $$|v_2v_3| \leq |v_1v_2|$$.

In general, we get that $$|v_iv_{i+1}| \leq |v_{i-1}v_i|$$. Applying all of these inequalities we get: $$|v_1v_2| \leq |v_kv_1| \leq |v_{k-1}v_k| \leq ... \leq |v_{2}v_3|$$. But we have already stated that $$|v_2v_3| \leq |v_1v_2|$$, so this tells us that $$|v_1v_2| = |v_2v_3|$$ and (by symmetry) $$|v_2v_3| = |v_3v_4|$$, and so on.

Altogether, this says that all edges must be equal on this cycle.

This shows some nice structure to the graph. All the cycles are "equilateral". Based on this, intuitively, it seems like there cannot be very long cycles. For example, have you ever seen a four-sided figure (in two dimensions) where all the points are the same distance to each other? Also, even a triangle doesn't seem to work because of the "always pick the neighbour with the lowest row" rule (stated in the problem-statement). Altogether, we have the following key observation.

No cycle in the graph can have more than two nodes on it.
This is actually a result of the previous lemma. Suppose you have any cycle. Consider the node in the cycle whose cell has the lowest row in the original grid (and lowest column, in case of ties). Then all other nodes on this cycle would "point to" this node (by the tie-breaking rule). This node would point only to one other node, so there can only be a cycle of two.
I wasn't immediately sure about whether this was true. I simply wrote it down as a "hunch" (an intuition), and informally proved it afterwards. This is sometimes useful to explicitly write down your lemmas.
These observations about the cycles are really nice. Altogether we have discovered that:
1. There is exactly one cycle per component, so
2. The number of components is equal to the number of cycles; and
3. Every cycle must have exactly two nodes on it
So we can make the final observation:
The number of components is exactly equal to the number of pairs of rabbits that mutually point to each other.
Follows from everything prior.
We have transformed the problem from a question of "components" to a question about "edges".
These are really beautiful observations. There are still a few steps needed to take this idea to a final solution, but I encourage the reader to attempt to solve the problem now if you have read this far!

Experience solving probability problems is (obviously) extremely helpful in solving this problem. You should likely be familiar with the terms: "Random Variable", "Sample Space", "Expected Value / Expectation", and so on. Also, one of the most powerful results from Probability theory is the "Linearity of Expectation" rule. In particular, if $$X$$ and $$Y$$ are two random variables, $$E[X+Y] = E[X] + E[Y]$$ (where $$E[\cdot]$$ denotes expected value). Many problems can be solved using these fundamental techniques.

If you have no idea about any of these terms, consult Introduction to Algorithms by Thomas H. Cormen, et al. Appendix C (of the Third Edition, at least) has a very nice review of Probability and Counting.

Based on this "experience", the next step in this problem is to try and compute this Expectation (i.e.: the Expected number of components in the graph) using these probability techniques

We want to find the expected number of components in the graph. It would be nice if we could break this number down into the sum of other random variables, and apply the Linearity of Expectation. To begin, we make the following key observation (note: this is the key result of the previous Idea Section):
The number of components in the final graph is exactly equal to the number of pairs of rabbits that mutually point to each other.
Omitted. See the previous Idea Section if you would like to see the proof.

This gives us a nice characterization of the solution. To further exploit this, let's introduce some notation.

Consider the final arrangement of rabbits. We let the random variable $$C$$ be the number of components in the resulting graph.

Now, the rabbits are basically symmetric. If two rabbits mutually point to each other, then it is because they are in two cells with no other rabbits in a closer cell (to either one). So we get the following:

Consider the final arrangement of rabbits. Let $$p$$ and $$q$$ be two empty cells. And let $$Y_{pq}$$ denote the "indicator" for whether the rabbits in $$p$$ and $$q$$ consider each other to be mutually closest. In other words: $Y_{pq} = \left\{\begin{array}{ll} 1 & : \text{if the rabbit in cell p points to the rabbit in cell q and vice-versa} \\ 0 & : \text{otherwise} \end{array}\right.$ Note: By convention, $$Y_{pq} = 0$$ if there is no rabbit in $$p$$ or if there is no rabbit in $$q$$.

With these two definitions, we can now write out the formula for the expectation:

$\text{answer} = E[C] = \sum_{p \in S} \ \ \sum_{q \in S, q \neq p}{E[Y_{ab}]}$
It now remains to solve a simpler problem: how do we find $$E[Y_{ab}]$$ for two empty cells $$a$$ and $$b$$?

This can actually be done by counting.

If we have two cells $$a$$ and $$b$$, consider all the cells $$c$$ that would be close enough to $$a$$ or $$b$$ to cause either $$a$$ or $$b$$ to point to $$c$$. If there is a rabbit in any of these cells $$c$$, then this will cause either $$a$$ or $$b$$ to point away from each other (to $$c$$), and they will not form the two-cycle. Otherwise, if there is no rabbit in any of these cells $$c$$, then $$a$$ and $$b$$ must point to each other, and form the two-cycle.

So, the variable $$Y_{ab}$$ will be $$1$$ if and only if none of these cells $$c$$ are assigned a rabbit!

How do we find these cells? Well, by manually checking! We can check all other empty cells against $$a$$ and $$b$$. For each cell $$c$$, we check if it is close enough (or causes a tie to be broken in a certain way) so that $$a$$ would point to $$c$$ if $$c$$ had a rabbit in it. This will generate a set of "bad points". Then the probability of $$Pr(Y_{ab} = 1)$$ is exactly the probability that this set has no rabbits in it. If there are $$B$$ bad cells, and $$S$$ is the set of empty cells in total (including $$a$$ and $$b$$), then there are $\binom{|S|-2-B}{R-2}$ ways of assigning $$R$$ rabbits to keep them empty (and to make sure $$a$$ and $$b$$ are occupied). And there are $\binom{|S|}{R}$ ways in total of assigning $$R$$ rabbits randomly.

So altogether, we get the following lemma.

For any two empty cells $$a,b \in S$$: $E[Y_{ab}] = \frac{\binom{|S|-2-B}{R-2}}{\binom{|S|}{R}}$ where $$R$$ is the number of rabbits, $$S$$ is the set of empty cells, and $$B$$ is the number of bad cells for the pair $$a,b$$.
See previous key-observation. We also use the fact that $$E[Y_{ab}] = Pr(Y_{ab} = 1)$$ (for $$0-1$$ variables)

Altogether, this actually gives us our final algorithm (after combining this result with the corollary above): For each pair of empty cells $$a$$ and $$b$$, compute the probability that there are rabbits in $$a$$ and $$b$$ and that they point to each other (this $$E[Y_{ab}]$$ from the formula above). Sum these up to get the final answer. See the Solution Summary below for the final pseudo-code and analysis.

Most of my initial observations were useless (although they were not necessarily incorrect). The most helpful ones ended up being the later ones, which I summarize as : "We have to use probability tricks, along with exploiting the structure of the graph."

With most probability problems, there is often a "random variable" whose expectation we are trying to find. Often-times by writing it out, you can find nice patterns. Usually, this one "random variable" turns into the sum of many smaller / simpler random variables (often variables that can only be equal to $$0$$ or $$1$$). You can then find the original expected-value as the sum of expected values of these variables (the "Linearity of Expectation"). (See below for more resources on probability problems).

First, we must be able to write down the random variable (the number of components) in a nice form. We start by looking at the structure of the components; and we notice that, because the graph has $$R$$ nodes and $$R$$ edges (where $$R$$ is the number of rabbits), every component of the graph contains a single cycle with "arms" (paths) attached. (Drawing a picture helps here.)

Upon further inspection, we notice that there cannot be any "long" cycles in this graph. This is because each edge corresponds to a "closest neighbour" for some node / rabbit, and we can prove that all the nodes on a given cycle must mutually be same distance to each other. (This was proved in Idea 1 above.) Because of the "tie-breaking" rule, if all of these nodes are equidistant to each other then they would all pick the same node. Using this argument we can show that there won't be a cycle of length three or more, because this would lead to some kind of contradiction. So every cycle has length exactly two (as there are no self-edges either).

The above arguments tell us: 1) There is exactly one cycle in every component; and 2) Every cycle has length exactly two (i.e.: it is a pair of nodes, that share a duplicate-edge). Altogether, this is enough to yield the characterization we need.

Let $$C$$ denote the number of cycles (this is a random variable that depends on the final arrangement of rabbits). For a pair of rabbits in cell $$a$$ and cell $$b$$, we define $$Y_{ab} := 1$$ if and only if the two rabbits are mutually closest to each other, or we set $$Y_{ab} = 0$$ otherwise. Based on the above arguments, in any configuration, we have that $$C = \sum Y_{ab}$$ over all pairs of distinct empty cells $$a$$ and $$b$$. Since our problem is to find $$E[C]$$ (the expected value of $$C$$), we can now (finally) use the Linearity of Expectation to get that: $E[C] = \sum E[Y_{ab}]$ over all pairs of distinct empty cells $$a$$ and $$b$$.

It is fairly easy to compute $$E[Y_{ab}] = Pr(Y_{ab} = 1)$$ by counting. Particularly, for two cells $$a$$ and $$b$$, consider a cell $$c$$ so that $$a$$ would point to $$c$$ instead of $$b$$ if both cells have a rabbit (for simplicity, we say that $$a$$ prefers $$c$$ over $$b$$). If $$c$$ has a rabbit on it, then $$Y_{ab} = 0$$ (obviously, since $$a$$ does not point to $$b$$). A similar truth holds if $$b$$ prefers $$c$$ over $$a$$. On the other hand, if we cannot find any such cell $$c$$ then $$Y_{ab}$$ must be 1, since $$a$$ and $$b$$ must point to each other. So altogether, if we let $$B := \{c : a \text{ prefers } c \text{ or } b \text{ prefers } c\}$$, then we can write $$Pr(Y_{ab} = 1)$$ as: $E[Y_{ab}] = \frac{\binom{|S|-2-B}{R-2}}{\binom{|S|}{R}}$ where $$S$$ is the set of all empty cells, $$B$$ is the set as described now, and $$R$$ is the number of rabbits. Note that the denominator $$\binom{|S|}{R}$$ is just the total number of configurations, and the numerator is the number of configurations that result in $$a$$ and $$b$$ mutually pointing to each other.

Altogether, our algorithm is to sum up the $$E[Y_{ab}]$$ values and return this number. For a fixed $$Y_{ab}$$ we only need to find the set $$B$$ (as described above). This can be done by manually checking all other empty cells $$c$$ to see if either $$a$$ or $$b$$ prefer $$c$$. Here is the pseudo-code.


choose(n,k):
return n! / k! / (n-k)!;

Y(a, b):
if (a==b): return 0
for each empty cell c:
# Check if a prefers c over b or if b prefers c over a
a_prefers_c = false
b_prefers_c = false
if distance(a,c) < distance(a,b) or (c beats b in the tie-breaker): a_prefers_c = true
if distance(b,c) < distance(b,a) or (c beats a in the tie-breaker): b_prefers_c = true

if a_prefers_c or b_prefers_c:

S := (total empty cells including a and b)
R := (total number of rabbits including those on a and b)

return (double) choose(S-2-bad_cells, R-2) / choose(S,R)

getExpected(board, R):
for each pair (a,b) of empty cells in board:

getExpected(board,R)
.

The running time of the algorithm is $$\Theta((NM)^3)$$, since we essentially check every triplet of cells once in the worst case. Also, be careful about precision. I would NOT use the "choose" function as described above, and also make sure to use doubles or long doubles, because the numbers can get pretty large.

1. There is exactly one cycle in each component. This comes from the structure of the graph. Also, any graph with $$n$$ nodes and $$n$$ edges must have this property (I think).
2. No cycle can contain more than two nodes. This can be found by exploiting the symmetry of cycles and the "nearest neighbour" graph.
3. Rabbits in cells $$a$$ and $$b$$ are mutually "closest" if and only if there is no other cell $$c$$ such that $$a$$ prefers $$c$$ or $$b$$ prefers $$c$$.
4. Let $$S$$ be the set of empty cells. Let $$a,b$$ be a pair of distinct empty cells. Let $$B$$ be as defined earlier. Let $$C$$ be the number of components, then $$E[C] = \sum E[Y_{ab}]$$, and $E[Y_{ab}] = \frac{\binom{|S|-2-B}{R-2}}{\binom{|S|}{R}}$
1. The first couple of key observations can easily be found by looking at example grids and/or example graphs.
2. Alternatively, you can find out these observations by focusing on the constraints (the structure of the closest-neighbour function)
3. Knowledge of Graph Theory is pretty useful here in being able to recognize the structure of the graph. Also, knowledge of Probability (such as Random Variables, Linearity of Expectation) was crucial to being able to write down the final solution.
4. With probability problems, it's often helpful to wrote down the random variables to find out exactly what the "expected value" corresponds to. Then we can use other techniques to modify and simplify the formula. It's also useful to write down every little observation you come across; they can be useful later.
5. We transformed our problem on components to a problem on cycles. Later, we transformed this to a problem of pairs of vertices.
6. I would say the entire "Linearity of Expectation" rule is an example of simplifying a problem: instead of trying to find $$E[C]$$, we broke it down into smaller variables: $$Y_{ab}$$ for each the expected values were easy to compute. This often works in general.
• This is the second time (for me) that I have come across a problem with Probability and Graph Theory in the last little while (I will write the other problem in the "Related Problems" section below). Both times, the solution was to reduce the Expected Value of one variable to the sum of expected values over pairs of nodes (i.e.: over potential edges in the graph). Maybe this is common?
• Linearity of Expectation is extremely powerful. Whenever you can write one random variable as the sum of other random variables: do it! It often reduces to the sum of "0-1" variables (often called "Indicator Variables"), and we can exploit the fact that $$E[X] == Pr(X=1)$$ for 0-1 variables.
• Probability / Expectation / Expected Value / Linearity of Expectation / 0-1 Indicator Variables
• Combinatorics / Counting / Number of Ways / Binomial Coefficients / Counting with Probability
• Graph Theory / Graph Construction / Closest Neighbour
• Random Graphs / Random Graph Construction
• Greedy / Observations / Exploit Structure and Conditions
• Math / Formulas and Proofs