Ghd

Given a set of $$n$$ integers ($$1 \leq n \leq 10^6$$), all between $$1$$ and $$10^{12}$$, find the largest integer which is a divisor of at least half of them.
  1. GCD
  2. Math
  3. Data Structures / Range Min Query / Segment Tree?
  4. DP?
  5. Prime Factorization
  6. Numbers are small, will probably need to do $$n \sqrt{n}$$ work
My initial observations were to somehow use the associativity and commutativity of gcd to simplify the calculations (i.e.: $$gcd(a,b,c) = gcd(gcd(a,b),c)$$). But this proved utterly useless, even after an hour of trying to wrap my head around the problem.

To make any useful gains, I made a problem transformation: What is the largest integer which is a gcd of any two of the given numbers? (For some reason, I thought that this would be an easier or equivalently hard problem.) And I noticed the following observation (for this simplified problem): If there are two elements in your array $$a_i$$ and $$a_j$$ that you have not explicitly checked against each other, then it's impossible to know the entire answer. Why? Because, suppose you have checked all the "gcd"'s of every pair of integers except $$gcd(a_i,a_j)$$. Then we could multiply $$a_i$$ and $$a_j$$ both by some sufficiently large prime number that is not a factor of any of the numbers. This would change the final answer completely (i.e.: this new prime number would be the new real answer), but it would not change any of the previously computed gcd's for any of the other pairs. So this means that, unless we explicitly check every pair $$a_i,a_j$$ to find their gcd's, then there is always some "missing information", and there is no way to know for sure whether our current answer is correct (because there could be infinitely many correct answers that correspond with our current knowledge).

This argument above is what some people call an "information theoretical" lower bound. I have just proven that any correct algorithm must take at least $$\Omega(N^2)$$ time (i.e.: must check all pairs of integers), if the integers can be arbitrarily large. This is similar to the proof that shows that "any sorting algorithm on $$n$$ integers must use at least $$n \log n$$ comparisons" (I encourage the reader to go research this if they have never seen this type of argument before).

But obviously, $$N^2$$ time is far too slow for this problem. So this implies two things:
  1. We probably need to exploit the fact that all of our numbers are bounded (i.e.: no greater than $$10^{12}$$)
  2. We might need a non-deterministic algorithm (i.e.: randomized, etc.), or something similar
The first point implies that we will probably use a "square-root" trick: the fact that we only need to check $$sqrt(n)$$ numbers to completely prime-factorize or compute the factors of $$n$$.

The last little bit of insight comes from the definition of the problem. We know that our final answer will be a factor of at least half of the given items. So, if we were to choose a non-deterministic algorithm, we only need to check a small number of elements (or pairs, or whatever).

This led to my major idea (which was "Time-Limit-Exceeded", at best): Check some fixed number of random pairs of items to find their "gcd". With enough pairs (i.e.: like 10 to 20) it becomes very likely that at least one of these pairs is within the set of multiples of the final answer. So, this means that, with high probability, the final number must be a factor of one of these gcd's. So, take these gcd's and factorize them. For each factor we find, we check it against all $$N$$ numbers and see how many it divides. We take the largest number that divides at least half of them. And we print the result.

As mentioned, this idea resulted in TLE. Factorizing 20 numbers didn't take that much time; however, each number added hundreds of factors. So, in the end we had something like 1000 different factors. And we checked these against all $$N$$ given numbers, where $$N$$ can be up to $$1,000,000$$. Hence, we were doing approximately $$1,000,000,000$$ divisibility checks, which was far too slow (even with the 4-second time limit).

I tried many different variants of this algorithm. I also tried adding more probabilistic checks (for example, with some probability, we just ignore some factors). But every (probability-based) change I made to improve the time resulted in Wrong Answer instead. So, I was utterly stuck.

In the end (after a half-day of attempting to solve the problem), I gave up and peeked at the judges solution. Happily, I found that they too implement a randomized algorithm using the very same observations as me. Their core ideas, however, were slightly different, and yielded a much better running-time with the same (or better) correctness guarantees. In hind-sight, I should have just focused more on coming up with a "better algorithm" rather than just trying to tweak my own algorithm. This might have led to more success in this problem.

The final solution is as follows: Pick a random element $$x$$ in the array. With a probability of 50% or greater, the final answer is a factor of this number $$x$$ (right?). So, for each other element $$a_i$$ in the array, compute the $$gcd(x,a_i)$$. By definition, all of these "gcd's" are factors of $$x$$. Also, if $$y$$ is a factor of $$x$$ and $$y$$ is a factor of some $$a_i$$, then $$y$$ is a factor of $$gcd(x,a_i)$$. So, for each factor $$y$$ of $$x$$, we check it against all the computed gcd's, and count the number of them that are multiples of $$y$$. This number will be exactly the same as the number of elements in the original array that are multiples of $$y$$. And if this count is greater than half of them, then $$y$$ is a candidate for the final answer. Now, notice that there will be a lot of duplicates with the gcd's. Even if there are 1,000,000 elements, there is a very small number of unique gcd's. So, when comparing $$y$$ against all the gcd's, we should avoid checking duplicates (instead, just check it against one of the duplicates, and add a count for all of them).

For a fixed $$x$$, this means we check each factor of $$x$$ against each other factor of $$x$$. So, for a fixed $$x$$, we do $$O(N \log N) + O(d(x)^2)$$ work, where $$d(x)$$ is the number of factors/divisors of $$x$$. The initial $$O(N\log N)$$ comes from computing the gcd's of all other numbers.

With probability of 50% or greater, the answer returned from this will be the correct final answer. If we repeat this with a few different choices for $$x$$ (maybe like 6 or 7 times), we will probably get the correct answer one of these times. So, if we let $$K$$ be the number of times we repeat this operation, it takes $$O(K*(N \log N + d(x)^2))$$ time. $$N\log N$$ is usually much bigger than $$d(x)^2$$ (since, I'm guessing, that $$d(x) \leq 1000$$ for most $$x$$). So, the running time would be dominated by $$O(KN \log N)$$, and would get the correct answer with probability $$1 - \frac{1}{2^K}$$.
  1. If there was no limit to the size of the items, this problem would have to take $$\Omega(N^2)$$ time. This was the "information theoretic" argument I used above. It was useful in showing that we need to exploit the fact that the numbers are (relatively) "small" or some other structure directly of this problem.
  2. Let $$g$$ be the supposed "Ghd" of the list (i.e.: the largest element which divides at least half of the elements). If we pick a element $$x$$ from the list at random, then there is at least a 50/50 chance that $$g$$ is a factor of $$x$$. This is the observation needed to make any "randomized" algorithm work. Most (at least half) of all the elements in the array are multiples of the final answer $$g$$, so we can randomly pick some elements, find its factors, and check each one against the remainder of the elements to see if it is a candidate for "Ghd".
  3. Suppose we have an element $$x$$. We can find the amount of elements in the array that each factor of $$x$$ divides. This can be done in $$O(d^{2}(x))$$ time, where $$d(x)$$ is the number of divisors/factors of $$x$$. This was the observation I needed to make the algorithm run in time. Instead of checking each divisor explicitly against each other element in the list, we notice that, a divisor $$y$$ of $$x$$ also divides some other element $$a_i$$ if and only if it ($$y$$) divides $$gcd(x,a_i)$$. This means, I don't have to go explicitly checking each $$y$$ against each $$a_i$$; I compute $$gcd(x,a_i)$$ for all $$a_i$$, which will generate many duplicate gcd's. Then I check each factor $$y$$ of $$x$$ against each other factor of $$x$$. Each time this "other factor" is a multiple of $$y$$, we check how many items it was the gcd for, and add this to our count. This takes $$O(d^{2}(x))$$ time instead of $$O(d(x)*N)$$ time.
  1. I first tried to see if I could solve the problem: Find the largest integer which is a gcd of two elements in the list. This was a useful transformation because it yielded insight into the original problem, and it turned out (I think) to be of comparable difficulty to the original problem.
  2. This problem-solving process came up in a variety of forms. The major occurrence came from assuming we already know the "Ghd", and drawing information from that (for example, the simple fact half the integers are multiples of this number).
  3. Although I wasn't "sure" that a randomized solution was needed, I decided to implement some basic forms of the randomized solution. Writing out this "naive" algorithm did indeed give me insights into the problem (in fact, all of the key insights needed to solve the problem); and it turned out that the approach I was using was the correct one, but my specific implementation was just too slow.

    In general, when I'm running out of time, it may be beneficial to say "well, this is the best I've got", and just write out the solution. Oftentimes, this will yield an "almost correct" solution that can be tweaked to the correct solution. Other times it may just yield some valuable insight / patterns into the structure of the problem. In many cases these gains would not be possible just by "starting at the problem", and only reveal themselves when one starts to actually write out the solution.
  4. I looked at some examples early on because I knew I didn't really understand the problem.
  • "Randomized algorithms in a contest!?" - This is probably the first time I've had to use a randomized algorithm in a contest setting (most real contest-problems are deterministic, or require heuristics to improve speed, but are always guaranteed to find the correct answer). Actually, maybe "Hashing" algorithms technically count as "randomized", to some degree. In any case, it is cool to see that I should sometimes be thinking about this. In hind-sight it seems like the problem-setter of this contest enjoyed "randomized", non-deterministic, or "strategy" based problems (which I am terrible at), which usually don't have a single correct solution, but can be solved in a variety of different ways. So I probably shouldn't expect too many more problems like this one. But in general, I should understand when randomized structure is useful. In this problem, it was useful because (by the very definition of the problem), we were guaranteed something like: "50% of all integers will be multiples of your final answer". And so, probability became useful here.
  • "State bashing" - In DP problems, I have learned to "write out the solution", even if it is an inefficient algorithm, and focus afterward on improving the efficiency of the DP by exploiting its structure. The same principle -- which I have dubbed "State bashing" -- can be applied here. Specifically, I should assume that the major observations I've made are "almost enough" to get to the correct solution, and derive an algorithm based on that assumption. If it is too slow, I should focus on picking away at specific structure in the formulas or code that I have written, and find specific parts that can be easily augmented to provide the efficiency guarantees I want.
  • "Divisors of A which divide B are the same as divisors of gcd(A,B)" - this simple fact is all that was needed to make the final solution run in time. Instead of checking all divisors of A against B, we could simply just compute gcd(A,B) and then check all divisors of A which are also divisors of gcd(A,B) (or instead just check all divisors of gcd(A,B), which is the same). Anytime we have a problem that involves finding "combined" divisibility, I should be thinking about how to use the gcd's of the numbers in question, because the gcd has that nice property that ALL common divisors are divisors of the GREATEST common divisor (which may or may not be intuitive to most people). See the "related problems" below.
  • "Irregular ==> Brute force ish" - There were too many interrelated variables to do any kind of DP. These interrelated variables (which I like to call "irregularities"), imply that a nice clean "DP" or "Greedy" solution would not hold. This usually implies a graph, flow, or brute-force solution. In this case, I would call this "randomized" algorithm a (somewhat) brute-force one. However, it does require very nice argumentation to make it pass in time.
  • Number Theory / Factorization / Square-Root Trick
  • Probabilities / Randomized Algorithms / Monte-Carlo Algorithms
  • Number Theory / Divisibility / GCD / Greatest Common Divisor
  • Ad-hoc / Irregularity / Brute-Force