Index Sampling

Task: given pseudorandomness, sample pseudorandomly $k$ distinct but otherwise uniform indices from the interval $[0 : U)$ . This index sampling task is needed in several locations, not just in the STARK engine. Outside of the VM, this task can be achieved by rejection sampling until the list of collected samples has the requisite size (namely, $k$ ). Inside the VM, we like to avoid variable-length execution paths. This note explains how to do it using non-determinism.

Solution 1: Out-of-Order Divination

We want to sample exactly $k$ unique indices, but the sampler in the VM is allowed to divine counter values.

Outside VM

By rejection sampling until the list of samples has the requisite length I mean something like this:

    function sample_indices(count, upper_bound, randomness):
        list <- []
        counter <- 0
        while length(list) < count:
            integer <- Hash(randomness || counter)
            candidate_index <- integer % upper_bound
            if candidate_index in list:
                continue // reject
            else:
                list <- list || candidate_index
        return list

The loop needs to iterate at least count number of times. It is possible to parallelize this first step by separating the count first iterations from the variable-number follow-ups.

This python-like pseudocode hides important complexity behind the list membership test. To expand this logic, we need sorting and consecutive deduplication.

    function sample_indices(count, upper_bound, randomness):
        list <- []
        counter <- 0
        while length(list) < count:
            integer <- Hash(randomness || counter)
            candidate_index <- integer % upper_bound
            list <- sort(list || candidate_index)
            list.deduplicate()
        return list

Inside VM

In a nutshell, the VM nondeterministically guesses ("divines") the values of the counters out of order, checks that they are within allowed bounds, and then checks that the resulting list of indices is sorted and has no duplicates.

    function sample_indices(count, upper_bound, randomness):
        list <- []
        for i in [0:count):
            counter <- divine()
            assert(counter >= 0)
            assert(counter < count + margin)
            integer <- Hash(randomness || counter)
            index <- integer % upper_bound
            list <- list || index
            if i > 0:
                assert(list[i] < list[i-1])
        return list

The Virtual Machine supplies the right counters through divine(), having observed them from the out-of-VM execution.

The parameter margin should be large enough so that, with overwhelming probability, there is a set of count-many counters in [0:(count+margin)) that map to distinct indices. At the same time, the parameter margin should not be too large as it gives the malicious adversary more flexibility in selecting favorable indices.

Failure Probability

In order for the honest user to fail to find a good counter set, the list [0:(count+margin)) must contain at least margin+1 collisions under x -> Hash(randomness || x) % upper_bound. We continue by modeling this map as uniformly random.

The event margin+1 collisions or more implies that margin+1 samples were drawn from a subset of at most count-1 marked indices. The probability of the implied event is $Pr [# collisions \geq μ + 1] = (\frac{( k - 1 )}{U})^{μ + 1},$ where $U$ denotes upper_bound, $k$ denotes count, and $μ$ denotes margin.

Given the count and the upper_bound, the appropriate value for margin that pushes this failure probability below $2^{- λ}$ is $(\frac{( k - 1 )}{U})^{μ + 1} \leq 2^{- λ}$ $(μ + 1) \cdot lo g_{2} (\frac{( k - 1 )}{U}) \leq - λ$ $μ \geq \frac{- λ}{lo g _{2} ( \frac{( k - 1 )}{U} )} - 1 .$

For a concrete example, set $k = λ = 160$ and $U = 2^{32}$ . Then $μ$ needs to be at least 6.

Security Degradation

Suppose the user is malicious and hopes to conceal his fraud by selecting a set of indices that do not expose it. Suppose that the proportion of subsets of $[0 : U)$ of size $k$ that are suitable for the adversary is $ρ$ . Then clearly with the standard old index sampling method the attacker's success probability is bounded by $ρ \approx 2^{- λ}$ . The question is whether the improved index sampler enables a higher success probability (and if so, how much higher).

The attacker has access to at most $(μ k + μ)$ subsets of $[0 : U)$ of size $k$ . The probability that a given subset is suitable for the attack is $ρ$ , and so:

The probability that one subset is unsuitable for attack is $1 - ρ$ .
The probability that all $(μ k + μ)$ subsets are unsuitable for attack is $(1 - ρ)^{(μ k + μ)}$ . In this step we assume that whether a subset is suitable for attack, is independent from whether different subset is suitable for attack, even if the intersection of the two subsets is nonzero.
The probability that at least one subset is suitable for attack is $1 - (1 - ρ)^{(μ k + μ)}$ .

The binomial formula expands $(1 - ρ)^{(μ k + μ)} = i = 0 \sum (μ k + μ) (- 1)^{i} (i ( μ k + μ )) ρ^{i} .$ We assume that each subsequent term is smaller in absolute value than the previous because $ρ$ dominates. Indeed, plugging the well-known upper bound on the binomial coefficient $(k n) < (\frac{n \cdot e}{k})^{k}$ , we get $(i ( μ k + μ )) < (i ( \frac{( k + μ ) \cdot e}{μ} ) ^{μ}) < \frac{( \frac{( k + μ ) \cdot e}{μ} ) ^{μ} \cdot e}{i}^{i}$ and the assumption holds already when $(\frac{( k + μ ) \cdot e}{μ})^{μ} < ρ^{- 1}$ . For a concrete example, set $k = 160$ and $μ = 6$ , then the left hand side of this expression is roughly $2^{37.4}$ whereas these parameters target a security level of $λ \approx lo g_{2} ρ^{- 1}$ .

Using this assumption (on the shrinking absolute value of each successive term) we can derive an expression to quantify the security degradation. $Pr [attack] \leq 1 - (1 - ρ)^{(μ k + μ)}$ $= 1 - 1 + ρ \cdot (μ k + μ) - ρ^{2} \cdot (μ k + μ) \cdot (μ k + μ) \cdot \frac{1}{2} + \dots$ $\leq (μ k + μ) \cdot ρ .$

For a concrete example, set $k = 160$ and $μ = 6$ . Then $(μ k + μ) \approx 2^{34.6}$ and so we lose about $33$ bits of security.

Solution 2: Stupid Safety Margin

How about we sample $k + μ$ indices from the start, and use them all no matter what? We only need $k$ for security. The margin parameter $μ$ is chosen such that finding more than $μ$ collisions, which is required to undermine the security guarantee, is cryptographically improbable.

The benefit of this solution is that both the index samplers, i.e., inside and outside the VM, have a simple description. Furthermore, there is no longer a nonzero failure probability for honest users. The drawback is that more work is done than necessary, and the proofs are larger too.

So what is the right value of $μ$ ? It turns out that this probability matches with the correctness error derived for the previous solution. For redundancy this derivation is repeated here.

Sampling $μ + 1$ or more collisions requires hitting a marked subset of at most $k - 1$ indices $μ + 1$ times. The probability of this event is therefore $(\frac{k - 1}{U})^{μ + 1}$ . This probability is smaller than $2^{- λ}$ when $(\frac{k - 1}{U})^{μ + 1} \leq 2^{- λ}$ $(μ + 1) \cdot lo g_{2} (\frac{k - 1}{U}) \leq - λ$ $μ \geq \frac{- λ}{lo g _{2} ( \frac{k - 1}{U} )} - 1 .$