Latent Symbols & Converging Approximations

Deriving the Poisson Distribution

Apr 02, 2025

Let’s look at another example of deriving a standard probability distribution by

starting with a finite premise expressing a finite approximation of our available information, then
taking the limit as the amount of detail increases to infinity.

Events of some sort—say, requests arriving at a web server, or emails arriving in some mailbox—occur at an average rate of 𝜆 per unit time. Let 𝑘 be the number of events that occur during some given time interval of duration 𝑡, which we’ll call the query interval; what probabilities should we assign to each of the possible values of 𝑘?

Defining the premise and query

We analyze this problem as follows:

We interpret “average rate of 𝜆” to mean that 𝜆𝑇 events occur over some large time interval of duration 𝑇, which we’ll call the reference interval to distinguish it from the smaller query interval contained within.
We divide the reference interval into 𝑇/𝜀 small time slices of duration 𝜀, which is chosen to be small enough that we can assume at most one event occurs in any given time slice.
Let 𝘺ᵢ, 0 ≤ 𝑖 < 𝑇/𝜀, be a propositional symbol intended to mean “an event occurs during time slice 𝑖” (which begins at time 𝑖𝜀 and ends at time (𝑖+1)𝜀).
Let 𝑑ₖ, 𝑘 ≥ 0, be a propositional symbol intended to mean that 𝑘 of the 𝘺ᵢ, for 𝑖 in the query interval, are true.
Let 𝑠 be the start of the query interval.
We assume that 𝜆,𝑠,𝑡 ≥ 0, 𝑠 + 𝑡 ≤ 𝑇, and 𝑇,𝜀 > 0.
After analyzing the case for finite 𝑇 and nonzero 𝜀, we let 𝑇 ➝ ∞ and 𝜀 ➝ 0.

A technical complication is that 𝜆𝑇, 𝑇/𝜀, 𝑠/𝜀, and (𝑠+𝑡)/𝜀 are not necessarily integers, although they are used as such. To address this we define:

\(\begin{align*} n & \triangleq\mathrm{round}\left(\lambda T\right)\\ & \quad\mbox{“number of events in reference interval”}\\ M & \triangleq\mathrm{round}\left(T/\epsilon\right)\\ & \quad\mbox{“number of time slices in reference interval”}\\ m & \triangleq\mathrm{round}\left(\left(s+t\right)/\epsilon\right)-\mathrm{round}\left(s/\epsilon\right)\\ & \quad\mbox{“number of time slices in query interval”}\\ \mathrm{round}(x) & \triangleq x\mbox{ rounded to the nearest integer} \end{align*}\)

We also define these:

𝑌 is a propositional formula stating that exactly 𝑛 of the 𝘺ᵢ, 0 ≤ 𝑖 < 𝑀, are true.
For each 𝑘 ≥ 0, 𝐷ₖ is a propositional formula stating that exactly 𝑘 of the 𝘺ᵢ in the query interval are true.
Our premise 𝑋 is the propositional formula
\(Y\wedge\bigwedge_{k=0}^{m}\left(d_{k}\leftrightarrow D_{k}\right),\)
which adds to 𝑌 definitions of the symbols 𝑑ₖ.

Our potential queries are the 𝑑ₖ.

Probabilities in the finite case

There are

\({M \choose n}\)

ways of satisfying 𝑋: there are 𝑀 time slices, 𝑛 of which must contain an event. Furthermore, there are

\({m \choose k}{M-m \choose n-k}\)

ways of satisfying 𝑋 ∧ 𝑑ₖ: there are 𝑚 time slices in the query interval, 𝑘 of which must contain an event, and there are 𝑀-𝑚 remaining time slices in the reference interval, 𝑛-𝑘 of which must contain an event. Therefore, by the EPL Theorem,

\(\Pr\left(d_{k}\mid X\right)=\frac{{m \choose k}{M-m \choose n-k}}{{M \choose n}}.\)

This, it turns out, is just the hypergeometric distribution probability

\(\mathrm{Hypergeom}\left(k\mid n,m,M\right)\)

for 𝑘 successes when doing 𝑛 draws without replacement from a population of size 𝑀 containing 𝑚 success cases. To see why, think of the 𝑛 events as random draws, the 𝑀 time slices in the reference interval as the population from which we draw, and the 𝑚 time slices in the query interval as the success cases.

Technical note: for the above to make sense, we need 𝑘 ≤ 𝑚, 𝑛 ≤ 𝑀, and 0 ≤ 𝑛-𝑘 ≤ 𝑀-𝑚, all of which hold true for 𝑇 > 𝑡 sufficiently large and 𝜀 sufficiently small.

Take it to the limit

Now let 𝑇,1/𝜀 → ∞. We have 𝑛,𝑚,𝑀 → ∞ and furthermore that

\(\frac{nm}{M}\sim\frac{(t/\epsilon)\left(\lambda T\right)}{(T/\epsilon)}=\lambda t\)

where 𝑎 ∼ 𝑏 means 𝑎/𝑏 → 1 (𝑎 and 𝑏 are asymptotically equal). As proven here

Hypergeom Converges To Poisson

251KB ∙ PDF file

Download

the hypergeometric distribution converges to the Poisson distribution under the above conditions:

\(\begin{align*} \Pr\left(d_{k}\mid X\right) &= \mathrm{Hypergeom}\left(k\mid n,m,M\right) \\ &\to\mathrm{Poisson}\left(k\mid\lambda t\right) \\ &=\frac{e^{-\lambda t}\left(\lambda t\right)^{k}}{k!}. \end{align*}\)

Alternative: average over possible worlds instead of time

The preceding used the notion of the long-run rate at which events occur. This may be inappropriate—events may be limited to a time interval too short for taking 𝑇 → ∞ to make sense. Let’s look at another approach based on averaging over possible states of the world:

There are 𝑆 possible states of the world.
We interpret “average rate of 𝜆” to mean that the number of events occurring in the reference interval, averaged over the 𝑆 possible world states, is 𝜆𝑇.
Let 𝑦(𝑖,𝑗), 0 ≤ 𝑖 < 𝑇/𝜀 and 1 ≤ 𝑗 ≤ 𝑆, be a propositional symbol intended to mean “in possible world 𝑗 an event occurs during time slice 𝑖”.
Let 𝑤(𝑗), 1 ≤ 𝑗 ≤ 𝑆, be a propositional symbol intended to mean “possible world 𝑗 is the actual world.”
Let 𝑑ₖ, 𝑘 ≥ 0, be a propositional symbol intended to mean that there are 𝑘 indices 𝑖 in the query interval for which 𝑦(𝑖,𝑗) is true, where 𝑗 is the actual world.
We assume that 𝜆,𝑠,𝑡 ≥ 0, 𝑠+𝑡 ≤ 𝑇, and 𝑆,𝑇,𝜀 > 0.
After analyzing the case for finite 𝑆 and nonzero 𝜀, we let 𝑆 → ∞ and 𝜀 → 0.

We define define 𝑛 and 𝑀 similarly as before, but larger by a factor of (about) 𝑆; 𝑚 we define identically as before:

\(\begin{align*} n & \triangleq\mathrm{round}\left(\lambda ST\right)\\ & \quad\mbox{“sum of number of events in reference interval, over all worlds”}\\ M & \triangleq S \cdot\mathrm{round}\left(T/\epsilon\right)\\ & \quad\mbox{“sum of number of time slices in reference interval, over all worlds”}\\ m & \triangleq\mathrm{round}\left(\left(s+t\right)/\epsilon\right)-\mathrm{round}\left(s/\epsilon\right)\\ & \quad\mbox{“number of time slices in query interval”} \end{align*}\)

The revised definitions of 𝑌 and 𝐷ₖ are these:

𝑌 is a propositional formula stating that exactly 𝑛 of the 𝑦(𝑖,𝑗), 0 ≤ 𝑖 < round(𝑇/𝜀) and 1 ≤ 𝑗 ≤ 𝑆, are true.
For each 𝑘 ≥ 0, 𝐷ₖ is a propositional formula stating that 𝑤(𝑗) ∧ 𝑦(𝑖,𝑗) is true for exactly 𝑘 of the indices 𝑖 and 𝑗, round(𝑠/𝜀) ≤ 𝑖 < round((𝑠+𝑡)/𝜀) and 1 ≤ 𝑗 ≤ 𝑆.

We also define

𝑍 ≜ ⟨𝑤(1), …, 𝑤(𝑆)⟩, i.e. “𝑤(𝑗) is true for exactly one index 𝑗.”

Our revised premise 𝑋 is then the propositional formula

\(Y\wedge Z\wedge\bigwedge_{k=0}^{m}\left(d_{k}\leftrightarrow D_{k}\right).\)

Probabilities for the alternative analysis

The revised definitions for 𝑛, 𝑀, and 𝑚 were chosen so that

\(\Pr\left(d_{k}\mid X\right)=\frac{{m \choose k}{M-m \choose n-k}}{{M \choose n}}=\mathrm{Hypergeom}\left(k\mid n,m,M\right)\)

as before, by identical reasoning. Furthermore, as S,1/𝜀 → ∞ we have

\(\begin{align*} m & \sim t/\epsilon\\ M & \sim ST/\epsilon\\ n & \sim\lambda ST \end{align*}\)

yielding

\(\begin{align*} n,m,M & \to\infty\\ nm/M & \to\lambda t \end{align*}\)

and so the conditions for

\(\mathrm{Hypergeom}\left(k\mid n,m,M\right)\to\frac{e^{-\lambda t}\left(\lambda t\right)^{k}}{k!}\)

still hold.

Commentary

This derivation contrasts with our previous derivation of the uniform distribution over the unit interval in two ways:

It makes use of latent symbols: the propositional symbols 𝑦ᵢ (or 𝑦(𝑖,𝑗) and 𝑤(𝑗)), which appear only in the premise, but not in any query of interest. (The term “latent symbol” corresponds to the term “latent variable” in statistics.) They may be thought of as unobserved variables that are nonetheless important in the description of the situation under consideration. With the approach of averaging over possible worlds it is obvious that only the actual world can be observed. With the approach of averaging over time, it might be, for example, that only the total number of web server requests per each defined time period are logged, but not the individual times at which they arrived.
As 𝑇 (or 𝑆) and 1/𝜀 increase we do not simply add additional conjuncts to the premise, as occurred in the previous derivation. Instead, we have a more general closer and closer approximations to the desired ideal.

We also found that whether we characterized our knowledge as one of a long-run average over time, or as an average over limited time and a large number of possible worlds, we got the same distribution for the number of events 𝑘.

Epistemic Probability

Discussion about this post