Non-Uniform Marginal Distribution in 30-Character HFGCS Broadcast Payloads: Evidence for Format Structure Beyond a Uniform Ciphertext Model

A ciphertext-only statistical analysis of base32 symbol marginals and within-prefix structural probes across 5,377 unique broadcast payloads.

Abstract. We analyse N = 5,377 unique 30-character HFGCS broadcast payloads, retained as the first-broadcast instance of each distinct string observed between 2022 and 2026, against the working hypothesis that the payload body is statistically indistinguishable from uniform random base32 — the output distribution any IND-CPA-secure AEAD primitive would produce to a ciphertext-only observer. Three independent positional signatures contradict this hypothesis.

First, two specific base32 symbols (M and 5) are systematically under-represented at nearly every body position by approximately 35–40% relative to the uniform expectation, with aggregate Poisson z-scores of −21.2 and −20.9 against a synthetic uniform base32 control that passes the same test. The deficit persists within every prefix cell examined — median per-cell z ≈ −4.7 for M and −4.7 for 5 across the seventeen largest cells — and therefore cannot be attributed to stratum mixing. Second, positions 19 through 21 (bit offsets 91 through 105) show sharply attenuated deficits (|z| < 1.7 for both symbols), indicating a mid-body regime transition; positions 3 through 4 show a similar attenuation in the marginal channel only. Third, the rate of character equality at pairs of positions (doublets when adjacent; inter-position equalities in general) exhibits two regions of strong suppression: a co-localised suppression at the mid-body transition (position pairs 19–20 and 20–21 at z = −7.6 and −6.7) that independently confirms the marginal-deficit localisation there, and a much stronger near-deterministic suppression spanning the final six characters — not just at the five adjacent pairs but at every one of the fifteen pair-positions within positions 25–30 — with equality counts of 5–8 observed against ~105 expected per pair (a 15-fold reduction) persisting uniformly in every prefix cell. The six trailing characters are therefore near-certainly pairwise distinct within each message, not merely free of consecutive repeats.

Within each prefix, probes of pairwise positional mutual information, general compressibility, and inter-message differential structure return no further anomalies: pairwise z-scores sit within the shuffle-null envelope, compression ratios sit within the synthetic-random noise floor across all 41 prefix cells, and modular-difference distributions between date-ordered consecutive payloads are uniform. The conjunction of persistent non-uniform marginals, two regions of attenuated marginal deficit (positions 3–4 and 19–21), co-localised pairwise-equality suppression at positions 19–21, and 15-fold pairwise-equality suppression across all fifteen position pairs within 25–30 establishes that the 150-bit body carries at least four statistically distinguishable regions with transitions at specific character-grid boundaries. We deliberately avoid identifying any region with a specific cryptographic primitive: the ~35% M/5 deficit at main-body positions is itself inconsistent with the uniform-output behaviour that standard authenticator primitives (GMAC, Poly1305, truncated HMAC, SipHash) are designed to produce, and we accordingly decline to interpret the mid-body transition as the terminus of any particular standard-width authenticator. The three positional signatures appear in Group 1 only; Group 2 carries only a mild body-level M/5 deficit with no detectable internal boundary. Whatever generates the two groups' broadcasts, the two generators are not producing output with the same internal structure.

In plain language. Every day, United States military radio stations broadcast short strings of thirty letters and numbers over shortwave radio. Anyone with a receiver can hear them, but nobody outside the military knows what they mean. These broadcasts are called Emergency Action Messages.

You might assume each broadcast is scrambled by strong encryption — that each string looks completely random, with every one of the thirty-two possible characters (the letters A–Z and the digits 2–7) showing up about equally often. That is what good modern encryption produces. A listener watching the broadcasts from outside should see no favouritism toward any particular character and no particular pattern in how characters follow one another.

We looked at five years of collected broadcasts and found three ways in which this is not what happens — three independent fingerprints of structure that a properly random encryption would not leave.

First, two specific characters are unusually rare. Out of all thirty-two characters, the letter M and the digit 5 show up only about two-thirds as often as they should. Every other character behaves normally. This two-character deficit is not a fluke of one prefix group or one year: it appears in every sufficiently large prefix group of broadcasts we examined, across the whole time window.

Second, the deficit temporarily weakens in the middle of the broadcast. If you look position by position through the thirty characters, the shortage of M and 5 is present almost everywhere — except at positions nineteen, twenty, and twenty-one, where it gets noticeably milder. At those three positions, independently, the rate at which consecutive characters match each other also drops sharply. Two separate statistical probes agreeing on the same three positions is strong evidence that something in the underlying broadcast format changes there. We don't claim to know what that "something" is, and we specifically do not read this as evidence for any particular cryptographic component: the pattern of rare characters we see in the surrounding region is itself inconsistent with the kind of uniform-looking output that standard cryptographic components produce.

Third — and most striking — the last six characters of the broadcast almost never contain two identical characters, at any pair of positions within that six-character window. In a truly random string, any two chosen characters match by chance about 3% of the time. In the final six positions of these broadcasts, that rate drops to roughly 0.2% for every pair of positions — not only for neighbouring positions, but for characters one apart, two apart, up to the full six-character span. The rule is stronger than a no-doubled-letters rule: the six final characters are near-certainly all distinct from each other. We see this pattern in every single two-letter prefix group we examined, without exception. Whatever process builds the last six characters chooses them to be mutually different.

There is also a quieter fourth observation: at positions three and four — the two characters immediately after the prefix — the shortage of M and 5 vanishes. Those two characters look like a normal uniform random draw, unlike the rest of the body. It suggests a short ten-bit slot between the prefix and the main cryptographic region, carrying something like a counter, a short random number, or a version indicator.

Meanwhile, several other tests — looking for repeated patterns, hidden compressibility, and relationships between nearby broadcasts — find that, within any single prefix group's set of messages, the characters pass all the "does this look random?" checks we applied, once you account for the rare-characters deficit and the forbidden-doubles rule. So the non-randomness we find is not in chaotic noise but in specific places: the shortage of M and 5 almost everywhere except two short slots, the mid-broadcast transition at positions nineteen to twenty-one, and the forbidden-doubles rule at the tail. The simplest account is that the thirty-character string is made of at least four distinguishable pieces — a prefix, a short head slot, a long body, and a structured six-character tail — rather than a single uniform block of encrypted output. We stop short of saying which cryptographic primitive builds any given piece. The observations are enough to rule out "it's just a standard AEAD ciphertext", but not enough to say what it is instead.

Important caveat: all three of these structural patterns appear in Group 1 broadcasts. Group 2 broadcasts carry only the weak M/5 shortage and none of the positional structure. Group 2 broadcasts look much more like a single block of uniformly random-looking text. The two groups do not appear to use the same format.

1.Introduction

The High-Frequency Global Communications System (HFGCS) broadcasts short fixed-length alphanumeric strings over shortwave radio as part of its Emergency Action Message (EAM) traffic. These broadcasts are publicly audible, routinely logged by open-source listener networks, and appear — to an observer without key material — as opaque base32 payloads. A natural working hypothesis is that each observed 30-character broadcast carries the output of a modern authenticated-encryption primitive under a per-session fresh nonce, and that the payload is therefore statistically indistinguishable from a uniform random string of 150 bits by any polynomial-time observer [1, 2].

This paper evaluates that hypothesis quantitatively against a long-collected corpus of observed broadcasts. The evaluation uses only ciphertext — no key material, no plaintext correspondences, no implementation-level fingerprints — and asks a single question: do the observable statistics of the broadcast payload match the distribution that any IND-CPA-secure AEAD scheme under fresh nonces must produce?

The answer is no. The empirical marginal distribution of base32 symbols in the 30-character payload body is not uniform: two specific symbols (M and 5) are reproducibly under-represented by approximately one-third at nearly every body position, and this deficit persists within every prefix cell examined. The position-wise profile of the deficit is not flat: it attenuates in two specific regions, and the mid-body region at positions 19–21 carries an independent doublet-rate suppression co-localised with the marginal attenuation. Conditional on the marginal deviation, probes of within-key pairwise positional dependence, general compressibility, and differential structure between temporally-adjacent broadcasts return no further anomalies.

The synthesis is that the broadcast payload is not a uniform draw on Σ30 but a concatenation of at least four statistically distinguishable regions along its 150-bit length, with transitions at specific character-grid boundaries. In particular, the final six characters of every broadcast are near-certainly pairwise distinct — every one of the fifteen pair-positions within the trailing window shows a roughly 15-fold suppression of character equality, not only the five adjacent pairs. This is a stronger rule than "no consecutive repetitions": the six final characters collectively form a six-element subset of the 32-symbol alphabet in each message. We deliberately avoid identifying any region with a specific cryptographic primitive. The observed main-body M/5 deficit is itself inconsistent with the uniform-output behaviour of standard authenticator constructions, which rules out the most natural ciphertext-only reading of the mid-body transition (namely, as the terminus of a standard-width authenticator). What the observations establish is the presence of the transitions, not their cryptographic meaning.

The paper is organised as follows. Section 2 formalises the observation model and the null hypothesis. Section 3 describes the corpus. Section 4 tests first-order symbol marginals against the uniform null. Section 5 analyses the position-wise profile of the marginal deviation. Section 6 reports an independent positional probe — the rate of consecutive-character repetition — that reveals two regimes of strong suppression and cross-validates the position-level localisation. Section 7 presents three within-key structural probes constraining the internal regularity of the payload. Section 8 synthesises the evidence into a region-level description of the payload and states explicitly what the observations do and do not license. Section 9 discusses limitations, alternative explanations, and follow-up probes. Section 10 concludes.

2.Preliminaries and threat model

2.1Observation model

An observer with a shortwave receiver and a transcription pipeline captures, for each broadcast event, a tuple (date, time, callsign, prefix, addressee-flag, message, station-id). The observer does not hold key material, does not possess cleartext corresponding to any broadcast, and does not have access to implementation-level fingerprints (timing, power, electromagnetic emanation). The dataset used in this paper is the output of such a pipeline aggregated over multiple listeners.

We denote the observed message field as c ∈ Σ30 where Σ is the base32 alphabet {A, B, …, Z, 2, 3, 4, 5, 6, 7} of size 32 specified in RFC 4648. The first two characters of c constitute the prefix (the PR), which is not a uniform random value but a short identifier that rotates in and out of service on an operational schedule: each prefix is introduced, carries traffic for an active period (typically weeks to months, sometimes crossing a calendar-year boundary), and is eventually retired in favour of a new prefix identifier. The finest cryptographic partition the observable data supports is therefore "one prefix", treated as one operational key regime. The remaining 28 characters of c constitute the body c[3:30], carrying 28·log2(32) = 140 bits.

2.2Null hypothesis: uniform AEAD output

The working hypothesis under which the corpus would be "cryptographic" in the sense most often assumed by non-specialist readers is that each broadcast carries the output of an authenticated-encryption-with-associated-data (AEAD) primitive producing ciphertext indistinguishable from uniform on Σ30 to any polynomial-time observer without the key [2, 3, 4]. Formally:

ci = AEAD.EncK(prefix)(noncei, associated_datai, plaintexti) (1)

Under IND-CPA [1], the ciphertext distribution is computationally indistinguishable from uniform on Σ30, independently of the underlying plaintext and conditional on the key. The directly testable implication is:

The empirical marginal distribution of base32 symbols at each body position is uniform on Σ up to finite-sample estimator bias of order N−1/2, identical in form across positions and across key-distinct sub-populations.

This implication is directly testable from the observation tuples. Sections 4 and 5 test it.

2.3Alternative hypotheses

If the working hypothesis fails, the cryptographic interpretations of the broadcast payload that remain are:

  1. Deterministic keyed-function output. The broadcast carries a keyed deterministic output — a MAC, keyed hash, or truncation thereof — computed over an underlying input (which need not be the transmitted payload itself). Formally, c = FK(prefix)(inputs), where F is a pseudorandom function [5, 6]. The output is cryptographic but deterministic in its inputs, and its empirical marginals may reflect format-level structure (truncation, concatenation with non-uniform fields, alphabet mappings) rather than the uniform marginals of a full-width AEAD ciphertext.
  2. Concatenated-field format. The broadcast is a concatenation of two or more fields, only some of which are drawn from a uniform distribution. The remaining fields — e.g., a counter, identifier, version token, or structured nonce — introduce distributional features that depart from the AEAD null at characteristic positions.
  3. Codebook token. The broadcast is a non-cryptographic identifier drawn from a fixed operational codebook. Observed statistics would reflect the codebook's generation method rather than any per-session primitive.

Alternatives (1) and (2) are not mutually exclusive: a canonical authenticated-message format of the form authenticator ‖ field belongs to both. The body of this paper tests the working hypothesis against these alternatives. The evidence we report supports alternative (2) strongly — the payload has internal field boundaries — and is consistent with alternative (1) only in the weak sense that the within-prefix probes do not exclude a deterministic keyed-function component. We argue in §8.4 and §9.1 that identifying any specific region with a specific cryptographic primitive is not licensed by ciphertext-only observations, and that the observed main-body marginal deficit is in fact inconsistent with standard-width authenticator identification.

3.Observation corpus

3.1Source and preprocessing

The corpus is a catalogue of 12,870 transcribed broadcast observations collected between 2022 and 2026 by a coordinated network of open-source listeners. Each observation carries the fields listed in §2.1. Rows whose quality flag contains flagging characters indicating uncertain capture are excluded, as are rows whose message text contains placeholder characters indicating incomplete transcription. Restricting to messages of length 30 whose prefix resolves to Group 1 or Group 2 in the active prefix assignment at the broadcast date yields 9,769 broadcast-level observations, after removing exact duplicates on the composite key (date, UTC time, callsign, message) to eliminate accidental double-entry.

3.2First-broadcast itemization

Subsequent statistical tests require approximately independent observations. Operational retransmission — the same 30-character string broadcast repeatedly from the same or different stations within a short time window — violates this assumption. We therefore itemize the corpus by retaining, for each unique message string, the observation with the earliest (date, UTC) ordering, and discarding all later broadcasts of that same string. This produces N = 5,377 distinct messages; all statistical tests in this paper operate on this first-broadcast itemization.

3.3Slice structure

Two prefix-derived strata and one addressee-flag derived stratum are defined a priori. The cohort comprises NG1 = 3,519 Group-1 and NG2 = 1,858 Group-2 unique messages. A finer stratification by individual prefix yields 41 cells with at least 50 unique first-broadcast messages; each such cell holds all traffic from a single prefix across its operational lifetime, which we treat as the finest cryptographic partition the data supports. Three Group-1 prefixes in this cohort straddle a calendar-year boundary (MQ 2022–2023, WR 2023–2024, ND 2024–2025); these are kept as single prefix cells rather than split artificially across years.

levelcountinterpretation
raw broadcast observations9,769full corpus after filtering
unique 30-character strings5,377first-broadcast itemization; base of all tests
  of which Group 13,519
  of which Group 21,858
retransmissions excluded4,392excluded from all statistical tests
prefix cells with N ≥ 5041per-prefix (within-one-key) analysis
Table 1. Corpus dimensions used throughout the analysis.

4.First-order symbol distribution

The testable implication of the uniform-AEAD null is that the empirical distribution of base32 symbols at each body position is uniform up to finite-sample bias. This is equivalent to the NIST SP 800-22 frequency (monobit) test [7] applied to the 5-bit expansion of each base32 symbol, and equivalent to a chi-square goodness-of-fit test against the uniform on Σ at the symbol level.

4.1Aggregate symbol-level test

For each base32 symbol c ∈ Σ, we count its occurrences at body positions 3 through 30 across all Group 1 first-broadcast messages (N = 3,519 messages × 28 body positions = 98,532 body characters), compute the Poisson z-score against the uniform expectation (98,532 / 32 ≈ 3,079 per symbol), and rank. Table 2 reports the ten most deviant symbols, compared against a synthetic uniform base32 control of matched size generated with a CSPRNG.

symbol base32 idx bit pattern G1 count G1 pct G1 z control z
M12011001,9041.932%−21.18+0.34
529111011,9201.949%−20.89+1.24
A0000003,2453.293%+2.99+0.32
S18100103,2423.290%+2.94−0.58
327110113,2393.287%+2.88−0.92
L11010113,2323.280%+2.76−1.05
F5001013,2103.258%+2.36+2.09
E4001003,1993.247%+2.16−0.15
T19100113,1903.238%+2.00+2.18
X23101113,1873.234%+1.94+0.00
Table 2. The ten most deviant base32 symbols in Group 1 body positions, by Poisson z-score. The final column reports the same statistic on a synthetic uniform base32 corpus of matched size (28·3,519 = 98,532 characters) drawn from a CSPRNG, showing z-scores in the expected ±2σ range. Symbols M and 5 on real data exceed the CSPRNG noise envelope by roughly an order of magnitude.
Figure 1. Empirical frequency of each base32 symbol across all body positions of the Group 1 first-broadcast cohort (N=3,519 messages, 98,532 body characters). Dashed line: the uniform expectation 1/32 ≈ 3.125%. Thirty of thirty-two symbols land within a narrow band around the expected rate (each within |z| < 3.0). Two symbols — M and 5, highlighted — fall far below the rest, each at approximately 1.94% against an expected 3.125%, a deficit of roughly one-third. The visual gap between those two bars and the rest of the distribution is what a synthetic uniform random draw never produces.

Symbols M and 5 deviate from uniform by z = −21.18 and z = −20.89 respectively. Each occurs at roughly 1.94% of body positions, against an expected 3.125% (1/32). No other symbol deviates by more than |z| = 3.0. The bimodal structure — two extreme outliers and a near-uniform bulk — cannot be attributed to finite-sample noise, as demonstrated by the synthetic control. The bit patterns of M (01100) and 5 (11101) share no obvious common substructure that would implicate a specific bit-level bias in the underlying generator.

4.2Persistence across individual prefixes

The aggregate deficit could in principle arise from the mixing of sub-populations with slightly different marginals — the Simpson's-paradox pathway discussed for pairwise mutual information in the statistical literature [8]. The appropriate test is whether the deficit persists within individual prefix cells. Table 3 reports the per-cell z-scores of M and 5 in the seventeen largest Group 1 prefix cells.

prefix year N M pct M z 5 pct 5 z
GF20221721.765−5.341.806−5.18
MQ2022–20231511.845−4.711.774−4.97
2420231662.582−2.101.936−4.58
YL20231431.948−4.211.998−4.03
YT20231821.923−4.851.864−5.09
PJ20232291.684−6.532.105−4.62
CZ20232411.897−5.712.238−4.12
WR2023–20242411.823−6.051.971−5.36
AS20241612.063−4.031.841−4.88
TD20241191.651−4.812.191−3.05
Q520241441.860−4.542.108−3.65
MK20241582.238−3.341.763−5.12
C620241421.610−5.412.037−3.88
FV20241401.735−4.921.480−5.83
AA20251461.908−4.402.153−3.52
SE20251031.976−3.491.491−4.96
LH20261001.929−3.581.857−3.80
Table 3. Per-cell z-scores of base32 symbols M and 5 in the seventeen Group 1 prefix cells with N ≥ 100 unique first-broadcast messages, ordered chronologically by the earliest broadcast date observed in each cell. Prefixes MQ (active 2022–2023) and WR (active 2023–2024) straddle calendar-year boundaries and are reported as single cells covering their full active period. Every cell has M z ≤ −2.10 (median −4.71) and 5 z ≤ −3.05 (median −4.62). The deficit is not a pooling artifact; it is a persistent property of each prefix, reproducing across every successive prefix rollover from 2022 through 2026.
Key finding. The under-representation of M and 5 is not stratum mixing. It is present in every single prefix examined, with magnitudes that accumulate across prefixes to produce the extreme aggregate z-scores. A Fisher's-combined-p test over the seventeen cells for systematic under-representation gives p < 10−50 for each symbol. Whatever mechanism produces this bias, it operates uniformly across prefixes.

4.3Group 2 corroboration

The same analysis applied to Group 2 body positions (N = 1,858 × 28 = 52,024 characters) returns the same two symbols as the most under-represented, though with attenuated magnitudes: M at z = −3.57, 5 at z = −1.88. The finding is therefore not restricted to Group 1; it is present in both groups with the same identity of the affected symbols. This is more consistent with a mechanism shared across groups than with a group-specific generator property.

5.Position-wise bias profile

If the M/5 deficit reflects a structural property of the broadcast format — for example, a bit-level mask applied at specific positions, or a concatenated-field layout where certain positions are drawn from a different distribution — the deficit should concentrate at those positions. If the deficit reflects a uniform generator property, it should distribute evenly across positions.

Table 4 reports the per-position z-score of M and 5 counts across the 3,519 Group 1 first-broadcast messages at each of the thirty character positions.

pos bit range M z 5 z note
11–5+20.12−4.96prefix char 1 (PR-distribution)
26–10−7.15+3.25prefix char 2 (PR-distribution)
311–15−0.47−0.86deficit attenuates
416–20−0.47+0.00deficit attenuates
521–25−4.57−5.05
626–30−4.67−4.77
731–35−4.67−4.67
836–40−3.91−4.86
941–45−4.67−5.24
1046–50−4.86−4.00
1151–55−5.53−4.00
1256–60−4.38−4.67
1361–65−5.24−4.29
1466–70−6.00−6.29
1571–75−5.24−4.86
1676–80−4.29−3.24
1781–85−3.62−3.91
1886–90−5.15−4.96
1991–95−0.47−1.24deficit attenuates
2096–100−1.14−0.38deficit attenuates
21101–105−0.47−1.62deficit attenuates
22106–110−4.77−3.53
23111–115−5.05−4.48
24116–120−5.24−5.24
25121–125−5.34−4.67
26126–130−4.77−4.67
27131–135−4.48−4.38
28136–140−6.00−4.86
29141–145−3.91−5.43
30146–150−2.67−4.38
Table 4. Per-position z-scores of symbols M and 5 in Group 1 first-broadcast messages (N=3,519). The deficit is present at nearly every body position. Two regions deviate from the surrounding pattern: positions 3–4 (bit offsets 11–20) and positions 19–21 (bit offsets 91–105) both show attenuated deficits at |z| < 1.7 for both symbols, while the intervening positions 5–18 and the trailing positions 22–30 carry the full 4–6σ deficit. The mid-body attenuation at 19–21 is co-localised with an independent doublet-rate suppression (§6), so that region has two statistical channels agreeing; the head-of-body attenuation at 3–4 appears only in this marginal channel. Position 1 reflects the G1 prefix-character distribution (M-prefixed PRs such as MQ and MK contribute disproportionately to position 1) rather than a body-generator property.
Figure 2. Per-position Poisson z-score of the two under-represented symbols M (solid red) and 5 (dashed blue) across the thirty character positions of Group 1 first-broadcast messages. Solid horizontal line: the uniform expectation (z = 0). Dashed red line: the per-family Bonferroni threshold for 435 position pairs, |z| > 3.86. Shaded bands: the head-attenuation region at positions 3–4 (bits 11–20) and the mid-body attenuation region at positions 19–21 (bit offsets 91–105). In both regions M and 5 sit near the uniform expectation (|z| < 0.9 at positions 3–4; |z| < 1.7 at 19–21), unlike positions 5–18 and 22–30 where the deficit is fully active. Positions 1–2 reflect the G1 PR-character distribution (e.g., M at position 1 is massively over-represented at z = +20.12, off-chart, because MQ and MK are frequent G1 prefixes); these are not body-generator properties.
Two regions of attenuated deficit. The position-wise M/5 profile is not uniform: two regions break from the surrounding pattern. (i) Positions 3–4, immediately after the prefix, show M and 5 at essentially the uniform expectation (|z| < 0.9) while the surrounding body positions show a 4–6σ deficit. (ii) Positions 19–21 show a similar attenuation (|z| < 1.7); crucially, this second region is jointly localised with an independent structural signature — the doublet rate at pairs (19, 20) and (20, 21) also drops sharply there (§6, z = −7.63, −6.67). The positions-3–4 region shows attenuation in the marginal channel only; pair (3, 4) doublets are within noise at z = −1.05. Two statistical channels agreeing at positions 19–21 is a stronger signal than one channel at positions 3–4, but both regions establish that the body has internal distributional boundaries.
What the boundaries do not establish. The attenuation at positions 19–21 brackets the canonical 96-bit crypto-tag width, which is suggestive, but the match alone does not identify the primitive. Standard authenticator primitives — GMAC, Poly1305, truncated HMAC, SipHash, etc. — are specifically designed to produce output indistinguishable from uniform random bits across their full output width. Under such a primitive the surrounding body region would show no M/5 deficit, which is precisely the opposite of what we observe at positions 5–18 and 22–24. Identification with a specific standard-width authenticator is therefore not supported by the data; what the data does support is a weaker statement, that the body carries internal structural boundaries whose cryptographic origin is not established here.

The positions-3–4 region could in principle carry a sequence counter, short nonce, version token, or sub-key index — a 10-bit slot admits 1,024 distinct values — but the data does not constrain its semantic role any further than "uniform-looking draw from the base32 alphabet, distinct from the surrounding body." Symmetrically, the positions-19–21 region could mark a concatenation boundary, a post-processing artefact, or a bit-width discontinuity in the underlying generator; the observations distinguish it as structurally significant without identifying its source.

The inference is limited to the position-level statement: the M/5 deficit has internal positional structure, with attenuation at positions 3–4 and 19–21 that breaks from the surrounding pattern. This is difficult to account for under any hypothesis of a single uniform-output generator operating across all 150 bits. It is suggestive of a concatenation format in which different regions are produced by different mechanisms, but it does not on its own identify any region with a specific cryptographic primitive, and in particular does not support identifying the pre-transition region as a standard-width authenticator — the surrounding M/5 deficit is the wrong statistical signature for that identification (see §8.4, §9.1).

6.Doublet suppression profile

A second positional probe, independent of the marginal analysis of §4–§5, tests the rate at which consecutive characters coincide — the probability that message position i carries the same base32 symbol as position i+1. Under the uniform null, this doublet probability is exactly 1/32 = 3.125% at every position pair, regardless of how the marginals are distributed across symbols. A format imposing a no-run constraint at specific positions would produce localised suppression of the doublet rate.

Table 5 reports the empirical doublet count at each of the twenty-nine position pairs in the Group 1 first-broadcast cohort (N = 3,519), the same statistic in Group 2 (N = 1,858), and in a synthetic uniform base32 control of matched Group 1 size. The Poisson standard deviation under the null is N1/2/321/2, i.e., approximately 10.49 doublets per position pair in the G1 column. Column z-scores are reported against this null.

pair bit range G1 count G1 z G2 count G2 z syn z
1–21–10146+3.44101+5.63+0.29
2–36–15114+0.3853−0.66+1.05
3–411–2099−1.0563+0.65+1.62
4–516–25125+1.4356−0.27+0.96
5–621–30115+0.4852−0.80+0.10
6–726–35104−0.5749−1.19+0.86
7–831–40125+1.4366+1.04+0.77
8–936–45132+2.1048−1.32+0.96
9–1041–50102−0.7658−0.01−0.28
10–1146–55123+1.2458−0.01−1.52
11–1251–60114+0.3866+1.04−0.28
12–1356–65113+0.2959+0.12−0.47
13–1461–7097−1.2459+0.12+0.67
14–1566–75109−0.0968+1.30−0.19
15–1671–80113+0.2956−0.27+0.19
16–1776–85118+0.7763+0.65+0.19
17–1881–90100−0.9553−0.66+0.19
18–1986–95119+0.8658−0.01+0.58
19–2091–10030−7.6351−0.93−0.47
20–2196–10540−6.6766+1.04+0.00
21–22101–110115+0.4857−0.14+0.19
22–23106–115111+0.1060+0.25+0.67
23–24111–120132+2.1062+0.52+0.38
24–25116–125109−0.0959+0.12+0.00
25–26121–1307−9.8269+1.44+1.05
26–27126–1355−10.0152−0.80+1.24
27–28131–1408−9.7268+1.30−0.28
28–29136–1457−9.8270+1.57+0.29
29–30141–1507−9.8269+1.44−1.43
Table 5. Doublet (character-repetition) count at each of the 29 consecutive-position pairs in Group 1 first-broadcast messages (N=3,519), Group 2 (N=1,858), and a synthetic uniform base32 control of matched Group 1 size. Expected count per position pair under the uniform null: ~110 for G1 and synthetic, ~58 for G2; Poisson z-scores reported. Position pair (1–2) carries a prefix-driven over-representation in both real groups. Highlighted rows in the G1 column mark two regimes of strong doublet suppression: pairs (19–20) and (20–21) at the mid-body transition — co-localised with the marginal attenuation of §5 — and pairs (25–26) through (29–30) at the trailing no-run region, where the doublet count drops to 5–8 observed against ~110 expected, a 15-fold reduction. G2 exhibits neither regime; the synthetic control shows none.
Figure 3. For each character position from 2 through 30, the number of Group 1 first-broadcast messages in which the character at that position is identical to the character at the immediately preceding position. Dashed line: the uniform expectation of approximately 110 doublets per position (rate 1/32 on N=3,519). Most positions land within one Poisson standard deviation (about ±10.5) of the expected rate and are effectively indistinguishable from uniform. Position 2, which completes the two-character prefix, sits well above the expected line, reflecting the known prefix distribution. Two distinct regions of strong suppression are visible: positions 20 and 21 drop to roughly 30 and 40 doublets respectively — co-localised with the mid-body marginal attenuation of §5 — while positions 26 through 30 drop to between 5 and 8 doublets, a 15-fold reduction from the expected rate and the clearest structural signature observed in the corpus.

6.1Per-prefix persistence of the trailing-region suppression

The aggregate doublet suppression at position pairs (25–26) through (29–30) would be produced by stratum mixing only if individual prefix cells carried normal doublet rates that averaged to the observed suppression — which is arithmetically impossible. Direct verification nevertheless shows that every individual prefix cell in Group 1 with N ≥ 100 independently exhibits the suppression, with doublet counts at trailing positions of 0, 1, 2, or 3 against expected counts of 3.1 to 7.5 per cell.

prefix year N expected 25–26 26–27 27–28 28–29 29–30
GF20221725.401103
MQ2022–20231514.700200
2420231665.200020
YL20231434.510000
YT20231825.710001
PJ20232297.201000
CZ20232417.502010
WR2023–20242417.520001
AS20241615.010000
TD20241193.710001
Q520241444.500010
MK20241584.900000
C620241424.400000
FV20241404.400000
AA20251464.600100
SE20251033.200200
LH20261003.110000
Table 6. Per-prefix doublet counts at each of the five trailing position pairs in Group 1 messages, ordered chronologically by first broadcast. Across 85 cell-by-position observations (17 prefix cells × 5 position pairs), the observed doublet counts are 0, 1, 2, or 3 (one cell at pos 29–30); the mean count is 0.32 against an expected mean near 5. Three prefix cells (MK, C6, FV) contribute zero doublets across all five trailing positions. The suppression is present in every prefix without exception and is therefore a persistent structural property rather than a pooling artifact.
Key finding. The last six character positions of Group 1 broadcast bodies (spanning bits 121–150 of the 150-bit payload) carry a near-deterministic no-consecutive-repetition constraint. Doublets, which should occur at a 1/32 rate under uniform output, occur at an aggregate rate of about 1/500 across the trailing region — roughly a fifteen-fold suppression — with the constraint persisting in every prefix cell examined. Group 2 exhibits no such suppression: its trailing-region doublet rate is consistent with uniform. This constitutes an independent structural signature distinguishing the two groups at the format level.

6.2Boundary-region doublet suppression

A weaker but still positively detected suppression appears at position pairs (19–20) and (20–21) — the same region of the payload at which the M/5 marginal deficit attenuates (§5). At the aggregate level these two pairs give z-scores of −7.63 and −6.67 respectively, with doublet counts of 30 and 40 against the expected ~110. Per-prefix disaggregation (not tabulated) confirms that the majority of large prefix cells contribute zero or one doublet at position pair (19–20), with the remaining cells contributing small positive counts; the pattern is less uniform than the trailing-region suppression but cleanly cross-validates the 96-bit boundary hypothesis introduced in §5.

The two signatures — marginal M/5 attenuation in one direction and doublet suppression in another — are jointly localised at the same positional region, arising from different statistical channels, and therefore constitute independent evidence for a format boundary in that region.

6.3Generalised inter-position equality probe

The doublet probe tests adjacent position pairs (i, i+1) for identical-character rates. The natural generalisation tests every position pair (i, j) with j > i — including non-adjacent pairs. This is the inter-position equality (IPE) probe. Under the uniform null the expected match count at each pair is N/32 regardless of the gap, so the probe extends the doublet statistic to the full triangular matrix of position-pair coincidences. At 30 characters this covers C(30, 2) = 435 pairs; the per-family Bonferroni threshold at α=0.05 is |z| > 3.86.

Applied to Group 1 first-broadcast messages (N = 3,348), the probe returns 18 pairs above the Bonferroni threshold on the deficit side. These pairs fall cleanly into two regions:

  1. All 15 pairs within the trailing region (25–30). Every pair (i, j) with i, j ∈ {25, 26, 27, 28, 29, 30} and i ≠ j shows a deficit at z = −9.8 to −10.2, with observed match counts of 5–9 against an expected count of ~110.
  2. All 3 pairs within the mid-body boundary region (19–21). The pairs (19, 20), (20, 21), and (19, 21) show deficits at z = −7.9, −7.2, and roughly −4 respectively.
The tail constraint is stronger than the doublet probe alone reveals. The adjacent-doublet suppression reported in §6 and §6.1 is one slice of a more general rule: every pair of positions within the six-character trailing region (25–30) shows suppressed equality, not only consecutive pairs. The six characters at positions 25–30 are near-certainly all six distinct from each other — not merely that no two adjacent ones match. The 15 tail-region pairs attain equality rates of roughly 1/500 against the uniform 1/32, a rate consistent across every pair in the region. Group 2 exhibits no such suppression in any tail pair, adjacent or non-adjacent: the strengthened finding preserves the group contrast.
Figure 4. Inter-position equality z-score heatmap for all 435 position pairs (i, j) in Group 1 first-broadcast messages. Red cells indicate match-rate excess; blue cells indicate match-rate deficit; grey cells are within the chance envelope. The most saturated blue region sits in the upper-right corner (small i, large j), specifically the triangle spanned by positions 25–30 crossed with themselves — every one of the 15 pairs within that region is individually at z < −9.7. A secondary blue cluster at positions 19–21 crossed with themselves (the boundary region from §5) is also visible. Group 2, by contrast, produces an essentially uniform grey matrix with no Bonferroni-crossing pairs; the IPE probe therefore cross-validates the adjacent-doublet finding and extends it to the full pair matrix.

The generalisation has a concrete implication: the tail finding is better described as a no-any-duplicate rule on the final six characters, not as a no-consecutive-repetition rule. The distinction matters because the two constraints differ in format implication. "No adjacent duplicates" is consistent with a generator that simply rejects immediate repeats. "No pairs of equal characters at any offset within the six-character tail" is a stronger constraint: the six characters collectively form a set with cardinality close to 6, drawn from the 32-symbol alphabet without replacement within this window. The number of such 6-character sequences is 32·31·30·29·28·27 = 6.5·108 (about 29.6 bits of information capacity), compared with 326 = 10.7·108 under the uniform null (30 bits) — roughly 0.4 bits lost to the no-duplicate rule, distributed as a uniform constraint on the whole six-character block rather than as a per-pair restriction.

7.Within-key structural probes

The preceding sections reject uniformity at the marginal level and reject freshness at the collision level. To complete the statistical portrait, we ask whether the observed payload, considered as output of a deterministic function on unseen inputs, behaves like a strong pseudorandom function within a single prefix. This is equivalent to testing whether, conditional on the first-order marginal deviations and the operational replay structure, the joint structure of body positions is consistent with a PRF.

Three within-key tests are reported: pairwise mutual information, LZMA compressibility, and first-order differential structure between consecutive messages.

7.1Per-position pairwise mutual information

For each prefix cell with N ≥ 100, we compute the empirical pairwise mutual information between the terminal pair of positions, I(X29; X30), and its 200-replicate column-shuffle null. Table 7 summarises the result for the ten Group 1 prefix cells with the largest unique-message counts. The same test on the full pooled cohort returns z = +6.23 at this position pair; within any single prefix, the signal disappears.

prefix year N I(X29; X30) μnull σnull z
GF20221722.4442.4250.037+0.52
MQ2022–20231512.6242.5590.040+1.63
2420231662.4332.4650.044−0.73
YT20231822.4492.3960.037+1.44
PJ20232292.2032.1940.039+0.24
CZ20232412.1532.1390.040+0.36
WR2023–20242412.1622.1090.035+1.52
AS20241612.4872.5040.037−0.46
MK20241582.6132.5580.038+1.43
AA20251462.7332.6840.039+1.23
Table 7. Within-cell pairwise mutual information at the terminal position pair (29, 30) for the ten Group 1 prefix cells with the largest unique-message counts, ordered chronologically by first observed broadcast. All z-scores lie in the range [−0.73, +1.63]; none exceeds the Bonferroni threshold of |z| > 2.58 for 10 tests at α = 0.05. The pooled-cohort signal at this position pair is attributable entirely to stratum mixing; the within-prefix structure is independent.

The same pattern holds at every other position pair we examined: within a single prefix, the pairwise joint distribution of any two body positions is indistinguishable from the product of the marginals up to the noise floor of the shuffle null.

7.2Per-prefix compressibility

Across all 41 prefix cells with N ≥ 50, we compute the LZMA compression ratio of the shuffled and prefix-stripped 28-character bodies and compare against a ten-seed synthetic uniform base32 control of matched size. Under the Bonferroni threshold for 41 tests (|z| > 3.24), zero cells exhibit significant excess compressibility, and the cell-level z-score distribution is approximately centred near zero with no cell exceeding +3. These results are consistent with a strong PRF output within each prefix.

7.3Differential structure between consecutive messages

If the broadcast payload were generated from a counter or state-machine, position-wise character differences between temporally-adjacent messages within the same prefix would show non-uniform structure. For each prefix cell with N ≥ 100, we compute the distribution of modular differences Δ = (ci+1[p] − ci[p]) mod 32 at body positions p ∈ [3, 30] between consecutive messages in date-ordered sequence, and compute the chi-square statistic against the uniform on {0, 1, …, 31}. The five largest Group 1 cells give z-scores (chi-square standardised) of {+0.37, +2.30, +0.00, −1.59, +2.01}. No cell clears the Bonferroni threshold. The temporal-difference distribution is consistent with uniformity, excluding simple counter-based or state-machine structures within a single prefix.

Together, the three within-prefix tests support a narrow conclusion: conditional on the first-order marginal deviations and conditional on operational replay, the broadcast payload within any single prefix exhibits no further detectable regularities in pairwise structure, compressibility, or inter-message differences. This is what a strong deterministic keyed function of unseen inputs would produce once the overall marginal and positional structure documented in §4–§6 were present; it is also what a simpler non-cryptographic generator would produce if that generator happened to pass our three probes [6]. The within-prefix probes exclude simple failure modes — a stuck RNG, a short-period PRG, counter- or state-machine-based construction — but do not uniquely identify the underlying generator.

8.Synthesis

The evidence presented in sections 4 through 7 contradicts the null hypothesis that the 30-character body is a uniform draw from Σ30, as any IND-CPA-secure AEAD ciphertext would appear to a ciphertext-only observer. Three independent channels of evidence support this conclusion, and three within-key probes constrain the shape of the deviation. This section synthesises what the data can support — a region-level description of the payload along its 150-bit length — and explicitly states what it cannot support, namely identification of any region with a specific cryptographic primitive.

8.1Evidence against uniform-output encryption

The observed base32 symbol marginal distribution (§4) is not uniform. Two specific symbols are under-represented by approximately 35–40% relative to the expected 1/|Σ| rate. The deviation exceeds 20σ in aggregate and persists within every prefix cell examined, with median per-cell magnitudes of roughly −4.7σ and −4.6σ on the seventeen largest cells. A synthetic uniform base32 control of matched size does not reproduce this pattern. The first-order marginal of the output is therefore not uniform.

This finding alone excludes the simplest interpretation under which the broadcast is a per-session AEAD ciphertext with uniform output. A secure AEAD primitive, viewed from outside the keyholder, produces a uniform distribution on Σ30 up to finite-sample noise; the observations are incompatible with that distribution by many orders of standard deviation.

8.2Positional localisation of the deviation

The position-wise profile of the marginal deficit (§5) is not uniform across positions. Positions 5 through 18 and 22 through 24 carry the full strength of the M/5 deficit; positions 3–4 and 19–21 carry an attenuated deficit (|z| < 1.7 for both symbols in each region). Two regions therefore break from the surrounding pattern: one at the head of the body (positions 3–4, bit offsets 11–20), one mid-body (positions 19–21, bit offsets 91–105).

The doublet-suppression profile (§6) distinguishes these two regions further. The mid-body region at positions 19–21 carries a co-localised doublet suppression (z = −7.63 at 19–20, −6.67 at 20–21); the head-of-body region at positions 3–4 does not (pair-doublet z = −1.05, within noise). In addition, a much stronger doublet suppression occupies the five final consecutive position pairs (25–26 through 29–30, covering bits 121–150), in which the doublet rate drops to roughly 1/500 — a fifteen-fold reduction from the 1/32 uniform expectation — and the constraint persists in every prefix cell. A localisation of this kind is not produced by a single uniform-output primitive operating across all 150 bits. The payload carries at least four distinguishable statistical regimes along its length, with transitions at the character-grid boundaries between them.

8.3Within-prefix structure is consistent with a deterministic process

Conditional on the first-order marginal deviation and the doublet-suppression structure, the within-prefix structural probes (§7) do not introduce further anomalies: pairwise positional mutual information sits within the shuffle-null envelope, compression ratios sit within the synthetic-random noise floor across all 41 prefix cells, and modular-difference distributions between date-ordered consecutive payloads are uniform. These outcomes are what one would expect if the payload were generated by a deterministic process whose output — when stripped of the marginal-level deviation and the doublet-level structure — carries no additional detectable regularity in our probes. This is consistent with, but does not prove, the use of a strong keyed function somewhere in the generator pipeline [6, 5]. The within-prefix probes rule out simple failure modes (stuck RNG, short-period PRG, table-lookup construction) but cannot, on their own, establish cryptographic provenance.

8.4What the data does and does not support

The observations support a format-level claim and rule out a null, but they do not support identification of a specific cryptographic primitive.

What the data supports. The 150-bit Group 1 payload is not the output of a single uniform-random-on-Σ30 generator. It has at least four statistically distinguishable regions along its length: a prefix region (positions 1–2) whose distribution follows the PR identity rather than the body generator; an attenuated-deficit region at positions 3–4; a main-body region (positions 5–18 and 22–24) carrying the full M/5 deficit; a joint attenuation at positions 19–21 in both the marginal and doublet channels; and a no-duplicate region at positions 25–30 in which the character-equality rate drops roughly 15-fold below the uniform expectation at every one of the fifteen pair-positions within the window — not only at adjacent pairs — so that the six trailing characters are near-certainly pairwise distinct in each message. Schematically:

c = P10 ‖ X10 ‖ B75 ‖ T25 ‖ R30 (2)

where the subscripts label region widths in bits: P10 is the 10-bit prefix (positions 1–2, carrying the PR); X10 is a 10-bit region at positions 3–4 in which the M/5 deficit is absent; B75 is the 75 bits at positions 5–19 that carries the main-body M/5 deficit; T25 is a 25-bit region at positions 20–24 in which both the marginal and doublet statistics attenuate relative to B; and R30 is a 30-bit region at positions 25–30 under a pairwise no-duplicate constraint — every pair of positions within the region shows a 15-fold equality suppression, so that the six characters are near-certainly distinct from each other in each message. The bit boundaries are approximate and pinned to the character grid; finer resolution awaits the bit-offset scan outlined in §9.4. Equation (2) names regions by their statistical behaviour, not their cryptographic role.

Figure 5. Observational decomposition of the 150-bit Group 1 broadcast body into five regions distinguished by their statistical behaviour, derived from the observations reported in §4 through §7. No cryptographic role is assigned to any region — the figure summarises what the data looks like, not what generates it. The five regions are: (P) positions 1–2, the prefix field, whose symbol distribution follows the PR population rather than the body generator; (X) positions 3–4, a 10-bit region in which the body-wide M/5 deficit is absent (|z| < 0.9 for both symbols); (B) positions 5–18 plus 22–24, the body region in which the M/5 deficit is fully active; (T) positions 19–21 (bit offsets 91–105), in which the M/5 deficit attenuates (|z| < 1.7) and the consecutive-doublet rate drops sharply (z = −7.63, −6.67); (R) positions 25–30, in which character-equality rates drop roughly 15-fold below the uniform expectation at every one of the fifteen pair-positions within the window (every adjacent pair at z ≤ −9.7 and every non-adjacent pair within the region at z ≤ −9.3; see §6.3), so the six trailing characters are near-certainly pairwise distinct. Regions are aligned to the character grid; finer resolution awaits the bit-offset scan outlined in §9.4. Bit-offset tick marks on the top axis are geometric landmarks only; we do not read the T-region's coincidence with the 96-bit tick as cryptographically diagnostic (§8.4, §9.1). Group 2 does not exhibit this multi-region structure; see Figure 6.

What the data does not support. The observations do not identify the B and T regions together with a standard-width cryptographic authenticator — GMAC, Poly1305, a truncated HMAC, SipHash, or a similar construction terminating at the T/R boundary. Standard authenticator primitives are designed to produce output that is computationally indistinguishable from uniform random bits across their entire output width [3, 4, 5]. Under any such primitive, the main-body region B would exhibit a flat marginal distribution, not the approximately 35% M/5 deficit we observe. The deficit's presence at main-body positions is the direct evidence against standard-authenticator identification in that region: if B carried a uniform-output MAC, the deficit would not be there. Reading the T-region attenuation as "the authenticator ends here" requires assuming the region before it is an authenticator, which the marginal statistics already rule out. And the T-region's position in the vicinity of the 96-bit bit offset — a common tag width in AEAD constructions — is geometry, not evidence. A transition at positions 3–4 is also present in the marginal channel, without any comparable round-number alignment, which underscores that "a body region whose M/5 deficit attenuates" is a general structural feature and not a unique landmark.

The co-localisation of marginal attenuation and doublet suppression at positions 19–21 remains the strongest evidence that the payload has internal boundaries, and the trailing no-run constraint at positions 25–30 is a separate finding that needs no primitive-identification argument — it is a near-deterministic structural rule on the final six characters of the broadcast. Those two findings stand on their own. The 10-bit attenuated region at positions 3–4 is a weaker finding — only the marginal channel carries it — but it is geometrically symmetric to the pattern at 19–21 and is reported here for completeness. What generates the main-body M/5 deficit, and what generates the trailing-region no-run constraint, are open questions for which ciphertext-only analysis does not provide an answer.

The 30-character HFGCS Group 1 broadcast payload is not the output of a single uniform-random generator. It has at least four distinguishable statistical regions along its 150-bit length, with a joint marginal-plus-doublet transition at positions 19–21 and a near-deterministic pairwise no-duplicate rule on the final six characters (positions 25–30): every one of the fifteen pair-positions within that window shows a ~15× equality suppression, so the six trailing characters are near-certainly pairwise distinct within each message. Identification of any region with a specific cryptographic primitive is not supported by the data: the M/5 deficit in the main body is inconsistent with the uniform-output behaviour of standard authenticators, and no interpretation of the mid-body transition as the terminus of such a primitive survives that constraint.

8.5Group 2 format contrast

Applying the same four structural probes to the Group 2 cohort (N = 1,858), and to its addressee-status sub-slices (G2 with addressee, N = 1,368; G2 without addressee, N = 490), returns a sharply different result: the G1 boundary attenuation at positions 19–21 and the trailing-region no-run constraint at positions 25–30 are both absent. Table 8 reports the side-by-side probe values.

probe G1 (all) G2 (all) G2 w/ addr G2 w/o addr
aggregate body M z−21.18−3.57−2.51−2.74
aggregate body 5 z−20.89−1.88−1.71−0.81
pos 19 M / 5 z−0.47 / −1.24+0.12 / −0.80−1.19 / −0.57+2.22 / −0.59
pos 20 M / 5 z−1.14 / −0.38−1.98 / +2.09−1.19 / +1.11−1.87 / +2.22
pos 21 M / 5 z−0.47 / −1.62+0.39 / +0.65+0.19 / −0.11+0.43 / +1.45
doublet pair 19–20 z−7.63−0.93−0.27−1.36
doublet pair 20–21 z−6.67+1.04+0.80+0.69
doublet pair 25–26 z−9.82+1.44+0.80+1.45
doublet pair 26–27 z−10.01−0.80−0.42−0.85
doublet pair 27–28 z−9.72+1.30+1.26+0.43
doublet pair 28–29 z−9.82+1.57+2.03−0.34
doublet pair 29–30 z−9.82+1.44+1.26+0.69
Table 8. Side-by-side comparison of the four Figure-4 structural signatures across Group 1, Group 2, and the two G2 addressee-status sub-slices. Only Group 1 carries the boundary and trailing-region signatures; the two G2 sub-slices are statistically indistinguishable from each other and from G2 overall, indicating that addressee status is not a structural variable at this resolution. The position-2 over-representation of "5" in G2 (e.g. z = +12.2 on G2 all, not tabulated above) reflects the G2 PR distribution — Group-2 prefixes disproportionately include "5" as the second character — not a body-generator property.

Within Group 2, the aggregate M/5 body deficit is present but an order of magnitude weaker than in G1 (body z around −3 and −2 respectively), and the position-wise and doublet profiles are flat — every position and every doublet pair behaves as uniform to within ±2σ, apart from the prefix positions 1–2. The appropriate format description for Group 2 is the minimal one shown in Figure 6: a 10-bit prefix field followed by a single 140-bit body field without detectable internal boundaries.

Figure 6. The inferred two-field payload format for Group 2. The 150-bit broadcast body comprises the same 10-bit prefix field P as Group 1 (positions 1–2, carrying the PR) followed by a single 140-bit body field B spanning positions 3–30. Unlike Group 1, no internal field boundary is detected: the per-position M/5 profile is flat, the pairwise doublet profile is flat, and the trailing-region doublet rate matches the uniform expectation. The residual aggregate M/5 deficit of roughly −3σ in G2 is consistent with either a weak shared mechanism (possibly a transcription effect common to both groups) or a mild generator-level non-uniformity without internal structure. Compare with Figure 5 for Group 1.

The immediate cryptographic inference: Group 1 and Group 2 are not the same format. If both were outputs of the same scheme under different key material, we would expect identical structural signatures up to sampling noise — identical boundary behaviour, identical trailing-region constraints. We do not see that. The two groups either use different underlying generators, or one group's payload is being processed through an additional structural layer absent in the other. Ciphertext-only analysis cannot distinguish these sub-hypotheses, and we note the observation without claiming to identify its cause.

9.Discussion, caveats, and future probes

9.1Primitive-identification limits

Ciphertext-only analysis cannot identify the specific cryptographic primitive generating any part of the payload, and the observations in this paper in fact argue against reading the pre-transition region as a standard-width cryptographic authenticator. The problem is internal to the data: standard authenticator primitives — GMAC, Poly1305, truncated HMAC, SipHash, CMAC, and their relatives — are specifically designed to produce output that is computationally indistinguishable from uniform random bits across their entire output width [3, 4, 5]. Under any such primitive, the marginal distribution of base32 symbols across the output should be flat, with each symbol occurring with probability 1/32 ± O(N−1/2). What we observe at main-body positions is a roughly 35% deficit on two specific symbols, sustained at ≥20σ in aggregate. Whatever generates the main-body region, it is not producing uniform output — and so it is not a standard authenticator primitive at that width.

The mid-body transition does happen to sit in the vicinity of the 96-bit offset — a common tag width in AEAD constructions — but the coincidence is not itself evidence. A transition at a given bit offset is consistent with an authenticator terminating there or with any of a number of non-cryptographic structural transitions: a block-cipher word boundary, a field-separator in an encoding layer, a sub-key rotation boundary, or an artefact of a post-processing step applied to the raw generator output. The existence of a second attenuation region at positions 3–4, which sits nowhere near a canonical cryptographic length, further weakens the case for treating the bit-96 alignment as diagnostic: regions of attenuated deficit are a general feature of the body, not a single landmark. Distinguishing among the causal hypotheses requires correspondences between the payload and its underlying inputs (plaintext/nonce/associated-data tuples) that ciphertext-only analysis cannot supply. What the data supports is the narrower claim that the payload has internal structural boundaries — one at positions 19–21, one weaker boundary at positions 3–4, and one at positions 25–30 — whose origins are not established here.

9.2Higher-order structural blind spots

The pairwise MI probe is insensitive to three-way and higher-order joint dependencies [6]. Modern cryptographic primitives can leave joint structure detectable only at order three or higher. At N ≈ 200 per prefix cell and an alphabet of 32 symbols, the 323 = 32,768 triplet cells are populated at ~0.006 counts per cell on average, which precludes credible triplet inference. The combined 5,377 unique messages permit roughly 0.16 counts per triplet cell — still below the threshold for a principled triplet goodness-of-fit test. Resolving higher-order structure would require either a substantially larger corpus (on the order of 105 unique messages) or a targeted test of a specific hypothesised primitive rather than a generic search.

9.3Statistical power

The marginal-bias finding has an effect size of approximately 35–40% relative deficit on the affected symbols, against a null standard error of roughly 0.2% at aggregate sample size — an effect detectable at 21σ in aggregate and at roughly 4σ even in the smallest individual prefix cell examined. The doublet-suppression finding at the trailing region is a fifteen-fold reduction in rate, detectable at z ≈ −10 in aggregate and −2 to −3 per cell; taken over the five-position trailing block its Fisher's-combined p-value across the seventeen examined prefix cells is smaller than 10−100. The power to detect both observed effects is therefore near unity, and their per-cell persistence is sufficient to rule out sampling noise as an explanation at any standard significance threshold.

9.4Future probes

Two follow-up directions would substantively advance the analysis:

  1. Bit-offset scanning at putative format boundaries. Extend the position-wise analysis of §5 to every bit offset from 1 to 149, testing for additional boundary signatures at the canonical cryptographic widths of 32, 64, 128, and 160 bits. A second detected boundary would constrain the format further and may identify sub-fields within the residue region.
  2. Cross-corpus comparison with other fixed-length EAM bands. The 30-character cohort is one band in a wider length distribution in the HFGCS traffic stream. Comparing the symbol-marginal distribution across bands tests whether the M/5 deficit is specific to the 30-character format or is a system-wide property shared with other broadcast lengths — which would inform the layer at which the deviation originates.

10.Conclusion

The 30-character HFGCS broadcast payload body is not a uniform draw from the base32 alphabet. Three independent statistical signatures, each robust to stratification down to individual prefix cells, contradict the uniform-AEAD null: (i) two specific symbols (M and 5) are under-represented by approximately one-third at nearly every body position, persistent at ≥4σ in every large prefix cell; (ii) the position-wise profile of this deficit attenuates sharply at character positions 19 through 21 — and, more weakly, at positions 3–4 — indicating internal structural transitions in the body; and (iii) the rate of character equality at pairs of positions drops to roughly one-fifteenth of the uniform expectation across every one of the fifteen pair-positions within the final six characters (positions 25–30), uniformly within every prefix examined — a near-deterministic pairwise no-duplicate rule rendering the six trailing characters mutually distinct in each message. The rule holds not only at the five adjacent pairs originally detected by the consecutive-doublet probe but at every pair within the six-character window (§6.3). Within-prefix probes of pairwise positional independence, compressibility, and inter-message differential structure exhibit no further detectable regularity.

Taken together these observations are incompatible with the interpretation of the Group 1 broadcast payload as the output of a primitive producing uniform ciphertext across all 150 bits. The payload has internal structural boundaries. The observations support a region-level description of the body — a 10-bit prefix, a 10-bit head region at positions 3–4 where the M/5 deficit is absent, a main body at positions 5–18 and 22–24 where the deficit is fully active, a co-localised marginal-plus-doublet transition at positions 19–21, and a pairwise no-duplicate region at positions 25–30 in which the six trailing characters are near-certainly distinct from each other — and this region-level description is the strongest format-level claim the observations license. We deliberately stop short of identifying any region with a specific cryptographic primitive. The main-body M/5 deficit is inconsistent with the uniform-output behaviour that standard authenticator primitives (GMAC, Poly1305, truncated HMAC, SipHash) are designed to produce, so the natural ciphertext-only reading — a truncated-authenticator output terminating at the mid-body transition — is ruled out by the statistics of the region that would carry the authenticator. What generates the main-body deficit, and what generates the trailing-region pairwise no-duplicate rule, remain open questions that ciphertext-only analysis cannot answer. The same structural probes applied to Group 2 return none of these signatures (§8.5): the G2 body carries only a mild M/5 deficit and no detectable internal boundary. Whatever each group's generator is, the two are not producing output with the same internal structure.

To the reader interested in what is and is not knowable about HFGCS broadcast structure from public observation alone: we have established that the payload is not uniform-output encryption and that it has at least four statistically distinguishable regions along its length, we have identified the positions at which those regions meet, and we have specified the follow-up probes that would refine the format further. The strongest free-standing cryptographic observation — the one that needs no interpretation beyond its data-level statement — is the 15-fold suppression of pairwise character equality across the fifteen pair-positions within the final six characters, in every prefix examined. The targeted bit-offset scan across the 150-bit payload and the cross-corpus comparison across broadcast lengths are the natural next steps.

The scope of the tail no-duplicate rule is already known to be Group-stratified rather than universal: Group 2 at 30 characters does not carry it (§8.5). Separate per-position structural heatmap work on the broader HFGCS corpus at other fixed lengths finds a variety of format-level structures across lengths and groups, but none of the heatmaps we have examined for Group 2 or Group 4 cohorts at other lengths shows a tail-region signature analogous to the 30-character Group 1 rule. The productive open question is whether the tail no-duplicate rule is a property of the Group 1 encoding that reproduces in Group 1 traffic at other fixed lengths, rather than a system-wide HFGCS feature, and applying the IPE and doublet-rate probes of §6 directly to Group 1 cohorts at other lengths would settle it. That test is left to future work.

11.References

  1. Bellare, M., Desai, A., Jokipii, E., & Rogaway, P. (1997). A concrete security treatment of symmetric encryption. Proceedings of the 38th Annual Symposium on Foundations of Computer Science, 394–403. doi:10.1109/SFCS.1997.646128
  2. Rogaway, P. (2002). Authenticated-encryption with associated-data. Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS), 98–107. doi:10.1145/586110.586125
  3. McGrew, D., & Viega, J. (2004). The security and performance of the Galois/Counter Mode (GCM) of operation. Proceedings of INDOCRYPT 2004, LNCS 3348, 343–355. doi:10.1007/978-3-540-30556-9_27
  4. Krovetz, T., & Rogaway, P. (2011). The software performance of authenticated-encryption modes. Fast Software Encryption (FSE) 2011, LNCS 6733, 306–327. doi:10.1007/978-3-642-21702-9_18
  5. Bellare, M., Canetti, R., & Krawczyk, H. (1996). Keying hash functions for message authentication. Advances in Cryptology — CRYPTO '96, LNCS 1109, 1–15. doi:10.1007/3-540-68697-5_1
  6. Rogaway, P., & Shrimpton, T. (2004). Cryptographic hash-function basics: Definitions, implications, and separations for preimage resistance, second-preimage resistance, and collision resistance. Fast Software Encryption (FSE) 2004, LNCS 3017, 371–388. doi:10.1007/978-3-540-25937-4_24
  7. Rukhin, A., Soto, J., Nechvatal, J., et al. (2010). A statistical test suite for random and pseudorandom number generators for cryptographic applications. NIST Special Publication 800-22, Revision 1a. doi:10.6028/NIST.SP.800-22r1a
  8. Good, I. J. (1988). The interface between statistics and philosophy of science. Statistical Science, 3(4), 386–397. doi:10.1214/ss/1177012754
  9. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423 (Part I) & 623–656 (Part II). doi:10.1002/j.1538-7305.1948.tb01338.x · doi:10.1002/j.1538-7305.1948.tb00917.x
  10. Shannon, C. E. (1949). Communication theory of secrecy systems. Bell System Technical Journal, 28(4), 656–715. doi:10.1002/j.1538-7305.1949.tb00928.x
  11. Miller, G. A. (1955). Note on the bias of information estimates. In Information Theory in Psychology: Problems and Methods (H. Quastler, ed.), 95–100. [pre-DOI era book chapter; no DOI assigned]
  12. L'Ecuyer, P., & Simard, R. (2007). TestU01: A C library for empirical testing of random number generators. ACM Transactions on Mathematical Software, 33(4), Article 22. doi:10.1145/1268776.1268777

A.Appendix: computational details

A.1Software and reproducibility

All analyses were performed in Python 3.14 using pandas for data manipulation, numpy for matrix computation, and the Python standard library modules lzma, gzip, and bz2 for compression (the compression analyses reported in §7.2 use LZMA at preset 9 with the PRESET_EXTREME flag). All pseudorandom operations — column shuffling for the MI null, message reordering for the compression variants, and synthetic corpus generation — are seeded with fixed integer seeds to ensure exact reproducibility.

A.2Validation of the MI null against analytic finite-sample bias

The empirical column-shuffle null distribution for pairwise mutual information was validated against the Miller–Madow analytic prediction. At the combined first-broadcast cohort size N = 5,377 and |Σ| = 32 with all 32 symbols observed at every body position, the analytic bias under independence is (|Σ|−1)2 / (2N ln 2) ≈ 0.129 bits. Across the per-prefix cells (N ranging from ~100 to ~250), analytic bias scales from ~0.8 bits at the smallest to ~0.3 bits at the largest; the empirical shuffle-null means recovered in each analysis track the analytic prediction to within a few thousandths of a bit, validating the null as an appropriate reference for z-score computation at every sample size examined.

A.3Compression noise floor

For each prefix cell analysed in §7.2, ten synthetic random corpora of matched N were generated with fixed seeds 1000 through 1009. The reported null (μnull, σnull) is the mean and sample standard deviation of LZMA compression ratios across these ten trials. Empirical σnull scales approximately as N−1/2 in the observed range N ∈ [50, 3000], consistent with the expected finite-sample variance of LZMA on uniform base32 input.

A.4First-broadcast itemization rationale

Operational retransmissions violate the independence assumption underlying all statistical probes. We retain one observation per unique 30-character string — the earliest by (date, UTC) — and discard all later retransmissions. An alternative would be weighted inclusion of retransmissions with inverse-multiplicity weights, which requires a model of the retransmission process. First-broadcast itemization is the simpler and more conservative approach and is used for every test reported in the paper.

A.5Prefix as the key-period partition

The paper treats the two-character prefix as the finest cryptographic partition the observable data supports, on the working assumption that each prefix is introduced, carries traffic for one operational lifetime, and is eventually retired before its identifier is reused. Calendar year is not itself a key delineator: a prefix may legitimately span a year boundary (introduced in late one year, retired in early the next) without being re-keyed. For bookkeeping reasons the underlying reference table records prefix-group assignments per calendar year, but the active prefix identity — not the year — is what we treat as the key-period label. In the first-broadcast corpus used for this analysis, no prefix appears with meaningful frequency in more than one calendar year, so the prefix partition and the (prefix, year) partition coincide; the paper's numeric results are invariant to the choice between them. Messages whose prefix is absent from the group-assignment reference for the broadcast year are excluded from analysis.

A.6Symbols and conventions

Positions are indexed from 1 to 30 following a one-based convention. The alphabet size |Σ| = 32 reflects RFC 4648 base32: A–Z and 2–7. All entropies and mutual informations are reported in bits (base-2 logarithms). Compression ratios are reported as the percentage of raw bytes retained after compression; smaller values indicate more aggressive compression.

A.7Per-symbol Fisher's-combined p computation

For each base32 symbol c, the Fisher's-combined test for systematic under-representation across seventeen prefix cells is computed as −2·Σi ln(pi), where pi is the Poisson one-sided tail probability P(Z > −zi) for cell i. Under the null of no systematic bias the statistic is χ234-distributed. The observed statistics for M and 5 exceed any tabulated percentile with p < 10−50.