The Red Team track requires selecting 1,000 images that cause representations across ~141 vision models
A naive approach might select highly diverse or anomalous images to “confuse” models. This is wrong. CKA computes the cosine similarity of vectorized, doubly-centered Gram matrices
Let the embedding matrix $X \in \mathbb{R}^{n \times d}$ decompose as $X = M + W$, where $M$ captures between-class means and $W$ captures within-class deviations. After centering, the Gram matrix becomes:
\[K_c = H(MM^T + MW^T + WM^T + WW^T)H\]When images span $k$ well-separated classes, the $HMM^TH$ term dominates. When $k = 1$ (all images from one class), $M$ reduces to a single point that the centering matrix $H$ maps to zero. The centered Gram becomes $K_c = HWW^TH$, depending entirely on within-class variation.
Within a single fine-grained superclass, different architectures genuinely disagree: CNNs organize by local texture and background statistics
We selected domestic dogs and wild canids as the target superclass: 118 dog breeds plus 7 wild canids from ImageNet validation, yielding 6,250 candidate images. This provides high within-class variation (breed, pose, lighting, background) while maintaining strict semantic uniformity.
We used 25 proxy models spanning all major architecture families and resolution regimes:
The resolution-diverse set was added in V2/V3 to close the generalization gap we observed in V1, where all proxies used 224px crops but the evaluation pool includes models at 160px-1024px.
The cluster jobs were repeatedly preempted by Slurm. We added a checkpoint system that writes the best selection to persistent storage every 10,000 SA iterations and on SIGTERM, allowing runs to resume from where they left off.
Our V3 submission was built from a checkpoint after 1 completed SA restart (200K iterations + local search polish), with a proxy CKA of 0.441 across 25 proxy models.
| Submission | Proxy models | Proxy CKA | Proxy score | Actual score |
|---|---|---|---|---|
| V1 (11 proxies, 224px only) | 11 | 0.416 | 0.584 | 0.547 |
| V3 (25 proxies, multi-resolution, partial) | 25 | 0.441 | 0.559 | 0.554 |
The V3 submission scored 0.5544, placing 1st on the Red Team leaderboard. V1 (submitted earlier as nathan-test) scored 0.5472 at rank 3.
| Rank | Submitter | Score |
|---|---|---|
| 1 | nathan-tryingsomething (us) | 0.5544 |
| 2 | tehruhn_imn1000_23 | 0.5499 |
| 3 | nathan-test | 0.5472 |
| 4 | moonshine-r92 | 0.5439 |
| 5 | moonshine-93 | 0.5439 |
| 6 | moonshine-r92 | 0.5439 |
| 7 | express-double-3 | 0.5266 |
| 8 | kencan-1st-attempt | 0.5177 |
| 9 | kencan-2nd-submit | 0.5177 |
| 10 | kencan-1st-submit | 0.5177 |
| … | ||
| 72 | express-double-5 | 0.393 |
The proxy score (0.559) and actual score (0.554) are much closer than in V1 (0.584 proxy vs 0.547 actual). The 25-model multi-resolution proxy set substantially narrowed the generalization gap compared to the 11-model 224px-only V1 proxy.
Interestingly, V3’s proxy CKA (0.441) is higher than V1’s (0.416), yet V3 scores better on the actual evaluation. This makes sense: V3’s proxy is a harder, more representative approximation of the true evaluation, so a slightly worse proxy score actually corresponds to a better real score.
Our result demonstrates a practical consequence of the known dataset-sensitivity of CKA
If the community wants CKA to measure genuine representational alignment rather than shared categorical knowledge, evaluation datasets should be composed of semantically uniform stimuli where coarse class structure cannot dominate the signal.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX