We present our Blue Team submission to the Re-Align Challenge, selecting 20 models from a curated registry of 141 to maximize mean pairwise Centered Kernel Alignment (CKA) on held-out evaluation images. Our key finding is that architecture family homogeneity—restricting to either attention-based models or a tightly aligned CNN cluster—predicts real-evaluation alignment better than proxy-optimized raw CKA across architecturally diverse sets. Our best submission achieves 0.7294 real portal CKA using a blur-CNN cluster of anti-aliased and blur-augmented convolutional networks (EcaResNet, ResNetBlur, DenseNet, CSPDarkNet) alongside lightweight attention-over-convolution hybrids (HaloNet, BotNet). Earlier all-attention sets reached 0.70, while imagenette-optimized diverse sets collapsed to 0.68 on the real evaluation. We show that CIFAR-100 is a better proxy than Imagenette for calibration, that homogeneous CNN clusters can match or exceed all-ViT alignment when selected appropriately, and offer actionable proxy design principles for future alignment studies.
A central question in representation learning is whether independently trained neural networks converge to similar internal feature geometries—a property sometimes called representational universality (Sucholutsky et al., 2023). The Re-Align Challenge operationalizes this question: the Blue Team selects 20 models from a registry of 141 whose pairwise CKA on held-out images is jointly maximized (Kornblith et al., 2019).
We compute CKA on proxy datasets (CIFAR-100, Imagenette), optimize model selection via greedy search with local refinement, and submit to the portal for real evaluation. Our iterative process—guided by four rounds of leaderboard feedback—yielded the following insights:
These results offer partial support for the Platonic Representation Hypothesis (Huh et al., 2024)—representations converge, but the path of convergence is shaped by architectural inductive biases. Within a shared inductive bias family (whether self-attention or anti-aliased convolution), models trained independently on ImageNet-1k arrive at remarkably similar feature geometries.
Maximize mean pairwise CKA over 20 models:
\[\bar{\rho}(\mathcal{M}) = \frac{2}{|\mathcal{M}|(|\mathcal{M}|-1)} \sum_{i < j} \text{CKA}(F_i, F_j)\]where $F_i$ is the penultimate-layer feature matrix of model $i$ over the evaluation images.
We use linear CKA (Kornblith et al., 2019):
\[\text{CKA}(X, Y) = \frac{\|Y^\top X\|_F^2}{\|X^\top X\|_F \cdot \|Y^\top Y\|_F}\]CKA is invariant to orthogonal transformations and isotropic scaling, making it appropriate for comparing features across architectures with varying embedding dimensionalities (256–3072 in the registry). We extract penultimate-layer features from 138 models (3 failed due to architectural incompatibilities) on two proxy datasets:
We compute full 138 × 138 pairwise CKA matrices on both proxies.
We solve the combinatorial subset selection problem using a three-stage algorithm:
We run this procedure separately on architecture-restricted candidate pools: all-attention (ViT + hybrids), CNN-only (ResNet/DenseNet/DarkNet variants), and the full registry.
Four portal evaluations reveal systematic proxy miscalibration:
| Submission | CIFAR-100 (proxy) | Imagenette (proxy) | Real Portal CKA | Calibration ratio |
|---|---|---|---|---|
| S1: Imagenette-optimized (diverse CNN+ViT) | 0.7502 | 0.9592 | 0.68 | 0.907 |
| S2: All-ViT (manually curated) | 0.5919 | 0.7600 | 0.70 | 1.183 |
| S3: All-Attention (combined-proxy opt) | 0.7612 | 0.9499 | 0.69 | 0.906 |
| S4: Blur-CNN (best) | 0.7613 | 0.8637 | 0.7294 | 0.958 |
Calibration ratio = Real / CIFAR-100 proxy.
Figure 2 & 3 — CKA heatmaps (full resolution)
Figure 2: Pairwise CIFAR-100 CKA heatmap for the blur-CNN submitted set (mean = 0.761). The matrix is uniformly warm with minimum pairwise CKA of approximately 0.55 and most pairs above 0.70.
Figure 3: Comparison of CIFAR-100 CKA heatmaps. Left: Blur-CNN set (real = 0.7294, proxy mean = 0.761). Right: Imagenette-optimized diverse set (real = 0.68, proxy mean = 0.750). The blur-CNN set is more uniformly high-CKA; the diverse set contains CNN–ViT blocks that appear warm on Imagenette but diverge on richer evaluations.
Figure 4: Intra-family average pairwise CKA on CIFAR-100 and Imagenette. All three families—blur/ResNet, hybrid attention, and pure ViT—show high intra-family alignment, with hybrid attention leading on both proxies. Imagenette consistently inflates CKA relative to CIFAR-100 across all families.
Figure 5: 2D MDS embedding of 138 vision models using CIFAR-100 CKA dissimilarity. Stars (★) mark the blur-CNN submitted set, which clusters tightly in the ResNet/blur region of the embedding space. The tight clustering confirms that the selection exploits high within-cluster alignment rather than spanning diverse regions.
Figure 6: Per-model average CKA to the other 19 models in the blur-CNN set. ResNetBlur50, CSPResNeXt50, and EcaResNetLight contribute most to set alignment. The spread is narrow (0.73–0.80), indicating no severe outliers.
Figure 7: Greedy selection trajectory for the blur-CNN set on CIFAR-100. Mean pairwise CKA converges to 0.7613 by model 12, with negligible improvement from subsequent additions, indicating the 20-model solution is near-optimal for this architecture pool.
Our results across four submissions reveal a consistent pattern: architecture-homogeneous sets outperform mixed sets on real evaluation, even when mixed sets score higher on proxy datasets.
CNN homogeneity: The blur-CNN models share anti-aliasing, ResNet bottlenecks, and similar training recipes. Convolutional processing is resolution-agnostic in a way that patch-based attention is not, yielding consistent CKA across image distributions. The CIFAR-100 proxy calibrates well (0.958×) because the convolutional pipeline processes upsampled low-resolution images through the same computational graph as natural images.
ViT homogeneity: All-attention sets (S2, S3) also benefit from shared inductive biases—patch-based attention and global aggregation—consistent with the Platonic Representation Hypothesis (Huh et al., 2024), which posits that sufficiently capable models converge to a shared statistical model of reality. However, ViT sets calibrate worse on CIFAR-100 (real/proxy ≈ 0.91–1.18) because low-resolution upsampling disrupts the spatial structure that patch embeddings rely on.
Mixed sets fail: S1 (imagenette-optimized diverse) has high proxy scores but collapses on real evaluation because CNN↔ViT pairs align on 10 coarse classes but diverge on 1,000 fine-grained classes. The representational strategies of convolution (local texture) and self-attention (global shape) produce similar outputs for easily separable categories but diverge when fine-grained discrimination is required (Raghu et al., 2021; Geirhos et al., 2019).
Our empirical analysis yields actionable principles for future proxy-based alignment studies:
Proxy sample size: We use only 500 images per proxy. CKA variance with 500 samples is non-negligible; scaling to 5,000–10,000 images would reduce estimation error. Layer selection: We use the registry-specified penultimate layer; earlier layers may exhibit different alignment patterns (Raghu et al., 2021). Limited submissions: With only four portal evaluations, our calibration analysis is necessarily coarse; more evaluation rounds would yield tighter calibration estimates.
We achieve 0.7294 real portal CKA with a blur-CNN cluster selected via CIFAR-100 proxy optimization—a substantial improvement over our earlier diverse (0.68) and all-attention (0.69–0.70) submissions. Our key finding is that architecture family homogeneity predicts real-evaluation alignment better than proxy-optimized diversity. Homogeneous CNN clusters with shared anti-aliasing and bottleneck architectures achieve both high absolute CKA and tight proxy calibration. For future work, full ImageNet-1k CKA computation and larger proxy sample sizes would improve calibration, and adaptive proxy construction—iteratively selecting images that maximize disagreement between candidate models—could bridge the proxy-to-real gap more effectively.
| # | Model | Layer | Architecture type |
|---|---|---|---|
| 1 | ecaresnetlight.miil_in1k |
global_pool | ECA-ResNet |
| 2 | cspdarknet53.ra_in1k |
head.global_pool | CSP-DarkNet |
| 3 | resnetblur50.bt_in1k |
global_pool | ResNet + blur pool |
| 4 | ecaresnet101d.miil_in1k |
global_pool | ECA-ResNet |
| 5 | tresnet_l.miil_in1k |
head.global_pool | TResNet |
| 6 | cs3darknet_focus_l.c2ns_in1k |
head.global_pool | CSP-DarkNet v3 |
| 7 | darknetaa53.c2ns_in1k |
head.global_pool | DarkNet + anti-alias |
| 8 | bat_resnext26ts.ch_in1k |
head.global_pool | BAT-ResNeXt (attn hybrid) |
| 9 | cspresnext50.ra_in1k |
head.global_pool | CSP-ResNeXt |
| 10 | botnet26t_256.c1_in1k |
head.global_pool | BotNet (attn hybrid) |
| 11 | lambda_resnet26rpt_256.c1_in1k |
head.global_pool | Lambda-ResNet (attn hybrid) |
| 12 | halo2botnet50ts_256.a1h_in1k |
head.global_pool | HaloBotNet (attn hybrid) |
| 13 | cspresnet50.ra_in1k |
head.global_pool | CSP-ResNet |
| 14 | repvit_m0_9.dist_300e_in1k |
head.head.bn | RepViT (mobile hybrid) |
| 15 | eca_botnext26ts_256.c1_in1k |
head.global_pool | ECA-BotNeXt (attn hybrid) |
| 16 | densenet121.ra_in1k |
global_pool | DenseNet |
| 17 | densenetblur121d.ra_in1k |
global_pool | DenseNet + blur pool |
| 18 | halonet26t.a1h_in1k |
head.global_pool | HaloNet (attn hybrid) |
| 19 | darknet53.c2ns_in1k |
head.global_pool | DarkNet |
| 20 | gernet_l.idstcv_in1k |
head.global_pool | GERNet |
We also participated in the Red Team track, selecting 1,000 ObjectNet images to maximize inter-model divergence. ObjectNet’s non-standard viewpoints and backgrounds suppress familiar texture cues, exploiting the texture-vs-shape bias difference between CNNs and transformers (Geirhos et al., 2019). Full methodology is omitted for brevity; we focus on the Blue Team in this report.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR. arXiv:1811.12231
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv preprint. arXiv:2405.07987
Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of Neural Network Representations Revisited. ICML. arXiv:1905.00414
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do Vision Transformers See Like Convolutional Neural Networks? NeurIPS. arXiv:2108.08810
Sucholutsky, I., Muttenthaler, L., Weller, A., et al. (2023). Getting aligned on representational alignment. Transactions on Machine Learning Research. OpenReview
Wightman, R. (2019). PyTorch Image Models. GitHub
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX