🟦 Architecture-Homogeneous Model Selection for Representational Alignment

Abstract

We present our Blue Team submission to the Re-Align Challenge, selecting 20 models from a curated registry of 141 to maximize mean pairwise Centered Kernel Alignment (CKA) on held-out evaluation images. Our key finding is that architecture family homogeneity—restricting to either attention-based models or a tightly aligned CNN cluster—predicts real-evaluation alignment better than proxy-optimized raw CKA across architecturally diverse sets. Our best submission achieves 0.7294 real portal CKA using a blur-CNN cluster of anti-aliased and blur-augmented convolutional networks (EcaResNet, ResNetBlur, DenseNet, CSPDarkNet) alongside lightweight attention-over-convolution hybrids (HaloNet, BotNet). Earlier all-attention sets reached 0.70, while imagenette-optimized diverse sets collapsed to 0.68 on the real evaluation. We show that CIFAR-100 is a better proxy than Imagenette for calibration, that homogeneous CNN clusters can match or exceed all-ViT alignment when selected appropriately, and offer actionable proxy design principles for future alignment studies.

1. Introduction

A central question in representation learning is whether independently trained neural networks converge to similar internal feature geometries—a property sometimes called representational universality (Sucholutsky et al., 2023). The Re-Align Challenge operationalizes this question: the Blue Team selects 20 models from a registry of 141 whose pairwise CKA on held-out images is jointly maximized (Kornblith et al., 2019).

We compute CKA on proxy datasets (CIFAR-100, Imagenette), optimize model selection via greedy search with local refinement, and submit to the portal for real evaluation. Our iterative process—guided by four rounds of leaderboard feedback—yielded the following insights:

Imagenette overestimates alignment for diverse CNN+ViT sets (0.96 proxy → 0.68 real).
CIFAR-100 is a better proxy but underestimates ViT-to-ViT alignment on natural 224px images.
Architecture homogeneity matters: both all-attention sets and homogeneous CNN clusters achieve higher real scores than mixed sets.
Blur-CNN cluster (anti-aliased ResNet/DenseNet variants with lightweight attention hybrids) achieves our best result (0.7294 real).

These results offer partial support for the Platonic Representation Hypothesis (Huh et al., 2024)—representations converge, but the path of convergence is shaped by architectural inductive biases. Within a shared inductive bias family (whether self-attention or anti-aliased convolution), models trained independently on ImageNet-1k arrive at remarkably similar feature geometries.

2. Methods

2.1 Objective

Maximize mean pairwise CKA over 20 models:

\[\bar{\rho}(\mathcal{M}) = \frac{2}{|\mathcal{M}|(|\mathcal{M}|-1)} \sum_{i < j} \text{CKA}(F_i, F_j)\]

where $F_i$ is the penultimate-layer feature matrix of model $i$ over the evaluation images.

2.2 CKA Computation

We use linear CKA (Kornblith et al., 2019):

\[\text{CKA}(X, Y) = \frac{\|Y^\top X\|_F^2}{\|X^\top X\|_F \cdot \|Y^\top Y\|_F}\]

CKA is invariant to orthogonal transformations and isotropic scaling, making it appropriate for comparing features across architectures with varying embedding dimensionalities (256–3072 in the registry). We extract penultimate-layer features from 138 models (3 failed due to architectural incompatibilities) on two proxy datasets:

CIFAR-100: 500 randomly sampled test images, 32→224px upsampled via bilinear interpolation
Imagenette: 500 randomly sampled images from imagenette2-320 validation (10 classes, natural 320px resolution)

We compute full 138 × 138 pairwise CKA matrices on both proxies.

2.3 Optimization

We solve the combinatorial subset selection problem using a three-stage algorithm:

Greedy initialization: Seed with the highest-CKA pair, then iteratively add the model maximizing marginal contribution to the set average.
1-swap local search: Scan all (current model, replacement) pairs and execute the best-improving swap; repeat until convergence.
Multi-restart: Apply steps 1–2 from 500–2000 random initializations; keep the global best.

We run this procedure separately on architecture-restricted candidate pools: all-attention (ViT + hybrids), CNN-only (ResNet/DenseNet/DarkNet variants), and the full registry.

3. Results

3.1 Proxy Calibration

Four portal evaluations reveal systematic proxy miscalibration:

Submission	CIFAR-100 (proxy)	Imagenette (proxy)	Real Portal CKA	Calibration ratio
S1: Imagenette-optimized (diverse CNN+ViT)	0.7502	0.9592	0.68	0.907
S2: All-ViT (manually curated)	0.5919	0.7600	0.70	1.183
S3: All-Attention (combined-proxy opt)	0.7612	0.9499	0.69	0.906
S4: Blur-CNN (best)	0.7613	0.8637	0.7294	0.958

Calibration ratio = Real / CIFAR-100 proxy.

Figure 2 & 3 — CKA heatmaps (full resolution)

Figure 2: Pairwise CIFAR-100 CKA heatmap for the blur-CNN submitted set (mean = 0.761). The matrix is uniformly warm with minimum pairwise CKA of approximately 0.55 and most pairs above 0.70.

Figure 3: Comparison of CIFAR-100 CKA heatmaps. Left: Blur-CNN set (real = 0.7294, proxy mean = 0.761). Right: Imagenette-optimized diverse set (real = 0.68, proxy mean = 0.750). The blur-CNN set is more uniformly high-CKA; the diverse set contains CNN–ViT blocks that appear warm on Imagenette but diverge on richer evaluations.

Figure 4: Intra-family average pairwise CKA on CIFAR-100 and Imagenette. All three families—blur/ResNet, hybrid attention, and pure ViT—show high intra-family alignment, with hybrid attention leading on both proxies. Imagenette consistently inflates CKA relative to CIFAR-100 across all families.

Figure 5: 2D MDS embedding of 138 vision models using CIFAR-100 CKA dissimilarity. Stars (★) mark the blur-CNN submitted set, which clusters tightly in the ResNet/blur region of the embedding space. The tight clustering confirms that the selection exploits high within-cluster alignment rather than spanning diverse regions.

Figure 6: Per-model average CKA to the other 19 models in the blur-CNN set. ResNetBlur50, CSPResNeXt50, and EcaResNetLight contribute most to set alignment. The spread is narrow (0.73–0.80), indicating no severe outliers.

Figure 7: Greedy selection trajectory for the blur-CNN set on CIFAR-100. Mean pairwise CKA converges to 0.7613 by model 12, with negligible improvement from subsequent additions, indicating the 20-model solution is near-optimal for this architecture pool.

4. Discussion

4.1 Why Homogeneity Helps

Our results across four submissions reveal a consistent pattern: architecture-homogeneous sets outperform mixed sets on real evaluation, even when mixed sets score higher on proxy datasets.

CNN homogeneity: The blur-CNN models share anti-aliasing, ResNet bottlenecks, and similar training recipes. Convolutional processing is resolution-agnostic in a way that patch-based attention is not, yielding consistent CKA across image distributions. The CIFAR-100 proxy calibrates well (0.958×) because the convolutional pipeline processes upsampled low-resolution images through the same computational graph as natural images.

ViT homogeneity: All-attention sets (S2, S3) also benefit from shared inductive biases—patch-based attention and global aggregation—consistent with the Platonic Representation Hypothesis (Huh et al., 2024), which posits that sufficiently capable models converge to a shared statistical model of reality. However, ViT sets calibrate worse on CIFAR-100 (real/proxy ≈ 0.91–1.18) because low-resolution upsampling disrupts the spatial structure that patch embeddings rely on.

Mixed sets fail: S1 (imagenette-optimized diverse) has high proxy scores but collapses on real evaluation because CNN↔ViT pairs align on 10 coarse classes but diverge on 1,000 fine-grained classes. The representational strategies of convolution (local texture) and self-attention (global shape) produce similar outputs for easily separable categories but diverge when fine-grained discrimination is required (Raghu et al., 2021; Geirhos et al., 2019).

4.2 Proxy Design Principles

Our empirical analysis yields actionable principles for future proxy-based alignment studies:

Match proxy resolution to model inductive bias: Use 224px natural images for ViTs; CIFAR-100 upsampling suppresses ViT alignment but calibrates well for CNNs.
Match proxy class diversity to evaluation scope: 10 classes cannot reveal differences visible on 1,000 classes; use 100+ classes for general alignment estimation.
Restrict candidate pool by architecture: Homogeneous pools (blur-CNN, all-ViT) generalize better than mixed optima because within-family alignment transfers robustly across image distributions.
Report calibration ratios: Future work should report real/proxy ratios per architecture family to enable cross-study comparison.

4.3 Limitations

Proxy sample size: We use only 500 images per proxy. CKA variance with 500 samples is non-negligible; scaling to 5,000–10,000 images would reduce estimation error. Layer selection: We use the registry-specified penultimate layer; earlier layers may exhibit different alignment patterns (Raghu et al., 2021). Limited submissions: With only four portal evaluations, our calibration analysis is necessarily coarse; more evaluation rounds would yield tighter calibration estimates.

5. Conclusion

We achieve 0.7294 real portal CKA with a blur-CNN cluster selected via CIFAR-100 proxy optimization—a substantial improvement over our earlier diverse (0.68) and all-attention (0.69–0.70) submissions. Our key finding is that architecture family homogeneity predicts real-evaluation alignment better than proxy-optimized diversity. Homogeneous CNN clusters with shared anti-aliasing and bottleneck architectures achieve both high absolute CKA and tight proxy calibration. For future work, full ImageNet-1k CKA computation and larger proxy sample sizes would improve calibration, and adaptive proxy construction—iteratively selecting images that maximize disagreement between candidate models—could bridge the proxy-to-real gap more effectively.

Appendix A: Submitted Model Set (Blur-CNN)

#	Model	Layer	Architecture type
1	`ecaresnetlight.miil_in1k`	global_pool	ECA-ResNet
2	`cspdarknet53.ra_in1k`	head.global_pool	CSP-DarkNet
3	`resnetblur50.bt_in1k`	global_pool	ResNet + blur pool
4	`ecaresnet101d.miil_in1k`	global_pool	ECA-ResNet
5	`tresnet_l.miil_in1k`	head.global_pool	TResNet
6	`cs3darknet_focus_l.c2ns_in1k`	head.global_pool	CSP-DarkNet v3
7	`darknetaa53.c2ns_in1k`	head.global_pool	DarkNet + anti-alias
8	`bat_resnext26ts.ch_in1k`	head.global_pool	BAT-ResNeXt (attn hybrid)
9	`cspresnext50.ra_in1k`	head.global_pool	CSP-ResNeXt
10	`botnet26t_256.c1_in1k`	head.global_pool	BotNet (attn hybrid)
11	`lambda_resnet26rpt_256.c1_in1k`	head.global_pool	Lambda-ResNet (attn hybrid)
12	`halo2botnet50ts_256.a1h_in1k`	head.global_pool	HaloBotNet (attn hybrid)
13	`cspresnet50.ra_in1k`	head.global_pool	CSP-ResNet
14	`repvit_m0_9.dist_300e_in1k`	head.head.bn	RepViT (mobile hybrid)
15	`eca_botnext26ts_256.c1_in1k`	head.global_pool	ECA-BotNeXt (attn hybrid)
16	`densenet121.ra_in1k`	global_pool	DenseNet
17	`densenetblur121d.ra_in1k`	global_pool	DenseNet + blur pool
18	`halonet26t.a1h_in1k`	head.global_pool	HaloNet (attn hybrid)
19	`darknet53.c2ns_in1k`	head.global_pool	DarkNet
20	`gernet_l.idstcv_in1k`	head.global_pool	GERNet

Appendix B: Red Team (Brief)

We also participated in the Red Team track, selecting 1,000 ObjectNet images to maximize inter-model divergence. ObjectNet’s non-standard viewpoints and backgrounds suppress familiar texture cues, exploiting the texture-vs-shape bias difference between CNNs and transformers (Geirhos et al., 2019). Full methodology is omitted for brevity; we focus on the Blue Team in this report.

References

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR. arXiv:1811.12231
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv preprint. arXiv:2405.07987
Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of Neural Network Representations Revisited. ICML. arXiv:1905.00414
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do Vision Transformers See Like Convolutional Neural Networks? NeurIPS. arXiv:2108.08810
Sucholutsky, I., Muttenthaler, L., Weller, A., et al. (2023). Getting aligned on representational alignment. Transactions on Machine Learning Research. OpenReview
Wightman, R. (2019). PyTorch Image Models. GitHub