Invariant Features in Language Models: Geometric Characterization and Model Attribution
Under review
We propose a local geometric framework for identifying invariant semantic subspaces in transformer-based language models. Using a contrastive generalized eigenvalue decomposition over semantic-preserving and semantic-changing perturbations, we localize layers where semantic meaning concentrates and validate these representations causally through hidden-state interventions. Invariant representations are further applied to zero-shot model attribution, achieving over 92% accuracy across base, fine-tuned, and distilled variants of 9 open-source LLMs spanning diverse architectures and parameter scales.