Hybrid Vision Architecture Sets Record
A novel hybrid architecture that combines vision transformers with Kolmogorov–Arnold networks has achieved state-of-the-art performance in surface crack classification. The result demonstrates the potential of blending different deep learning paradigms to solve practical computer vision problems.
- The architecture is based on Kolmogorov–Arnold Networks (KANs), which are themselves a modern application of the Kolmogorov-Arnold representation theorem, a mathematical concept developed in 1957 by Andrey Kolmogorov and Vladimir Arnold. Unlike traditional neural networks with fixed activation functions on nodes, KANs feature learnable activation functions on the edges, allowing them to achieve higher accuracy with fewer parameters. - In this hybrid model, the standard Multi-Layer Perceptron (MLP) layers within the Vision Transformer (ViT) architecture are replaced with KAN modules. This approach combines the ViT's powerful self-attention mechanism, which excels at capturing global relationships between image patches, with the KAN's enhanced ability to model complex non-linear dependencies. - The model detailed in the *Scientific Reports* paper achieved a classification accuracy of 99.4% for non-cracked surfaces and 99.2% for cracked surfaces. Its high precision rates indicate a low number of false positives, which is critical for efficiently managing infrastructure maintenance. - Automated surface crack detection is a significant application of computer vision, as traditional visual inspection of infrastructure is labor-intensive, slow, and subject to human error. Datasets for training these models can be extensive, with one common public dataset containing 40,000 images of concrete surfaces. - Beyond this specific application, research into hybrid KAN-ViT models has shown other benefits, such as significantly mitigating "catastrophic forgetting"—the tendency of a model to lose knowledge of a previous task after learning a new one. This is due to KANs' use of localized, spline-based activations, which means only a subset of parameters are updated for each new sample. - While promising, the integration of KANs can introduce a trade-off between performance and computational cost. In experiments with a "ViKANformer" on the MNIST dataset, models achieved over 97% accuracy but with significantly increased training times compared to standard architectures.