xLSTM distillation checkpoints incoming
A new paper and community thread outline effective distillation pipelines for hybrid xLSTM architectures derived from transformer LLMs like Llama, and the authors say checkpoints with vLLM support are coming to Hugging Face. The work points to continued innovation in architectures that blend recurrence and transformer strengths for efficient long‑context and streaming inference. (x.com)
The paper "Effective Distillation to Hybrid xLSTM Architectures" was posted to arXiv on March 16, 2026 and lists Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied and co‑authors affiliated with NX‑AI and Johannes Kepler University Linz. (arxiv.org) Their distillation pipeline adds an explicit merging stage that combines individually linearized experts into a single xLSTM student, and the experiments cover distillation of both base and instruction‑tuned teachers from the Llama, Qwen, and Olmo families. (arxiv.org) The authors quantify fidelity with a "Win‑and‑Tie" rate and report that, across benchmarks spanning math, code, STEM and chat domains, xLSTM students recover most of the teacher performance and exceed teachers on some downstream tasks. (arxiv.org) The paper’s Hugging Face discussion thread shows active community engagement and an author comment on the paper page referencing ongoing integration work and forthcoming model artifacts in the comments section. (huggingface.co) vLLM’s documentation states the runtime can read Hugging Face model configs and weights, supports paged attention and streaming outputs, and advertises quantization pathways including GPTQ, INT8 and FP8 that are commonly used for high‑throughput serving. (docs.vllm.ai) NX‑AI already publishes an xLSTM‑7B checkpoint on Hugging Face that the model card says was pretrained on roughly 2.3 trillion tokens, and the repository shows the checkpoint split across multiple safetensors files (several ~4.8–5.0 GB parts plus a ~3.0 GB part). (huggingface.co) The paper documents the merging stage; NX‑AI’s GitHub hosts xLSTM code and optimized Triton kernels for the architecture; and vLLM’s docs list paged‑memory and quantization features—all concrete components mentioned in public artifacts that align with practical serving workflows for xLSTM checkpoints. (arxiv.org)