Hugging Models Intros Fast Multilingual AI for Apple Silicon
Hugging Models has introduced GLM-4.7-Flash-8bit, a fast, multilingual AI model specifically quantized for Apple Silicon. The model is designed to enable high-quality text generation to run locally on Apple devices. This expands the toolkit for developers building on-device AI applications within the Apple ecosystem.
- The GLM-4.7-Flash model utilizes a Mixture of Experts (MoE) architecture with 30 billion total parameters, but only activates 3 billion parameters for any given token, significantly speeding up inference and reducing memory usage compared to dense models. - This specific version is an 8-bit quantized model optimized for Apple Silicon using Apple's own MLX framework, a machine learning library designed for the unified memory architecture of M-series chips. - The underlying GLM-4 architecture was pretrained on 15 trillion tokens of high-quality multilingual data, providing a strong foundation in language understanding, mathematical reasoning, and code generation. - On-device performance for quantized models of a similar size (7-8 billion parameters) on Apple Silicon can range from approximately 30 to over 65 tokens per second, with the MLX framework often providing a significant performance uplift over other backends. - The 8-bit quantization is a key enabler for local execution, compressing the model's weights to reduce its memory footprint, which is critical for fitting within the unified RAM of devices like a MacBook Pro. - The broader GLM-4 model family, developed by Zhipu AI, also includes multimodal variants like GLM-4-Voice, which integrates speech recognition and generation for real-time voice conversations in multiple languages. - For developers, running models like this locally on Apple Silicon removes reliance on cloud APIs, reducing latency and improving data privacy for applications built within the Apple ecosystem.