Hugging Models Intros Fast Multilingual AI for Apple Silicon

Hugging Models has introduced GLM-4.7-Flash-8bit, a fast, multilingual AI model specifically quantized for Apple Silicon. The model is designed to enable high-quality text generation to run locally on Apple devices. This expands the toolkit for developers building on-device AI applications within the Apple ecosystem.

- The GLM-4.7-Flash model utilizes a Mixture of Experts (MoE) architecture with 30 billion total parameters, but only activates 3 billion parameters for any given token, significantly speeding up inference and reducing memory usage compared to dense models. - This specific version is an 8-bit quantized model optimized for Apple Silicon using Apple's own MLX framework, a machine learning library designed for the unified memory architecture of M-series chips. - The underlying GLM-4 architecture was pretrained on 15 trillion tokens of high-quality multilingual data, providing a strong foundation in language understanding, mathematical reasoning, and code generation. - On-device performance for quantized models of a similar size (7-8 billion parameters) on Apple Silicon can range from approximately 30 to over 65 tokens per second, with the MLX framework often providing a significant performance uplift over other backends. - The 8-bit quantization is a key enabler for local execution, compressing the model's weights to reduce its memory footprint, which is critical for fitting within the unified RAM of devices like a MacBook Pro. - The broader GLM-4 model family, developed by Zhipu AI, also includes multimodal variants like GLM-4-Voice, which integrates speech recognition and generation for real-time voice conversations in multiple languages. - For developers, running models like this locally on Apple Silicon removes reliance on cloud APIs, reducing latency and improving data privacy for applications built within the Apple ecosystem.

Hugging Models Intros Fast Multilingual AI for Apple Silicon

Get your own daily briefing