Meta Launches Llama 4 Multimodal AI

Meta has officially unveiled Llama 4, its new flagship family of large language models with robust multimodal capabilities. While positioned as a next-gen foundation with improved efficiency, the EU will not receive full feature parity, signaling ongoing friction with regional AI regulations.

Llama 4's architecture marks Meta's first use of a Mixture-of-Experts (MoE) design in its flagship models. This allows for greater efficiency by only activating a fraction of the model's total parameters for any given task. The Llama 4 Maverick variant, for instance, has 400 billion total parameters but only activates 17 billion for any single input, a technique aimed at balancing performance with computational cost. The model family includes specialized versions: "Scout," a lighter model designed to run on a single NVIDIA H100 GPU, and "Maverick," a larger, more capable generalist model. A third, even larger model named "Behemoth" is still in training and is intended to act as a "teacher" model to improve the smaller variants through a process called distillation. For training Llama 4, Meta confirmed using a cluster of over 100,000 NVIDIA H100 GPUs. The training data for Llama 4 included a mix of publicly available sources along with publicly shared, non-private user data from Facebook and Instagram. This investment in custom compute infrastructure is part of a broader strategy that includes the development of Meta's own custom silicon, the Meta Training and Inference Accelerator (MTIA), to optimize for its specific AI workloads. The Llama 4 Maverick model is positioned to compete with other leading models like OpenAI's GPT-4o and Google's Gemini series. Benchmarks show Maverick outperforming GPT-4o and Gemini 2.0 Flash in some key areas. However, the open-weight model's accessibility comes with a significant caveat for large-scale operators; companies with more than 700 million monthly active users must obtain a special license from Meta. The exclusion of the EU from the Llama 4 community license is a direct response to the region's AI Act. Specifically, the restrictions apply to the multimodal capabilities of the Llama 4 family, a sticking point related to the EU's stricter regulations on AI systems that can process multiple types of data inputs. This has forced EU-based developers and companies to rely on older, non-multimodal versions of Llama or seek alternatives. From a cost perspective, the MoE architecture significantly lowers inference expenses. Estimates place Maverick's inference cost at approximately $0.19 to $0.49 per million tokens, a fraction of the cost of running models like GPT-4o. This efficiency is critical for developers and enterprises building applications on top of the model, directly impacting the economics of deploying AI at scale. The release is part of Meta's broader GTM strategy to commoditize the AI model layer by open-sourcing its most powerful tools. By building a large ecosystem of developers and companies using Llama, Meta aims to establish its architecture as an industry standard, which in turn could drive demand for its own hardware and cloud services in the long run. This open approach contrasts with the closed, API-only models from competitors like OpenAI and Anthropic. For MLOps teams, deploying multimodal models like Llama 4 necessitates more complex infrastructure capable of handling diverse data types for training, validation, and monitoring. The rise of multi-model serving, where multiple specialized models are managed within a single containerized environment, is becoming a key strategy for optimizing resource utilization and cost in production environments.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.