New Tool 'Cortex' Compiles LLMs into C++ Binaries
A new tool called Cortex has launched to address MLOps bloat by compiling LLMs directly into zero-dependency C++ binaries. The goal is to create smaller, faster, and more portable models by eliminating the need for oversized Docker images and complex runtime environments. This approach could significantly streamline deploying language models in resource-constrained settings.
Cortex.cpp is the underlying C++ engine for Jan, an open-source alternative to ChatGPT developed by Menlo Research. The project aims to enable users to run large language models locally with full control and without reliance on third-party cloud services. Menlo Research, the lab behind Jan and Cortex, recently rebranded to Homebrew to focus on broader AI infrastructure challenges. The tool provides a command-line interface inspired by Ollama and is designed to run on various architectures, including both CPUs and GPUs. Cortex supports multiple inference engines, with `llama.cpp` being a primary one for running models in formats like GGUF. This approach allows for a more streamlined and hardware-optimized execution of language models. While the goal is to create smaller and more portable models, Cortex.cpp functions as a C++-based runtime environment rather than directly compiling LLMs into standalone binaries. This method still significantly reduces the deployment overhead compared to large Docker images by providing a more lightweight and efficient way to run models locally. In a recent development, the Jan team has moved to integrate `llama.cpp` directly, removing the Cortex abstraction layer. This change was made to reduce latency, simplify maintenance, and give users more direct control over the underlying inference engine settings.