Local 70B LLM setups

People are sharing real-world notes on running very large language models locally — one thread describes a self-sovereign 70‑billion‑parameter model setup that avoids cloud exposure, and an open-source project for local LLM memory was published to help privacy-first apps (x.com) (x.com).

A 70 billion-parameter language model now fits on a desk setup, and developers are pairing it with local memory software to keep prompts and data off cloud servers. (ollama.com) (github.com) A parameter is one of the model’s learned knobs, and 70.6 billion of them usually means a much larger file and much heavier hardware. Ollama lists its Llama 3.1 70B build at 43GB in a Q4_K_M quantized format, a compressed version that cuts memory use by storing weights at lower precision. (ollama.com) (github.com) Quantization is the trick that makes these setups practical: llama.cpp says it converts high-precision weights such as F32 or BF16 into smaller formats such as 4-bit integers, trading some accuracy for lower RAM use and faster inference. The same project uses the GGUF file format to package those compressed models for local runtimes. (github.com) (huggingface.co) The model itself is not new. Meta released Llama 3.1 on July 23, 2024 in 8B, 70B and 405B sizes, and the 70B version is part of the line’s instruction-tuned text models. (github.com) (ollama.com) What changed is the surrounding software stack. Tools such as Ollama and llama.cpp now let developers pull, quantize and serve large open models locally on consumer or prosumer hardware instead of wiring every request to a hosted application programming interface. (ollama.com) (github.com) (huggingface.co) A local model still has a short working memory unless an app stores context outside the model. That is the gap projects such as OpenMemory are trying to fill with a self-hosted layer that saves user preferences, past interactions and other recallable facts in SQLite or Postgres. (github.com 1) (github.com 2) OpenMemory’s GitHub repository says it is “currently being fully rewritten,” and the project describes itself as a cognitive memory engine rather than a simple retrieval-augmented generation database. The repo lists Python and Node software development kits, integrations with LangChain, CrewAI, AutoGen, Streamlit, Model Context Protocol and Visual Studio Code, plus connectors for GitHub, Notion, Google Drive, OneDrive and web crawling. (github.com 1) (github.com 2) The pitch for both pieces is control. A local model keeps inference on the machine, and a local memory layer keeps the app’s long-term context on the same machine or a self-managed database instead of a vendor’s hosted memory service. (huggingface.co) (github.com) (openmemory.cavira.app) That does not remove the tradeoffs. The 70B model file is still tens of gigabytes, OpenMemory warns of breaking changes and bugs during its rewrite, and quantization reduces size by sacrificing some model quality. (ollama.com) (github.com 1) (github.com 2) The result is a more complete local stack than “run a chatbot on your laptop” meant a year ago: one layer generates text, another stores memory, and both can stay under the user’s control. (ollama.com) (github.com)

Local 70B LLM setups

Get your own daily briefing