Small‑model how‑tos published
- MarkTechPost published hands‑on tutorials for Qwen 3.6‑35B‑A3B and Microsoft's Phi‑4‑Mini covering RAG, tool calling, and quantized inference. - The guides include LoRA fine‑tuning, session persistence, MoE routing examples, and multimodal inference patterns. - The pieces emphasise deployment patterns useful for production‑oriented small models, like quantization, adapters, and retrieval workflows (marktechpost.com (marktechpost.com).
MarkTechPost published two code-first guides on April 20 and April 21 that show how developers can run small open models with retrieval, tool use, and low-memory inference. (marktechpost.com 1) (marktechpost.com 2) A small model is a language model tuned to fit tighter hardware budgets, often by shrinking memory use or activating only part of the network on each query. The Phi-4-mini guide loads Microsoft’s 3.8 billion-parameter model in 4-bit quantization, a compression method that stores weights with fewer bits so the model can run on lighter setups. (huggingface.co) (marktechpost.com) The Qwen guide covers a different way to cut compute: mixture of experts, which routes each token to a few specialist sub-models instead of using the full network every time. Qwen’s model card says Qwen3.6-35B-A3B has 35 billion total parameters but activates 3 billion during inference, with 256 experts and 8 routed experts plus 1 shared expert. (huggingface.co) (marktechpost.com) Both tutorials stay close to deployment work that teams actually ship. The Phi-4-mini notebook walks through streaming chat, tool calling, retrieval-augmented generation, and LoRA fine-tuning, while the Qwen notebook adds multimodal input, structured JavaScript Object Notation output, session persistence, and routing inspection. (marktechpost.com 1) (marktechpost.com 2) Retrieval-augmented generation, or RAG, is the pattern where a model looks up outside documents before answering, like giving it an open-book test instead of relying only on memory. LoRA, short for low-rank adaptation, is a fine-tuning method that trains small adapter layers instead of rewriting the whole model, which cuts cost and keeps the base weights intact. (marktechpost.com) The Microsoft side already points developers in the same direction. Microsoft said when it introduced Phi-4-mini on February 26, 2025 that the model supports function calling, has a 128,000-token context window, and is aimed at memory- and latency-constrained environments, including edge deployments. (techcommunity.microsoft.com) (huggingface.co) The Qwen tutorial also leans into hardware limits instead of ignoring them. Its setup code switches between bfloat16, 8-bit, and 4-bit loading based on available graphics memory, and the notebook says a first download is about 70 gigabytes for the model files. (marktechpost.com) Microsoft and Alibaba both publish official model materials, but guides like these translate model cards into runnable workflows. Microsoft maintains a PhiCookBook repository with examples for function calling, retrieval, and edge deployment, and Qwen’s Hugging Face page lists support across Transformers, vLLM, SGLang, and KTransformers. (github.com) (huggingface.co) The immediate takeaway is not that smaller models replaced the biggest systems this week. It is that two April tutorials put the current playbook in one place: compress the model, add retrieval, bolt on tools, and keep the application state so a compact model can handle production-style work. (marktechpost.com 1) (marktechpost.com 2)