Edge models on satellites and FP4 inference news
A post describes a fine‑tuned 1.6B vision‑language model running onboard satellites, noting inference costs measured in pennies per call versus dollars for cloud processing and the latency benefits of local inference. (x.com) Separately, a researcher announced FP4 quantization‑aware training for attention that enables end‑to‑end FP4 models on NVIDIA Rubin hardware with roughly 3× throughput over FP8—an advance aimed at long‑context, edge inference workloads. (x.com)
A satellite image is just pixels until a model turns it into a sentence, label, or alert without waiting for a ground station. NVIDIA and academic researchers are now pushing the same idea on chips, shrinking inference from eight-bit math to four-bit math to run more of that work locally. (arxiv.org) (developer.nvidia.com) Vision-language models are systems that read images and text together, like software that can look at a crop field or floodplain and answer a question in plain language. Falcon, a remote-sensing model posted to arXiv in March 2025, has 1.6 billion parameters and was trained on more than 10 million image-text pairs, including a 9.3 million-pair dataset called Remote Sensing Image Captioning. (arxiv.org) Running those models on a satellite is hard because low Earth orbit spacecraft have tight power budgets, limited onboard compute, and short windows to send data back to Earth. A 2025 paper from Fudan University said satellite-ground collaboration cut average latency by 76 percent to 95 percent by keeping compact models onboard and sending harder tasks to ground stations. (arxiv.org) That is the backdrop for a recent post describing a fine-tuned 1.6 billion-parameter vision-language model running onboard satellites, with inference priced in pennies per call instead of dollars for cloud processing. The post also said local inference avoids the round trip to ground infrastructure, which is the main reason latency falls when a spacecraft can classify or caption imagery before downlink. (x.com) The chip-side story is about quantization, which means storing numbers with fewer bits, like rounding prices to fewer decimals so they take less space and are faster to handle. NVIDIA says its NVIDIA floating-point four-bit format, or NVFP4, cuts model memory about 3.5 times versus sixteen-bit floating point and about 1.8 times versus eight-bit floating point. (developer.nvidia.com) Attention is the part of a transformer model that decides which earlier words or image regions matter most, and it has been the stubborn piece for four-bit inference because its values swing widely. The paper “Attn-QAT: 4-Bit Attention With Quantization-Aware Training,” revised on March 6, 2026, said reliable four-bit attention is the prerequisite for end-to-end FP4 computation on emerging FP4-capable graphics processors. (arxiv.org) The authors said ordinary “drop-in” quantization-aware training was unstable, so they changed the backward pass to recompute attention scores in low precision and removed hidden high-precision assumptions in Flash Attention gradients. Hao Zhang then said in a recent post that this recipe enables end-to-end FP4 models on NVIDIA Rubin hardware with roughly three times the throughput of FP8 for the target workloads. (arxiv.org) (x.com) NVIDIA has separately said Blackwell Ultra graphics processors deliver peak dense NVFP4 throughput of up to 15 petaFLOPS, or three times FP8 on the same chips, and that Rubin raises that to 50 petaFLOPS of NVFP4 Transformer Engine inference compute. NVIDIA’s TensorRT-LLM documentation also now lists FP4 support on Blackwell and Hopper and NVFP4 key-value cache support for long-context serving. (developer.nvidia.com) (nvidia.github.io) (developer.nvidia.com) Put together, the two announcements point at the same target: push more image and language inference to the edge, where bandwidth is scarce and delays are expensive. For satellites, drones, robots, and other remote systems, the useful model is increasingly the one that fits on the device and answers before the link comes back. (arxiv.org) (developer.nvidia.com)