Neo Kim lists 40 scaling fixes
- Neo Kim on March 8 published a social post listing 40 system-scaling techniques for handling growth, spanning caching, sharding, autoscaling, CDNs and replicas. - The clearest detail was the count itself: 40 tactics, from horizontal scaling and queues to read replicas and write batching. (substack.com) - Anyscale’s May 28 webinar and AWS’s current SageMaker guidance give the next places to watch for implementation details. (anyscale.com)
Neo Kim’s March 8 post was framed as a practical checklist: “If I had to scale a system, here are 40 techniques I’d consider.” The list ran from familiar infrastructure moves such as horizontal scaling, caching and load balancing to more failure-oriented controls including rate limits, circuit breakers, backpressure and graceful degradation. It also bundled data-layer tactics such as sharding, replication, partitioning, read replicas and write batching. (substack.com) The thread drew attention because it compressed several layers of production engineering into a single operating playbook. (anyscale.com) Kim’s list mixed throughput tools, resilience controls and architectural choices rather than treating “scaling” as only a compute problem. That matters for teams running media, video or model-serving systems, where bottlenecks often move between storage, networking, queues and accelerators instead of staying in one place. The post itself did not present benchmark data, but it did provide a map of where engineers usually look first when traffic jumps. (substack.com) ### Which fixes in Kim’s list are about traffic spikes, and which are about failure containment? The first cluster in Kim’s list is aimed at absorbing more demand: horizontal scaling, vertical scaling, caching, load balancing, autoscaling, CDNs and read replicas. Those are the standard levers for serving more requests or offloading hot paths before a database or application tier becomes the choke point. Kim also included indexing and prefetching, which are usually used to reduce repeated work and lower request latency. (substack.com) A second cluster is designed to keep systems standing when load becomes erratic. Kim named timeouts, retries, rate limits, circuit breakers, backpressure, failover, high availability, bulkheads and graceful degradation. Those patterns do not increase raw capacity on their own. They reduce blast radius, shed work deliberately and stop one overloaded component from taking down the rest of the stack. ### Why does a 40-point checklist resonate now with AI and video workloads? Anyscale’s current event materials make the same point in a different domain. (substack.com) The company says multimodal pipelines combine CPU-bound preprocessing with GPU-bound inference, and that traditional batch architectures can leave GPUs idle more than 50% of the time. Anyscale’s materials for Microsoft Build say the bottleneck is often CPU-to-GPU I/O, not only the model itself. That framing lines up with Kim’s inclusion of queueing, event-driven design, stateless services, orchestration, monitoring and tracing. (substack.com) Inference systems and high-load media systems both depend on keeping each stage fed without overprovisioning every stage. AWS’s SageMaker documentation similarly says teams can cut costs by choosing the inference mode that matches workload shape, including real-time, serverless, asynchronous and batch options. ### What are the clearest cost signals from AWS and Anyscale? (anyscale.com) AWS has been explicit about the cost side of inference tuning. Amazon said its SageMaker inference optimization capability can deliver up to roughly 2x higher throughput while reducing costs by up to roughly 50% for generative AI models including Llama 3, Mistral and Mixtral. AWS documentation also says Batch on Amazon Bedrock is priced at 50% below on-demand inference for supported models. (substack.com) AWS has also pushed hardware and deployment selection as a first-order decision. Its Well-Architected guidance says managed, serverless and self-hosted inference choices should be evaluated against workload cost and performance, while SageMaker’s Inference Recommender is positioned as a way to automate load testing and configuration selection for lowest-cost deployment. ### Which items on Kim’s list are most relevant when GPUs are expensive? (aws.amazon.com) Read replicas, write batching and partitioning are database-side answers to traffic growth, but Kim’s queueing, autoscaling and stateless-service entries are the ones that map most directly to expensive inference fleets. Those patterns let teams smooth bursts, add capacity incrementally and avoid pinning every request to a warm dedicated worker. That is also the logic behind AWS’s guidance to match inference type to workload and behind SageMaker features meant to reduce idle capacity. (docs.aws.amazon.com) CDNs, lazy loading and prefetching sit one layer higher. They matter because every request avoided at the edge is one request that never reaches a model endpoint or origin service. Kim’s list does not single out AI, but the same mechanics apply when the costly resource is a GPU instead of a database connection. ### Where does this go next? Anyscale has a webinar scheduled for May 28, 2026, titled “Building a Multimodal Video Processing Pipeline with Ray,” according to its events page. (substack.com) AWS’s current SageMaker cost-optimization and inference-guidance pages remain the main public references for teams comparing real-time, serverless, asynchronous and batch deployment patterns. (anyscale.com)