Keep AI off critical path
- Daniel Oh and other enterprise AI speakers said on June 2 that LLM features should run outside core transaction paths, not inside synchronous booking flows. - The clearest design rule was operational: use queues, bounded concurrency, caching and feature-level cost attribution, while keeping strict synchronous SLAs for load booking. - The source video, “Scaling LLMs, Open Ecosystems, Enterprise AI,” remains available on YouTube with Daniel Oh discussing enterprise deployment patterns.
Daniel Oh’s June 2 talk on scaling LLMs framed enterprise AI as a distributed-systems problem, not a feature-layer add-on. In the session, Oh argued that teams should not wire model calls directly into the critical path of transactional software when latency, retries and token costs are hard to bound. The design advice was practical: isolate AI workloads, control concurrency, and decide explicitly which jobs can run asynchronously. For freight and logistics platforms, that means treating AI like a separate service tier rather than an invisible dependency inside booking, pricing or execution flows. ### Why keep AI out of the booking path? Load booking systems usually carry fixed response expectations, customer-facing timeouts and downstream commitments. If a model call sits inside that path, one slow response, retry storm or provider outage can delay the transaction that actually moves freight. Oh’s session described a cleaner split: keep the synchronous path narrow and deterministic, then hand off AI work to queues or background workers where possible. That approach lets a platform confirm a booking, write the operational record, and then run enrichment tasks afterward instead of making the customer wait on an unbounded inference call. ### Which AI jobs can be asynchronous, and which cannot? Document classification is a straightforward example of work that can usually run after the main transaction commits. A bill of lading, proof-of-delivery image or rate confirmation can be queued for extraction, labeling or summarization without blocking the shipment from being created in the system. Booking confirmation is different because it often sits under a strict service-level target and can trigger carrier commitments, customer notifications and pricing locks. (saltmarch.com) The rule described in the talk was not “never use AI”; it was to separate tasks by operational requirement and only leave a model in the synchronous path when the SLA, fallback and failure mode are defined in advance. ### What controls stop AI latency and cost from spreading? Queues were one of the core controls because they absorb bursty demand and let teams smooth inference traffic. Bounded concurrency was another because it prevents a surge in requests from exhausting model endpoints, GPUs or budget all at once. Caching matters when the same prompts, documents or reference lookups recur across workflows. Cost attribution matters because teams otherwise struggle to tell which feature, customer segment or workflow is driving token spend. (saltmarch.com) In practice, those controls turn AI from a hidden dependency into an observable workload with limits, backpressure and ownership. ### What does this look like inside a freight platform? A freight platform can keep tender creation, load booking and core status writes on conventional synchronous services backed by strict timeouts. The same platform can push document extraction, exception summarization, email drafting and support copilot tasks onto separate workers that consume from queues and write results back later. That separation also helps incident response. If an AI provider slows down, operators can degrade only the enrichment layer instead of taking down shipment creation or customer-facing tracking. (saltmarch.com) Oh’s broader body of work centers on cloud-native architectures and production-ready LLM deployment, which fits that pattern of isolating workloads by reliability requirement. ### What should engineers audit first? A useful first check is whether any customer transaction now depends on an LLM response without a hard timeout, fallback or budget cap. A second is whether retries are idempotent and whether queue lag is visible in monitoring. The June 2 source video remains the clearest next reference point for teams reviewing those questions. Daniel Oh’s session page and related conference material focus on production-ready LLM applications, enterprise deployment and cloud-native controls, giving engineers a concrete place to compare their own architecture against those patterns. (youtube.com) (saltmarch.com)