LLM serving infrastructure
LLM serving infrastructure with KV-cache optimization, mixed-precision quantization, and dynamic request batching for low-latency token streaming over an API endpoint.
Know whether your idea is already crowded — the 50 closest patents + a plain-English read, in minutes.
See if your idea's crowded — free → See a full example ↗
This invention
This invention is a serving system for large language models (LLMs) that makes generating text faster and cheaper. It combines three techniques. KV-cache optimization reuses attention computations. Mixed-precision quantization runs parts of the model at lower numerical precision. Dynamic request batching groups incoming requests efficiently. Together, these stream tokens back to users with low latency over an API endpoint. The work belongs to the field of AI/ML inference infrastructure and distributed serving systems.
Where it fits
Your idea sits at the intersection of a few well-developed corners of computing. The strongest threads are Digital Computing (G06F), with 29 patents at about 4.3× the corpus baseline, and Data Networking (H04L), with 23 patents at 5.4× baseline. Together these reflect the request-routing, batching, and API-serving heart of your invention. A notable cluster appears in AI & Machine Learning (G06N), with 10 patents at 11.4× baseline. Filings appear steadily across two decades, with active years around 2012–2020 and fresh entries in 2025–2026 — a sign this is a real, actively pursued direction. Groups active in this neighborhood include Amazon Technologies (four patents here on low-latency compute and request management), Microsoft, and IBM. The field also connects to Computer Vision (G06V), which these results touch less.
Closest related work
US-2025390703-A1 — Optimizing key value cache for large language model inference (Character Technologies Inc · 0 citations · 1-member family)
Filed in 2025, this is the nearest match to your KV-cache angle. It performs LLM inference across transformer layers using hybrid attention, multi-query attention, and cross-layer key-value sharing, then streams generated tokens back to a client. It shows how one team squeezed latency and memory out of the attention and KV-cache path — the same core problem your invention tackles — and offers a concrete vocabulary for the cache-sharing techniques in this space.
US-11442775-B1 — Dynamic batching for inference system for transformer-based generation tasks (FriendliAI Inc · 8 citations · 11-member family)
This patent maps directly onto your dynamic-batching component. It applies a transformer model to a batch of requests with variable input, target, and internal-state lengths, selectively batching some operations while processing others individually per request. It shows how FriendliAI handled the tricky problem of batching generation requests that finish at different times — exactly the efficiency challenge your low-latency serving system addresses.
US-11183178-B2 — Adaptive batching to reduce recognition latency (Microsoft Technology Licensing LLC · 2 citations · 17-member family)
Though framed around speech recognition, this work shares your latency-versus-batching tradeoff. It collects batches of feature frames of varying sizes and feeds them adaptively into a recognition network to balance throughput and responsiveness. It shows how Microsoft approached adaptive batch sizing in a streaming inference setting — a useful parallel to the dynamic batching you apply to token generation.
US-9830193-B1 — Automatic management of low latency computational capacity (Amazon Technologies Inc · 113 citations · 3-member family)
This patent covers the serving-infrastructure side. It maintains pools of compute instances, spots trends in incoming execution requests, and adjusts capacity to keep latency low. It shows how Amazon handled elastic, low-latency request handling at the system level — complementary to your model-level optimizations and helpful for understanding how the endpoint and capacity layers fit around an LLM serving stack.
What you can do next
- Explore & build on it. Browse the related work above — new, differentiated ideas often come from combining or improving on existing approaches (a specific KV-cache sharing scheme, quantization precision mix, batching policy, or scheduling mechanism others haven't pinned down).
- If you'd like to protect it. Filing a provisional application (usually with a patent attorney) is a common first step. Most inventions can be protected in some form — what matters is how broad and defensible that protection is, which is where a patent attorney adds value (a very narrow claim may be granted but protect very little).
- If you'd like to make or sell it. The patents above point to who holds rights in this space; if your product would use a protected approach, licensing is a path worth exploring.
Top assignees
| Assignee | Patents | Citations |
|---|---|---|
| PHOENIX SOLUTIONS INC | 2 | 1154 |
| NUANCE COMMUNICATIONS INC | 2 | 528 |
| NORTEL NETWORKS LIMITED | 1 | 528 |
| INTERNATIONAL BUSINESS MACHINES CORPORATION | 2 | 481 |
| MICROSOFT CORPORATION | 2 | 463 |
| AMAZON TECHNOLOGIES INC | 4 | 376 |
| INTERNATIONAL BUSINESS MACHINES CORP | 1 | 329 |
| PANASONIC CORPORATION | 1 | 266 |
| AT HOME CORPORATION | 1 | 202 |
| SCIENCE APPLICATIONS INTERNATIONAL CORPORATION | 1 | 162 |
Closest related work
View all 50 ranked patents in the interactive report →
More patent landscapes
Leak-proof travel mug · Oral semaglutide tablet · CRISPR gene therapy · Wearable glucose monitor · Skin lesion detection AI · Adaptive hearing aid · Dual-target ADC · SGLT2 inhibitor polymorph · Metal 3D printing · Self-driving perception · Quantum error correction · Biodegradable packaging · Tool-free flat-pack furniture · Solid-state EV battery · Direct air carbon capture · Warehouse picking robot · Precision agriculture drone · Vertical indoor farm · Self-healing rubber · Microplastic water filter · Wearable pairing protocol
AI-generated research, not legal advice. One of PatentLens.AI's free sample reports — browse all landscapes.