LLM serving infrastructure

LLM serving infrastructure with KV-cache optimization, mixed-precision quantization, and dynamic request batching for low-latency token streaming over an API endpoint.

Want this for your own invention?
Know whether your idea is already crowded — the 50 closest patents + a plain-English read, in minutes.
See if your idea's crowded — free → See a full example ↗

This invention

This invention is a serving system for large language models (LLMs) that makes generating text faster and cheaper. It combines three techniques. KV-cache optimization reuses attention computations. Mixed-precision quantization runs parts of the model at lower numerical precision. Dynamic request batching groups incoming requests efficiently. Together, these stream tokens back to users with low latency over an API endpoint. The work belongs to the field of AI/ML inference infrastructure and distributed serving systems.

Where it fits

Your idea sits at the intersection of a few well-developed corners of computing. The strongest threads are Digital Computing (G06F), with 29 patents at about 4.3× the corpus baseline, and Data Networking (H04L), with 23 patents at 5.4× baseline. Together these reflect the request-routing, batching, and API-serving heart of your invention. A notable cluster appears in AI & Machine Learning (G06N), with 10 patents at 11.4× baseline. Filings appear steadily across two decades, with active years around 2012–2020 and fresh entries in 2025–2026 — a sign this is a real, actively pursued direction. Groups active in this neighborhood include Amazon Technologies (four patents here on low-latency compute and request management), Microsoft, and IBM. The field also connects to Computer Vision (G06V), which these results touch less.

Closest related work

US-2025390703-A1 — Optimizing key value cache for large language model inference (Character Technologies Inc · 0 citations · 1-member family)

Filed in 2025, this is the nearest match to your KV-cache angle. It performs LLM inference across transformer layers using hybrid attention, multi-query attention, and cross-layer key-value sharing, then streams generated tokens back to a client. It shows how one team squeezed latency and memory out of the attention and KV-cache path — the same core problem your invention tackles — and offers a concrete vocabulary for the cache-sharing techniques in this space.

US-11442775-B1 — Dynamic batching for inference system for transformer-based generation tasks (FriendliAI Inc · 8 citations · 11-member family)

This patent maps directly onto your dynamic-batching component. It applies a transformer model to a batch of requests with variable input, target, and internal-state lengths, selectively batching some operations while processing others individually per request. It shows how FriendliAI handled the tricky problem of batching generation requests that finish at different times — exactly the efficiency challenge your low-latency serving system addresses.

US-11183178-B2 — Adaptive batching to reduce recognition latency (Microsoft Technology Licensing LLC · 2 citations · 17-member family)

Though framed around speech recognition, this work shares your latency-versus-batching tradeoff. It collects batches of feature frames of varying sizes and feeds them adaptively into a recognition network to balance throughput and responsiveness. It shows how Microsoft approached adaptive batch sizing in a streaming inference setting — a useful parallel to the dynamic batching you apply to token generation.

US-9830193-B1 — Automatic management of low latency computational capacity (Amazon Technologies Inc · 113 citations · 3-member family)

This patent covers the serving-infrastructure side. It maintains pools of compute instances, spots trends in incoming execution requests, and adjusts capacity to keep latency low. It shows how Amazon handled elastic, low-latency request handling at the system level — complementary to your model-level optimizations and helpful for understanding how the endpoint and capacity layers fit around an LLM serving stack.

What you can do next

Explore & build on it. Browse the related work above — new, differentiated ideas often come from combining or improving on existing approaches (a specific KV-cache sharing scheme, quantization precision mix, batching policy, or scheduling mechanism others haven't pinned down).
If you'd like to protect it. Filing a provisional application (usually with a patent attorney) is a common first step. Most inventions can be protected in some form — what matters is how broad and defensible that protection is, which is where a patent attorney adds value (a very narrow claim may be granted but protect very little).
If you'd like to make or sell it. The patents above point to who holds rights in this space; if your product would use a protected approach, licensing is a path worth exploring.

Top assignees

Assignee	Patents	Citations
PHOENIX SOLUTIONS INC	2	1154
NUANCE COMMUNICATIONS INC	2	528
NORTEL NETWORKS LIMITED	1	528
INTERNATIONAL BUSINESS MACHINES CORPORATION	2	481
MICROSOFT CORPORATION	2	463
AMAZON TECHNOLOGIES INC	4	376
INTERNATIONAL BUSINESS MACHINES CORP	1	329
PANASONIC CORPORATION	1	266
AT HOME CORPORATION	1	202
SCIENCE APPLICATIONS INTERNATIONAL CORPORATION	1	162

Closest related work

US-2025390703-A1 · 2025

Optimizing key value cache for large language model inference

CHARACTER TECHNOLOGIES INC

US-11442775-B1 · 2022

Dynamic batching for inference system for transformer-based generation tasks

FRIENDLIAI INC

US-10282237-B1 · 2019

Systems and methods for implementing an intelligent application program interface for an intelligent optimization platform

SIGOPT INC

US-2015324690-A1 · 2015

Deep Learning Training System

MICROSOFT CORPORATION

US-9910713-B2 · 2018

Code execution request routing

AMAZON TECHNOLOGIES INC

US-9830193-B1 · 2017

Automatic management of low latency computational capacity

AMAZON TECHNOLOGIES INC

US-8914497-B1 · 2014

System and method for throttling service requests having non-uniform workloads

AMAZON TECHNOLOGIES INC

US-9846771-B2 · 2017

Low latency, high payload, high volume API gateway

TWC PATENT TRUST LLT

US-10599460-B2 · 2020

Analytic model execution engine with instrumentation for granular performance analysis for metrics and diagnostics for troubleshooting

MODELOP INC

US-9031087-B2 · 2015

Method and apparatus for optimizing response time to events in queue

GENESYS TELECOMMUNICATIONS LABORATORIES INC

US-11369873-B2 · 2022

Methods and systems for rendering and encoding content for online interactive gaming sessions

GOOGLE LLC

US-9361344-B2 · 2016

System and method for distributed database query engines

FACEBOOK INC

View all 50 ranked patents in the interactive report →

More patent landscapes

Leak-proof travel mug · Oral semaglutide tablet · CRISPR gene therapy · Wearable glucose monitor · Skin lesion detection AI · Adaptive hearing aid · Dual-target ADC · SGLT2 inhibitor polymorph · Metal 3D printing · Self-driving perception · Quantum error correction · Biodegradable packaging · Tool-free flat-pack furniture · Solid-state EV battery · Direct air carbon capture · Warehouse picking robot · Precision agriculture drone · Vertical indoor farm · Self-healing rubber · Microplastic water filter · Wearable pairing protocol

AI-generated research, not legal advice. One of PatentLens.AI's free sample reports — browse all landscapes.