PatentLens.AI— turn patents into decisionsLandscape reportsPatent reportsCompany reports
Plain English in.No boolean, no CPC codes, no search skills.
Answers, not a list.The whole landscape, explained — in ~90 seconds.
Self-serve, in minutes.No account, no sales call, no analyst engagement.

LLM serving infrastructure

LLM serving infrastructure with KV-cache optimization, mixed-precision quantization, and dynamic request batching for low-latency token streaming over an API endpoint.

Want this for your own invention?
Know whether your idea is already crowded — the 50 closest patents + a plain-English read, in minutes.
See if your idea's crowded — free →   See a full example ↗

This invention

This invention is a serving system for large language models (LLMs) that makes generating text faster and cheaper. It combines three techniques. KV-cache optimization reuses attention computations. Mixed-precision quantization runs parts of the model at lower numerical precision. Dynamic request batching groups incoming requests efficiently. Together, these stream tokens back to users with low latency over an API endpoint. The work belongs to the field of AI/ML inference infrastructure and distributed serving systems.

Where it fits

Your idea sits at the intersection of a few well-developed corners of computing. The strongest threads are Digital Computing (G06F), with 29 patents at about 4.3× the corpus baseline, and Data Networking (H04L), with 23 patents at 5.4× baseline. Together these reflect the request-routing, batching, and API-serving heart of your invention. A notable cluster appears in AI & Machine Learning (G06N), with 10 patents at 11.4× baseline. Filings appear steadily across two decades, with active years around 2012–2020 and fresh entries in 2025–2026 — a sign this is a real, actively pursued direction. Groups active in this neighborhood include Amazon Technologies (four patents here on low-latency compute and request management), Microsoft, and IBM. The field also connects to Computer Vision (G06V), which these results touch less.

Closest related work

US-2025390703-A1 — Optimizing key value cache for large language model inference (Character Technologies Inc · 0 citations · 1-member family)

Filed in 2025, this is the nearest match to your KV-cache angle. It performs LLM inference across transformer layers using hybrid attention, multi-query attention, and cross-layer key-value sharing, then streams generated tokens back to a client. It shows how one team squeezed latency and memory out of the attention and KV-cache path — the same core problem your invention tackles — and offers a concrete vocabulary for the cache-sharing techniques in this space.

US-11442775-B1 — Dynamic batching for inference system for transformer-based generation tasks (FriendliAI Inc · 8 citations · 11-member family)

This patent maps directly onto your dynamic-batching component. It applies a transformer model to a batch of requests with variable input, target, and internal-state lengths, selectively batching some operations while processing others individually per request. It shows how FriendliAI handled the tricky problem of batching generation requests that finish at different times — exactly the efficiency challenge your low-latency serving system addresses.

US-11183178-B2 — Adaptive batching to reduce recognition latency (Microsoft Technology Licensing LLC · 2 citations · 17-member family)

Though framed around speech recognition, this work shares your latency-versus-batching tradeoff. It collects batches of feature frames of varying sizes and feeds them adaptively into a recognition network to balance throughput and responsiveness. It shows how Microsoft approached adaptive batch sizing in a streaming inference setting — a useful parallel to the dynamic batching you apply to token generation.

US-9830193-B1 — Automatic management of low latency computational capacity (Amazon Technologies Inc · 113 citations · 3-member family)

This patent covers the serving-infrastructure side. It maintains pools of compute instances, spots trends in incoming execution requests, and adjusts capacity to keep latency low. It shows how Amazon handled elastic, low-latency request handling at the system level — complementary to your model-level optimizations and helpful for understanding how the endpoint and capacity layers fit around an LLM serving stack.

What you can do next

  • Explore & build on it. Browse the related work above — new, differentiated ideas often come from combining or improving on existing approaches (a specific KV-cache sharing scheme, quantization precision mix, batching policy, or scheduling mechanism others haven't pinned down).
  • If you'd like to protect it. Filing a provisional application (usually with a patent attorney) is a common first step. Most inventions can be protected in some form — what matters is how broad and defensible that protection is, which is where a patent attorney adds value (a very narrow claim may be granted but protect very little).
  • If you'd like to make or sell it. The patents above point to who holds rights in this space; if your product would use a protected approach, licensing is a path worth exploring.

Top assignees

AssigneePatentsCitations
PHOENIX SOLUTIONS INC21154
NUANCE COMMUNICATIONS INC2528
NORTEL NETWORKS LIMITED1528
INTERNATIONAL BUSINESS MACHINES CORPORATION2481
MICROSOFT CORPORATION2463
AMAZON TECHNOLOGIES INC4376
INTERNATIONAL BUSINESS MACHINES CORP1329
PANASONIC CORPORATION1266
AT HOME CORPORATION1202
SCIENCE APPLICATIONS INTERNATIONAL CORPORATION1162

Closest related work

US-2025390703-A1 · 2025
Optimizing key value cache for large language model inference
CHARACTER TECHNOLOGIES INC
US-11442775-B1 · 2022
Dynamic batching for inference system for transformer-based generation tasks
FRIENDLIAI INC
US-10282237-B1 · 2019
Systems and methods for implementing an intelligent application program interface for an intelligent optimization platform
SIGOPT INC
US-2015324690-A1 · 2015
Deep Learning Training System
MICROSOFT CORPORATION
US-9910713-B2 · 2018
Code execution request routing
AMAZON TECHNOLOGIES INC
US-9830193-B1 · 2017
Automatic management of low latency computational capacity
AMAZON TECHNOLOGIES INC
US-8914497-B1 · 2014
System and method for throttling service requests having non-uniform workloads
AMAZON TECHNOLOGIES INC
US-9846771-B2 · 2017
Low latency, high payload, high volume API gateway
TWC PATENT TRUST LLT
US-10599460-B2 · 2020
Analytic model execution engine with instrumentation for granular performance analysis for metrics and diagnostics for troubleshooting
MODELOP INC
US-9031087-B2 · 2015
Method and apparatus for optimizing response time to events in queue
GENESYS TELECOMMUNICATIONS LABORATORIES INC
US-11369873-B2 · 2022
Methods and systems for rendering and encoding content for online interactive gaming sessions
GOOGLE LLC
US-9361344-B2 · 2016
System and method for distributed database query engines
FACEBOOK INC

View all 50 ranked patents in the interactive report →

More patent landscapes

Leak-proof travel mug · Oral semaglutide tablet · CRISPR gene therapy · Wearable glucose monitor · Skin lesion detection AI · Adaptive hearing aid · Dual-target ADC · SGLT2 inhibitor polymorph · Metal 3D printing · Self-driving perception · Quantum error correction · Biodegradable packaging · Tool-free flat-pack furniture · Solid-state EV battery · Direct air carbon capture · Warehouse picking robot · Precision agriculture drone · Vertical indoor farm · Self-healing rubber · Microplastic water filter · Wearable pairing protocol

AI-generated research, not legal advice. One of PatentLens.AI's free sample reports — browse all landscapes.