Optimizing key value cache for large language model inference

Patent US-2025390703-A1 — Optimizing key value cache for large language model inference

Patent number	US-2025390703-A1 — Pending application
Assignee	CHARACTER TECHNOLOGIES INC
Inventors	LIANG, Bowen\|Shazeer, Noam Mordechai\|OTT, Myle
Forward citations	0
Patent family	1 in 1 countries
CPC	G06N

Want to understand a patent you care about?
Enter any US patent number and see exactly what it protects — claims, status, family, citations, and the landscape around it — in minutes.
Understand any patent — free → See a full example ↗

About this patent

This patent application describes a method for accelerating large language model (LLM) inference by optimizing the key-value (KV) cache — the memory structure that grows with sequence length during autoregressive generation and is typically the dominant cost in transformer serving. As recited in claim 1, the method receives an input sequence from a client device and performs inference by tokenizing the input, converting tokens into embeddings, and passing those embeddings through a series of transformer layers. The novelty is concentrated in three jointly applied mechanisms: hybrid attention, multi-query attention, and cross-layer key-value sharing. The hybrid attention element enables local attention across a plurality of consecutive layers while injecting global attention at regular intervals between blocks of local-attention layers — a banded/sparse attention pattern that bounds the per-layer cache footprint while preserving long-range information flow. Multi-query attention (sharing key/value projections across attention heads) and cross-layer KV sharing (reusing cached keys/values across layers) further compress the cache. The combination targets reduced memory bandwidth and latency without retraining-incompatible architecture changes.

The applicant is Character Technologies Inc (Character.AI), with named inventors Bowen Liang, Noam Mordechai Shazeer, and Myle Ott. Shazeer is notable here: he co-authored the original transformer and multi-query attention work, which makes his name on this filing substantively relevant to the claimed techniques. The application published in 2025 as US-2025390703-A1; it is a recently filed pending application, so the claims may still narrow during prosecution. The family is a single member in a single country (US) — a focused, domestic-only filing rather than an aggressive international prosecution effort, consistent with an early-stage filing where foreign counterparts may not yet have been pursued or published. Forward citations stand at 0, which is expected for a 2025 publication and carries no signal about importance; citation-weighted ranking systematically under-rewards filings this recent.

Where this patent stands: This is a focused, technique-level filing rather than a foundational architecture patent — it composes several known KV-cache efficiency primitives (local/global hybrid attention, multi-query attention, cross-layer sharing) into a specific inference pipeline. The inventor roster, particularly Shazeer, ties it directly to a serving-cost optimization problem central to Character.AI's consumer-facing conversational product, where inference economics at scale are the binding constraint. In the broader field it sits alongside the major labs (Google, Microsoft, Salesforce) that own the transformer and attention foundations; this filing is best read as practical, product-driven IP protecting a specific efficient-serving configuration rather than a claim over the underlying attention mechanism itself.

Patent landscape

The surrounding space is recent, fast-moving, and tightly clustered around transformer inference efficiency. The nearest patents are almost uniformly 2024–2026 filings with single-member families and zero-to-low citations — a clear signal of an emerging subfield where citation history has not yet accumulated. On a citation-weighted basis the chart is led by Microsoft Corporation (2 patents, 738 citations), Salesforce (5 patents, 176 citations), Microsoft Technology Licensing LLC (3 patents, 140 citations), and Google Inc (1 patent, 129 citations) — but those totals are driven by older, broadly foundational NLP/attention patents. The patents nearest in subject matter to this one are recent KV-cache filings that the citation-weighted ranking under-rewards.

The closest related patents:

US-2026099695-A1 — Pyramid key-value cache compression for transformer models (Microsoft Technology Licensing LLC · 0 citations · 1-member family) — directly addresses the same KV-cache memory bottleneck via a pyramid (layer-varying) compression scheme; the most on-point competitor filing, and strategic IP the citation-weighted score under-rewards.
US-2026094232-A1 — Systems and methods for key-value (kv) cache pruning (Salesforce Inc · 0 citations · 1-member family) — pursues cache reduction through pruning rather than the sharing/sparse-attention approach claimed here; same objective, different mechanism.
US-2025094712-A1 — Multi-granular clustering-based solution for key-value cache compression (Intel Corporation · 0 citations · 1-member family) — another contemporaneous cache-compression filing, framing the problem as clustering of cached entries; recent strategic IP under-rewarded by citations.
US-10956819-B2 — Attention-based sequence transduction neural networks (Google LLC · 4 citations · 63-member family) — the original transformer patent and the architectural foundation that all of the above, including the subject patent, build upon; its 63-member family marks it as the anchor IP of the field.

Technology concentration confirms how tightly this cluster sits: AI & Machine Learning (G06N) appears at 42.3× the corpus baseline and Speech Processing (G10L) at 26.3×, with Digital Computing (G06F) at 5.8× — the surrounding result set is overwhelmingly machine-learning method patents, not a diffuse mix. The competitive picture here is corporate rather than academic, so the practical concern is overlap with active commercial filers (Microsoft, Salesforce, Intel, Google) racing on the same KV-cache efficiency problem, several of whom filed near-identical-objective applications in the same publication window.

Related patents

US-2025390703-A1 · 2025

Optimizing key value cache for large language model inference

CHARACTER TECHNOLOGIES INC

US-2026037786-A1 · 2026

Key value neural network architecture

WRITER INC

US-2026099695-A1 · 2026

Pyramid key-value cache compression for transformer models

MICROSOFT TECHNOLOGY LICENSING LLC

US-2026094232-A1 · 2026

Systems and methods for key-value (kv) cache pruning

SALESFORCE INC

US-2025094712-A1 · 2025

Multi-granular clustering-based solution for key-value cache compression

INTEL CORPORATION

US-2025200374-A1 · 2025

Transformer models with optimized first layer

NILS GRAEF

US-2025342555-A1 · 2025

Hardware-aware attention mechanism with dynamic workload distribution for transformer models

MICROSOFT TECHNOLOGY LICENSING LLC

US-11941356-B2 · 2024

Systems and methods for multi-scale pre-training with densely connected transformer

SALESFORCE INC

US-2024152770-A1 · 2024

Neural network search method and related device

HUAWEI TECHNOLOGIES CO LTD

US-12443851-B2 · 2025

Augmenting attention-based neural networks to selectively attend to past inputs

GDM HOLDING LLC

US-10956819-B2 · 2021

Attention-based sequence transduction neural networks

GOOGLE LLC

US-2026023992-A1 · 2026

System and Method for Large Language Model with Integrated Memory During Inference Using Manifold Traversal Architecture

ATOMBEAM TECHNOLOGIES INC

View the full ranked list in the interactive report →

More patent reports

Wearable glucose monitor · CRISPR gene therapy · Solid-state EV battery · Self-driving perception · Direct air carbon capture · Oral semaglutide tablet · Warehouse picking robot · Skin lesion detection AI · Adaptive hearing aid · Dual-target ADC · SGLT2 inhibitor polymorph · Metal 3D printing · Quantum error correction · Biodegradable packaging · Tool-free flat-pack furniture · Precision agriculture drone · Vertical indoor farm · Self-healing rubber · Microplastic water filter · Wearable pairing protocol

AI-generated patent analysis. Not legal advice — consult a registered patent attorney before any filing decision. One of PatentLens.AI's free sample reports — browse all patent reports.