Thread Transfer

How we measure token savings

The data collection pipeline, tokenization scripts, and QA loops that underpin our published 40–80% savings range.

Jorgo Bardho

Founder, Thread Transfer

February 24, 2025•11 min read

token economicsbenchmarksresearch

Visualization of token savings benchmarks

Whenever we publish a claim like “Bundles cut context windows by 40–80%,” the very next slide in every sales call is the same: Show me the math. This post captures the exact protocol our applied research crew runs before a number lands on the website. It covers the tooling, how we guard against optimistic samples, and what to watch for if you're benchmarking savings inside your own stack.

1. Capture a clean baseline

Every study starts with the raw thread. We export the full conversation—including author IDs, timestamps, channel metadata, and attachments—into a signed JSON archive. That file is immutable and tied to the Request ID that produced it, which means we can always reproduce the run.

The baseline is stored in our research bucket with a manifest describing:

The source system (Slack, Zendesk, Intercom, or custom API)
The reason the conversation matters (incident review, enterprise QBR, deal desk escalation, etc.)
PII sensitivity tier and any redaction filters requested by the customer

2. Choose the tokenizer that actually bills you

Token savings only matter if they align with the model that ultimately consumes the bundle. For the majority of customers in 2025 that means gpt-4.1, claude-3.5-sonnet, or gpt-4o-mini. We run the raw thread and the distilled bundle through the exact same tokenizer version their production workflows call. When teams experiment with a mix of vendors, we version the results so finance can see how savings shift with each model.

3. Distill with production settings

There is zero “internal mode.” We pipe the baseline archive through the same distillation service our API exposes. That keeps the study honest around rate limits, retention policies, redaction filters, and integrity signatures. If a bundle fails validation for any reason, the run is marked incomplete and never makes it into the dataset.

4. Count, annotate, and persist

Once the bundle clears validation, we fire a counting job that records:

Total tokens (raw thread vs. bundle)
Breakdown by block type (facts, decisions, references, metadata)
Compression ratio per section

Results land in an internal dataset backed by DuckDB. Each record carries a study_id, the tokenizer hash, and any redaction filters applied. Researchers can slice by industry, team size, or conversation length to see how savings fluctuate.

5. Stress-test outliers

Every run marked as an outlier (±2 standard deviations from the mean) is replayed manually. We inspect the raw transcript side-by-side with the bundle to confirm there isn’t a hallucinated summary, a missing decision, or a formatting quirk that inflated savings. Roughly 12% of runs require a replay with tweaked heuristics—for example, loosening entity resolution thresholds for multilingual threads.

Benchmarks from the last 90 days

Conversation type	Median raw tokens	Median bundle tokens	Median reduction
Enterprise support escalations	32,140	13,420	58%
Weekly engineering incident reviews	18,905	9,880	48%
Sales-to-success hand-offs	14,330	5,470	62%
Research stand-ups w/ code snippets	11,210	8,940	20%

The low end might look underwhelming until you consider what stays in. Code-heavy transcripts often require entire snippets to survive to keep future debugging honest. Bundles still enforce consistent headings and link every excerpt back to the originating message ID, which pays off in audits even when token counts barely move.

How we calculate the published range

We publish three figures: the modal range (40–60%), the 90th-percentile high end (80%+), and the conservative floor (≈20%). These are recomputed monthly. Any time the modal range shifts more than five percentage points, the product marketing team receives an automated alert so copy across the site stays accurate.

Integrating the methodology into your workflow

Baseline every new conversation type. The first time you add a new channel—say, legal review calls—run 5–10 transcripts through the benchmarking pipeline before you promise savings internally.
Automate integrity checks. Persist your raw threads and bundles with IDs so you can replay anomalies and prove nothing was altered after export.
Make finance a subscriber. Finance teams love the weekly CSV we push with raw vs. bundle counts. It gives them a forward-looking model that tracks context-window growth before invoices spike.

Where this methodology can break

Token counting is still brittle if you ignore metadata. Common failure modes include:

Comparing different tokenizer versions (OpenAI updates can swing counts by 3–5%).
Including attachments in the raw thread but stripping them from the bundle.
Sampling only “happy path” transcripts, which underestimates bundle overhead.

We mitigate this by snapshotting tokenizer digests, enforcing consistent manifest schemas, and keeping auditors in the loop whenever the distillation recipe changes.

Next steps

Want to run this yourself? Grab our open-source notebook, drop in your transcripts, and compare the counts. If you need help wiring the benchmarking job into your data warehouse, ping me at info@thread-transfer.com.

Learn more: How it works · Why bundles beat raw thread history