Python Embeddings sentence-transformers Screaming Frog

Information Density Auditor

How do you measure information density? Most people can visit a page and tell if it contains a lot of marketing fluff. This can work if your website has a dozen posts, but once you get to the scale of hundreds or thousands of pages — like our clients do — you have to find a way to quantify and automate this process.

The approach

At PromptMarketing we tried multiple approaches to measure information density — lexical density, propositional density — but the approach that gave the most consistent results while preserving scalability turned out to be embedding-based.

Chunk pages on sentence level
Embed each sentence
Compare to the centroid of a list of marketing fluff sentences
If the average sentence of a page is close to this centroid, you are most likely dealing with low information density

We call the result the specificity score — computed as 1 - avg_similarity, where avg_similarity is the mean cosine distance between each sentence embedding and the fluff centroid. Higher specificity generally means higher information density.

Why measure this in the first place?

Web-searching LLMs rely heavily on embedding-based approaches, both on page level and on chunk level. This means that in order to perform well in AI search, your pages and their chunk embeddings need to be close to that of the search queries you are optimizing for.

Fluffy marketing sentences tend to dilute your embeddings, pulling them away from your target keywords. And every sentence on your page is a citation opportunity — if your sentences don't contain much information on average, your overall citation probability will tank.

— Fluff dilutes similarity. Generic sentences pull your page and chunk embeddings away from the queries you're targeting.
— Every sentence is a citation opportunity. If your sentences are weak on average, LLMs have less reason to cite you.

Workflow

Extract markdown using Screaming Frog
Split each article into sentences
Embed each sentence
Compare each sentence embedding to the centroid of a collection of marketing fluff sentences
Take the average similarity per page, then compute 1 - avg

Thresholds

The exact numbers will depend on your embedding model and fluff sentence collection, but as a rule of thumb:

Below 0.70 Definitely needs a rewrite

0.70–0.75 Underperforming — prioritize for review

0.75–0.80 Acceptable — revisit later

Above 0.80 Generally fine

These thresholds are based on our setup: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 with a reference set of 40 marketing fluff sentences (13 English, 9 Dutch, 18 mixed/bilingual). Different embedding models or reference sentences will shift these numbers — calibrate against your own content before treating them as gospel.

How to run an audit

Plot the specificity scores at 0.05 intervals: x-axis is specificity, y-axis is the count of pages in that range. The distribution immediately tells you how much of your content library needs attention.

A heavy left tail means a large fraction of pages are pulling down your overall GEO performance. Pages in the 0.60–0.65 range almost always share the same problems: no third-party citations, few concrete action points, and declarative sentences that say nothing. You'll also find older posts padded to hit a word count, or content that exists because someone needed to publish something that week.

In our experience, even a distribution that looks healthy overall often hides one outlier — and that outlier is frequently the "About us" page. Packed with fluff, which is exactly what you don't want for a page like that in the age of AI search.

In both cases, a single chart gives you an immediate overview of which pages need attention — even across hundreds of posts. No manual review, no reading through every page. Just plot, sort, and prioritize.

What we tried before

— Lexical density — inconsistent, didn't give a clear signal on what was fluff versus substance.
— Propositional density — same problem, results were all over the place.

The embedding-based approach was the first to give a satisfactory, consistent result. The logical next step would have been using LLMs to score each page directly, but that would be expensive, slow, and nondeterministic — overkill for what is essentially a triage step.

AI search is embedding-driven, and your content is competing at the vector level. Pages stuffed with generic marketing language get pulled away from the queries that matter, and LLMs have less to cite when every other sentence says nothing.

The specificity score gives you a fast, scalable way to find the weakest pages in your content library. Run the audit, sort by score, and start rewriting from the bottom up. It won't tell you how to fix a page, but it reliably tells you which pages need fixing — and at scale, that's the harder problem to solve.