Python Regex LLM Streamlit CSV

Brand Mention Extractor

When analyzing how a brand is described across the web, you need to extract the relevant sentences first. This tool takes the full markdown of a website, finds every sentence that mentions the brand, and filters it down to the ones worth keeping — without reading through thousands of pages manually.

The problem

Understanding how an LLM describes a brand requires understanding how the brand is described in its training data. That means finding the sentences across competitor sites, review platforms, and industry publications that actually define what the brand does — not the partnership announcements, not the feature release notes, and definitely not the "Scale smarter, not harder" slogans.

At scale, doing this manually is not an option. But a pure regex approach returns too much noise, and asking an LLM to read every sentence of every page is slow and expensive.

The approach

The tool combines two steps: broad regex recall, followed by targeted LLM filtering.

Input: a CSV file with a URL column and a markdown text column, typically exported from Screaming Frog's markdown extraction feature.
Sentence extraction: frontmatter is stripped, markdown syntax (images, links, bold/italic markers, headings) is cleaned, and the text is split into sentences.
Regex matching: each sentence is checked against a word-boundary pattern built from the brand name and any provided synonyms. This step prioritizes recall — it captures everything.
Boilerplate suppression: sentences that appear on more than a configurable fraction of pages (default: 30%) are dropped. This removes repeated footer text, cookie notices, and sitewide navigation copy before any LLM sees it.
LLM filtering: the remaining sentences are sent to a cheap LLM in batches of 40, which filters down to the sentences that provide a macro-level, definitional description of the brand.
Output: a clean CSV with one row per mention — brand_mention_sentence, page_url, matched_term — ready to drop into Claude for further analysis.

The batching trick

The LLM filtering step is where most of the cost would normally go. We kept it cheap with two decisions:

— Batches of 40 sentences at a time. Each API call receives a numbered list of sentences and a strict prompt describing what counts as a valid brand definition. The model classifies all 40 in one pass.
— The model only returns indexes. Instead of asking the LLM to repeat back the sentences it wants to keep, we instruct it to respond with a JSON array of integer indexes — e.g. [0, 2, 5]. This keeps the output token count minimal and avoids any hallucination on the content itself.

The result is that you can filter thousands of brand mentions down to the definitional ones in seconds, at a cost that is effectively negligible.

What the LLM keeps

The filtering prompt instructs the model to keep only sentences that answer "What is [Brand] at a high level?" — sentences that define the brand's identity, category, and core function. Everything else is filtered out:

× Granular features and sub-tool descriptions
× Partnership and integration announcements
× Corporate news, personnel changes, and PR
× Customer lists and case studies
× Slogans, fluff, and calls-to-action

Output

The final CSV is clean enough to paste directly into Claude for deeper analysis — tracking how a brand's description has shifted across sources, identifying gaps in knowledge graph coverage, or benchmarking how competitors are positioned in the training data.

The tool runs as a Streamlit app. You upload a CSV, specify a brand name and any synonyms, configure the boilerplate threshold, and download the result. No infrastructure, no pipeline to maintain.