Why semantic deduplication beats keyword filtering

The problem with keyword matching is that it assumes language is consistent. It isn't. A story about the Federal Reserve raising interest rates will appear across dozens of sources as "Fed hikes rates," "FOMC lifts benchmark," "central bank tightens policy," and "rate decision announced" — all referring to the same event, none sharing more than a stop word.

When we started building the deduplication layer for Digestr, we tried the obvious approach first: exact URL matching, then title similarity via Jaccard distance. It caught the easy cases. Syndicated wire copy from AP and Reuters that gets republished verbatim. But it missed everything interesting — the long tail of paraphrased coverage that makes up 80% of what a news reader actually sees.

What deduplication actually means

There's an important distinction between exact deduplication and semantic deduplication. Exact deduplication removes byte-level or near-byte-level copies. That's a solved problem. Semantic deduplication asks a harder question: do these two articles, written differently, tell the same story?

Two articles can share zero words and still be about the same event. Two articles can share most of their words and be about entirely different things.

This is the boundary case that kills keyword approaches. A financial brief and an editorial might both use the word "inflation" extensively. A Reuters dispatch and a Bloomberg analysis might describe the same press conference without a single shared phrase beyond proper nouns.

The embedding approach

Layer 0 of the Polari pipeline generates a 384-dimensional embedding for each article using all-MiniLM-L6-v2. These vectors encode semantic meaning — documents about similar topics cluster together in the high-dimensional space regardless of surface vocabulary.

The basic deduplication pass computes cosine similarity between incoming articles and recent candidates:

def is_duplicate(new_embedding, candidate_embeddings, threshold=0.92):
    if not candidate_embeddings:
        return False
    similarities = cosine_similarity(
        [new_embedding], candidate_embeddings
    )[0]
    return float(similarities.max()) >= threshold

A threshold of 0.92 is aggressive — it means two articles are considered duplicates only if they're nearly semantically identical. This is intentional. At Layer 0, we want to err toward keeping articles rather than discarding them. The clustering pass at Layer 2 handles the softer grouping problem.

Choosing the right threshold

The threshold is not arbitrary. We ran precision/recall curves across a hand-labeled dataset of 2,000 article pairs and found 0.92 gives us about 94% precision at 88% recall. Below 0.90, we start collapsing articles that are genuinely different stories with overlapping subject matter — a profile of a CEO and a news article about their company's earnings, for example. Above 0.95, we keep too many near-duplicates and the clustering layer has to work harder.

Where keyword matching fails

We kept the keyword approach running in parallel for three weeks to compare. The failure modes were consistent:

Synonym blindness. "Rate hike" and "tightening cycle" refer to the same monetary policy action. Keyword matching scores them as unrelated.
Entity sensitivity. Articles about different companies in the same sector share a lot of vocabulary. "Q3 earnings beat expectations" is a phrase that appears across hundreds of genuinely different stories every quarter.
Negation blindness. "The bill passed" and "the bill failed" are maximally keyword-similar and maximally semantically different. Embeddings handle this; n-gram overlap does not.

The semantic approach is not perfect either. It struggles with articles that discuss the same story through very different lenses — a technical analysis of a geopolitical event might embed far from the news coverage of the same event. This is where the entity overlap component at Layer 2 adds signal: even if two articles embed differently, if they mention the same people and organizations, they're probably related.

What this means for feed quality

The practical outcome is that Digestr users see fewer variations of the same story without seeing fewer stories. A major news event might generate 40 articles in a 12-hour window. A keyword-based feed shows them all. A semantic-deduplicated feed surfaces 4-6 of the most distinct — the original breaking news, a key analysis piece, an international angle, an editorial response.

That's the bet: that signal density matters more than volume, and that the reader's attention is worth protecting.