Ngrams distribution highly skewed with data quality problems
I stumbled upon this analyzing our DB with LLM. It's related to #518 (very large context<->ngrams table). Brief summary:
Distribution
- 73% of ngrams (9.4M) appear in only 1 context, these seem to be mostly 2-grams or 3-grams like
fair localization using,data mining communitiesetc - p95 = 6 contexts, p99 = 33 contexts
- Extreme long tail (max: 519k contexts)
Data quality issues found
-
Nothingis top ngram (519k contexts) - likely null/empty field parsing bug - Empty string
second (217k contexts) - Bare numbers stored as ngrams:
122777,92455 - HTML/encoding artifacts:
</i>,_,% - Single letters:
c,n,p
Recommendations
- Filter nulls/empty strings during import
- Strip HTML entities and metadata
- Remove single characters and bare numbers
- Consider lemmatization to reduce tail
- Add min frequency threshold (e.g., ≥2 occurrences)
Queries used for analysis
-- 1. Count ngrams by frequency bucket
SELECT
CASE
WHEN context_count = 1 THEN '1 context'
WHEN context_count BETWEEN 2 AND 5 THEN '2-5 contexts'
WHEN context_count BETWEEN 6 AND 20 THEN '6-20 contexts'
WHEN context_count BETWEEN 21 AND 100 THEN '21-100 contexts'
WHEN context_count > 100 THEN '>100 contexts'
END as bucket,
COUNT(*) as ngram_count
FROM (
SELECT ngrams_id, COUNT(DISTINCT context_id) as context_count
FROM context_node_ngrams
GROUP BY ngrams_id
) sub
GROUP BY bucket
ORDER BY MIN(context_count);
-- 1 context 9383885
-- 2-5 contexts 2661530
-- 6-20 contexts 533581
-- 21-100 contexts 157869
-- >100 contexts 46150
-- 2. Sample single-context ngrams
SELECT ng.terms, ng.id
FROM ngrams ng
WHERE ng.id IN (
SELECT ngrams_id
FROM context_node_ngrams
GROUP BY ngrams_id
HAVING COUNT(DISTINCT context_id) = 1
)
LIMIT 100;
-- 3. Top high-frequency ngrams
SELECT ng.terms, COUNT(DISTINCT cng.context_id) as context_count
FROM context_node_ngrams cng
JOIN ngrams ng ON cng.ngrams_id = ng.id
GROUP BY ng.id, ng.terms
ORDER BY context_count DESC
LIMIT 100;
-- 4. Distribution percentiles
SELECT
percentile_cont(0.25) WITHIN GROUP (ORDER BY context_count) as p25,
percentile_cont(0.50) WITHIN GROUP (ORDER BY context_count) as p50,
percentile_cont(0.75) WITHIN GROUP (ORDER BY context_count) as p75,
percentile_cont(0.90) WITHIN GROUP (ORDER BY context_count) as p90,
percentile_cont(0.95) WITHIN GROUP (ORDER BY context_count) as p95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY context_count) as p99
FROM (
SELECT COUNT(DISTINCT context_id) as context_count
FROM context_node_ngrams
GROUP BY ngrams_id
) sub;
Edited by Fabien MANIERE