Skip to content

Ngrams distribution highly skewed with data quality problems

I stumbled upon this analyzing our DB with LLM. It's related to #518 (very large context<->ngrams table). Brief summary:

Distribution

  • 73% of ngrams (9.4M) appear in only 1 context, these seem to be mostly 2-grams or 3-grams like fair localization using, data mining communities etc
  • p95 = 6 contexts, p99 = 33 contexts
  • Extreme long tail (max: 519k contexts)

Data quality issues found

  1. Nothing is top ngram (519k contexts) - likely null/empty field parsing bug
  2. Empty string second (217k contexts)
  3. Bare numbers stored as ngrams: 122777, 92455
  4. HTML/encoding artifacts: </i>, _, %
  5. Single letters: c, n, p

Recommendations

  • Filter nulls/empty strings during import
  • Strip HTML entities and metadata
  • Remove single characters and bare numbers
  • Consider lemmatization to reduce tail
  • Add min frequency threshold (e.g., ≥2 occurrences)

Queries used for analysis

-- 1. Count ngrams by frequency bucket
SELECT 
  CASE 
    WHEN context_count = 1 THEN '1 context'
    WHEN context_count BETWEEN 2 AND 5 THEN '2-5 contexts'
    WHEN context_count BETWEEN 6 AND 20 THEN '6-20 contexts'
    WHEN context_count BETWEEN 21 AND 100 THEN '21-100 contexts'
    WHEN context_count > 100 THEN '>100 contexts'
  END as bucket,
  COUNT(*) as ngram_count
FROM (
  SELECT ngrams_id, COUNT(DISTINCT context_id) as context_count
  FROM context_node_ngrams
  GROUP BY ngrams_id
) sub
GROUP BY bucket
ORDER BY MIN(context_count);

-- 1 context    9383885
-- 2-5 contexts    2661530
-- 6-20 contexts    533581
-- 21-100 contexts    157869
-- >100 contexts    46150


-- 2. Sample single-context ngrams
SELECT ng.terms, ng.id
FROM ngrams ng
WHERE ng.id IN (
  SELECT ngrams_id
  FROM context_node_ngrams
  GROUP BY ngrams_id
  HAVING COUNT(DISTINCT context_id) = 1
)
LIMIT 100;

-- 3. Top high-frequency ngrams
SELECT ng.terms, COUNT(DISTINCT cng.context_id) as context_count
FROM context_node_ngrams cng
JOIN ngrams ng ON cng.ngrams_id = ng.id
GROUP BY ng.id, ng.terms
ORDER BY context_count DESC
LIMIT 100;

-- 4. Distribution percentiles
SELECT 
  percentile_cont(0.25) WITHIN GROUP (ORDER BY context_count) as p25,
  percentile_cont(0.50) WITHIN GROUP (ORDER BY context_count) as p50,
  percentile_cont(0.75) WITHIN GROUP (ORDER BY context_count) as p75,
  percentile_cont(0.90) WITHIN GROUP (ORDER BY context_count) as p90,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY context_count) as p95,
  percentile_cont(0.99) WITHIN GROUP (ORDER BY context_count) as p99
FROM (
  SELECT COUNT(DISTINCT context_id) as context_count
  FROM context_node_ngrams
  GROUP BY ngrams_id
) sub;
Edited by Fabien MANIERE