Data Story

The Subtitle Split

Subtitle tracks disagree: two sources lay down language sets and divergence becomes a tasteful glitch.

filmimdbtmdblanguagesdata-qualityinternational
Dataset scope
6557
films
1914–2024
years
45
orig codes
2
sources
The same title can have multiple language truths — original, spoken, release, dubbed. Divergence is a signal, not always an error.
Loading subtitle split
Hypothesis

Language disagreement is higher for older films and co-production-heavy titles, reflecting metadata practices and internationalization.

Question: How often do TMDB and IMDB disagree about a film’s languages, and what predicts disagreement?

Method: Parse language sets and compute Jaccard similarity; analyze divergence by decade.

Prediction: Mainstream recent films align more; older and international films diverge more.

Test: Compare divergence distributions across decades and inspect the most divergent frames.

Narrative Arc
Act I

Frames appear with subtitle tracks underneath.

Act II

TMDB and IMDB stack their language blocks — sometimes aligned, sometimes split.

Act III

Divergence glitches the frame: a map of international complexity and data messiness.

Datasets
  • imdb.film_languages
  • tmdb.movies
  • 23_subtitle_split.json
Limitations
  • Language meaning differs across sources (original vs release vs dubbed).
  • Normalization and naming are imperfect.
  • Missingness is not random across eras and markets.
Next

Want another story? Head back to the film data stories index or explore a new concept.

Back to indexarrow_forward