Data Story

The Subtitle Split

Subtitle tracks disagree: two sources lay down language sets and divergence becomes a tasteful glitch.

filmimdbtmdblanguagesdata-qualityinternational

Dataset scope

6557

films

1914–2024

years

orig codes

sources

The same title can have multiple language truths — original, spoken, release, dubbed. Divergence is a signal, not always an error.

Loading subtitle split

Hypothesis

Language disagreement is higher for older films and co-production-heavy titles, reflecting metadata practices and internationalization.

Question: How often do TMDB and IMDB disagree about a film’s languages, and what predicts disagreement?

Method: Parse language sets and compute Jaccard similarity; analyze divergence by decade.

Prediction: Mainstream recent films align more; older and international films diverge more.

Test: Compare divergence distributions across decades and inspect the most divergent frames.

Narrative Arc

Act I

Frames appear with subtitle tracks underneath.

Act II

TMDB and IMDB stack their language blocks — sometimes aligned, sometimes split.

Act III

Divergence glitches the frame: a map of international complexity and data messiness.

Datasets

Limitations

Want another story? Head back to the film data stories index or explore a new concept.