The Subtitle Split
Subtitle tracks disagree: two sources lay down language sets and divergence becomes a tasteful glitch.
Language disagreement is higher for older films and co-production-heavy titles, reflecting metadata practices and internationalization.
Question: How often do TMDB and IMDB disagree about a film’s languages, and what predicts disagreement?
Method: Parse language sets and compute Jaccard similarity; analyze divergence by decade.
Prediction: Mainstream recent films align more; older and international films diverge more.
Test: Compare divergence distributions across decades and inspect the most divergent frames.
Frames appear with subtitle tracks underneath.
TMDB and IMDB stack their language blocks — sometimes aligned, sometimes split.
Divergence glitches the frame: a map of international complexity and data messiness.
- imdb.film_languages
- tmdb.movies
- 23_subtitle_split.json
- Language meaning differs across sources (original vs release vs dubbed).
- Normalization and naming are imperfect.
- Missingness is not random across eras and markets.
Want another story? Head back to the film data stories index or explore a new concept.
Back to indexarrow_forward