Extracting vobsub subtitles

What is the VOBsub format, and how can I extract the subtitle text from it?

Origin of VobSub

VobSub was developed in the late 1990s, coinciding with the rise of DVD technology. It was created primarily to address the need for high-quality subtitles that could be seamlessly integrated with DVD video content. The format was designed to overcome limitations of text-based subtitle formats, particularly in preserving the original appearance and positioning of subtitles as seen in professional DVD releases.

How VobSub Works

VobSub is an image-based subtitle format, which sets it apart from many text-based alternatives. Here’s how it works:

  1. File Structure: VobSub subtitles consist of two main files:
  • An .idx file: This is a text file containing timing information and other metadata.
  • A .sub file: This binary file contains the actual subtitle images.
  1. Image-Based Approach: Instead of storing subtitles as text, VobSub stores them as pre-rendered images. Each subtitle frame is essentially a small picture that is overlaid on the video at the appropriate time.
  2. Timing Mechanism: The .idx file specifies exactly when each subtitle image should appear and disappear, synchronizing it with the video playback.
  3. Multiple Language Support: VobSub allows for multiple language tracks within the same set of files, making it versatile for multi-language releases.

Prevalence and Usage

VobSub gained significant popularity during the DVD era and remained common in the early days of digital video distribution. However, its usage has declined in recent years due to:

  1. The rise of streaming services preferring text-based formats
  2. Increased adoption of more flexible subtitle formats like ASS/SSA
  3. The shift towards higher resolution displays, which can expose quality issues in image-based subtitles

Despite this decline, VobSub is still encountered in some DVD rips, older digital video files, and certain niche applications where preserving the exact appearance of subtitles is crucial.

Using ffprobe to Detect VobSub Subtitles

To detect VobSub subtitles, one can use the ffprobe utility, to look for subtitle tracks with the codec “dvd_subtitle”, which is the identifier for VobSub subtitles.

Extracting Text from VobSub Subtitles

This process of extracting the text is more complex than with text-based subtitle formats, but it’s possible with the right tools and techniques. Here’s a step-by-step guide to extracting text from VobSub subtitles:

Step 1: Extracting .idx and .sub Files

First, we need to extract the .idx and .sub files from the video container (if they’re not already separate). For this, we’ll use the mkvextract tool, which is part of the MKVToolNix suite. FFMPEG, while versatile, can’t be used for this specific task.

mkvextract tracks {input_file} {track_idx}:{output_file}

For example:

mkvextract tracks movie.mkv 2:subtitles.idx

This command extracts the subtitle track from ‘movie.mkv’ into ‘subtitles.idx’ and ‘subtitles.sub’ files.

Step 2: Extracting Images from the .sub File

Once we have the .idx and .sub files, we need to extract the individual subtitle images. This process involves:

  1. Reading the .idx file to get timing information and byte offsets.
  2. Using these offsets to locate and extract compressed image data from the .sub file.
  3. Decompressing the image data, which uses a run-length encoding (RLE) compression scheme.

Step 3: Optical Character Recognition (OCR)

The final step is to perform OCR on the extracted images to get the text content. This can be done using various OCR libraries or tools, such as Tesseract. The effectiveness of this step depends on the quality of the original subtitles and the OCR software used.

Putting it all together

There are a few open source programs available, that implement extracting the images and performing OCR on them.

VobSub-ML-OCR

VobSub2SRT

subtitleedit

For those who prefer not to install software, subtitleextractor.com offers a web-based alternative. This online tool handles the entire process of extracting text from VobSub subtitles, including separating files, extracting images, and performing OCR, all through a simple user interface.