Timed Text Formatting

Validator implements checks to assert conformance to ISO/IEC 14496-30. This is the MP4 subtitle/timed text format.

Should Fix: 14496-12 8.3.2.3 and ISO 14496-30 4.1 

Unless specified by an embedding environment (e.g. an HTML page), the track header box information (i.e. width, height) shall be used to size the subtitle or timed text track content with respect to the associated track(s) as follows:

If the flag track_size_is_aspect_ratio is not set, and the track width and height are set to values different from 0, the size of the timed text track shall be the track width and height.
If the flag track_size_is_aspect_ratio is not set, and the track width and height are set to 0, the size of the timed text track shall match the reference size.
If the flag track_size_is_aspect_ratio is set, it indicates that the content of the track was authored to an aspect ratio equal to the track header width/height. In this case, neither width nor height shall be 0. The timed text track shall be sized to the maximum size that will fit within the reference size and should equal its width or height, while preserving the indicated aspect ratio. If only one track is associated with the timed text track, the reference size is the size of the associated track. If multiple tracks are associated, the reference size is the size of the composition of tracks as described by the matrices in the track headers of the associated tracks. Upon file creation, the width and height of the subtitle or timed text track should be set appropriately according to the width and height of the associated track(s), as declared in their track header. A typical usage is that the timed text or subtitle track has the same width and height as an associated visual track, and no translation.
If the track it is supposed to overlay is not stored in an ISOBMFF file or if it is stored as a track in a different ISOBMFF file, the values 0x0 may be used; or the track_size_is_aspect_ratio flag may be used and the width and height set to the desired aspect ratio. For some timed text documents, the region as defined by the width, height and track_size_is_ aspect_ratio corresponds to the visual area filled by the rendering of the timed text documents. When the track width and height attributes are set to a value different from 0 and the track_size_ is_aspect_ratio_flag is not used, additional region positioning using the translation values tx and ty from the track header matrix, as defined for 3GPP Timed Text tracks, may be used (3GPP TS 26.245:2017, 5.7 defines the text track region using tx, ty, and the track width and height). NOTE 1 The 3GPP region is not the same as a WebVTT region. Unless specified by an embedding environment (e.g. an HTML page), visually composed tracks including video, subtitle, and timed text shall be stacked or layered using the ‘layer’ value in the track header bo

Validator only checks the last condition when validating a single track, as the corresponding visual tracks, it does raise a warning then -1 is not used for the layer value in the track header box.

Should Fix: 14496-30 4.3 

Timed text tracks should be marked with a suitable language in the media header box, indicating the audience for whom the track is appropriate. In the case where it is suitable for a single language, the media header must match that declared language. The value ‘mul’ may be used for a multi-lingual text.

Validator raises a should fix warning when the text track does not contain a valid language code.

Must Fix: ISO 14496-30 4.2 

The general processing of timed text or subtitle tracks is that the text content of the sample is delivered to the decoder at the sample decode time, at the latest. The rendering of the sample happens at the composition time, taking into account edit lists if any, and for the whole sample duration, without timing behaviour. However, timed text or subtitle sample data of specific formats may contain internal timing values. Internal timing values may alter the rendering of the sample during its duration as specified by the timed text or subtitle format.

NOTE If an internal timing value does not fall in the time interval corresponding to the sample composition time and sample composition time plus sample duration, the rendering of the sample can be different from the rendering of the same sample data with a composition time such that the internal timing value lies in the associated composition interval.

The subclauses defining the storage of specific formats in the ISOBMFF specify how internal timing values relate to the track time or to the sample decode or composition time (see subclauses 5.3 and 6.3). For instance, start or end times may be relative to the start of the sample, or the start of the track. For sections of the track timeline that have no associated subtitles or timed text content, ‘empty’ samples may be used, as defined for each format, or the duration of the preceding sample extended. Samples with a size of zero are not used. The timescale field in the media header box should be set appropriately to achieve the desired timing accuracy. It is recommended to be set to the value of the timescale field in the media header box of (one of) the associated track(s).

Validator checks that samples with a size of zero are not used, but the checking if ttml falls in sample times is out of scope and not checked by validator.

ISO 14496-30 5.1 ttml sample conformance 

This subclause describes how documents based on TTML, as defined by the W3C, and derived specifications (for example SMPTE-TT), are carried in files based on the ISO base media file format. A TTML Track is a track carrying TTML documents, which can be documents that correspond to a specification based on TTML.

Validator checks the conformance to the xml schema’s of ttml 1 , conformance checking of ttml 2 is not yet supported

Must Fix: 14496-30 5.4 

TTML streams shall be carried in subtitle tracks, and as a consequence according to ISOBMFF, the media handler type is ‘subt’, and the track uses a subtitle media header, and associated sample entry and sample group base class.

Validator checks that the hdlr is subt for ttml tracks, and raises a must fix if this is not the case.

Must Fix: 14496-30 5.5. Sample entry format 

TTML streams shall use the XMLSubtitleSampleEntry format. The namespace field shall be set to at least one unique namespace. It should be set to indicate the primary TTML-based namespace of the document, and should be set to all namespaces in use in the document (e.g. TTML + TTML-Styling + SMPTE-TT). The schema_location field should be set to schema pathnames that uniquely identify the profile or constraint set of the namespaces included in the namespace field. When sub-samples are present (see 5.6), then the auxiliary_mime_types field shall be set to the mime types used in the sub-samples — e.g. “image/png”

Validator checks that the XMLSubtitleSampleEntry format is used and auxiliary mime_types in case of sub-samples

Should Fix: 14496-30 5.5 scheme location 

The scheme_location field should be set.

Should Fix: 14496-30 5.6 Sample format 

A TTML subtitle sample shall consist of an XML document, optionally with resources such as images referenced by the XML document. Every sample is therefore a sync sample in this format; hence, the sync sample table is not present.

Validator checks the conformance to ttml 1 schemas and checks absence of sync sample table.

Must Fix: 14496-30 6.1 

WebVTT text content in tracks is encoded using UTF-8, and the data-type boxstring indicates an array of UTF-8 bytes, to fill the enclosing box, with neither a leading character count nor a trailing null terminator.

Each WebVTT cue, as defined in W3C Community Group Report, WebVTT, is stored de-constructed, partly to emphasize that the textual timestamps one would normally find in a WebVTT file do not determine presentation timing; the ISO file structures do. It also separates the text of the actual cue from the structural information that the WebVTT file carries (positioning, timing, and so on). WebVTT cues are stored in a typical ISO boxed structured approach to enable interfacing an ISO file reader with a WebVTT renderer without the need to serialize the sample content as WebVTT text and to parse it again. Boxes shall not contain trailing CR or LF characters, or trailing CRLF sequences (where ‘trailing’ means that they occur last in the payload of the box).

Validator checks that no trailing CR LF sequences exist in VTT boxes and triggers a must fix if this is detected

Must Fix: 14496-30 6.2 

Subclause 4.1 defines the general layout behaviour for timed text and subtitle tracks, which is applicable to WebVTT tracks.

Validator checks that timed text tracks use nmhd and hdlr is ‘text’.

Must Fix: 14496-30 6.3 Timing 

The text from 6.3:

Following the general timing processing defined in 4.2, each cue shall be passed to the WebVTT renderer at the time from the time-to-sample table, as mapped by the edit list (if any). The times derived for a sample from the durations in the time-to-sample table reflect the start and end-time of all cues in that sample. All samples are sync samples; the sync sample table is not used. If there is internal timing value in a cue, each sample must be labelled with the VTT time that corresponds to the sample start time on the VTT time line.

Validator checks that all samples are sync samples and sync sample table is not used.

Must Fix: 14496-30 6.4 Track Format 

WebVTT streams shall be carried as timed text tracks, and as a consequence according to ISOBMFF, use the ‘text’ media handler type, and the associated media header, sample entry, and sample group base class

Validator checks that handler type for vtt tracks is text

Must Fix: 14496-30 6.5 WVTT Sample Entry 

WebVTT streams shall use the WVTTSampleEntry format. In the sample entry, a WebVTT configuration box must occur, carrying exactly the lines of the WebVTT file header, i.e. all text lines up to but excluding the ‘two or more line terminators’ that end the header. NOTE Other boxes may be defined for the sample entry in future revisions of this document (e.g. carrying optional CSS style sheets, font information, and so on). A WebVTT source label box should occur in the sample entry. It contains a suitable string identifier of the ‘source’ of this WebVTT content, such that if a file is made by editing together two pieces of content, the timed text track would need two sample entries because this source label differs. A URI is recommended for the source label; however, the URI is not interpreted and it is not required there be a resource at the indicated location when a URL form is used

Validator checks the presence of WVTTSampleEntry and the required presence of boxes in this entry.

Should Fix: 14496-30 6.5 source label 

Webvtt source label box should occur in the sample entry. A warning is triggerred if this is not the case.

Must Fix: 14496-30 6.6 Sample 

Based on 14496-30, each sample is in a vtt track is either:

exactly one VTTEmptyCueBox box (representing a period of non-zero duration in which there is no cue data), or
one or more VTTCueBox boxes that share the same start time and end time, each containing the following boxes. Only the CuePayloadBox is mandatory, all others are optional. A sample containing cue boxes may also contain zero or more VTTAdditionalTextBox boxes, interleaved between VTTCueBox boxes and carrying any other text in between cues, in the order required by the processing of the additional text, if any. The VTTCueBox boxes must be in presentation order, i.e. if imported from a WebVTT file, the cues in any given sample must be in the order they were in the WebVTT file. It is recommended that the contents of the VTTCueBox boxes occur in the order shown in the syntax, but the order is not mandatory. If a cue has WebVTT Cue Settings, they are placed into a CueSettingsBox without the leading space that separates timing and settings. When a WebVTT source label box is present in the sample entry and a cue is written into multiple samples, it must be represented in a set of VTTCueBoxes all containing the same source_ID. All VTTCueBoxes that originate from the same VTT cue must have the same source_ID, and that source_ ID must be unique within the set of cues that share the same source_label. This means that when stepping from one sample to another (possibly after a seek, as well as during sequential play), a match of source_ID under the same source_label is diagnostic that the same cue is still active. Cues with no CueSourceIDBox are independent from all other cues; a source ID may be assigned to all cues. When there is no WebVTT source label in the sample entry, there must be no CueSourceIDBox in the associated samples. In this way the presence of the WebVTT source label indicates whether source IDs are assigned to cues split over several samples, or not. When a cue has internal timing values (i.e. WebVTT cue timestamp as defined in W3C Community Group Report, WebVTT) then each VTTCueBox must contain a CueTimeBox which gives the VTT timestamp associated with the start time of sample. When the cue content of a sample is passed to a VTT renderer, timestamps within the cues in the sample must be interpreted relative to the time given in this box, or adjusted considering this time and the sample start time.

The CuePayloadBox must contain exactly one WebVTT Cue. Other text, such as WebVTT Comments are placed into VTTAdditionalText boxes. NOTE The sample entry code is ‘vttC’; in contrast the VTTCueBox is ‘vttc’ and their container is also different. In the CuePayloadBox there must be no blank lines (but there may be multiple lines)

Validator checks if each sample is either one VTTCueEmptyCueBox or one or more VTTCueBoxes.

Should Fix: 14496-30 6.6 Cue settings 

If a cue has WevVTT cue settings they are placed in cue settings box without leading space.

Validator checks that no leading space is used

Should Fix: 14496-30 6.6 Cue Source ID 

There is no CueSourceIDBox in samples if there is no source label.

Validator raises a warning when cuesourceidbox is present but no source label in the sample entry