1 | % -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
|
---|
2 | %!TEX root = Vorbis_I_spec.tex
|
---|
3 | \section{Introduction and Description} \label{vorbis:spec:intro}
|
---|
4 |
|
---|
5 | \subsection{Overview}
|
---|
6 |
|
---|
7 | This document provides a high level description of the Vorbis codec's
|
---|
8 | construction. A bit-by-bit specification appears beginning in
|
---|
9 | \xref{vorbis:spec:codec}.
|
---|
10 | The later sections assume a high-level
|
---|
11 | understanding of the Vorbis decode process, which is
|
---|
12 | provided here.
|
---|
13 |
|
---|
14 | \subsubsection{Application}
|
---|
15 | Vorbis is a general purpose perceptual audio CODEC intended to allow
|
---|
16 | maximum encoder flexibility, thus allowing it to scale competitively
|
---|
17 | over an exceptionally wide range of bitrates. At the high
|
---|
18 | quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
|
---|
19 | it is in the same league as MPEG-2 and MPC. Similarly, the 1.0
|
---|
20 | encoder can encode high-quality CD and DAT rate stereo at below 48kbps
|
---|
21 | without resampling to a lower rate. Vorbis is also intended for
|
---|
22 | lower and higher sample rates (from 8kHz telephony to 192kHz digital
|
---|
23 | masters) and a range of channel representations (monaural,
|
---|
24 | polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
|
---|
25 | discrete channels).
|
---|
26 |
|
---|
27 |
|
---|
28 | \subsubsection{Classification}
|
---|
29 | Vorbis I is a forward-adaptive monolithic transform CODEC based on the
|
---|
30 | Modified Discrete Cosine Transform. The codec is structured to allow
|
---|
31 | addition of a hybrid wavelet filterbank in Vorbis II to offer better
|
---|
32 | transient response and reproduction using a transform better suited to
|
---|
33 | localized time events.
|
---|
34 |
|
---|
35 |
|
---|
36 | \subsubsection{Assumptions}
|
---|
37 |
|
---|
38 | The Vorbis CODEC design assumes a complex, psychoacoustically-aware
|
---|
39 | encoder and simple, low-complexity decoder. Vorbis decode is
|
---|
40 | computationally simpler than mp3, although it does require more
|
---|
41 | working memory as Vorbis has no static probability model; the vector
|
---|
42 | codebooks used in the first stage of decoding from the bitstream are
|
---|
43 | packed in their entirety into the Vorbis bitstream headers. In
|
---|
44 | packed form, these codebooks occupy only a few kilobytes; the extent
|
---|
45 | to which they are pre-decoded into a cache is the dominant factor in
|
---|
46 | decoder memory usage.
|
---|
47 |
|
---|
48 |
|
---|
49 | Vorbis provides none of its own framing, synchronization or protection
|
---|
50 | against errors; it is solely a method of accepting input audio,
|
---|
51 | dividing it into individual frames and compressing these frames into
|
---|
52 | raw, unformatted 'packets'. The decoder then accepts these raw
|
---|
53 | packets in sequence, decodes them, synthesizes audio frames from
|
---|
54 | them, and reassembles the frames into a facsimile of the original
|
---|
55 | audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
|
---|
56 | minimum size, maximum size, or fixed/expected size. Packets
|
---|
57 | are designed that they may be truncated (or padded) and remain
|
---|
58 | decodable; this is not to be considered an error condition and is used
|
---|
59 | extensively in bitrate management in peeling. Both the transport
|
---|
60 | mechanism and decoder must allow that a packet may be any size, or
|
---|
61 | end before or after packet decode expects.
|
---|
62 |
|
---|
63 | Vorbis packets are thus intended to be used with a transport mechanism
|
---|
64 | that provides free-form framing, sync, positioning and error correction
|
---|
65 | in accordance with these design assumptions, such as Ogg (for file
|
---|
66 | transport) or RTP (for network multicast). For purposes of a few
|
---|
67 | examples in this document, we will assume that Vorbis is to be
|
---|
68 | embedded in an Ogg stream specifically, although this is by no means a
|
---|
69 | requirement or fundamental assumption in the Vorbis design.
|
---|
70 |
|
---|
71 | The specification for embedding Vorbis into
|
---|
72 | an Ogg transport stream is in \xref{vorbis:over:ogg}.
|
---|
73 |
|
---|
74 |
|
---|
75 |
|
---|
76 | \subsubsection{Codec Setup and Probability Model}
|
---|
77 |
|
---|
78 | Vorbis' heritage is as a research CODEC and its current design
|
---|
79 | reflects a desire to allow multiple decades of continuous encoder
|
---|
80 | improvement before running out of room within the codec specification.
|
---|
81 | For these reasons, configurable aspects of codec setup intentionally
|
---|
82 | lean toward the extreme of forward adaptive.
|
---|
83 |
|
---|
84 | The single most controversial design decision in Vorbis (and the most
|
---|
85 | unusual for a Vorbis developer to keep in mind) is that the entire
|
---|
86 | probability model of the codec, the Huffman and VQ codebooks, is
|
---|
87 | packed into the bitstream header along with extensive CODEC setup
|
---|
88 | parameters (often several hundred fields). This makes it impossible,
|
---|
89 | as it would be with MPEG audio layers, to embed a simple frame type
|
---|
90 | flag in each audio packet, or begin decode at any frame in the stream
|
---|
91 | without having previously fetched the codec setup header.
|
---|
92 |
|
---|
93 |
|
---|
94 | \begin{note}
|
---|
95 | Vorbis \emph{can} initiate decode at any arbitrary packet within a
|
---|
96 | bitstream so long as the codec has been initialized/setup with the
|
---|
97 | setup headers.
|
---|
98 | \end{note}
|
---|
99 |
|
---|
100 | Thus, Vorbis headers are both required for decode to begin and
|
---|
101 | relatively large as bitstream headers go. The header size is
|
---|
102 | unbounded, although for streaming a rule-of-thumb of 4kB or less is
|
---|
103 | recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
|
---|
104 |
|
---|
105 | Our own design work indicates the primary liability of the
|
---|
106 | required header is in mindshare; it is an unusual design and thus
|
---|
107 | causes some amount of complaint among engineers as this runs against
|
---|
108 | current design trends (and also points out limitations in some
|
---|
109 | existing software/interface designs, such as Windows' ACM codec
|
---|
110 | framework). However, we find that it does not fundamentally limit
|
---|
111 | Vorbis' suitable application space.
|
---|
112 |
|
---|
113 |
|
---|
114 | \subsubsection{Format Specification}
|
---|
115 | The Vorbis format is well-defined by its decode specification; any
|
---|
116 | encoder that produces packets that are correctly decoded by the
|
---|
117 | reference Vorbis decoder described below may be considered a proper
|
---|
118 | Vorbis encoder. A decoder must faithfully and completely implement
|
---|
119 | the specification defined below (except where noted) to be considered
|
---|
120 | a proper Vorbis decoder.
|
---|
121 |
|
---|
122 | \subsubsection{Hardware Profile}
|
---|
123 | Although Vorbis decode is computationally simple, it may still run
|
---|
124 | into specific limitations of an embedded design. For this reason,
|
---|
125 | embedded designs are allowed to deviate in limited ways from the
|
---|
126 | `full' decode specification yet still be certified compliant. These
|
---|
127 | optional omissions are labelled in the spec where relevant.
|
---|
128 |
|
---|
129 |
|
---|
130 | \subsection{Decoder Configuration}
|
---|
131 |
|
---|
132 | Decoder setup consists of configuration of multiple, self-contained
|
---|
133 | component abstractions that perform specific functions in the decode
|
---|
134 | pipeline. Each different component instance of a specific type is
|
---|
135 | semantically interchangeable; decoder configuration consists both of
|
---|
136 | internal component configuration, as well as arrangement of specific
|
---|
137 | instances into a decode pipeline. Componentry arrangement is roughly
|
---|
138 | as follows:
|
---|
139 |
|
---|
140 | \begin{center}
|
---|
141 | \includegraphics[width=\textwidth]{components}
|
---|
142 | \captionof{figure}{decoder pipeline configuration}
|
---|
143 | \end{center}
|
---|
144 |
|
---|
145 | \subsubsection{Global Config}
|
---|
146 | Global codec configuration consists of a few audio related fields
|
---|
147 | (sample rate, channels), Vorbis version (always '0' in Vorbis I),
|
---|
148 | bitrate hints, and the lists of component instances. All other
|
---|
149 | configuration is in the context of specific components.
|
---|
150 |
|
---|
151 | \subsubsection{Mode}
|
---|
152 |
|
---|
153 | Each Vorbis frame is coded according to a master 'mode'. A bitstream
|
---|
154 | may use one or many modes.
|
---|
155 |
|
---|
156 | The mode mechanism is used to encode a frame according to one of
|
---|
157 | multiple possible methods with the intention of choosing a method best
|
---|
158 | suited to that frame. Different modes are, e.g. how frame size
|
---|
159 | is changed from frame to frame. The mode number of a frame serves as a
|
---|
160 | top level configuration switch for all other specific aspects of frame
|
---|
161 | decode.
|
---|
162 |
|
---|
163 | A 'mode' configuration consists of a frame size setting, window type
|
---|
164 | (always 0, the Vorbis window, in Vorbis I), transform type (always
|
---|
165 | type 0, the MDCT, in Vorbis I) and a mapping number. The mapping
|
---|
166 | number specifies which mapping configuration instance to use for
|
---|
167 | low-level packet decode and synthesis.
|
---|
168 |
|
---|
169 |
|
---|
170 | \subsubsection{Mapping}
|
---|
171 |
|
---|
172 | A mapping contains a channel coupling description and a list of
|
---|
173 | 'submaps' that bundle sets of channel vectors together for grouped
|
---|
174 | encoding and decoding. These submaps are not references to external
|
---|
175 | components; the submap list is internal and specific to a mapping.
|
---|
176 |
|
---|
177 | A 'submap' is a configuration/grouping that applies to a subset of
|
---|
178 | floor and residue vectors within a mapping. The submap functions as a
|
---|
179 | last layer of indirection such that specific special floor or residue
|
---|
180 | settings can be applied not only to all the vectors in a given mode,
|
---|
181 | but also specific vectors in a specific mode. Each submap specifies
|
---|
182 | the proper floor and residue instance number to use for decoding that
|
---|
183 | submap's spectral floor and spectral residue vectors.
|
---|
184 |
|
---|
185 | As an example:
|
---|
186 |
|
---|
187 | Assume a Vorbis stream that contains six channels in the standard 5.1
|
---|
188 | format. The sixth channel, as is normal in 5.1, is bass only.
|
---|
189 | Therefore it would be wasteful to encode a full-spectrum version of it
|
---|
190 | as with the other channels. The submapping mechanism can be used to
|
---|
191 | apply a full range floor and residue encoding to channels 0 through 4,
|
---|
192 | and a bass-only representation to the bass channel, thus saving space.
|
---|
193 | In this example, channels 0-4 belong to submap 0 (which indicates use
|
---|
194 | of a full-range floor) and channel 5 belongs to submap 1, which uses a
|
---|
195 | bass-only representation.
|
---|
196 |
|
---|
197 |
|
---|
198 | \subsubsection{Floor}
|
---|
199 |
|
---|
200 | Vorbis encodes a spectral 'floor' vector for each PCM channel. This
|
---|
201 | vector is a low-resolution representation of the audio spectrum for
|
---|
202 | the given channel in the current frame, generally used akin to a
|
---|
203 | whitening filter. It is named a 'floor' because the Xiph.Org
|
---|
204 | reference encoder has historically used it as a unit-baseline for
|
---|
205 | spectral resolution.
|
---|
206 |
|
---|
207 | A floor encoding may be of two types. Floor 0 uses a packed LSP
|
---|
208 | representation on a dB amplitude scale and Bark frequency scale.
|
---|
209 | Floor 1 represents the curve as a piecewise linear interpolated
|
---|
210 | representation on a dB amplitude scale and linear frequency scale.
|
---|
211 | The two floors are semantically interchangeable in
|
---|
212 | encoding/decoding. However, floor type 1 provides more stable
|
---|
213 | inter-frame behavior, and so is the preferred choice in all
|
---|
214 | coupled-stereo and high bitrate modes. Floor 1 is also considerably
|
---|
215 | less expensive to decode than floor 0.
|
---|
216 |
|
---|
217 | Floor 0 is not to be considered deprecated, but it is of limited
|
---|
218 | modern use. No known Vorbis encoder past Xiph.Org's own beta 4 makes
|
---|
219 | use of floor 0.
|
---|
220 |
|
---|
221 | The values coded/decoded by a floor are both compactly formatted and
|
---|
222 | make use of entropy coding to save space. For this reason, a floor
|
---|
223 | configuration generally refers to multiple codebooks in the codebook
|
---|
224 | component list. Entropy coding is thus provided as an abstraction,
|
---|
225 | and each floor instance may choose from any and all available
|
---|
226 | codebooks when coding/decoding.
|
---|
227 |
|
---|
228 |
|
---|
229 | \subsubsection{Residue}
|
---|
230 | The spectral residue is the fine structure of the audio spectrum
|
---|
231 | once the floor curve has been subtracted out. In simplest terms, it
|
---|
232 | is coded in the bitstream using cascaded (multi-pass) vector
|
---|
233 | quantization according to one of three specific packing/coding
|
---|
234 | algorithms numbered 0 through 2. The packing algorithm details are
|
---|
235 | configured by residue instance. As with the floor components, the
|
---|
236 | final VQ/entropy encoding is provided by external codebook instances
|
---|
237 | and each residue instance may choose from any and all available
|
---|
238 | codebooks.
|
---|
239 |
|
---|
240 | \subsubsection{Codebooks}
|
---|
241 |
|
---|
242 | Codebooks are a self-contained abstraction that perform entropy
|
---|
243 | decoding and, optionally, use the entropy-decoded integer value as an
|
---|
244 | offset into an index of output value vectors, returning the indicated
|
---|
245 | vector of values.
|
---|
246 |
|
---|
247 | The entropy coding in a Vorbis I codebook is provided by a standard
|
---|
248 | Huffman binary tree representation. This tree is tightly packed using
|
---|
249 | one of several methods, depending on whether codeword lengths are
|
---|
250 | ordered or unordered, or the tree is sparse.
|
---|
251 |
|
---|
252 | The codebook vector index is similarly packed according to index
|
---|
253 | characteristic. Most commonly, the vector index is encoded as a
|
---|
254 | single list of values of possible values that are then permuted into
|
---|
255 | a list of n-dimensional rows (lattice VQ).
|
---|
256 |
|
---|
257 |
|
---|
258 |
|
---|
259 | \subsection{High-level Decode Process}
|
---|
260 |
|
---|
261 | \subsubsection{Decode Setup}
|
---|
262 |
|
---|
263 | Before decoding can begin, a decoder must initialize using the
|
---|
264 | bitstream headers matching the stream to be decoded. Vorbis uses
|
---|
265 | three header packets; all are required, in-order, by this
|
---|
266 | specification. Once set up, decode may begin at any audio packet
|
---|
267 | belonging to the Vorbis stream. In Vorbis I, all packets after the
|
---|
268 | three initial headers are audio packets.
|
---|
269 |
|
---|
270 | The header packets are, in order, the identification
|
---|
271 | header, the comments header, and the setup header.
|
---|
272 |
|
---|
273 | \paragraph{Identification Header}
|
---|
274 | The identification header identifies the bitstream as Vorbis, Vorbis
|
---|
275 | version, and the simple audio characteristics of the stream such as
|
---|
276 | sample rate and number of channels.
|
---|
277 |
|
---|
278 | \paragraph{Comment Header}
|
---|
279 | The comment header includes user text comments (``tags'') and a vendor
|
---|
280 | string for the application/library that produced the bitstream. The
|
---|
281 | encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
|
---|
282 |
|
---|
283 | \paragraph{Setup Header}
|
---|
284 | The setup header includes extensive CODEC setup information as well as
|
---|
285 | the complete VQ and Huffman codebooks needed for decode.
|
---|
286 |
|
---|
287 |
|
---|
288 | \subsubsection{Decode Procedure}
|
---|
289 |
|
---|
290 | The decoding and synthesis procedure for all audio packets is
|
---|
291 | fundamentally the same.
|
---|
292 | \begin{enumerate}
|
---|
293 | \item decode packet type flag
|
---|
294 | \item decode mode number
|
---|
295 | \item decode window shape (long windows only)
|
---|
296 | \item decode floor
|
---|
297 | \item decode residue into residue vectors
|
---|
298 | \item inverse channel coupling of residue vectors
|
---|
299 | \item generate floor curve from decoded floor data
|
---|
300 | \item compute dot product of floor and residue, producing audio spectrum vector
|
---|
301 | \item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
|
---|
302 | \item overlap/add left-hand output of transform with right-hand output of previous frame
|
---|
303 | \item store right hand-data from transform of current frame for future lapping
|
---|
304 | \item if not first frame, return results of overlap/add as audio result of current frame
|
---|
305 | \end{enumerate}
|
---|
306 |
|
---|
307 | Note that clever rearrangement of the synthesis arithmetic is
|
---|
308 | possible; as an example, one can take advantage of symmetries in the
|
---|
309 | MDCT to store the right-hand transform data of a partial MDCT for a
|
---|
310 | 50\% inter-frame buffer space savings, and then complete the transform
|
---|
311 | later before overlap/add with the next frame. This optimization
|
---|
312 | produces entirely equivalent output and is naturally perfectly legal.
|
---|
313 | The decoder must be \emph{entirely mathematically equivalent} to the
|
---|
314 | specification, it need not be a literal semantic implementation.
|
---|
315 |
|
---|
316 | \paragraph{Packet type decode}
|
---|
317 |
|
---|
318 | Vorbis I uses four packet types. The first three packet types mark each
|
---|
319 | of the three Vorbis headers described above. The fourth packet type
|
---|
320 | marks an audio packet. All other packet types are reserved; packets
|
---|
321 | marked with a reserved type should be ignored.
|
---|
322 |
|
---|
323 | Following the three header packets, all packets in a Vorbis I stream
|
---|
324 | are audio. The first step of audio packet decode is to read and
|
---|
325 | verify the packet type; \emph{a non-audio packet when audio is expected
|
---|
326 | indicates stream corruption or a non-compliant stream. The decoder
|
---|
327 | must ignore the packet and not attempt decoding it to
|
---|
328 | audio}.
|
---|
329 |
|
---|
330 |
|
---|
331 |
|
---|
332 |
|
---|
333 | \paragraph{Mode decode}
|
---|
334 | Vorbis allows an encoder to set up multiple, numbered packet 'modes',
|
---|
335 | as described earlier, all of which may be used in a given Vorbis
|
---|
336 | stream. The mode is encoded as an integer used as a direct offset into
|
---|
337 | the mode instance index.
|
---|
338 |
|
---|
339 |
|
---|
340 | \paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
|
---|
341 |
|
---|
342 | Vorbis frames may be one of two PCM sample sizes specified during
|
---|
343 | codec setup. In Vorbis I, legal frame sizes are powers of two from 64
|
---|
344 | to 8192 samples. Aside from coupling, Vorbis handles channels as
|
---|
345 | independent vectors and these frame sizes are in samples per channel.
|
---|
346 |
|
---|
347 | Vorbis uses an overlapping transform, namely the MDCT, to blend one
|
---|
348 | frame into the next, avoiding most inter-frame block boundary
|
---|
349 | artifacts. The MDCT output of one frame is windowed according to MDCT
|
---|
350 | requirements, overlapped 50\% with the output of the previous frame and
|
---|
351 | added. The window shape assures seamless reconstruction.
|
---|
352 |
|
---|
353 | This is easy to visualize in the case of equal sized-windows:
|
---|
354 |
|
---|
355 | \begin{center}
|
---|
356 | \includegraphics[width=\textwidth]{window1}
|
---|
357 | \captionof{figure}{overlap of two equal-sized windows}
|
---|
358 | \end{center}
|
---|
359 |
|
---|
360 | And slightly more complex in the case of overlapping unequal sized
|
---|
361 | windows:
|
---|
362 |
|
---|
363 | \begin{center}
|
---|
364 | \includegraphics[width=\textwidth]{window2}
|
---|
365 | \captionof{figure}{overlap of a long and a short window}
|
---|
366 | \end{center}
|
---|
367 |
|
---|
368 | In the unequal-sized window case, the window shape of the long window
|
---|
369 | must be modified for seamless lapping as above. It is possible to
|
---|
370 | correctly infer window shape to be applied to the current window from
|
---|
371 | knowing the sizes of the current, previous and next window. It is
|
---|
372 | legal for a decoder to use this method. However, in the case of a long
|
---|
373 | window (short windows require no modification), Vorbis also codes two
|
---|
374 | flag bits to specify pre- and post- window shape. Although not
|
---|
375 | strictly necessary for function, this minor redundancy allows a packet
|
---|
376 | to be fully decoded to the point of lapping entirely independently of
|
---|
377 | any other packet, allowing easier abstraction of decode layers as well
|
---|
378 | as allowing a greater level of easy parallelism in encode and
|
---|
379 | decode.
|
---|
380 |
|
---|
381 | A description of valid window functions for use with an inverse MDCT
|
---|
382 | can be found in \cite{Sporer/Brandenburg/Edler}. Vorbis windows
|
---|
383 | all use the slope function
|
---|
384 | \[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
|
---|
385 |
|
---|
386 |
|
---|
387 |
|
---|
388 | \paragraph{floor decode}
|
---|
389 | Each floor is encoded/decoded in channel order, however each floor
|
---|
390 | belongs to a 'submap' that specifies which floor configuration to
|
---|
391 | use. All floors are decoded before residue decode begins.
|
---|
392 |
|
---|
393 |
|
---|
394 | \paragraph{residue decode}
|
---|
395 |
|
---|
396 | Although the number of residue vectors equals the number of channels,
|
---|
397 | channel coupling may mean that the raw residue vectors extracted
|
---|
398 | during decode do not map directly to specific channels. When channel
|
---|
399 | coupling is in use, some vectors will correspond to coupled magnitude
|
---|
400 | or angle. The coupling relationships are described in the codec setup
|
---|
401 | and may differ from frame to frame, due to different mode numbers.
|
---|
402 |
|
---|
403 | Vorbis codes residue vectors in groups by submap; the coding is done
|
---|
404 | in submap order from submap 0 through n-1. This differs from floors
|
---|
405 | which are coded using a configuration provided by submap number, but
|
---|
406 | are coded individually in channel order.
|
---|
407 |
|
---|
408 |
|
---|
409 |
|
---|
410 | \paragraph{inverse channel coupling}
|
---|
411 |
|
---|
412 | A detailed discussion of stereo in the Vorbis codec can be found in
|
---|
413 | the document \href{stereo.html}{Stereo Channel Coupling in the
|
---|
414 | Vorbis CODEC}. Vorbis is not limited to only stereo coupling, but
|
---|
415 | the stereo document also gives a good overview of the generic coupling
|
---|
416 | mechanism.
|
---|
417 |
|
---|
418 | Vorbis coupling applies to pairs of residue vectors at a time;
|
---|
419 | decoupling is done in-place a pair at a time in the order and using
|
---|
420 | the vectors specified in the current mapping configuration. The
|
---|
421 | decoupling operation is the same for all pairs, converting square
|
---|
422 | polar representation (where one vector is magnitude and the second
|
---|
423 | angle) back to Cartesian representation.
|
---|
424 |
|
---|
425 | After decoupling, in order, each pair of vectors on the coupling list,
|
---|
426 | the resulting residue vectors represent the fine spectral detail
|
---|
427 | of each output channel.
|
---|
428 |
|
---|
429 |
|
---|
430 |
|
---|
431 | \paragraph{generate floor curve}
|
---|
432 |
|
---|
433 | The decoder may choose to generate the floor curve at any appropriate
|
---|
434 | time. It is reasonable to generate the output curve when the floor
|
---|
435 | data is decoded from the raw packet, or it can be generated after
|
---|
436 | inverse coupling and applied to the spectral residue directly,
|
---|
437 | combining generation and the dot product into one step and eliminating
|
---|
438 | some working space.
|
---|
439 |
|
---|
440 | Both floor 0 and floor 1 generate a linear-range, linear-domain output
|
---|
441 | vector to be multiplied (dot product) by the linear-range,
|
---|
442 | linear-domain spectral residue.
|
---|
443 |
|
---|
444 |
|
---|
445 |
|
---|
446 | \paragraph{compute floor/residue dot product}
|
---|
447 |
|
---|
448 | This step is straightforward; for each output channel, the decoder
|
---|
449 | multiplies the floor curve and residue vectors element by element,
|
---|
450 | producing the finished audio spectrum of each channel.
|
---|
451 |
|
---|
452 | % TODO/FIXME: The following two paragraphs have identical twins
|
---|
453 | % in section 4 (under "dot product")
|
---|
454 | One point is worth mentioning about this dot product; a common mistake
|
---|
455 | in a fixed point implementation might be to assume that a 32 bit
|
---|
456 | fixed-point representation for floor and residue and direct
|
---|
457 | multiplication of the vectors is sufficient for acceptable spectral
|
---|
458 | depth in all cases because it happens to mostly work with the current
|
---|
459 | Xiph.Org reference encoder.
|
---|
460 |
|
---|
461 | However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
|
---|
462 | the audio spectrum vector should represent a minimum of 120dB (\~{}21
|
---|
463 | bits with sign), even when output is to a 16 bit PCM device. For the
|
---|
464 | residue vector to represent full scale if the floor is nailed to
|
---|
465 | $-140$dB, it must be able to span 0 to $+140$dB. For the residue vector
|
---|
466 | to reach full scale if the floor is nailed at 0dB, it must be able to
|
---|
467 | represent $-140$dB to $+0$dB. Thus, in order to handle full range
|
---|
468 | dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
|
---|
469 | spec. A 280dB range is approximately 48 bits with sign; thus the
|
---|
470 | residue vector must be able to represent a 48 bit range and the dot
|
---|
471 | product must be able to handle an effective 48 bit times 24 bit
|
---|
472 | multiplication. This range may be achieved using large (64 bit or
|
---|
473 | larger) integers, or implementing a movable binary point
|
---|
474 | representation.
|
---|
475 |
|
---|
476 |
|
---|
477 |
|
---|
478 | \paragraph{inverse monolithic transform (MDCT)}
|
---|
479 |
|
---|
480 | The audio spectrum is converted back into time domain PCM audio via an
|
---|
481 | inverse Modified Discrete Cosine Transform (MDCT). A detailed
|
---|
482 | description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
|
---|
483 |
|
---|
484 | Note that the PCM produced directly from the MDCT is not yet finished
|
---|
485 | audio; it must be lapped with surrounding frames using an appropriate
|
---|
486 | window (such as the Vorbis window) before the MDCT can be considered
|
---|
487 | orthogonal.
|
---|
488 |
|
---|
489 |
|
---|
490 |
|
---|
491 | \paragraph{overlap/add data}
|
---|
492 | Windowed MDCT output is overlapped and added with the right hand data
|
---|
493 | of the previous window such that the 3/4 point of the previous window
|
---|
494 | is aligned with the 1/4 point of the current window (as illustrated in
|
---|
495 | the window overlap diagram). At this point, the audio data between the
|
---|
496 | center of the previous frame and the center of the current frame is
|
---|
497 | now finished and ready to be returned.
|
---|
498 |
|
---|
499 |
|
---|
500 | \paragraph{cache right hand data}
|
---|
501 | The decoder must cache the right hand portion of the current frame to
|
---|
502 | be lapped with the left hand portion of the next frame.
|
---|
503 |
|
---|
504 |
|
---|
505 |
|
---|
506 | \paragraph{return finished audio data}
|
---|
507 |
|
---|
508 | The overlapped portion produced from overlapping the previous and
|
---|
509 | current frame data is finished data to be returned by the decoder.
|
---|
510 | This data spans from the center of the previous window to the center
|
---|
511 | of the current window. In the case of same-sized windows, the amount
|
---|
512 | of data to return is one-half block consisting of and only of the
|
---|
513 | overlapped portions. When overlapping a short and long window, much of
|
---|
514 | the returned range is not actually overlap. This does not damage
|
---|
515 | transform orthogonality. Pay attention however to returning the
|
---|
516 | correct data range; the amount of data to be returned is:
|
---|
517 |
|
---|
518 | \begin{Verbatim}[commandchars=\\\{\}]
|
---|
519 | window\_blocksize(previous\_window)/4+window\_blocksize(current\_window)/4
|
---|
520 | \end{Verbatim}
|
---|
521 |
|
---|
522 | from the center of the previous window to the center of the current
|
---|
523 | window.
|
---|
524 |
|
---|
525 | Data is not returned from the first frame; it must be used to 'prime'
|
---|
526 | the decode engine. The encoder accounts for this priming when
|
---|
527 | calculating PCM offsets; after the first frame, the proper PCM output
|
---|
528 | offset is '0' (as no data has been returned yet).
|
---|