14 KiB
Data track handling in AccurateRip disc ID calculation
Data tracks present in enhanced and mixed mode CDs require special treatment during calculation of AccurateRip disc IDs. While mixed mode CDs are not popular anymore, enhanced CDs are still widespread.
To figure out how to deal with this problem I ripped several CDs of different
types using various AccurateRip-aware programs, and compared the results I got.
I consider dBpoweramp
to be the reference implementation: it was created by
the same entity which operates AccurateRip database.
AccurateRip database lookup is based on a disc fingerprint of the following
format: nnn-aaaaaaaa-bbbbbbbb-ffffffff
. The fields are:
nnn
: number of tracks on CD (decimal),aaaaaaaa
andbbbbbbbb
: two types of AccurateRip disc IDs (hex),ffffffff
: FreeDB disc ID (hex).
All fields are padded with zeros if they would be shorter than 3 or 8 digits. The three types of disc IDs are calculated based on the CD table of contents (TOC).
Enhanced CD (Blue Book)
Let's use "Liquid" by Recoil as an example. It has 10 audio tracks followed by one data track.
Calculated full AccurateRip disc IDs are:
010-00164419-00b9f6e2-9e11600b
(calculated bydBpoweramp
andEAC
- correct)010-00154c2f-00af4fd4-890e120a
(calculated byWhipper
andARver
prior tov0.5.0
- wrong!)
Only the number of tracks (010
) is the same, all three disc IDs are different.
dBpoweramp
and EAC
verify the checksums of my ripped files correctly, but
Whipper
and old ARver
versions do not (I get no matching checksums).
dBpoweramp
is developed by the same entity that operates AccurateRip database,
so its results can be considered authoritative. Actually it is interesting that
Whipper
and old ARver
versions get any AccurateRip results at all, instead
of a 404 error.
Disc data
MusicBrainz provides the following information about the disc:
- MB Disc ID:
9AQWqQ0eCCwktwPPrIUIYkUw2Uo-
- FreeDB disc ID:
890e120a
start length end
time sect time sect time sect
1 0:02 150 2:48 12617 2:50 12767
2 2:50 12767 6:10 27720 9:00 40487
3 9:00 40487 5:03 22738 14:03 63225
4 14:03 63225 6:42 30185 20:45 93410
5 20:45 93410 5:29 24705 26:15 118115
6 26:15 118115 7:30 33750 33:45 151865
7 33:45 151865 7:13 32475 40:58 184340
8 40:58 184340 6:52 30920 47:50 215260
9 47:50 215260 7:09 32195 54:59 247455
10 54:59 247455 5:05 22880 1:00:04 270335
discid
provides equivalent data from reading a physical CD:
{
"disc": {
"id": "9AQWqQ0eCCwktwPPrIUIYkUw2Uo-",
"sectors": "270335",
"offset-list": [
150,
12767,
40487,
63225,
93410,
118115,
151865,
184340,
215260,
247455
],
"offset-count": 10
}
}
There is no information about a data track in MusicBrainz and discid
output.
That information can be extracted using pycdio
:
$ ./tracks.py
CD-ROM /dev/sr0 has 11 track(s) and 281585 session(s).
Track format is CD-DA.
Media Catalog Number: 0000000000000
#: LSN Format
1: 000000 00:02:00 audio
2: 012617 02:50:17 audio
3: 040337 08:59:62 audio
4: 063075 14:03:00 audio
5: 093260 20:45:35 audio
6: 117965 26:14:65 audio
7: 151715 33:44:65 audio
8: 184190 40:57:65 audio
9: 215110 47:50:10 audio
10: 247305 54:59:30 audio
11: 281585 62:36:35 data
AA: 333651 leadout
(tracks.py
is a sample program distributed with pycdio
. Note that LSN
rather than LBA is shown in the output, so add 150
sectors to convert
from LSN to LBA.)
Analysis
Let's examine FreeDB disc IDs first, as they are easiest to understand. The
format of FreeDB disc ID is XXSSSSTT
where XX
is a checksum, SSSS
is the
total number of seconds on the disc, and TT
is the number of tracks on CD,
all in hexadecimal. Let's break these two FreeDB disc IDs down:
XX SSSS TT
MB: 89 0e12 0a (0a = 10 tracks)
EAC: 9e 1160 0b (0b = 11 tracks!)
The total CD length is 1h 00m 04s = 3604 s (3602 s excluding lead in).
- MB length (sec):
0e12
hex =3602
dec - EAC length (sec):
1160
hex =4448
dec
delta = 4448 s - 3602 s = 846 s
Where does this 846 seconds difference come from? The data track is offset
from the end of last audio track by the usual 11400
sectors (152 seconds,
or 2:32.00
). The data track is 333651 - 281585 = 52066 sectors long (694
seconds and 16 sectors, or 11:34.16
).
152 s + 694 s = 846 s
So it appears one must know the length of data track to calculate AccurateRip
disc ID of an enhanced CD. discid
doesn't provide this, and MusicBrainz disc
ID does not encode this information. This means it is not possible to calculate
AccurateRip disc ID just from MusicBrainz disc ID. Reading a physical CD with
pycdio
is necessary. :(
Examples
The following examples demonstrate that it is possible to calculate correct disc
IDs by using correct input data. No changes are necessary in functions which
perform the calculations. freedb_id()
and accuraterip_ids()
are internal
ARver
functions.
lba_offsets
are LBA offsets of audio tracks. sectors
is the LBA of the
first sector following the last audio track. In a regular CDDA disc this is
equivalent to the lead out offset. But this is not the lead out offset when
another session follows audio tracks!
lba_offsets
and sectors
are acquired by reading a physical CD using discid
,
or from MusicBrainz based on disc ID.
true_leadout
is the LBA offset of track 0xAA
(170). This offset, along with
the information about existence of data tracks, can be acquired by reading a
physical CD using pycdio
.
The numbers in examples follow the data from "Disc data" section above, with
LSN offsets reported by pycdio
converted to LBA (LBA = LSN + 150).
FreeDB disc ID
This is what the calculation looks like if we are not aware of the data track:
>>> from arver.disc.fingerprint import freedb_id
>>> lba_offsets = [150, 12767, 40487, 63225, 93410, 118115, 151865, 184340, 215260, 247455]
>>> sectors = 270335
>>> freedb_id(lba_offsets, sectors)
'890e120a' # same as MusicBrainz or Whipper - NOK
We can obtain the correct result only if we know that there is a data track
and we know its offset (taking into account the usual 11400
sectors gap
after the last audio track). We must also know the true lead out offset:
>>> from arver.disc.fingerprint import freedb_id
>>> lba_offsets = [150, 12767, 40487, 63225, 93410, 118115, 151865, 184340, 215260, 247455, 281735]
>>> true_leadout = 333801
>>> freedb_id(lba_offsets, true_leadout)
'9e11600b' # same as EAC - OK
AccurateRip disc IDs
AccurateRip ID calculation requires knowing the true lead out offset too:
>>> from arver.disc.fingerprint import accuraterip_ids
>>> lba_offsets = [150, 12767, 40487, 63225, 93410, 118115, 151865, 184340, 215260, 247455]
>>> sectors = 270335
>>> accuraterip_ids(lba_offsets, sectors)
('00154c2f', '00af4fd4') # same as Whipper - NOK
Specifying the true lead out offset corrects the calculated AccurateRip disc IDs of an enhanced CD. No manipulation of offset list is necessary:
>>> from arver.disc.fingerprint import accuraterip_ids
>>> lba_offsets = [150, 12767, 40487, 63225, 93410, 118115, 151865, 184340, 215260, 247455]
>>> true_leadout = 333801
>>> accuraterip_ids(lba_offsets, true_leadout)
('00164419', '00b9f6e2') # same as EAC - OK
Mixed mode CD (Yellow Book)
Let's use "Mortal Kombat Trilogy" game CD as an example. It contains 29 tracks: the first one is game data, the rest are audio tracks.
The AccurateRip disc ID calculated by EAC is 028-00517a54-05a845d2-af0da31d
.
The initial number in the disc ID appears to be the number of audio tracks,
not the total number of tracks.
Calculation of FreeDB disc ID requires offsets of all tracks, regardless of
their type. discid
is able to provide them for mixed mode CDs, because the
data track and audio tracks share the same CD session. Lead out offset is
correct too, because there is just one session and the last audio track is
not followed by a data track. This means the FreeDB disc ID can be calculated
based on information provided by discid
alone:
>>> from arver.disc.fingerprint import freedb_id
>>> offsets = [150, 66728, 76502, 85963, 93760, 104048, 116066, 124743, 134206, 142916, 151852, 160720,
... 169425, 178731, 188459, 196711, 206354, 214845, 223686, 225746, 226384, 232668, 239447, 239931,
... 244989, 254264, 260066, 261222, 261674]
>>> leadout = 261976
>>> freedb_id(offsets, leadout)
'af0da31d' # same as EAC - OK
Calculation of AccurateRip disc IDs is the tricky part. Using the same data as above results in one fingerprint being incorrect:
>>> from arver.disc.fingerprint import accuraterip_ids
>>> accuraterip_ids(offsets, leadout)
('00517a54', '05f9c027') # 00517a54 is OK, 05f9c027 is not
To calculate them correctly the data track offset must be omitted:
>>> accuraterip_ids(offsets[1:], leadout)
('00517a54', '05a845d2') # both are correct now
This demonstrates that handling mixed mode CDs requires information discid
cannot provide: one needs to know which tracks are data tracks to omit their
offsets from the list. Track type information can be acquired with pycdio
package.
Sample AccurateRip response for a mixed mode CD
As indicated above, "Mortal Kombat Trilogy" CD contains 28 audio tracks. AccurateRip response for that CD is following:
disc ID: 028-00517a54-05a845d2-af0da31d
track 1: 00000000 (confidence: 0)
track 2: f7372315 (confidence: 1)
track 3: 953b5f00 (confidence: 1)
track 4: 7cc9a20d (confidence: 1)
track 5: 83f58407 (confidence: 1)
track 6: 33754ce7 (confidence: 1)
track 7: efe4fc28 (confidence: 1)
track 8: bd1fe2a7 (confidence: 1)
track 9: a1ff10c6 (confidence: 1)
track 10: a84d3b16 (confidence: 1)
track 11: 00a31f62 (confidence: 1)
track 12: 4c856e88 (confidence: 1)
track 13: ec88131e (confidence: 1)
track 14: 7e72262f (confidence: 1)
track 15: 5c5b588f (confidence: 1)
track 16: 62a24ed2 (confidence: 1)
track 17: bc6ef90f (confidence: 1)
track 18: 026de5e6 (confidence: 1)
track 19: 982fddd8 (confidence: 1)
track 20: 2a2734ed (confidence: 1)
track 21: 4dabe6e0 (confidence: 1)
track 22: 0083e0c8 (confidence: 1)
track 23: 00000000 (confidence: 0)
track 24: 060d3d73 (confidence: 1)
track 25: 75c6048b (confidence: 1)
track 26: dd6ee61a (confidence: 1)
track 27: e5ecb215 (confidence: 1)
track 28: 00000000 (confidence: 0)
The first track has checksum 00000000
with zero confidence. This corresponds
to CD track 1, i.e. to the data track. Since AccurateRip response contains only
28 tracks, it means the checksum of one audio track is missing: there is no
checksum of the final audio track (track 29). This seems to be a limitation of
AccurateRip database and there is nothing arver
can do to work around it.
Note: tracks 23 and 28 have zero checksums with zero confidence as well, but
this is because they are too short. arver
handles these tracks correctly.
Mixed mode CD verification in ARver
The table below presents differences between track numbers in the CD table
of contents (TOC) and in the set of audio files extracted from a CD (a rip).
The data track is not extracted, so there is no file corresponding to CD TOC
track 1 in the rip. This is why file names start from track02
.
TOC # | rip # | type | file name |
---|---|---|---|
1 / 29 | -- | data | -- |
2 / 29 | 1 / 28 | audio | track02.cdda.wav |
3 / 29 | 2 / 28 | audio | track03.cdda.wav |
4 / 29 | 3 / 28 | audio | track04.cdda.wav |
... | ... | ... | ... |
29 / 29 | 28 / 28 | audio | track29.cdda.wav |
It appears that the checksum of the first audio track (i.e. CD TOC track 2) in AccurateRip response is calculated as if samples were omitted at the beginning. This is fortunate: it means that checksum calculation can be based on the rip index, just like in other supported disc types. Basing the calculation on the rip index omits some samples at the beginning of the first track and in the end of the last track. Whether the latter is actually required for mixed mode CDs is not clear (and irrelevant).
When comparing AccurateRip checksums of local files with database response, it is useful to think in terms of two different indexes:
- the rip index used for calculating AccurateRip checksums,
- the CD TOC index used for lookups in the database response.
These indexes are equivalent when verifying an audio or enhanced CD, but their values become different when verifying a mixed mode CD: the value of the rip index will be equal to CD TOC index minus one.
Putting all of this together, to verify audio tracks ripped from a mixed mode CD
arver
must:
- iterate over CD TOC index starting from track 2,
- calculate AccurateRip checksums of an audio file using:
- the rip index as the track number,
- the number of files in the rip as the total number of tracks,
- look up expected checksums in the database response using CD TOC index.
As discussed previously, AccurateRip response for a mixed mode CD contains a zero confidence checksum of the data track, but no checksums of the last audio track. In other words, useless data track checksum occupies a slot that could have been used by an audio track. This results in an edge case: there is no AccurateRip checksum in the response that corresponds to the final value of CD TOC index. This would lead to a runtime error on dictionary lookup.
The workaround for that is inserting an "artificial" track with no available
checksums to the end of AccurateRip response when converting database response
to a dictionary. This achieves two things: first, it avoids the lookup error.
Second, it makes arver
treat the final track as one which has no checksums
in the database. Track verification result is then correctly shown as N/A
in the summary.