AVAudioEngine reconcile/sync input/output timestamps on macOS/iOS

Question

I'm attempting to sync recorded audio (from an AVAudioEngine inputNode) to an audio file that was playing during the recording process. The result should be like multitrack recording where each subsequent new track is synced with the previous tracks that were playing at the time of recording.

Because sampleTime differs between the AVAudioEngine's output and input nodes, I use hostTime to determine the offset of the original audio and the input buffers.

On iOS, I would assume that I'd have to use AVAudioSession's various latency properties (inputLatency, outputLatency, ioBufferDuration) to reconcile the tracks as well as the host time offset, but I haven't figured out the magic combination to make them work. The same goes for the various AVAudioEngine and Node properties like latency and presentationLatency.

On macOS, AVAudioSession doesn't exist (outside of Catalyst), meaning I don't have access to those numbers. Meanwhile, the latency/presentationLatency properties on the AVAudioNodes report 0.0 in most circumstances. On macOS, I do have access to AudioObjectGetPropertyData and can ask the system about kAudioDevicePropertyLatency, kAudioDevicePropertyBufferSize,kAudioDevicePropertySafetyOffset, etc, but am again at a bit of a loss as to what the formula is to reconcile all of these.

I have a sample project at https://github.com/jnpdx/AudioEngineLoopbackLatencyTest that runs a simple loopback test (on macOS, iOS, or Mac Catalyst) and shows the result. On my Mac, the offset between tracks is ~720 samples. On others' Macs, I've seen as much as 1500 samples offset.

On my iPhone, I can get it close to sample-perfect by using AVAudioSession's outputLatency + inputLatency. However, the same formula leaves things misaligned on my iPad.

What's the magic formula for syncing the input and output timestamps on each platform? I know it may be different on each, which is fine, and I know I won't get 100% accuracy, but I would like to get as close as possible before going through my own calibration process

Here's a sample of my current code (full sync logic can be found at https://github.com/jnpdx/AudioEngineLoopbackLatencyTest/blob/main/AudioEngineLoopbackLatencyTest/AudioManager.swift):

//Schedule playback of original audio during initial playback
let delay = 0.33 * state.secondsToTicks
let audioTime = AVAudioTime(hostTime: mach_absolute_time() + UInt64(delay))
state.audioBuffersScheduledAtHost = audioTime.hostTime

...

//in the inputNode's inputTap, store the first timestamp
audioEngine.inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (pcmBuffer, timestamp) in
            if self.state.inputNodeTapBeganAtHost == 0 {
                self.state.inputNodeTapBeganAtHost = timestamp.hostTime
            }
}

...

//after playback, attempt to reconcile/sync the timestamps recorded above

let timestampToSyncTo = state.audioBuffersScheduledAtHost
let inputNodeHostTimeDiff = Int64(state.inputNodeTapBeganAtHost) - Int64(timestampToSyncTo)
let inputNodeDiffInSamples = Double(inputNodeHostTimeDiff) / state.secondsToTicks * inputFileBuffer.format.sampleRate //secondsToTicks is calculated using mach_timebase_info

//play the original metronome audio at sample position 0 and try to sync everything else up to it
let originalAudioTime = AVAudioTime(sampleTime: 0, atRate: renderingEngine.mainMixerNode.outputFormat(forBus: 0).sampleRate)
originalAudioPlayerNode.scheduleBuffer(metronomeFileBuffer, at: originalAudioTime, options: []) {
  print("Played original audio")
}

//play the tap of the input node at its determined sync time -- this _does not_ appear to line up in the result file
let inputAudioTime = AVAudioTime(sampleTime: AVAudioFramePosition(inputNodeDiffInSamples), atRate: renderingEngine.mainMixerNode.outputFormat(forBus: 0).sampleRate)
recordedInputNodePlayer.scheduleBuffer(inputFileBuffer, at: inputAudioTime, options: []) {
  print("Input buffer played")
}

When running the sample app, here's the result I get:

sbooth · Answer 1 · 2021-01-20T04:26:41.187

This answer is applicable to native macOS only

General Latency Determination

Output

In the general case the output latency for a stream on a device is determined by the sum of the following properties:

kAudioDevicePropertySafetyOffset
kAudioStreamPropertyLatency
kAudioDevicePropertyLatency
kAudioDevicePropertyBufferFrameSize

The device safety offset, stream, and device latency values should be retrieved for kAudioObjectPropertyScopeOutput.

On my Mac for the audio device MacBook Pro Speakers at 44.1 kHz this equates to 71 + 424 + 11 + 512 = 1018 frames.

Input

Similarly, the input latency is determined by the sum of the following properties:

kAudioDevicePropertySafetyOffset
kAudioStreamPropertyLatency
kAudioDevicePropertyLatency
kAudioDevicePropertyBufferFrameSize

The device safety offset, stream, and device latency values should be retrieved for kAudioObjectPropertyScopeInput.

On my Mac for the audio device MacBook Pro Microphone at 44.1 kHz this equates to 114 + 2404 + 40 + 512 = 3070 frames.

`AVAudioEngine`

How the information above relates to AVAudioEngine is not immediately clear. Internally AVAudioEngine creates a private aggregate device and Core Audio essentially handles latency compensation for aggregate devices automatically.

During experimentation for this answer I've found that some (most?) audio devices don't report latency correctly. At least that is how it seems, which makes accurate latency determination nigh impossible.

I was able to get fairly accurate synchronization using my Mac's built-in audio using the following adjustments:

// Some non-zero value to get AVAudioEngine running
let startDelay = 0.1

// The original audio file start time
let originalStartingFrame: AVAudioFramePosition = AVAudioFramePosition(playerNode.outputFormat(forBus: 0).sampleRate * startDelay)

// The output tap's first sample is delivered to the device after the buffer is filled once
// A number of zero samples equal to the buffer size is produced initially
let outputStartingFrame: AVAudioFramePosition = Int64(state.outputBufferSizeFrames)

// The first output sample makes it way back into the input tap after accounting for all the latencies
let inputStartingFrame: AVAudioFramePosition = outputStartingFrame - Int64(state.outputLatency + state.outputStreamLatency + state.outputSafetyOffset + state.inputSafetyOffset + state.inputLatency + state.inputStreamLatency)

On my Mac the values reported by the AVAudioEngine aggregate device were:

// Output:
// kAudioDevicePropertySafetyOffset:    144
// kAudioDevicePropertyLatency:          11
// kAudioStreamPropertyLatency:         424
// kAudioDevicePropertyBufferFrameSize: 512

// Input:
// kAudioDevicePropertySafetyOffset:     154
// kAudioDevicePropertyLatency:            0
// kAudioStreamPropertyLatency:         2404
// kAudioDevicePropertyBufferFrameSize:  512

which equated to the following offsets:

originalStartingFrame =  4410
outputStartingFrame   =   512
inputStartingFrame    = -2625

Interesting -- on my machine (also a MBP), my numbers are similar, but it still seems to yield an offset of ~300 samples (assuming I'm doing the calculations right). Not terrible, but certainly not as close as I'd like. Getting someone else to run it on theirs so I can see. My `kAudioStreamPropertyLatency` reports 0 on my machine, which I find suspicious. Will comment again once I hear my tester's numbers. — jnpdx, Jan 15 '21 at 01:07
BTW, I've updated my repo to incorporate these numbers in the branch feature/printLowLevelLatencies (https://github.com/jnpdx/AudioEngineLoopbackLatencyTest) — jnpdx, Jan 15 '21 at 01:13
My tester's numbers are similar to yours (1596 output, 150 input) on a MBA. On his machine, this seems to lead to an even bigger offset than mine at ~500 samples. Do you happen to know why stream latency and buffer frame size should be accounted for on the output side, but not the input side? — jnpdx, Jan 15 '21 at 01:41
It took me a few read-throughs, but I think I understand what you're saying. The numbers my Mac reports are similar to yours (-70 adjusted input vs 66 kAudioDevicePropertySafetyOffset, and 1112 adjusted output vs 1117 for inBuffer + outBuffer + out safety). The piece that I'm missing and I'm not clear on from your post is if these numbers can be somehow used to align the loopback audio - my test (wo/ accounting for latency) shows about ~750 frames. I can't seem to massage these numbers to work into that number. Think it's possible? Did you manage to align the audio? — jnpdx, Jan 17 '21 at 21:16
P.S. Thank you so much for the work you've put into this -- amazing detail and research. Happy to give you the bounty even though it's just the Mac side, but I'd like to try to clear up my last questions about alignment. Also would very much welcome the opportunity to do a quick chat about this if you were up for it. — jnpdx, Jan 17 '21 at 21:18
https://chat.stackoverflow.com/rooms/227464/room-for-jn-pdx-and-sbooth — jnpdx, Jan 17 '21 at 21:25
For anyone who comes across this in the future, it's probably worth noting that the solution here varies by machine and still leads to roughly ~300-700 sample offsets — jnpdx, Jan 18 '21 at 18:20
@jn_pdx Check out the edits and see if that method works better for you. I lost faith in the latency values the device I was using before handed out and used my Mac's built-in audio with seemingly better results. — sbooth, Jan 20 '21 at 04:28

score 1 · Answer 2 · answered Jan 08 '21 at 21:36

I may not be able to answer your question, but I believe there is a property not mentioned in your question that does report additional latency information.

I've only worked at the HAL/AUHAL layers (never AVAudioEngine), but in discussions about computing the overall latencies, some audio device/stream properties come up: kAudioDevicePropertyLatency and kAudioStreamPropertyLatency.

Poking around a bit, I see those properties mentioned in the documentation for AVAudioIONode's presentationLatency property (https://developer.apple.com/documentation/avfoundation/avaudioionode/1385631-presentationlatency). I expect that the hardware latency reported by the driver will be there. (I suspect that the standard latency property reports latency for an input sample to appear in the output of a "normal" node, and IO case is special)

It's not in the context of AVAudioEngine, but here's one message from the CoreAudio mailing list that talks a bit about using the low level properties that may provide some additional background: https://lists.apple.com/archives/coreaudio-api/2017/Jul/msg00035.html

`presentationLatency` presents 0.0 for input and output nodes in Catalyst. On the Mac, it reports the same 399 samples that `AVAudioSession.sharedInstance().outputLatency` does (as well as `mainMixerNode.outputPresentationLatency`). So, it's useful to know that those properties line up. The regular `latency` properties all report 0.0 (making me wonder why the exist in the first place). So, that leaves me with about `300`+ samples to account for still on my machine... Looking into the mailing list link now... — jnpdx, Jan 08 '21 at 23:00
Your link eventually pointed me towards a thread in January 2020 where people discussed these issues on iOS. The general consensus was that the user would have to calibrate their system in order to get close to sample-perfect. Seems surprising given that multitrack recording software would always have to do this. https://lists.apple.com/archives/coreaudio-api/2020/Jan/index.html — jnpdx, Jan 08 '21 at 23:48

AVAudioEngine reconcile/sync input/output timestamps on macOS/iOS

2 Answers2

General Latency Determination

Output

Input

`AVAudioEngine`

Linked