VOXOS: bringing Daft Punk to the FPGA

TLDR: I made a digital vocoder with MIDI support, an instrument that imparts your voice onto any source audio as heard in Daft Punk, Kraftwerk, and others’ music. It involved implementing DSP without third-party libraries, writing SPI/UART drivers, and wrestling with USB on a Xilinx FPGA.

The Plan

I’m not really a singer and my mouth trumpet¹ doesn’t really qualify. Let’s just say there’s a reason I stuck with instruments. So for my 6.111 (now 6.205) final project, I wanted to make something that could give me an extra boost. The vocoder, a historical instrument originally meant for “downsampling” spoken audio during wartime for efficient transmission, has found new use in electronic music. It’s how Daft Punk sounds like robots in Hard, Better, Faster, Stronger. It essentially takes your voice and “matches” it to a specific frequency, a “carrier.” Back in the day, this carrier was beyond audible frequencies and sent over radio since sending the raw voice audio was infeasible. Nowadays, that carrier is another audible pitch, and the output is an almost autotune-like sound.

The theory of operation is pretty simple. Sounds sound the way they do because they have unique sets of frequencies they contain, all at different prominence levels. Human speech, and in particular certain letters combinations or inflections within a language have universal characteristics no matter the speaker. “s” sounds have higher frequencies to get that crackle. In English, ends of questions might have higher frequencies as you go up in pitch.

The idea is to a “shape” or “impart” these spectral features from speech onto our carrier. If you’re tight with Fourier, you might look into a FFT of both your speech and carrier, then scaling each bucket of the carrier FFT by the magnitude of corresponding speech FFT bucket. Since I wanted to hand-write all my hardware, I decided to do a more lo-fi version:

the plan The overall plan. Drawing is my passion

In the end I had a few design requirements:

Sound like I’m singing any note on the standard 88 piano keys (since I’m going to be controlling this with a keyboard, anything higher or lower would be unpleasant anyways)
Have decent audio quality, so nothing 8-bit. On par with the OG analog Roland VP330 was the benchmark
Controllable over MIDI so I could choose my favorite DAW/controller and pitchbend/generate envelopes as I desired

And an FPGA shines here because we have a large parallel problem (lots of filters in our filterbank) that can be pipelined at a relatively high frequency.

Making Sound

Like any self-respecting synthesizer, I wanted to be able to generate sine, sawtooth, triangle, and square waves, up to at least 5kHz to meet requirement (1). Further, to reach cent-resolution² in the worst case (at 20Hz) for requirement (2), we need at least 21-bits to represent our audio. I rounded this up to 24-bit for a round 3 bytes.

The triangle, square, and saw are easy to generate on the FPGA. For the sine wave, instead of doing anything floating-point or CORDIC on the FPGA, I opted to do Direct Digital Synthesis (DDS). All this means is storing a huge table of precomputed sine values scaled by $2^{23}-1$ and using those depending on the current phase we’re at. Exploiting symmetry, I only needed to save a quarter-period of these values. I wrote a Python script to do that and save it to a file Verilog could use to fill up BRAM on the FPGA.

Designing Pipelined Filters

Next, I went about designing my filterbank for our cheapo-pseudo-FFT. For digital filtering, we can choose an IIR or FIR³ filter with benefits and tradeoffs. I went with an IIR for its (a) easier implementation, (b) flatter response outside of band of interest, and (c) lower latency. The tradeoff is worse filter stability and nonlinear phase: these are important for some audiophiles, but I figured implementation ease won out.

A basic digital IIR filter is the biquad which can be parametrized to be a bandpass I needed, and can be easily implemented via the difference equation, where $y[n]$ is our output and $x[n-i]$ are current and previous samples.

$y[n] = b_0x[n] + b_1x[n-1] + b_2x[n-2] - a_1y[n - 1] - a_2y[n-2]$

Next, I chose to make these bandpasses maximally flat to faithfully reproduce speech characteristics. Via the $z$ -transform, we can generate cofficients $a_i, b_i$ for Butterworth IIR filters that have this response. Learning MATLAB came in handy here, and I could easily generate the coefficients $a_i, b_i$ for however many frequency bands I wanted for my filterbank, logarithmically spaced between 50Hz and 7kHz (normal human speech range). Above 7kHz, I had a highpass handle the rest.

filter bank in MATLAB 16 Butterworth IIR biquads from 50Hz-7kHz logarithmically spaced. These are also 4th order (2 cascaded biquads) for better rolloff.

These were saved in a BRAM with fixed-point arithmetic (no hardware floats!). Next, I wanted to as many filters in my filterbank as possible to have a higher output quality (more buckets = more speech features captured). But the FPGA has only so many resources, so to push it to its limits, I timeshared every 4th order biquad: speech and synth audio were filtered serially, then the same filter hardware was reconfigured as a lowpass filter for envelope detection. But each of these involved 20-bit with 24-bit multiplications and headroom, so I had to pipeline each operation of the biquad to meet timing. This further put constraints given the clock and sample rate, and gave me a theoretical maximum of 29 filters. I ended up implementing 28.

Talking to the World

For my microphone and audio interface, I wrote UART/I2S drivers that I testbenched and also verified with a logic analyzer. I used the Pmod I2S2 peripheral which came with the name addition of an audio IN. I could change any song to any key!

Battling USB (and Losing)

To accomplish (3), I actually spent the first few weeks of the project trying to get the MAX3241E chip on the FPGA over SPI to act as a USB host. Then, I could plug in any MIDI keyboard directly into the FPGA and truly play it on-the-go. In fact, I spent a week or so reading up on this excellent wiki, USB in a Nutshell and implementing the entire process in a driver:

the pain of USB in commits

I got the chip to detect a peripheral, negotiate USB 1.1, send 5V to the keyboard, setup endpoints and configurations, but it crapped out as soon as I asked for a data packet. It’d keep spinning in this un-further-documented 0x0E “timeout” state. Englightening. Joe Steinmeyer, who runs the class, even brought out this monster $40k scope which could decode USB packets. Unfortunately, that feature is a subscription :sob:

the oscillscope runs windows? It runs Windows???

I ended up writing a small Python script that would transparent forward MIDI packets from keyboard to computer to the FTDI chip on the FPGA over UART. I guess I know that datasheet by heart now. Here’s a demo of the synth and MIDI features:

How does it sound?

So, while you definitely sound like a robot, there’s still some left to be desired. Some small off-by-one timing issues crept up, and clipping was an issue I couldn’t fully solve. On other fronts though, the project was a great success: I learned a lot about DSP, USB, and hardware debugging. It was really satisfying to design the DSP algorithm end-to-end, and come out with a convincing result!

Put your top teeth on your bottom lip and hum. Your teeth should introduce some higher harmonics. Best if you have an underbite and flip the placement. ↩
A cent is $1/100$ of a semitone, or a twelfth of an octave. This is kind of an arbitrary definition for “hi-fi” audio, but a good bar for what I can distinguish with my ears ↩
Infinite impulse or finite impulse response. ↩