Nate's Blog

ApOPL3xy Emulator

Nate Barney — Wed, 24 Sep 2025 01:35:34 +0000

Last week, I posted about the ApOPL3xy reaching the version 1.0 milestone. This time, I want to discuss the emulator I wrote for it. I think it’s pretty cool. The emulator was initially conceived as a way to run the code on a computer with all the debugging tools that I’m used to having at my disposal. It was certainly useful for that, but it grew into a project of its own.

Demo Videos

Before I get into some of the more technical details, I want to show some video captures of the emulator running. And if you happen to find yourself thinking that it looks like something you’d want to play with, I’m releasing it all as open source. There will be some links later in the post where you can download the emulator, and for the ambitious, all the files needed to build your own hardware ApOPL3xy are available as well.

VGM Player

One of the most iconic pieces composed for the OPL3 is “At Doom’s Gate,” the music for the first level (Episode 1 Mission1, or E1M1) of Doom. vgmrips.net has the VGM capture of this song available for download, and I can’t think of a better song to demonstrate the VGM player feature. Note that most, if not all, VGM files only use the first two output channels, since no 4-channel OPL3 sound cards were commonly available.

The numbered buttons below the display correspond to the colored buttons on the hardware, and the two groups of three buttons, labeled ↺, ↻, and ☟, represent the rotary encoders being turned counter-clockwise, turned clockwise, and pressed like a button, respectively. At some point, I may improve these to be a more skeuomorphic control, but the buttons get the job done.

Doom: At Doom’s Gate (E1M1)

MIDI Player

My last post showed the new MIDI file player feature on the hardware ApOPL3xy, but due to difficulties filming the backlit LCD screen, it wasn’t easy to see what was on the display. The next video shows the emulator playing a MIDI version of the Doctor Who theme music. As before, I arbitrarily assigned MIDI channels to output channels to make the VU meters a bit more interesting.

Doctor Who Theme

Loop Controls

Between the last post and this one, I added loop controls to both the VGM and MIDI players. It’s a small feature, but potentially useful. The loop indicator is just to the left of the total time in the player interface. It starts out as a right-pointing arrow, meaning no loop. The other loop modes are single loop, represented by an arrow that curves from the right to the left, and continual loop, represented by two arrows pointing at each other’s tail. The 5 button cycles through these modes.

This video demonstrates the single loop mode being used for a MIDI file with a version of the Pac-Man music. It also shows the page down menu navigation function, which is mapped to button 9.

Pac-Man

MIDI Input

My primary motivation for building the ApOPL3xy was to enable MIDI input to control an OPL3 chip. So far, I’ve demonstrated playback functionality, but I want to show the live MIDI capability as well. Unfortunately, I’m not very skilled at the piano, so I just play a few scales in the following video. The piano keyboard in the video is an open source tool called VMPK (Virtual MIDI Piano Keyboard), which I did not write, but am simply using.

The video shows selecting a MIDI input source for the emulator and using the Channel Editor feature to show the patches assigned to MIDI channels. Then, a few scales are played using a few different patches.

MIDI Input

MIDI Output

The ApOPL3xy also has a MIDI output port, which sends MIDI messages from the MIDI player to an external device. It also optionally echoes messages received from the MIDI input port, but that echoing is not demonstrated here. What is demonstrated is using VMPK as an external device to visualize the notes being played. In this video, VMPK is configured to display each MIDI channel with a different color. The MIDI file being played is the battle music from Final Fantasy VII.

Final Fantasy VII Battle Theme

Implementation Details

The ApOPL3xy firmware interacts with several pieces of hardware, all of which need to be emulated to be able to run the firmware purely in software.

Microcontroller

The hardware ApOPL3xy has an ATmega1284p microcontroller to run the firmware. Because the main goal of the emulator is to support debugging, this microcontroller is not emulated. Instead of running the compiled AVR firmware on an emulated ATmega1284p, the firmware is recompiled to native code for the computer running the emulator, with some conditionally-compiled hooks in a few places to tie in the other emulated hardware.

OPL3

This is probably the most important piece of hardware to emulate, and also the most daunting. Fortunately, the open source project DOSBox has code to emulate the OPL2 and OPL3 chips, so I adapted it for my use. This code exposes a function to set registers on the emulated OPL chip, which I used instead of the SPI interface I built for the hardware ApOPL3xy for that purpose. The DOSBox code also exposes a function to get output audio samples, which I used to populate the audio buffers (more on this later).

The OPL3 supports up to 4 channels of audio output, but sound cards incorporating the chip rarely if ever used more than 2. DOSBox didn’t bother implementing support for channels 3 and 4, so I modified it to add that support. It was actually easier than I expected. Since they had already gone to the trouble to implement 2 channels, most of the work involved increasing some array sizes and loop bounds from 2 to 4.

LCD Character Display

The standard 20×4 LCD character display used by the ApOPL3xy is based on a driver chip called the HD44780. I broke the emulation of this device up into two parts: the HD44780 emulation and the drawing of the LCD panel itself.

The HD44780 supports several commands which are sent by writing to its control and data registers. To emulate this chip, I wrote a C++ class to represent all of its internal registers, RAM, and ROM, and then implemented each command as a separate function that operates on the internal state. I then implemented a wrapper function which accepts commands or data, and dispatches to the correct underlying function(s). This is called by the emulated firmware instead of using SPI, similar to the OPL3.

For the HD44780 emulation, I also wrote a function to retrieve the output state, e.g., which pixels are on and which are off. This is used by the LCD panel emulation to actually display the pixels. I used the open-source, cross-platform GUI library wxWidgets to implement the emulator’s GUI, and the LCD panel is implemented as a custom widget that does its own drawing with the graphics routines provided by the library.

Audio Output

To implement audio output for the emulator, I used the open-source, cross-platform library PortAudio. This library abstracts all the platform-specific audio APIs and presents a unified interface to interact with the audio system. With it, the emulator is able to select audio output devices, sample rate, latency, and number of channels. (These options are exposed to the user via the Audio Settings dialog window.) To play the synthesized audio, the emulator extracts audio samples from the emulated OPL3 and sends them to the host audio system using PortAudio’s interface.

VU Meters and Gain Controls

Each output channel on the physical ApOPL3xy has a VU meter (well, not a true VU meter, just a linear amplitude display) and a gain control. Emulating this in wxWidgets was pretty straightforward. These are in separate tool windows that the user can display and dismiss.

The gain controls are slider widgets. To implement gain, the emulator reads the values from the sliders and uses them to adjust the audio samples before sending them to PortAudio for playback.

The VU meters are custom widgets that draw themselves based on a given amplitude. The emulator determines the current amplitude of each channel from the audio samples before sending them to PortAudio, and sends the amplitude values to their respective VU meter for display.

MIDI Input and Output

To emulate the MIDI input and output ports, I used the open-source, cross-platform library RtMidi. This library abstracts the host computer’s MIDI system, and enables enumeration of, reading from, and writing to system MIDI ports. The MIDI In and MIDI Out dialogs allow the user to select which, if any, MIDI ports to read from and write to.

If a MIDI input port is selected, MIDI events are read from it and injected into the firmware’s MIDI queue, instead of reading from the microcontroller’s serial interface. Similarly, instead of writing output MIDI events to the microcontroller’s serial interface, the emulator writes them to the host MIDI port selected for output.

MicroSD Card

The hardware ApOPL3xy includes a MicroSD card reader, which is read by the firmware using the open-source SdFat library. This library abstracts both the SPI messaging needed to interact with the card itself, as well as handling the FAT filesystem that keeps track of which files are where in the card’s storage.

To emulate this, I wrote a mock implementation of the subset of SdFat’s interface that the firmware uses, but instead of accessing an SD card, it uses the standard C++ std::filesystem library to access the host’s filesystem through the host OS. The user can “insert” a virtual SD card by using the Mount SD Card dialog window to select the directory to serve as the root directory of the virtual card. The user can “eject” a virtual SD card by using the Unmount SD Card menu option.

SRAM and EEPROM

The ATmega1284p microcontroller has 16 kiB of RAM and 4 kiB of EEPROM storage built in. This isn’t enough for everything I wanted the ApOPL3xy to be able to do, so I added SPI SRAM and EEPROM chips (128 kiB each) to the design. The SRAM on the microcontroller itself doesn’t need to be emulated; the emulator just uses the host’s system RAM for that. The microcontroller’s EEPROM isn’t used by the firmware at all, so that doesn’t need to be emulated either. The add-on chips, however, do need to be emulated.

The SRAM chip was very simple to emulate. I simply allocated an array of the right size and provided functions to read from and write to it. The emulator uses these functions instead of sending SPI commands like the firmware does on physical hardware. The EEPROM was very similar, except that a file is used instead an in-memory array, since EEPROMs are nonvolatile.

If the emulator finds no EEPROM file in the expected location on startup, it generates a default one and uses that. This has turned out to be useful for setting up a hardware ApOPL3xy. Since the format of the EEPROM is identical between the emulator and the hardware, one can copy the EEPROM file from the emulator to a MicroSD card and load it onto the hardware.

Input Controls

The ApOPL3xy has 10 buttons and 2 rotary encoders (which can also act as buttons) which are used to navigate menus and issue commands. The hardware version of these are connected to a shift register, which the firmware reads via SPI, interprets, debounces, and adds to an input event queue. The emulator instead provides GUI buttons for these, and pressing a button injects an input event directly into the queue. These buttons are also mapped to (hopefully) intuitive keyboard shortcuts.

Miscellaneous

There were a few other things that needed to be replaced or mocked to be able to compile and run the firmware in an emulated context.

The microcontroller framework used by the ApOPL3xy’s firmware provides functions for timing that use the microcontroller’s on-chip timers. These were replaced with versions that use the standard C++ std::chrono library instead.
The AVR architecture used by the ATmega1284p requires the use of special functions to read data out of program code (e.g. constant string data) instead of SRAM. These were replaced with simple pass-through functions, since the host machine for the emulator doesn’t have this restriction.
Various utility functions like min() and max() provided by the framework needed to be reimplemented.

Download Links

Emulator

If you’re interested in trying out the ApOPL3xy emulator, you can download pre-compiled binaries here:

Note that on Windows and MacOS, the fact that this binary isn’t signed may cause the operating system to complain. To get past this on MacOS, see https://support.apple.com/guide/mac-help/open-a-mac-app-from-an-unknown-developer-mh40616/mac. Microsoft doesn’t seem to have a good page on this, but here’s a potentially helpful google search.

Music Files

Here are some music files that you can play using the ApOPL3xy:

Source Code

The combined source code for the emulator and firmware is available below. It’s released under the terms of the MIT License, except for the OPL emulation code from DOSBox, which is released under the LGPL 2.1 license. I’ll probably post this to github at some point, but the code is a bit messy at the moment, and I might like to clean it up first.

ApOPL3xy Emulator and Firmware v1.0.1 (Source Code)

Hardware Design Files

If you’re interested and brave enough to want to build your own hardware ApOPL3xy, here are the files you’ll need. Note that the documentation is rather lacking at the moment, but I think someone determined enough could figure it out. If you try and have difficulties, let me know and I’ll be happy to help.

Closing Thoughts

I’m quite pleased with how this project turned out. What began as an attempt to make debugging easier grew into something kinda cool on its own. It also enables me to share the ApOPL3xy with more people than I would have otherwise. If you do try it out, I’d love to hear what you think. I have my own ideas of what could be improved, but there’s no substitute for real user feedback.

ApOPL3xy 1.0

Nate Barney — Mon, 15 Sep 2025 01:04:55 +0000

It’s been a long journey, but I finally have the ApOPL3xy hardware and software to the point where I can call it version 1.0. I’ll make another post in a few days with links to the firmware source code, the PCB design files, and emulator binaries. (Update: here’s that post.) But for now, I just want to show it off a little.

MIDI Player

One of the things I’ve added to the firmware recently is the ability to play MIDI files from the SD card. Here are a few videos of this in action. These videos are a little dark. It was hard to find a balance between blowing out the display and making the videos way too dark, so I did the best I could. I mapped the MIDI channels to somewhat arbitrary output audio channels to make the VU meters a little more interesting.

Video Killed the Radio Star

Star Wars Medley

Super Mario World

PCB Photos

Here are some high(ish) resolution photos of the assembled Rev 3 board, for those who are into that sort of thing. Also included is a photo of the bodges necessary on the Rev 2 board. To view the images at full resolution, right-click (or long-press) the image and select “Open Image in New Tab ” or similar.

ApOPL3xy Rev 3 PCB (Front)

ApOPL3xy Rev 3 PCB (Front, Modules Removed)

ApOPL3xy Rev 3 PCB (Back)

ApOPL3xy Rev 2 PCB Bodges

Future Work

Now that the ApOPL3xy has reached this milestone, I’ll probably leave it alone for a while. But, such things are never truly finished. I have several ideas for user interface improvements, such as patch and bank copying, M3U playlist support, and most interesting to me, a drum machine mode (suggested by a co-worker). I’m sure I’ll come back to this eventually to make these and other improvements, but it sure feels good to be able to call it done (for now)!

Damon PCM

Nate Barney — Mon, 31 Mar 2025 04:54:51 +0000

I’ve made and posted a new version of Damon: The Rocket Jockey which adds an updated music track.

Updated Music

Off and on over the past several months, I’ve been working on a 3D voxel-art version of Damon for modern computers. It’s almost finished, but it has been for a while. One day I’ll finish it and make a post about it. It just needs a menu screen for configuring the controls. But, that’s not what this post is about.

Damon 3D Screenshot

I mention the 3D game because as part of that project, I wrote an updated version of the game’s music. The original music, being for a 1980’s 8-bit computer, was a bit simple and repetitive. I thought I might be able to jazz it up a bit, so I gave it a shot in Ableton Live. I’m pretty pleased with the result.

Back-Porting

But, as I said, this post isn’t about the 3D game; it’s about the version for the Commander X16. I was looking through the VERA documentation recently, and I learned that it supports PCM audio. (VERA stands for Versatile Embedded Retro Adapter, and it’s the primary graphics and sound chip used by the Commander X16. PCM stands for Pulse-Code Modulation, which is a way of encoding digital audio.) I thought it would be fun to see if I could get the Commander X16 version of the game to play the music from the 3D game. There were a few challenges to overcome, but I was ultimately able to get it working.

Resolution Reduction

PCM audio is made up of a sequence of “samples”, each of which represents the value of the audio waveform at an instant of time. For example, CD-quality audio and the Damon 3D music both use two channels of 16-bit samples (-32,768 to 32,767) at 44,100 Hz (samples per second). Doing the math, this ends up taking 176,400 bytes per second, or just over 10.5 MB for a minute of audio. While this is trivial for modern computers, it’s way too much for something like the Commander X16.

The VERA PCM system supports a sample rate of up to 48,828.125 Hz (25 MHz / 512), and the rate can be configured by specifying a value between 0 and 128, inclusive, scaling linearly between 0 Hz (no playback) and the maximum rate. After playing around a bit, I settled on the value 21, which gives a rate of approximately 8,010.86 Hz. I used Audacity to resample the audio from 44,100 Hz to 8,011 Hz, which is close enough.

With this sample rate reduction, by mixing the stereo down to mono, and by using 8-bit samples instead of 16-bit, 1 second of audio can be brought from 176,400 bytes/second down to 8,011 bytes/second, a roughly 95.5% reduction in size! This obviously reduces the audio quality, but surprisingly, it sounds pretty okay despite the severe reduction in resolution.

One interesting thing to note is that the WAV file format treats 8-bit samples as unsigned offset binary, but VERA treats 8-bit samples as signed two’s complement. To handle this difference, it was necessary to write a python script to parse and convert the PCM data from the WAV files.

Tempo Adjustment

The new version of the music was rendered at 120 bpm. The original music was a bit faster, at 148 bpm. Since I wanted the player to be able to switch between the two versions of the music, I needed the two versions’ tempos to match up. With all the extra percussion in the new version, I felt like it sounded right at 120 bpm, so I adjusted the game code to generate the older music at that tempo as well.

Damon X16 Original Music (148 bpm)

Damon X16 Original Music (120 bpm)

I think the older music is a little better at the faster tempo, but that would probably be too fast for the new music, and it seems more important to have the two versions sync up.

Clip Sequences

To reduce duplication of redundant sections of audio, and to have more control over what parts of the music get played when, I broke the audio up into 1-measure clips, which at 120 bpm are 2 seconds each. For example, the music includes several measures of the bass line playing with no melody over it. There’s no need to store more than one of those. Similarly, other musical phrases are repeated in the song, and the game only stores a single copy of each distinct phrase.

There are a few other special-purpose audio clips, such as the ominous drone as the letters slide in on the title screen, and the level start and level complete music. Each of these is stored in its own file as well.

The audio clip files are stored in the same filesystem directory as the main program, and the first thing the game does is load each of them from disk into high memory. A stock X16 has 512 kiB of memory, and this is expandable to 2 MiB. All the clips together, after the resolution reduction, use about 220 kiB, so there was no problem fitting them all in. High memory is banked in the Commander X16, each bank appearing at $A000–$C000 when it’s selected. This didn’t present significant difficulty, as the PCM data is mostly read sequentially.

To handle the sequencing of audio clips I wrote a small interpreter with instructions to do things like “play this clip” or “jump to this other part of the sequence”, which made it much simpler to manage the clips and play the right one at the right time. I discussed the interpreter technique briefly in my previous post. If you’d like more information about this, you’re welcome to check out the source code on github, or ping me directly.

Gameplay Video

Here’s a short recording I made with the X16 emulator of the game using the new music. To compare this to the previous music, check out the gameplay video in my previous post.

Damon: The Rocket Jocky Gameplay with PCM Music

Conclusion

I’m quite pleased with how this experiment turned out. I think it adds a bit more fun to the game. If you’d like to try it yourself, you can find it at natebarney.com/damon or on the Commander X16 forums. The source code is available at github.com/natebarney/damon-the-rocket-jockey.

Damon: The Rocket Jockey

Nate Barney — Sun, 02 Jun 2024 23:16:01 +0000

Introduction

About two months ago, I received in the mail my very own Commander X16 computer. This is a modern 8-bit computer envisioned and produced by David Murray (a.k.a The 8-Bit Guy) along with a thriving community. It’s inspired by Commodore computers from the 1980’s, primarily the Commodore 64. I hooked it up and started playing with it, and immediately fell in love with it. It’s so Commodore-y. But it runs at 8 MHz instead of 1 MHz, has VGA output, and has a whole bunch of other modern improvements. It came with several games and other pieces of software, and I enjoyed messing around with those for a while.

What I really wanted to do, however, was write something for the machine. But what to write? A game would be the most fun to work on, but game design is not my strong suit, nor are graphics or sound design. After giving it some thought, I decided to make a clone of an existing game, relying on the game, graphics, and sound design from that game. I chose to clone Nomad: The Space Pirate, a Commodore 64 game I played a lot as a kid.

I thought this was a good game to clone for several reasons. First, It’s a fun game (at least I think so). Second, it’s relatively simple; there’s a single static screen per level, and only a few moving objects, so no scrolling or crazy sprite multiplexing would be required. Third, it’s a pretty obscure game. I haven’t met anyone outside my immediate family that’s ever heard of it. This obscurity meant that I would be unlikely to be beaten to the punch while my version was still in development. If I’d chosen something like Pac-Man or Tetris, on the other hand, my game would probably only be one of many similar games available on the platform.

I also needed a name for the project. I couldn’t simply call it Nomad: The Space Pirate; that would be at best confusing, and at worst copyright infringement. I noticed that Nomad spelled backwards was Damon. That’s a good start, but if Nomad is a Space Pirate, what should Damon be? I decided that he should be a Rocket Jockey, which sounds cool and is fun to say. And, he does ride a rocket around, so it’s not inaccurate. Thus, Damon: The Rocket Jockey was conceived.

The Game

The goal of the game is to fly your ship around a grid, picking up pellets. Every level has one or more enemies, which the player must avoid or shoot, as crashing into them causes the player to lose a ship. Shooting an enemy destroys it, but it immediately respawns in a corner of the screen unless all the pellets have already been collected.

Damon: The Rocket Jockey – Gameplay Screenshot
(White Ship: Player, Purple Ship: Enemy)

Once all pellets are collected, and all enemies destroyed, the player advances to the next level. Collecting pellets and destroying enemies award points, and a bonus ship is awarded every 10,000 points. Each level, another enemy is added (up to a maximum of 5), and in every level after the first, barrier blocks are placed in different locations around the play area. Neither the player nor enemies can fly through the barriers, but the player can shoot through them.

If you’d like to play the game, I’ve made it available at natebarney.com/damon. This page uses the Commander X16 web emulator to run the game right in your web browser. It doesn’t really work on smart phones or tablets however, so you’ll need to access it from a desktop or laptop computer. (For Science, I tried it on the Playstation 4 web browser. Unsurprisingly, it didn’t even load.) There are also links there to download the game if you’d prefer to play it using the standalone Commander X16 Emulator or even a real Commander X16.

If I’m having a really good game, I can sometimes make it to level 5, with my top score being something like 13,000 points. I’ve never made it to level 6, usually running out of ships at level 3 or 4, and my average score is something like 7,500 points. Here’s a video of me having a decent game, although I struggled a bit during level 3. It was captured directly from the VGA output of my physical Commander X16.

Gameplay Video

Development

The rest of this post will cover the development process for the game and discuss how or why I did things the way I did them. I probably won’t dive much into the actual code, but if you’d like to have a look at that, it’s available on GitHub.

Goals

My primary goal for this project was authenticity. I wanted my version to look, sound, and feel like the original. I think I largely succeeded, but there are, of course, some minor differences.

Another goal, perhaps related to the first, was to write the game entirely in assembly language. The original was almost certainly written this way, and I wanted to do the same. I did allow myself the luxury of using a modern cross-assembler (ca65) and linker (ld65), but I think that’s a minor concession. If I try my hand at writing another X16 game, I will probably try writing it in C. The cc65 suite includes a C compiler, and I haven’t done C for the 6502 before, so that should be interesting. But for this project, I stuck to assembly language.

My third goal was to write everything myself, from scratch. The author of the original game wrote what he wrote, and it’s copyrighted. Plus, it’s more fun to figure everything out myself. So, I didn’t look at the disassembled code for the original. I did use the machine language monitor in the VICE emulator to locate the memory address where the number of the player’s remaining ships was stored, so I could cheat by giving myself extra lives to see what the higher levels looked like. (This game gets hard pretty fast!) But, I didn’t look at the code.

Finally, I set myself a challenge goal of making my game smaller (in bytes) than the original. If I met the other goals but didn’t meet this one, I’d still consider the project a success, but it’s fun to see how tight one can make one’s code. As it turns out, the original was approximately 15K, and when finished, my version was just over 11K, so I count that as a win.

Graphics

The graphics for this game are primarily tile-based, with stationary graphical elements and text being composed of a 40×30 grid of 8×8 pixel tiles with 1 bpp (bit per pixel), giving a total screen resolution of 320×240. This is actually one of the minor differences I mentioned before. The Commodore 64 has a screen resolution of 320×200 pixels, or 40×25 tiles, but both systems use a 4:3 aspect ratio. This makes the C64 have a more vertically stretched appearance, and is the reason my clone has black bars at the top and bottom of the screen.

The elements that move (e.g. ships and bullets) are made using sprites instead of tiles. The Commander X16’s video chip, the VERA, supports up to 128 hardware sprites, of varying sizes and color depths, which it can overlay onto the stationary tiles at more-or-less arbitrary positions.

One of the first steps I took when developing this game was to grab a bunch of screenshots of the original game using VICE, and hand-copy out all of the graphics onto graph paper. I then used these notes to enter the tile and sprite data into assembly source code files, to be used by the rest of the game.

Tiles

The border around the screen, the pellets, and the blocks in the play area are all static graphical elements drawn using tiles. The blocks, and somewhat surprisingly, the pellets, are actually 16×16 pixels each, so they’re made of 2×2 arrangements of tiles.

Graphic Tiles

The level 1 (green) and level 5 (light red) blocks are made of four different tiles each, but the blocks for levels 2-4 (brown, medium grey, and orange) and the barrier (light grey) blocks simply repeat the same tile 4 times. Levels 6-10 use the same block graphics as 1-5, but with different colors and barrier block layouts. Levels 11-99 are exactly the same as level 10. The border (purple) is a single tile thick. Interestingly, some of the individual pellet tiles do double duty as single quotes and periods on the title screen.

Text in the game is also displayed using tiles. I was able to capture almost all of these tiles from the original game, since most of them were used in various places, especially the title screen. I did have to guess at a few letters and symbols, J, W, X, Z, and :, that weren’t used anywhere in the original.

Font Tiles

Sprites

A sprite is an image that the video chip can overlay on top of the tiles, at arbitrary positions not necessarily constrained to the tile grid. Each sprite has an independent position attribute, and simply updating the position moves the whole sprite image. This allows movement to be implemented in an efficient way.

Sprites on the Commodore 64 are 24×21 pixels, 1 bit-per-pixel (There is a 2 bpp mode, but Nomad didn’t use it.) The VERA chip supports sprites that are 8, 16, 32, or 64 pixels wide, and 8, 16, 32, or 64 pixels high. These dimensions may be mixed, so one could define a 64×8 sprite if one wanted to. Each sprite on the VERA may also be either 4 or 8 bpp.

There’s quite a bit of mismatch between the sprite capabilities of the two systems. Fortunately, both the player and enemy ship sprites fit nicely within 16×16, and the bullet sprites similarly fit within 8×8. That works well, but there’s still a color depth mismatch. Since VERA doesn’t support 1 bpp sprites, I wrote a routine to read 1 bpp sprite definitions out of system RAM, and pad them out to 4 bpp when copying them to video RAM (VRAM). That way, I didn’t need to store a bunch of unused color depth data in the program itself.

VERA has the ability to flip each individual sprite either horizontally, or vertically, or both (or neither). This was useful, as each ship needs to be able to face 4 different directions. By making use of this feature, I only needed to store a horizontal definition and a vertical definition per ship, instead of all four.

Ship and Bullet Sprites

Animation

Sprites are also useful for animation. Each sprite entry in the video chip holds a pointer to the image data in VRAM that will be used to draw the sprite. It’s trivial to update the pointer and instantly change what the sprite looks like. The game uses this technique for player and enemy death animations, and for the large, animated block letters on the title screen.

When either an enemy or the player is destroyed, an explosion animation replaces the ship image. In the case of the player’s destruction, this is also followed by a winking skull and crossbones. Each of these animations consists of three separate frames. The explosion frames are played sequentially, and the winking skull frames are played sequentially and then in reverse.

Unfortunately, neither of these animations fit inside a 16×16 pixel sprite, instead making use of most of the C64’s 24×21 sprite resolution. To display these with the VERA, they need to be defined as 32×32 sprites. But, to save space, I defined them in the code using 24×21, and wrote a routine to pad them out to the required 32×32 when copying them to VRAM.

Explosion and Skull Animation Sprites

Explosion and Skull Animations

Title Screen

The block letters on the title screen both move and animate, so they’re also drawn using sprites. These sprites presented two new challenges, both related to features present on the C64 but not on the Commander X16, one much simpler than the other to resolve.

Title Screen Block Letter Sprites

Title Screen Block Letter Animation

The first and simpler of the two challenges relates to sprite size. Recall that C64 sprites are 24×21 pixels. On the C64, but not the X16, sprites can be rendered as double-width, double-height, or both (or neither). This doesn’t add any extra detail to the sprite image. It merely doubles each pixel in the appropriate dimension(s) to increase the overall size.

The original game used this feature for the title screen block letters to make them double-width, for a total size of 48×21 pixels each. This means that, on the X16, the sprites need to be horizontally doubled and then padded out to 64×32. But, storing the doubled pixels directly in the game code would be wasteful. So, I wrote another sprite loading routine to double each pixel horizontally from 24×21 to 48×21 and pad the whole thing out to 64×32 when copying a sprite from system RAM to VRAM.

Normal and Width-Doubled Sprites

The second challenge was trickier. If you watched the gameplay video above, you might have noticed that the letters slide in from off-screen. The C64 allows sprites to be positioned partially or even completely off-screen, which makes achieving this effect relatively straightforward. However, the Commander X16 doesn’t support this. Setting a sprite’s X or Y position to 0 will put the sprite against the left or top edge of the screen, but the sprite will be fully visible. The coordinates are unsigned integers, so trying to set them to negative values results in large positive values instead, and the sprite isn’t displayed at all.

The key to solving this puzzle is that, on the X16, you don’t always have to set the sprite image pointer at the beginning of the sprite image data. So, I padded the sprite image data with extra blank lines at the bottom, and as a new letter is just starting to emerge from off-screen, I position the sprite at the very top of the screen. I then set the sprite image pointer to the first blank line past the actual sprite image data, so all that shows is a blank sprite. Then, each frame, I move the sprite image pointer one line up, to give the illusion of the sprite coming in from off-screen. As soon as the sprite image pointer is at the top of the sprite image data, I switch from moving the sprite image pointer up to moving the sprite position down. This gives a seamless transition as the letter continues to move down the screen.

Sprite Emergence from Off-Screen via Sprite Image Pointer Manipulation

Sound and Music

The sound effects in the game are pretty simple. There are only three sound effects: the player’s bullet sound, the ship destruction crash sound, and the bonus ship sound. I implemented all of these using the Programmable Sound Generator (PSG) that’s built into the VERA. To determine the parameters of the sounds, I recorded them from the original running in VICE, and loaded them into Audacity, an open-source audio editor. This allowed me to visually see the waveforms used and measure durations, and the fourier analysis tool it provides enabled me to determine roughly the frequencies to use.

The bullet sound uses a triangle waveform and does a linear sweep from about 880 Hz down to about 546 Hz over roughly half a second. It does stop whenever the bullet disappears, such as when it hits a wall or an enemy, so it might not make it to the end frequency. The ship destruction sound uses a noise waveform at about 4,400 Hz, and the base frequency doesn’t change, but the volume drops off linearly from full to zero over about half a second. The bonus ship sound uses a sawtooth waveform, with a constant frequency of about 1,134 Hz at full volume for about a sixth of a second.

Bullet Sound

Ship Destruction Sound

Bonus Ship Sound

The music is also pretty simple. The title/background music is the same 8 notes repeated endlessly, although the in-game tempo is slightly faster than the title screen tempo. It reminds me of the bass line from the Peter Gunn Theme. It is pretty repetitive, and in fact, a member of the Commander X16 forums asked for a button to turn it off! I kinda like it myself, but maybe that’s just nostalgia talking.

Title/Background Music

Title Music

Background Music

The other two pieces of music in the game are the level start and level complete jingles. The level start music is simply a rapid chromatic scale from A♯3 to G4. The level complete music is the most complex of all of them, and my friend Matt pointed out that it’s actually a sped-up, minor version of the opening measures of Pictures at an Exhibition: Promenade by Modest Mussorgsky.

Level Start Music

Level Complete Music

For all of the music, I used the Yamaha YM2151 (OPM) FM synthesis chip built into the Commander X16. The X16 comes with a full set of synthesizer patches for this chip, burned into the ROM, and I played with some of those for a bit, but none of them sounded quite right. I ended up making my own patches by tediously tweaking the parameters until I got a sound that I felt was close enough to the original. The OPM chip’s parameters felt unusual to me, since my experience with FM is primarily with the OPL line of chips. For example, the envelope generator on the OPM has two separate decay phases. But, after playing around for a while, I think I’ve figured it out well enough.

The patch for the title/background music uses FM to approximate a sawtooth waveform, and the envelope has a sharp attack and decay, but a slow release, sort of like a plucked string. The patch for the level start and level complete jingles uses a similar sawtooth waveform, but has a slightly softer attack and decay.

Enemy AI

I fretted about the enemy AI for weeks before implementing it. After all, it’s what really makes the game challenging and fun to play. It seemed like a very complex part of the game to write, so I kept putting it off and worked on other parts, letting the problem simmer in the back of my mind. Soon enough, the time came when I couldn’t make any more progress without addressing this problem. Fortunately, a thought had occurred to me that might make it tractable: Markov chains.

A Markov chain is a process that makes decisions based only on its current state. Any previous history is irrelevant to what happens next. Instead of following some complicated overarching strategy, the enemies could simply make decisions in the moment, with whatever information they have at the time. I realized that the only time an enemy needs to make a decision is when it’s at an intersection. Any other time, it just keeps moving forward. So I watched the enemies move in the original game with this viewpoint in mind, to try to determine what rules it used. I came away with a set of rules that seems to be pretty close.

These are the rules I implemented for an enemy in its normal state:

If the enemy is at a wall, start moving in the reverse direction.
If the enemy has line-of-sight to the player, to the left, right, or straight ahead, move toward the player.
If the enemy is facing a barrier block, turn left or right with equal probability.
Otherwise, decide whether to turn with 50% probability, and if so, turn left or right with equal probability.

That’s all there is to it. If there are multiple enemies, they don’t coordinate. Each follows the same set of rules independently. It’s surprising how complex and apparently organized the resulting behavior can seem. However, this is for enemies in their normal state. Respawning enemies follow a different set of rules:

Select the corner of the screen furthest from the player and respawn there.
Start moving vertically away from the corner.
If the player is visible at an intersection, turn in there and enter normal state.
Otherwise, at each intersection but the last, turn in and enter normal state with 25% probability.
At the last intersection, always turn in and enter normal state.

However, there is one more aspect of enemy behavior that I have yet to cover. Some readers may have noticed that in the ship and bullet sprite figure above, there appears to be a bullet for the enemy. That’s exactly what that is. The enemies can shoot back.

It only starts happening at level 8, so it’s kind of ridiculous to even have bothered implementing it, since I don’t think anyone will ever reach level 8. But I’ve been wrong before, and in any case, the original game does it, so I wanted my clone to do it too. I only discovered it myself when cheating to see the higher level layouts.

The enemies only have one bullet to share among themselves. The Commodore 64 can only display 8 sprites at a time, unless the game does something fancy like sprite multiplexing, which Nomad apparently doesn’t do. So, we have the player, the player’s bullet, and 5 enemies, which comes to 7 sprites, leaving one left for the enemy bullet. I wonder if the enemies max out at 5 because the programmer wanted to save a sprite for the enemy bullet, or if 5 enemies was selected for other reasons, and since there was one sprite left over, he decided to use it for this. I guess there’s really no way to know for sure.

The first time I implemented enemy fire, the enemies were deadly snipers, immediately shooting and killing the player as soon as they had a shot. In order to tone this down a bit, I added a couple of constraints. The first was a reload time. At the beginning of a round, and after each time an enemy fires, the enemies have to wait for their bullet to reload before they can fire again. I set this to 5 seconds, because that seemed to approximate the rate of enemy fire in the original game. The second constraint was reaction time. Once an enemy sees the player and decides to fire, it won’t actually fire until 400 ms later.

Here are the resulting enemy fire rules, processed every frame in which the enemy isn’t busy making other decisions:

If the level is less than 8, don’t fire.
If the bullet is still reloading, don’t fire.
If the bullet is already in flight, don’t fire.
If the enemy is pointing at the player, start a 400 ms countdown and fire when it elapses, if possible.

These rules seem to result in behavior that loosely approximates that of the original game. Since it will likely never happen in practice, I think that’s close enough.

Collision Detection

Sprite collision occurs when two sprites overlap on the screen. It’s important to be able to detect when this happens, and between which sprites, to know when an enemy or player ship should be destroyed. The VERA chip used by the Commander X16 supports hardware collision detection, but the way it works can be a little confusing at first.

Hardware Collision Detection

Each sprite has a 4-bit collision mask, which can be thought of as defining 4 independent collision groups, one per bit. Each bit in the sprite collision mask is set to 1 for groups the sprite is a member of. Other bits are set to 0. The VERA keeps track of a 4-bit overall collision result, which it sets to all 0’s at the start of each frame. As the VERA renders each pixel, if it draws a sprite on top of another sprite, it checks their collision masks by bitwise-ANDing them together. If the result of the AND is non-zero (i.e. the two sprites share at least one collision group), the previous overall collision result is bitwise-ORed with the result of the AND to give the updated overall collision result. At the end of the frame, the overall collision result contains a 1 bit for each collision group that experienced a collision during that frame.

The program/game can register to receive a CPU interrupt for sprite collisions by specifying a 4-bit value (let’s call it the collision interrupt mask) indicating which collision groups it’s interested in. If, at the end of the frame, the overall collision result is non-zero, it is bitwise-ANDed with the collision interrupt mask. The result of this AND isn’t stored anywhere, but if it’s non-zero, the VERA generates a CPU interrupt and reports the overall collision result as part of its interrupt status register.

When the CPU receives an interrupt, it stops what it’s doing and jumps to a special routine called the Interrupt Service Routine (ISR), commonly known as an interrupt handler. The handler checks various hardware registers to see what generated the interrupt, and performs any actions that are needed to handle that interrupt. When it’s done, the interrupt handler returns the CPU to what it was doing before the interrupt occurred.

It’s good practice not to spend too much time in an interrupt handler. The interrupt handler for this game, when it receives a sprite collision interrupt, simply stores the overall collision result from the interrupt status register into a location in memory, to be processed as part of the normal game loop.

This game uses 3 different collision groups, one for the player’s ship, one for the player’s bullet, and one for the enemies’ bullet. Here’s a table of the collision group memberships:

Sprite	Player Bullet Group	Player Ship Group	Enemy Bullet Group
Player Bullet	1	0	0
Enemy Ship	1	1	0
Player Ship	0	1	1
Enemy Bullet	0	0	1

Collision Groups

This works really well for a single enemy. The VERA checks for collisions, and the set of groups with collisions tells the game exactly what needs to happen. If the player ship group or enemy bullet group has a collision, the player ship is destroyed. If the player bullet group has a collision, the enemy ship is destroyed. However, this runs into difficulties when multiple enemies are introduced. To resolve them, some software collision detection is also needed.

Software Collision Detection

When there is more than one enemy, the collision groups alone don’t provide enough information, so the game needs to do some additional checks to determine what needs to happen next. For example, if the player bullet group has a collision, that doesn’t specify which enemy was hit.

To resolve this issue, the bullets and ships are assigned “hit boxes,” which are axis-aligned rectangles relative to the sprite’s coordinate space. (Axis-aligned means each side of the rectangle is parallel to either the X- or the Y-axis.) Hit boxes are entirely a software concept; the VERA chip has no notion of them. When the VERA informs the game of a collision in the player bullet group, for example, the game does an intersection test between the bullet’s hit box and each enemy’s hit box. If it finds an intersection, that’s the enemy that should be destroyed.

Ship and Bullet Sprite Hit Boxes

The player bullet group isn’t the only group that needs these extra checks, however. Every enemy is also a member of the player ship collision group. This means they can collide with the player ship, as expected. Some readers might be wondering why it would matter which enemy the player collided with. It, of course, doesn’t matter, as far as that goes. But since enemies are all in the same collision group, when they collide with each other, that registers a collision in that group. Without extra checks to determine whether the player was involved, any time two or more enemies overlapped, the player’s ship would spontaneously explode. I actually tried that experiment, For Science, as I was implementing multiple enemies. It was pretty funny to watch.

Without Software Collision Detection

The game is able to rely solely on hardware detection for collisions between the player ship and enemy bullet, however, since there’s only one of each. It’s only when there are multiple sprites of a given type (e.g. enemies) that the software algorithm is required. Therefore, there’s no hit box defined for the enemy bullet.

There are a couple of reasons to use axis-aligned hit boxes. The first is that they only require four numbers to fully specify: top, left, bottom, and right. The second is that it makes intersection testing really simple. The trick to doing an intersection test between axis-aligned rectangles is to check all the ways that they might not intersect. If the left coordinate of one rectangle is greater than the right coordinate of the other, or vice-versa, they clearly don’t intersect. The same holds true in the other direction with the top and bottom coordinates. If none of these four checks returns true, then the rectangles do intersect. Neat, huh?

Vertically Non-Intersecting Rectangles

Horizontally Non-Intersecting Rectangles

Intersecting Rectangles

The significant advantage to using hit boxes for software collision detection is that it’s much faster than checking each sprite against each other sprite, pixel by pixel. This could be done, but it would be very computationally expensive (and harder to write as well). The obvious disadvantage is that it’s less precise than the pixel-perfect hardware collision detection. To deal with this, I shrank the player ship’s hit box a little. Otherwise, there could be spurious collisions when two empty sprite corners “overlap.” Since the game is hard enough already, I erred on the side of forgiving.

If the VERA supported seven or more collision groups, this software collision detection would be unnecessary (for this game). In that case, the one player ship collision group could be replaced with five new collision groups, one for each enemy ship, and the collision groups themselves would provide enough information about what to do. In fact, as I’m writing this, it occurs to me that if the enemies were partitioned into two groups, that could reduce the average number of checks the software collision detection would need to make. Fortunately, the game runs well enough without this optimization that I don’t feel the need to implement it.

Miscellaneous Techniques

In this section, I’ll describe several programming techniques that I found useful when developing this game.

Game Loop and Hierarchical Init / Update

The core of any game is the game loop. This is the bit of code that runs over and over again until the game exits, handling everything—input, graphics, sound, music, game state, etc. When the game first starts, it calls some initialization routines, often shortened to “init.” Once init is complete, the game enters the game loop, calling the update routine repeatedly.

I found it useful to establish multiple, hierarchical init and update routine pairs. There are of course, the master init and update routines, at the top of the hierarchy. But, for example, the title screen has its own init and update routines, only called when the game is displaying the title screen. There are also pairs of routines for the player, the enemies, the bullets, the music, the sound effects, and several others. Each layer of the hierarchy knows when and whether to call its subordinate sets of routines. In this way, the game is able to manage multiple independent operations apparently simultaneously, without becoming a huge mess of spaghetti code.

V-Blank Interrupt

If the game ran its update loop as fast as it possibly could, this would cause multiple problems. Some iterations of the loop take less time than others, depending on what needs to be done each time. So, the game would speed up and slow down erratically. The Commander X16 itself can run at different speeds. It defaults to 8 MHz, but it can also run at 4 Mhz or 2 MHz, if the user desires. Changing this would alter the frame rate and behavior of the game as well. Finally, the graphics would be updated at effectively random times throughout the frame update, so at any one time, the screen would display part of one frame, and part of another. Possibly more than two, depending on how fast each loop iteration is.

To fix all of these issues, a common technique is to synchronize the game loop to the vertical blanking interval, or V-blank. A V-blank happens only once per frame, after the entire frame has been drawn. It also happens very regularly. The Commander X16 uses a frame rate of 60 Hz, so a V-blank happens pretty much exactly 60 times a second.

The VERA chip can be configured to cause a CPU interrupt on V-blank, so I enabled that, and set up my interrupt handler to set a flag variable to 1 when a V-blank interrupt occurs. The main game loop waits for that flag to become 1, performs a single update, sets the flag to 0, and waits for it to become 1 again. This allows the game to update at a smooth 60 frames-per-second, without any of the issues described above.

State Machines

Several of the entities in the game have complicated behavior. For example, an enemy can be moving, deciding whether to turn, exploding, or respawning. All of these behaviors could be implemented in a monolithic update routine, but that would make keeping track of all of the variables that might affect things a nightmare. Instead, entities with complicated behaviors were given a state machine.

With a state machine approach, the entity has a state variable, which roughly corresponds to “what it’s doing right now.” The behavior of the entity differs depending on what state it’s in, and some events can move the entity to a new state. For example, if the enemy is in the “moving” state, and it’s hit by the player’s bullet, it transitions to the “exploding” state, and begins playing the exploding animation. Once the animation is complete, it transitions to the “respawning” state, unless all the pellets in the level have been collected, in which case it transitions to the “retired” state instead.

State machines fit nicely into the hierarchical init/update scheme described above. Each state has an init routine, which is called when the entity transitions to that state, and an update routine, which defines the behavior of the entity while it’s in that state. Entities’ main update routines are often nothing more than a lookup of the state value and a dispatch to the appropriate state-specific routine.

Entities don’t have to be moving objects to benefit from state machines. For example, the game itself has states corresponding to displaying the title screen, displaying the “Get Ready” screen, playing the level start music, and running the main game. The title screen has states for sliding letters in and animating the letters. It’s an extremely useful concept for organizing game code.

Object-Oriented Design

In general, when developing software, I follow the motto “make it work, then make it pretty.” This means that, in the beginning, exploratory phase of development, code organization is given second priority to getting something working at all. Once the code works, then it’s time to refactor the code and make it nice and neat. Often, with larger projects, this cycle repeats more than once. At some point, I’ve usually gotten enough things working (but not pretty) that adding anything new is challenging. That’s when I know it’s time for a “make it pretty” phase. Once that’s done, I can add more features to the software.

During the development of this game, I went through several such cycles, and during one of the “make it pretty” phases, I noticed I had a lot of repeated code for entities that move around the screen (e.g. ships and bullets). I wanted to try to collapse this duplicated code down to one common set of routines, and even though I was working in assembly language, I found utility in the principles of object-oriented programming (OOP).

Of course, the OOP techniques I employed were rudimentary compared to the capabilities of languages like C++ or Java. Data encapsulation (e.g. private or protected), for example, is more of a high-level language feature, and in assembly language, there is no enforcement of such things. But, objects themselves can be done, or at least approximated, in assembly language just as well as in any high-level language.

Going back to the moving entity example, I identified all of the common variables used by these entities and grouped them together into a single data definition (struct). ca65’s .STRUCT directive was very useful for this. I was then able to declare multiple instances of that struct, one for each moving entity, and pass a pointer to that struct into the routines that implemented motion.

I even have one example of simple inheritance. The struct for enemies has all of the data members that a moving entity struct has, but it adds another after those to keep track of the reaction time delay for firing its gun. Because the enemy struct starts with all of the same data as the moving entity struct, I can pass a pointer to an enemy struct to a routine expecting a moving entity struct, and everything works.

Interpreters

A few things in the game needed to be “scripted,” or follow a predetermined sequence of steps. For example, the music follows a fixed sequence of notes, and animations follow a fixed sequence of frames. For each of these, to make changing the “scripts” easier, I implemented a rudimentary interpreter. Such an interpreter is given a list of commands, each with its parameters, stored in read-only memory (ROM).

The interpreter looks at the first command, executes it (i.e. performs the specified action), then goes to the next command in the list to execute that one. Some commands can change which instruction is next, to enable the creation of loops. This was useful for the title/background music as well as the title screen animations, both of which loop. For the animation interpreter, I implemented several commands, including sprite positioning, setting the sprite image, setting the sprite color, and a delay command. I was impressed by the utility of this approach, and I expect to use it again if/when I write another game.

Closing Thoughts

I am quite proud of what I achieved with this project. I feel I met all of my goals, learned a lot, and added something to the Commander X16 community. I think the game is fun to play too! I’ve spent several hours playing it already. As usual, if you have questions on any particular topic, feel I haven’t explained something well enough, have encountered bugs in the game, have a suggestion for improvement, or even have ideas for future projects or topics, I’d love to hear from you. Feel free to leave a comment below, or contact me directly if you have my contact info.

ApOPL3xy Hardware Design

Nate Barney — Mon, 09 Oct 2023 02:03:48 +0000

I’ve been working on designing and building a MIDI synthesizer (called the ApOPL3xy) based on the OPL3 FM synthesis chip and the ATmega1284 microcontroller. I’ve made a couple of posts about it (here and here) and have gotten some good questions from some people about how this or that works under the hood. So, for this post, I thought I’d dig a little deeper into the technical details. I’ll go over the various components and how they’re connected together. If needed, I may make other posts focusing on some of the components in further detail.

Microcontroller

The microcontroller that drives the system is the Microchip ATmega1284. (It used to be made by Atmel, before Microchip bought them.) The ATmega1284 is an 8-bit AVR microcontroller, similar to but more powerful than the ATmega328P at the heart of the Arduino Uno R3.

Hardware

Chips in the AVR line of microcontrollers have integrated flash memory for storing programs, SRAM for program variables and stack, and an EEPROM for non-volatile data storage. Most have hardware implementations of several communication protocols, such as SPI, I²C, and TTL Serial (basically RS-232 but with different voltage levels).

Whereas the ATmega328P has 32 kB of flash, 2 kB of SRAM, and 1 kB of EEPROM, the ATmega1284 has 128 kB of flash, 16 kB of SRAM, and 4 kB of EEPROM. I had initially designed the ApOPL3xy around the ATmega328P, but soon found the memory constraints too limiting, especially the 2 kB of SRAM. So, I upgraded to the ATmega1284. 128 kB of program flash and 16 kB of on-chip SRAM should be more than enough for the initial version, and leave plenty of room for future expansion.

The microcontroller in the ApOPL3xy runs at 5V and 20 MHz. A quartz crystal is used to generate the clock signal. 20 MHz is the maximum clock frequency for which the ATmega1284 is rated, and so far, it seems to be sufficient.

Programming

Arduino boards each have a USB port connected to a USB-to-serial bridge chip, which is used by special bootloader code on the microcontroller to receive new program code, so that all one needs to program an Arduino is a USB cable. When using a bare microcontroller as the ApOPL3xy does, the way to program it is ISP (In-circuit Serial Programming), sometimes called ICSP. This uses the SPI bus to upload code to the microcontroller (configured as the slave). However, to do this, one needs a separate device to act as the SPI master and upload the code to the microcontroller. I’m using the AVRISP-mkII programmer for this project. There are many others, but I have read that some of them have trouble with chips that have more than 64 kB of flash. I have, so far, not had any problems with the AVRISP-mkII, but note that the compiled firmware is still less than 64 kB in size, so that’s not entirely conclusive.

I’m using the PlatformIO IDE to edit, compile, and upload code to the microcontroller. This could be done with the official Arduino IDE as well (with the MightyCore board definitions added), and I was doing it that way for a while. However, the Arduino IDE is not very developer-friendly for large projects, and it makes simple things like having multiple source files much more difficult than they should be. PlatformIO, on the other hand, is much easier to use and I find myself fighting with it much less than with the Arduino IDE. PlatformIO is a Visual Studio Code extension, so you’ll need that installed, but it’s cross-platform, supports a large number of microcontrollers, and can use libraries written for the Arduino IDE. In theory, it even supports debugging over JTAG if the chip supports it. I haven’t tried that yet, but the ATmega1284 does claim to support it. PlatformIO is still a tad rough around the edges, but it’s so much better than the Arduino IDE experience that I highly recommend it.

Peripheral Devices

The ApOPL3xy has several peripheral devices attached to the microcontroller. “Peripheral” here means that the device is separate from the microcontroller chip, not necessarily that it’s detachable from the circuit board. Each of these peripherals communicates with the microcontroller using the SPI protocol. The one exception to this is the MIDI input port, which communicates using TTL Serial and is therefore connected to one of the microcontroller’s USART (i.e. hardware serial) interfaces.

The ApOPL3xy contains the following peripheral devices:

OPL3 FM synthesis chip (Yamaha YMF262)
20×4 LCD character display module (with Hitachi HD44780-compatible controller)
Input controls
- 2 EC11 incremental rotary encoders
- 10 tactile push buttons
MicroSD card module
128 kB SRAM (Microchip 23LCV1024)
128 kB EEPROM (Microchip 25AA1024)
MIDI 5-pin DIN input port
- I’m thinking about adding a MIDI thru port as well
Reset button

Reset Circuit

The reset button is connected to the reset pins of both the ATmega1284 and the OPL3. Since this signal is not being handled by a GPIO pin, software debouncing is not really an option. I also wanted the reset signal to start out active, so that the chips connected to it would be reset at power on. So, I designed a simple RC (resistor-capacitor) circuit, with its analog output converted to digital by means of two gates from a 74HC14 (six Schmitt-trigger inverters).

ApOPL3xy Reset Circuit

The reset signal for both chips is active low (as it is for most chips with a reset pin), so the way this circuit works is as follows. One lead of a 10 μF capacitor is connected to GND (0V), so the other lead starts out at 0V as well. The second lead is connected to V_CC (+5V) through a 10kΩ resistor and a 1kΩ resistor in series, so the capacitor gradually charges to +5V over time. After about 60 ms, the voltage at the second lead becomes high enough to trigger the first inverter, which goes low, and then the second inverter goes high, deactivating the reset signal.

When the reset button is pressed, it connects the second lead of the capacitor to ground through just the 1kΩ resistor, allowing it to discharge quickly, bringing the reset signal low (active). The signal stays low until the reset button is released, at which time, it takes about another 60 ms for the capacitor to charge up enough to bring the reset signal high again. If the button bounces when pressed, that high frequency oscillation is filtered out by the slow-charging capacitor, which is acting as a low-pass filter.

Here are a couple of captures from my oscilloscope illustrating the behavior of the circuit. In these images, the yellow trace measures the voltage across the capacitor, the pink trace measures the output of the first inverter, and the blue trace measures the output of the second inverter, which is the reset signal itself. The first image shows the behavior as the system is powered on. You can see the capacitor charging and how the schmitt-trigger inverters react to it.

ApOPL3xy Power-On Reset Oscilloscope Capture

The second image shows the behavior when the reset button is pressed. In this capture, you can see the capacitor quickly discharging as soon as the button is pressed, and remain discharged until the button is released, then start slowly charging again.

ApOPL3xy Reset Button Oscilloscope Capture

Custom SPI Interface

Three of the peripherals (MicroSD, SRAM, EEPROM) connected to the microcontroller natively communicate using SPI. The others do not, but in order to conserve GPIO pins, I built an SPI interface for them (except for the MIDI port) using shift registers. In a recent post, I described how this works in more detail. I could have used something a bit fancier, like the Microchip MCP23S17 SPI I/O expander. But they’re more expensive and more proprietary than standard 74HC shift registers, and the shift registers work just fine.

The SPI interface is built from one 74HC595 (serial-in / parallel-out shift register), two 74HC165‘s (parallel-in / serial-out shift registers), and one gate each from a 74HC14 and a 74HC125 (four tri-state buffers). I could have used a 74HC04 instead of a 74HC14, as I’m not using the Schmitt trigger functionality, but I already had a 74HC14 in the design for the reset button circuit, and I wasn’t using all of its gates. This setup provides eight output pins and sixteen input pins, and uses only three of the microcontroller’s pins. Two of these pins, SCK and MOSI, are shared among all SPI devices, so this really only uses one extra pin: SS.

Since the LCD character display and the OPL3 sound chip are controlled independently from one another, I was able to use the same eight output pins from the 74HC595 for both, and so they’re connected to each of these peripherals’ data busses. The sixteen input pins are connected to the two rotary encoders (three pins each) and the ten push buttons (one pin each).

OPL3 FM Synthesis Chip

If the ATmega1284 is the heart of the ApOPL3xy (and if you’ll forgive my briefly waxing poetic), then the Yamaha YMF262 (a.k.a. OPL3) is its soul. Well, perhaps brain and larynx would be a better metaphor; it just doesn’t have the same ring. But I digress…

The OPL3 is a sound synthesizer chip that implements frequency modulation (FM) synthesis. It was used in Creative Labs’ popular Sound Blaster 16 sound card released in 1992 for IBM-compatible PC’s, and many PC games of that era used it to produce their in-game music.

FM Synthesis

FM synthesis is a method of producing sound waves by combining simpler waves. The basic sound-producing unit in a synthesizer is the oscillator. An oscillator takes a few parameters: frequency/pitch, amplitude/volume, and waveform, and produces the sound wave with those characteristics. In the OPL line of synthesizer chips, and often with FM in general, oscillators are referred to as operators. The reason for this is the way FM synthesis uses oscillators.

Each voice, or independently controlled sound source, in FM is built from two or more oscillators. However, usually only one of these oscillators, called the carrier, actually produces sound. The other oscillators, called modulators for reasons which will soon be apparent, modulate the carrier by adjusting the carrier’s frequency up or down based on the amplitude of the waveform produced by the modulator. This can produce a wide variety of sounds, many of which approximate different musical instruments fairly well. With more than two oscillators, there may be multiple carriers sounding at once, and/or multiple modulators, modulating carriers or even other modulators. The reason FM oscillators are often called operators is that they operate on each other in this way.

The OPL3 contains thirty-six operators, which are paired together to provide eighteen independent two-operator voices. Up to six pairs of voices can be combined to create four-operator sounds, at the cost of a corresponding decrease in the number of simultaneous voices. If percussion mode is enabled, three two-operator melodic voices are exchanged for five percussion voices—one two-operator voice and four one-operator voices, for a total of twenty independent voices. (One-operator voices are not technically FM, since there’s no modulation happening, but they can be produced by the OPL3.)

Microcontroller Interface

The microcontroller interface to the OPL3 consists of a chip select pin (CS), a write enable pin (WR), a read enable pin (RD), two address pins (A0, A1), eight data pins (D0–D7), an interrupt request output pin (IRQ), and a reset pin (IC, or Initial Clear). To control the behavior of the synthesizer (e.g. play a note, adjust sound settings, etc.), the microcontroller uses these pins to set the value(s) of one or more registers within the chip. I won’t go into detail about the specific registers and their values, because the unofficial OPL3 Programmer’s Guide has already done an excellent job of that. They’re also described in the datasheet, but I find that does a rather poorer job.

To set a register, the microcontroller first needs to set the CS pin low (if it’s not already) to select the chip, the A0 pin low to indicate it’s selecting a register, the A1 pin either high or low to select one of the two banks of registers, and the D0–D7 pins to the index of the register to set. The WR pin is then pulsed low then high to write the register index. Next, A0 is set high to indicate the register’s value is being set, and D0–D7 are set to the new value for the register. The WR pin is once again pulsed low then high, this time to write the register value. Finally, if the microcontroller is done setting registers, it can set the CS pin high to deselect the chip.

The ApOPL3xy connects the OPL3’s pins as follows. CS is connected to GND to permanently select the chip. A0, A1, IC, and WR are connected to individual GPIO pins on the ATmega1284, configured as output pins. (A0 actually shares a GPIO pin with the LCD module’s RS pin.) D0–D7 are connected to the eight shared output pins from the custom SPI interface built from shift registers. This pin sharing works because the OPL3’s WR is only brought low (active) when any other device using the shared pins is inactive. When WR is high, the OPL3 doesn’t care what the values of the address and data pins are (except for setup and hold times, but those are so short that, even at 20 MHz, it’s hard to violate them; see the datasheet for more details on that).

The IRQ and RD pins are rarely used. The OPL3 can generate timer interrupt signals on the IRQ pin to let the microcontroller know when a certain amount of time has elapsed, and the RD pin is only able to be used to get information about these interrupts. The Sound Blaster 16 did not use this feature, nor does the ApOPL3xy. Therefore, the ApOPL3xy connects the RD pin to V_CC to permanently disable reads, and leaves the IRQ pin disconnected.

Digital-to-Analog Conversion

The OPL3 doesn’t generate sound directly. Rather, it produces digital representations of waveform amplitude (called samples) at a rate of 49,716 samples per second. The samples are represented as 16-bit offset binary numbers, where 0 represents the most negative value, and 65,535 represents the most positive value. These samples are sent serially to a Digital-to-Analog Converter (DAC) chip, the YAC512, which was designed as a companion chip to the YMF262.

The YAC512 takes the digital samples and converts them to an analog waveform which can be sent to an amplifier and then to a loudspeaker. Each YAC512 supports two audio channels, and the YMF262 can connect to two YAC512’s, giving four audio channels. Each of the OPL3’s voices can be configured to be output to any combination of the four audio channels.

LCD Character Display Module

This is a pretty standard component for a lot of homebrew projects. It’s a twenty-column by four-row character display module with an LED backlight and an integrated controller (Hitachi HD44780). It also comes in a sixteen-column by two-row variety, but I wanted the little bit of extra space afforded by the larger module. Ben Eater has an excellent video (to be honest, all of his videos are excellent) in which he connects a 16×2 version to his 65C02-based breadboard computer. The datasheet for the HD44780 is surprisingly good, and details all of the instructions that can be sent to the module.

The microcontroller interface consists of eight data pins (D0–D7) and three control pins: enable (E), register select (RS), and read/write (RW). There are five other pins, two for power, two for backlight power, and one to set the contrast of the display, for a total of sixteen pins, but only eleven need to be connected to the microcontroller.

To send instructions to the LCD module, a microcontroller first needs to set RW low, to signal a write, and RS low, to signal an instruction. Next, D0–D7 are set to the instruction to send, and finally E is pulsed high then low to send the command. Sending data (i.e.characters) follows the same process except that RS is set high to signal data rather than an instruction, and D0–D7 are set to the ASCII value of the character to send.

The LCD module also supports a four-bit mode, in which only D4–D7 need to be connected. This would be one way to conserve GPIO pins, but the ApOPL3xy doesn’t take this approach. This mode takes twice as long to send instructions and data. Because each instruction and character is still eight bits, each one takes two cycles to send. However, the main reason the ApOPL3xy uses the eight-bit mode is that it simplifies the code to do so, and the GPIO pressure is dealt with another way (see below).

Character data can be read back out of the LCD module, if desired, by setting RS and RW high, pulsing E high then low, and reading the value of D0–D7. There is also a “busy” flag which can be read by setting RS low, setting RW high, pulsing E high then low, and reading the value of D7. The busy flag indicates that the HD44780 is still processing the last instruction it received.

If an instruction is sent while the busy flag is high, the instruction will not execute, and the HD44780 will take longer to complete its current action, so this should be avoided. However, the datasheet lists the maximum duration for each instruction, so another way to avoid this situation is simply to wait long enough between instructions. This can be slightly slower, but the longest delay needed is about 1.5 ms, and most are less than 50 µs, so it’s not terrible, and it simplifies the connections if reading doesn’t need to be supported. Therefore, this is the approach the ApOPL3xy takes.

The ApOPL3xy connects the LCD module’s pins as follows. RS and E are connected to GPIO pins on the ATmega1284, configured as outputs. (RS shares a pin with the OPL3’s A0 pin.) D0–D7 are connected to the eight shared output pins from the custom SPI interface built from shift registers. As with the OPL3, this pin sharing works because the LCD’s E pin is only brought high when any other device using the shared pins is inactive. When E is low, the LCD doesn’t care what the values of RS, RW, and D0–D7 are.

Some readers might be wondering whether the ApOPL3xy uses the Arduino LiquidCrystal library to control the LCD module, and if not, why not. The ApOPL3xy doesn’t use this library because the library expects all of the LCD’s pins to be connected directly to the microcontroller’s GPIO pins, and that’s not how the module is connected in this case. To do so would have used more GPIO pins than I would have liked. The LCD control code in the ApOPL3xy’s firmware has an API modeled after LiquidCrystal, because it has a decent design, and because that might be more familiar to some developers.

Input Controls

The user interface for the ApOPL3xy consists of the LCD character display module previously discussed, and a number of input controls. Specifically, two EC11 rotary encoders, and ten momentary tactile push buttons. Each encoder has three output pins, and the ten push buttons have one each, for a total of sixteen. These outputs are connected to the sixteen inputs provided by the 74HC165’s in the custom SPI interface.

The shift registers in the SPI interface are part of the reason there are so many buttons. With six inputs needed for the two encoders, a single 74HC165 would only provide enough inputs for two more buttons. I felt that this wouldn’t be enough, so I added another 74HC165 with another eight inputs. Rather than let some of them go unused, I connected a button to each of them, for a total of ten. At present, the firmware only uses five of the buttons, but I’m sure I’ll be able to find uses for the others.

The buttons each have two pairs of pins, but the pins in each pair are shorted together, so there are effectively only two pins per button. I believe, but I’m not 100% certain, that the redundant pins are there to provide structural support when soldered to a circuit board. While the button is pressed, all four pins are shorted together.

EC11 rotary encoders are knobs that can be turned in arbitrarily many discrete steps in either direction, unlike a potentiometer, which turns continuously, but has limited extents. The encoders send a signal for each clockwise and counter-clockwise step, and they also serve as push buttons when pressed. If you’re curious to know how these work in more detail, check out my recent post on the topic.

These controls are all wired up in the ApOPL3xy to produce active-low signals. This means that each shift register input pin sees a low signal while, for example, the connected button is pressed, and a high signal otherwise. I don’t have a strong reason for making the controls’ outputs active-low rather than active-high. It would have worked just as well the other way.

The ApOPL3xy deals with the problem of contact bounce by applying a software debouncing algorithm. This algorithm acts as a state machine which only transitions states after a configured minimum time is spent in each state. In this case, the states are (effectively) idle, idle wait, active, and active wait. The wait states are the ones that require minimum durations before transitioning to its non-wait counterpart. If there’s interest, I could make another post going into detail about the various hardware and software debouncing solutions, and the pros and cons of each.

MicroSD Card Module

The ApOPL3xy includes a MicroSD card module to allow it to read VGM (and eventually MIDI) files for playback, as well as to load and save data to and from the EEPROM. I’m looking into the possibility of using a bootloader for the ATmega1284 that can load new firmware from the SD card, so an ISP programmer wouldn’t be required, but my investigation into that is not yet complete.

The MicroSD module I’m using was meant for Arduinos. It’s a small board with a slot for the card, a pin header to connect to the microcontroller, and a few extra components to shift voltage levels between the 5V the microcontroller uses and the 3.3V that the card uses. I may, in some future version of the ApOPL3xy, ditch the separate module and just incorporate a card slot and a level shifter chip directly. The card itself contains all the circuitry needed for the actual control and data interface, so it would be relatively straightforward. The module is convenient however, so that’s what I’m using for now.

SD (and MicroSD) cards natively use a four-bit parallel protocol, and that’s the way most modern devices (e.g. computers, smartphones, etc.) talk to them. Using this mode enables the highest transfer rates that the card can support. However, these cards also provide an SPI interface, which is how most smaller microcontrollers, including the ApOPL3xy, talk to them. This method is slower, but still fast enough for how the ApOPL3xy uses the card, it’s supported in hardware by the microcontroller, and it only uses up one additional GPIO pin for the SPI SS signal.

One issue I ran into is that SD cards are not particularly well-behaved when it comes to SPI. Specifically, the cards don’t seem to release the MISO line when its SS line is brought high, interfering with all the other devices on the SPI bus. At least this is the case with the module I’m using, but I believe this is a general phenomenon. To fix this, I used a 74HC125 tri-state buffer chip to disconnect the SD card’s MISO line when its SS line is high. In fact, I already had a 74HC125 with unused gates in the circuit, because I needed to do the same thing with the 74HC165 shift registers as well.

The SPI interface to the SD card provides raw read/write access to the bytes stored on the card. However, that’s not really sufficient to be able to read and write files on the card. At least, not if the card needs to be able to be read and written by other devices as well. Most storage devices, including SD cards, organize the large amount of data stored on them by using a filesystem. Filesystems keep track of file metadata like filenames and file size and provide controllers with a way to work with files instead of just raw data.

The topic of filesystems is vast, and well outside the scope of this article, but a common filesystem used on SD cards is called FAT32. FAT32 was developed by Microsoft for Windows 95, as an extension to their earlier filesystems FAT16 and FAT12, developed for MS-DOS. FAT32 has become a de facto standard for removable media, as its relative simplicity compared to other filesystems has enabled many different computer operating systems and embedded systems to implement support for it. This is slowly being superseded by Microsoft’s more modern exFAT filesystem, but support for this is still far from ubiquitous.

The ApOPL3xy expects its SD card to be formatted with the FAT32 filesystem, and rather than implementing a FAT32 filesystem driver from scratch, it makes use of the SdFat library. This library takes care of both the low-level SD card interface for handling raw data and the FAT32 filesystem interface. It’s been in development for many years, and at the time of this writing, is still in active development. It manages to pack a tremendous amount of capability into a surprisingly small amount of space, and does so very efficiently. It even has optional support for exFAT, but ApOPL3xy doesn’t enable that, at least not yet.

External SRAM and EEPROM

Even though the ATmega1284 has much larger memory capacities than the ATmega328P, it’s still not enough to store as many synthesizer patches (i.e. sound settings to emulate various instruments) as I want to be able to. Each patch for the OPL3 (at least as I have currently implemented it) is 23 bytes, plus a 24-byte name string, giving 47 bytes per patch. The General MIDI specification includes 128 melodic instruments, and 61 percussive instruments, for a total of 189 * 47 = 8,883 bytes, more than half of the available SRAM space, and more than double the available EEPROM space. Furthermore, I’d eventually like to be able to store and select from multiple banks of patches.

To deal with this, the ApOPL3xy includes an external (i.e. not part of the microcontroller) SRAM chip (Microchip 23LCV1024) and an external EEPROM chip (Microchip 25AA1024), each with 128 kB of space. These chips, like most of the other peripherals in the ApOPL3xy, communicate via the SPI protocol. The specific details of the command interface can be found in the datasheets for these chips, but the basic structure for both chips is: write a command byte (i.e. “read”, “write”), write a three-byte address, then read or write as many bytes as desired. Because these chips hold 128 kB of data, only seventeen bits are needed to encode an address, and the seven most significant bits of the three-byte address are ignored.

Writing to the EEPROM is slightly more complicated. Its storage is organized into 256-byte pages, and each write operation can only modify a single page. To write to multiple pages, multiple write operations must be performed. To write to a page, first the status register must be checked to ensure that the chip has completed its last operation. Then a “write enable” command byte must be sent, then the chip select line must be brought high and then low before the “write” command is sent, followed by the 3-byte address, followed by up to 256 bytes of data. Once the last byte of the page is written, the chip select is brought high, and the cycle is repeated for any additional pages.

MIDI Port

The ApOPL3xy uses the MIDI protocol to allow an instrument (like a keyboard) to control the FM synthesis chip. MIDI is a fairly straightforward protocol, and the electrical interface is reasonably simple to implement. The MIDI port itself is a female 5-pin DIN jack, and there are a few other components, such as an optocoupler and a few resistors to provide isolation and level-shifting. The data signals transmitted via this port are essentially RS-232 serial signals, at 31,250 baud with eight data bits, one stop bit, and no parity bits per frame. The USARTs built into the ATmega1284 can directly process this kind of signal, so the data line from the MIDI-In port is connected to one of these.

At present, the breadboard version of the ApOPL3xy includes a single port for MIDI-In, but I will likely add a MIDI-Thru port as well. MIDI-Thru ports are ports which simply output whatever signals came in on the MIDI-In port. They’re also pretty simple electrically, and require no software support in the microcontroller, so it’s probably worth adding one, just in case it turns out to be useful.

Conclusion

This covers almost all of the hardware used in the ApOPL3xy. I did leave out the amplifier circuits, because I’m still fiddling around with their design. Once I sort that out to my satisfaction, I’ll likely make another post about that. As always, I hope this has been interesting and informative. Let me know if anything is unclear, or if you’d like more detail on anything in particular.

I Drove a Lamborghini

Nate Barney — Sun, 08 Oct 2023 19:44:21 +0000

Last Christmas, my girlfriend Donnett got me a gift certificate to a company that offers supercar driving experiences, and I recently cashed it in. I could choose from 8 different cars, but it’s been a dream of mine for a long time to drive a Lamborghini, so it wasn’t a difficult decision to go with the Lamborghini Huracán. Yesterday, we went out to the Pocono Raceway and I drove it three laps around the track.

The Lamborghini Huracán I drove

The car has a top speed of about 200 mph, and I had naïvely hoped to get close to that, but it took me a little more time than I thought it would to get a feel for the car and build up confidence. It was also unfortunately raining, and I was a little nervous about skidding, though I probably didn’t need to be. The instructor was encouraging me to go faster around the corners, and when I did, the car was extremely stable. I got up to a top speed of about 85 mph, and while I had hoped to do better, it really was so much fun.

Donnett took a few videos with her phone, and here’s a good shot of the exterior of the car on the straightaway.

I also got a really nice video from the company running the event that shows the view from inside the car, with all sorts of metrics superimposed. (The brakes are really squeaky, but it turns out that’s just how they are on a Lamborghini.)

I’m definitely going to do this again (hopefully while it’s sunny) and with the experience I have under my belt now, I think I’ll really be able to push some limits next time.

Moon Blaster

Nate Barney — Sat, 19 Aug 2023 04:28:24 +0000

When I was about 11 or 12 years old, I wrote a simple game in Commodore 128 BASIC. It’s not very fancy, but it is a complete game, with sound, graphics, a title screen, and even a backstory. The object of the game is to fire missiles at a moon moving across the top of the screen. The player can move the missile launcher to one of three bases using the joystick, and (of course) launch missiles with the fire button. By any standard, it’s not a great game, but I was pretty proud of it at the time.

The C128 included Commodore BASIC 7.0, which was much better than the older Commodore BASIC 2.0 that was on the C64. BASIC 7.0 had lots of additional commands to control the graphics and sound chips in the machine, read the joysticks, save and load binary files between RAM and disk, etc., and the computer came with a really nice system guide [PDF] that described all of them. 6502 machine language was out of my reach at the time—I didn’t have an assembler, or even any reference material. The C128 included a machine language monitor with a rudimentary assembler, but I had no idea what it was for. Nevertheless, I decided to try to make a game using the enhanced C128 BASIC, and Moon Blaster was the result.

Moon Blaster screenshot

Machine Language Port

My recent foray into C64 game reverse-engineering gave me an idea: re-write Moon Blaster in assembly language for the C64. I’ve never written a game in machine language for the C64 before, and this game is pretty simple—great for a first project. I had a couple main goals: 1) stay as true as possible to the original game, and 2) build both disk and cartridge versions. I’m pleased to report that I was successful. Here’s a short video of the new version of the game (headphone warning, it might be a bit loud):

Moon Blaster ML gameplay

If you’d like to play the game, you can download it here. The download contains cartridge and disk images for the new version, as well as the assembly source code for it. It also includes a disk image of the original BASIC game, if you’d like to compare the two. You’ll need a C64/C128 emulator (VICE is a good one), or a real C64/C128. (If you’re using real hardware, I’ll assume you know how to get the images to it. If not, let me know.) The easiest way to start the game in VICE is to attach the cartridge image, either by using the menu or by pressing Alt+C. You’ll need to set up the joystick as well. VICE works with a game controller, or can emulate a joystick using the keyboard. It’s pretty straightforward to set up, but if you have trouble, let me know.

It’s Not a Bug

An amusing story (at least to me) about Moon Blaster is the way I inadvertently introduced the concept of a critical hit. The moving objects on the screen—the moon, missile launcher, and missiles—were implemented as sprites. C128 BASIC included commands for sprite collision detection, which I used to detect when the player got a hit. When two sprites touch, the program jumps to a specified line, which in this case plays the hit sound, updates the score, etc.

Unfortunately, BASIC is so slow that, in the original game, if the player hit the moon close to dead center, the missile collided twice before it was moved back to the launcher. This had the effect of doubling the hit sound and animation, and scoring twice instead of once. I recall trying to fix it, and being stumped. Ultimately, in the grand tradition of “It’s not a bug; it’s a feature,” I decided just to add it to the instructions as an “intentional” bonus. However, the machine language version runs much faster, so this didn’t happen, and to stay true to the original, I actually had to write code specifically to replicate this ~~bug~~ feature.

Technical Stuff

The rest of this post contains technical details intended to be of interest to other programmers. If you’re not a programmer, you’re welcome to stick around, but if you want to bail at this point, I won’t mind . I won’t go over all of the game code, but there were a few things I was pretty happy with, and I thought they were worth talking about. The assembler I used for this project is ca65, which is part of the cc65 compiler suite. In case it’s helpful, here’s a link to a 6502 instruction set reference.

Linear Feedback Shift Register

The original game initially used BASIC drawing commands to draw the star field. This was painfully slow, so I ended up changing it to load the bitmap memory from a disk file, which I had saved after running the drawing routines. This was still pretty slow, but was a noticeable improvement. When doing the assembly port, I had the raw power of a 1 MHz processor at my command, so I decided the game should draw the stars at startup every time. Doing this in machine code is way faster than loading a screen image from disk. Plus, with the cartridge version, there wouldn’t even be a disk.

However, I wanted the stars to be the same every time, too, so I needed a pseudorandom number generator. The SID sound chip in the C64 can be used to generate random numbers, but there’s not a good way to seed it. So using that approach, the stars would change every game. Instead, I implemented a 16-bit Linear Feedback Shift Register (LFSR) routine, and sampled bits from it to get pseudorandom coordinates for drawing the stars. I used the constants from the Wikipedia article, since that seemed as good a set as any. I don’t claim it’s the most optimal LFSR ever written for the 6502, but I think it’s pretty slick:

.EXPORTZP LFSR_STATE = $fd

.CODE

.PROC lfsr_16

    ; get XOR of bits 15 and 13
    lda LFSR_STATE+1
    lsr a
    lsr a
    eor LFSR_STATE+1

    ; get XOR of bit 12
    lsr a
    eor LFSR_STATE+1

    ; get XOR of bit 10
    lsr a
    lsr a
    eor LFSR_STATE+1

    ; put result of XORs into carry flag
    lsr a
    lsr a
    lsr a

    ; rotate carry flag into LFSR_STATE as the LSB
    rol LFSR_STATE
    rol LFSR_STATE+1

    rts
.ENDPROC ; lfsr_16

Raster Interrupt Split-Screen

The game is mostly graphical, but there are text elements to display the score and number of shots fired. C128 BASIC has a command (CHAR) to draw character on a bitmap screen. I could have taken the same approach with the machine language port, but that would involve banking in the character ROM and copying the bitmap data for each character. It would be slow (although I had plenty of cycles to do it since the game is so simple) and ugly code. Instead, I decided to go a different direction—changing between graphics and text mode in the middle of the frame.

The VIC-II video chip in the C64 can be configured to cause an interrupt when it reaches a specified scan line—a so-called raster interrupt. To achieve a split screen effect, one can enable the raster interrupt at the top of the screen, and then in the interrupt handler, turn on graphics mode, then set the raster interrupt somewhere in the middle of the screen, and turn on text mode. There are a few other details to take care of, but this works surprisingly well. To print the score, I can simply place the characters in the right location in screen memory. Here’s the code I used to enable the interrupt, and the interrupt handler that implements the split-screen:

.INCLUDE "vic2.inc"

IRQ_VECTOR = $0314
KERNAL_ISR = $ea31
CIA1_IRQ = $dc0d
CIA2_IRQ = $dd0d

.DATA

FRAME_SYNC: .res 1

.CODE

.PROC setup_irq
    sei

    ; disable CIA-1 interrupts
    lda #%01111111
    sta CIA1_IRQ

    ; clear high bit of raster counter
    and VIC2::CR1
    sta VIC2::CR1

    ; acknowledge pending interrupts
    lda CIA1_IRQ
    lda CIA2_IRQ

    ; set raster interrupt
    lda #0 ; interrupt on this raster line
    sta VIC2::RST

    ; clear frame sync variable
    sta FRAME_SYNC

    ; set interrupt vector
    lda #split_screen_isr
    sta IRQ_VECTOR+1

    ; enable raster interrupts
    lda #%00000001
    sta VIC2::IMA

    cli
    rts
.ENDPROC ; setup_irq

.PROC teardown_irq
    sei

    ; disable raster interrupts
    lda #%00000000
    sta VIC2::IMA

    ; acknowledge pending interrupts
    asl VIC2::IRQ ; acknowledge VIC-II raster interrupt by clearing low bit

    ; enable CIA-1 interrupts
    lda #%11111111
    sta CIA1_IRQ

    ; restore interrupt vector
    lda #KERNAL_ISR
    sta IRQ_VECTOR+1

    cli
    rts
.ENDPROC ; teardown_irq

.PROC split_screen_isr

    ; clear decimal flag
    cld

    ; look at current raster line to see if we should turn bitmap on or off
    lda VIC2::CR1
    ldx VIC2::RST
    cpx #100
    bcs bitmap_off

    ; turn bitmap mode on
bitmap_on:
    inc FRAME_SYNC
    ora #%00100000  ; bitmap on
    sta VIC2::CR1
    lda VIC2::PTR
    ora #%00001100  ; set bitmap memory to $2000-$3FFF
    and #%11111100
    sta VIC2::PTR
    lda #209        ; next interrupt raster line
    jmp return

    ; turn bitmap mode off
bitmap_off:
    and #%11011111  ; bitmap off
    sta VIC2::CR1
    lda VIC2::PTR
    ora #%00000100  ; set bitmap memory to $0000-$1FFF
    and #%11110100
    sta VIC2::PTR
    lda #0          ; next interrupt raster line

return:
    sta VIC2::RST
    asl VIC2::IRQ   ; acknowledge VIC-II raster interrupt by clearing low bit
    jmp KERNAL_ISR
.ENDPROC ; split_screen_isr

Exit to BASIC

C64 games rarely include an option to exit the game and return to BASIC. Even though 64K of RAM was pretty roomy for the time, most games needed a great deal of it, and so banked out the BASIC ROM and clobbered BASIC memory areas. It’s also possible this was used as a rudimentary form of copy protection. However, the original Moon Blaster included an option to quit; being a BASIC program itself, this was simple to do.

I wanted the machine language port to have this same capability. It was in the original game, and I’m not worried about copy protection. The game doesn’t take much memory (less than 5K of code and data, not counting the 8K bitmap and 1K text screen memory areas), and I was able to place everything so as not to interfere with BASIC. To exit to BASIC, the disk-based game can simply jump to the BASIC warm-start routine $E38B. However, there are two problems with this.

The first problem is that just jumping to BASIC warm-start doesn’t work at all if you’re running the cartridge version, because BASIC isn’t initialized in that case. To address this, I call a few BASIC ROM routines from the cartridge entry point to initialize BASIC to the point that it can be warm-started. Here’s the code I used to do that:

.INCLUDE "main.inc"
.SEGMENT "START"

; entry point which jumps away to BASIC, so it works right from a cartidge
.PROC start

    ; initialize the KERNAL
    jsr $fda3 ; initialize I/O devices
    jsr $fd50 ; test RAM and initialize memory pointers
    jsr $fd15 ; initialize KERNAL vectors
    jsr $e518 ; initialize the screen and keyboard
    cli ; enable interrupts

    ; initialize BASIC
    jsr $e453 ; initialize BASIC vectors
    jsr $e3bf ; initialize BASIC RAM

    ; run the game
    jsr main

    ; jump to BASIC
    ldx #$80 ; no error code
    jmp $e38b ; BASIC warm start

.ENDPROC ; start

The second problem is that I wanted the user to be able to re-start the game simply by typing RUN. This kind of works with the disk version, but that runs the BASIC loader program and loads the whole game from disk again, even though it’s still in memory. With a cartridge, there would be no BASIC program to RUN at all. To deal with this, I included a simple BASIC program to restart the game in the constant data section. Before exiting, I replace whatever BASIC program is loaded (possibly none) with this program, which immediately starts the game again when run. To do this, not only the BASIC program area at $0801 needs to be populated, but also several pointers that BASIC uses, starting at $2B in zero page. (The details about these locations can be found in a C64 memory map.) Here’s the code I used to do that:

.RODATA

; BASIC program to re-start the game
;
; 10 SYS 32804
;
BASIC_LOADER_ADDR = $0801
BASIC_LOADER:
.byte $0d, $08, $0a, $00, $9e, $20, $33, $32, $38, $30, $34, $00, $00, $00
BASIC_LOADER_SIZE = * - BASIC_LOADER

BASIC_POINTERS_ADDR = $2b
BASIC_POINTERS:
.byte $01, $08, $0f, $08, $0f, $08, $0f, $08, $00, $80, $00, $00, $00, $80
BASIC_POINTERS_SIZE = * - BASIC_POINTERS

.SEGMENT "MAIN"

; entry point which does an RTS, so it works right from BASIC's SYS command
.PROC main

    ; initialize SID
    jsr init_sid

    ; initialize screen
    jsr init_screen

    ; show title screen
    jsr title_screen

    ; restore screen
    jsr restore_screen

    ; put the BASIC loader into BASIC program memory
    lda #BASIC_LOADER
    sta $fc
    lda #BASIC_LOADER_ADDR
    sta $fe
    ldy #0
loop1:
    lda ($fb),y
    sta ($fd),y
    iny
    cpy #BASIC_LOADER_SIZE
    bne loop1

    ; update the BASIC pointers for the new BASIC program just copied
    lda #BASIC_POINTERS
    sta $fc
    lda #BASIC_POINTERS_ADDR
    sta $fe
    ldy #0
loop2:
    lda ($fb),y
    sta ($fd),y
    iny
    cpy #BASIC_POINTERS_SIZE
    bne loop2

    rts

.ENDPROC ; main

With those two problems solved, the user can cleanly exit the game and then jump right back in, with no load time, regardless of whether the game is in cartridge or disk form. It’s not earth-shattering or anything, but I think it’s pretty cool.

Closing Thoughts

I had a lot of fun doing this project, and learned quite a few things. It was neat to revisit something I wrote so many years ago, and to breathe new life into it. The game won’t win any awards, but that’s okay. If you download and the play the game, or look at the code, and have any questions about how I did this or that, feel free to ask. If you have any suggestions on how I could have done things better, I’d love to hear about that too.

Quest for Tires

Nate Barney — Sat, 05 Aug 2023 08:32:35 +0000

This is a story of nostalgia, retro-computing, gaming, reverse-engineering, and fixing a decades-old bug. It might be a bit long, and can get rather technical in places (a lot of places), but if you’re into that sort of thing, I think you may find it enjoyable.

The Game

When I was a kid, my family had a Commodore 128 (C128) computer. The C128 was the successor to the famous Commodore 64 (C64), and it could be booted into a C64-compatible mode. I used this computer all the time, for playing games, writing papers for school, and, of course, programming. Probably most of my time on the computer was spent playing games. One of my favorites was a game called B.C.’s Quest for Tires. It’s a pretty simple side-scrolling game in which the player moves steadily to the right, and has to jump and duck obstacles to proceed.

Quest for Tires screenshot

I had, and still have, a copy (of dubious provenance) of the C64 version of the game on a floppy disk. A couple of years ago, I used a ZoomFloppy with my Commodore 1571 disk drive to make images of many of my old Commodore disks, including the one containing Quest for Tires. I loaded the game up in the VICE emulator and started playing, but the nostalgia I felt was quickly and rudely interrupted when the game crashed.

Quest for Tires Gameplay and Crash

One of the things the player can do in the game is to change the main character’s speed, with a minimum speed of 10 and a maximum of 80. Partway through the game, the minimum speed is increased to 40. (There are no units specified, but I assume it’s meant to be in miles per hour.) The faster you go, the more difficult the game is, but the more points you get along the way. Unfortunately, every time I got the speed up to 80, the game crashed. Everything else seemed to work fine, so I just kept my speed below 80, but it was annoying that the game crashed at all.

A few days ago, I was re-watching 8-Bit Show and Tell‘s YouTube video about fixing an old bug in the C64 version of Pac-Man, and I was inspired to see if I could resolve the mystery of the Quest for Tires crash, and possibly even fix it. (As an aside, if you have any interest in retro-computing in general, and Commodore computers in particular, you should definitely check out that channel. Right now. I’ll be here when you get back.)

Finding the Bug

To figure out what’s going on, let’s use VICE’s built-in machine language monitor, which is a debugging tool that can examine and modify code and data in the machine’s memory. Some reference material to keep handy will be:

Joe Forster‘s Commodore 64 memory map
Michael Steil‘s Ultimate Commodore 64 Reference, especially the C64 KERNAL API section

The game on disk consists of two files: "QUEST FOR TIRES", which is a short BASIC program that loads and executes the second file, "QFT.8000-C010", which is the machine code for the game itself. Based on the filename, it seems pretty likely that the machine code gets loaded to the addresses $8000 through $C010. (Note: the convention with 6502-based computers, like the C64, is to represent hexadecimal numbers like addresses with a leading $, and binary numbers with a leading %.)

Since the crash is clearly related to changing the player’s speed, a good place to start might be to look for the code that updates the speed. To change the character’s speed, the player presses and holds the joystick button and moves the joystick left (to slow down) or right (to speed up), so we should look for instructions that read and process the joystick state.

Using the Monitor

VICE’s machine language monitor, when active, pauses the computer and presents a prompt where the user can enter commands to examine or modify the state of the computer, or to resume normal execution. Here’s a summary of the monitor commands we’ll be using, with abbreviations where applicable:

a – assemble instructions into memory
backtrace (bt) – display the chain of subroutine calls leading to the current instruction
bank – control which devices and memory ranges are visible in the monitor
break (bk) – set a breakpoint on a memory location
disass (d) – disassemble the contents of a memory range
goto (g) – resume execution, optionally jumping to a new location
hunt (h) – search through a memory range for a byte sequence
load (l) – load a file from disk into memory
mem (m) – display raw contents of a memory range
next (n) – execute a single instruction, stepping over subroutine calls
save (s) – save a region of memory to a file on disk
step (z) – execute a single instruction, stepping into subroutine calls

Reading the Joystick

On the C64, the state of joystick port #1 can be read from memory address $DC01, and the state of joystick port #2 can be read from $DC00. The value read from the joystick port is encoded in the lowest five bits, and each bit is active low, so 0 if pressed and 1 if not pressed. The five bits from right to left are:

Bit 0: joystick up
Bit 1: joystick down
Bit 2: joystick left
Bit 3: joystick right
Bit 4: fire button

For example, if the joystick is idle, the binary value read from the joystick port would be %00011111, or $1F in hexadecimal. If the joystick is pointed down and to the left, and the fire button is being pressed, the binary value read from the joystick port would be %00001001, or $09 in hexadecimal. (Actually, the top three bits will probably be different. They’re related to the state of the keyboard.)

The game is played with the joystick connected to port #2, so it must have some code somewhere to read from $DC00. Let’s see if we can find it.

Bank Switching

There’s one more thing we need to deal with before running our search—bank switching. The C64’s 6510 CPU has a 16-bit address bus, which means it can address a maximum of 64k (65,536) different memory locations. The C64 has a full 64k of RAM, but it also has ROMs and I/O devices that need to fit into the same address space, which the C64 handles by bank switching. This strategy involves mapping different devices into and out of the address space so that all of the memory and devices can be accessed, though not all at the same time.

The C64, by default, has a BASIC ROM mapped into the address range $A000–$BFFF, masking the RAM at those locations, which in our case will contain the second half of the game code. If a program writes to those addresses, the data will be written to the RAM. As a result, with the default bank setup, we can load the game into memory, but if we try to read that code, we’ll get the contents of the BASIC ROM instead.

Memory location $01 controls what devices are banked in and out, and location $00 controls writes to $01. (By convention, addresses less than $0100 are written with just two hex digits instead of four.) We could write to those addresses with the monitor to bank out the BASIC ROM. However, we can use the monitor’s bank command to read from the RAM without changing the computer’s state, so let’s stick to that approach, at least for now.

searching the code

Let’s load the machine code file into memory and search for instructions that refer to $DC00. The 6510 processor in the C64, like the 6502 it’s based on, uses little-endian byte ordering, so instead of searching for the byte sequence $DC $00, we’ll need to search for the sequence $00 $DC.

(C:$e5d4) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$e5d4) bank ram
(C:$e5d4) h 8000 c00f 00 dc
a089
ad71

Great, we found two memory addresses which reference that address. Let’s disassemble and look at the code around the first one, $A089. To get some context about the code you’re looking at, a good tactic is to disassemble several bytes before and after the location of interest. So, let’s disassemble starting at $A07E and going through $A0A5:

(C:$e5d4) d a07e a0a5
.C:a07e  AD 0F 12    LDA $120F
.C:a081  D0 3C       BNE $A0BF
.C:a083  AD 5E 40    LDA $405E
.C:a086  D0 37       BNE $A0BF
.C:a088  AD 00 DC    LDA $DC00
.C:a08b  29 10       AND #$10
.C:a08d  F0 14       BEQ $A0A3
.C:a08f  A5 C5       LDA $C5
.C:a091  C9 04       CMP #$04
.C:a093  F0 0E       BEQ $A0A3
.C:a095  C9 06       CMP #$06
.C:a097  F0 07       BEQ $A0A0
.C:a099  C9 05       CMP #$05
.C:a09b  D0 3F       BNE $A0DC
.C:a09d  4C 06 BF    JMP $BF06
.C:a0a0  4C 30 BF    JMP $BF30
.C:a0a3  EE 0F 12    INC $120F

We can see the instruction LDA $DC00 at address $A088. This instruction reads the value at address $DC00 (the joystick port) and loads it into the CPU’s accumulator register, also called the A register. The next two instructions, AND #$10 and BEQ $A0BF, check if the fire button is being pressed. However, none of the other joystick bits are being checked here, so it’s unlikely that this is in-game code. More likely, it’s part of code that runs at the “game over” screen, waiting for the player to press the fire button to start the next game.

Perhaps we’ll have better luck with the other address where we found the joystick port being accessed, $AD71. Again, let’s expand out a bit to get context, and disassemble $AD61 to $AD80.

(C:$a0a6) d ad61 ad80
.C:ad61  D0 AD       BNE $AD10
.C:ad63  70 40       BVS $ADA5
.C:ad65  69 00       ADC #$00
.C:ad67  A2 00       LDX #$00
.C:ad69  20 13 B3    JSR $B313
.C:ad6c  4C 1C AD    JMP $AD1C
.C:ad6f  60          RTS
.C:ad70  AD 00 DC    LDA $DC00
.C:ad73  8D 3D 40    STA $403D
.C:ad76  60          RTS
.C:ad77  AD 39 40    LDA $4039
.C:ad7a  C9 20       CMP #$20
.C:ad7c  D0 03       BNE $AD81
.C:ad7e  4C F6 B2    JMP $B2F6

This is interesting. The LDA $DC00 instruction reads the value of the joystick port and the next instruction, STA $403D, stores that value into a memory address. Presumably, this is so that it can be read multiple times by the code without it potentially changing between reads. Let’s see how many places in the game code read that location. Let’s look for intances of the instruction LDA $403D, which assembles to the three-byte sequence $AD $3D $40. (If we don’t find much, we can expand the search to other instructions that use that address.)

(C:$ad81) h 8000 c00f ad 3d 40
9fd0
9fd7
9fde
afee
b001
b03d
b044
b05b
b14d
b161

That instruction is found in ten different places in the code. Not a huge number, but not a small number either. Notice, however, that they seem kind of clustered into groups. This would make sense if we expect that the joystick state might be read multiple times in a subroutine. Let’s start going through them. To look at the first group of three, let’s disassemble and examine $9FC0 to $9FFD:

(C:$ad81) d 9fc0 9ffd
.C:9fc0  18          CLC
.C:9fc1  69 0F       ADC #$0F
.C:9fc3  8D 30 40    STA $4030
.C:9fc6  CE 25 40    DEC $4025
.C:9fc9  10 20       BPL $9FEB
.C:9fcb  A9 0A       LDA #$0A
.C:9fcd  8D 25 40    STA $4025
.C:9fd0  AD 3D 40    LDA $403D
.C:9fd3  29 10       AND #$10
.C:9fd5  D0 14       BNE $9FEB
.C:9fd7  AD 3D 40    LDA $403D
.C:9fda  29 08       AND #$08
.C:9fdc  F0 10       BEQ $9FEE
.C:9fde  AD 3D 40    LDA $403D
.C:9fe1  29 04       AND #$04
.C:9fe3  D0 06       BNE $9FEB
.C:9fe5  CE 61 40    DEC $4061
.C:9fe8  EE 24 40    INC $4024
.C:9feb  4C 03 A0    JMP $A003
.C:9fee  EE 61 40    INC $4061
.C:9ff1  CE 24 40    DEC $4024
.C:9ff4  AD 24 40    LDA $4024
.C:9ff7  C9 04       CMP #$04
.C:9ff9  B0 08       BCS $A003
.C:9ffb  4C 00 A0    JMP $A000

This is very promising. The joystick state ($403D) is read three times, each time to check a different bit. The three bits checked are bit 4 ($10, fire button), bit 3 ($08, joystick right), and bit 2 ($04, joystick left). These are exactly the relevant bits for controlling the speed. The first check, at $9FD0, looks to see if the fire button is pressed, and if not, jumps ahead to $9FEB, which immediately jumps to $A003.

The next two checks only occur if the fire button is pressed. The second check, at $9FD7, looks to see if the joystick is pointing right (to speed up), and if it is, jumps ahead to $9FEE, which increments location $4061 and decrements location $4024. It then checks if the value stored at $4024 is less than 4. If it’s greater than or equal to 4, it jumps ahead to $A003, but if it’s less than 4, it jumps to $A000 instead.

If the second check doesn’t find the joystick pointing right, the code falls through to the third check, at $9FDE. This looks to see if the joystick is pointing left (to slow down), and if it’s not, jumps ahead to $9FBE, which as we’ve already seen, immediately jumps to $A003. If the joystick is pointing left, then $4061 is decremented, and $4024 is incremented, and then the code jumps to $A003.

This very much appears to be the subroutine that updates the speed based on player input. It seems like $A003 is the location to jump to when we’re done updating the speed, and $4061 and/or $4024 are good candidate locations for where some representation of the speed is stored. $4061 is incremented when the character speeds up and decremented when the character slows down, so that at least moves in the right direction. $4024 goes in the opposite direction, but it is compared against some kind of limit (4), which fits with the fact that the in-game speed is limited. Maybe they’re both different representations of the speed. Let’s look at the disassembly for $A000 to $A020 and see if we can learn anything more.

(C:$9ffe) d a000 a020
.C:a000  EE 24 40    INC $4024
.C:a003  AD 24 40    LDA $4024
.C:a006  CD 30 40    CMP $4030
.C:a009  90 06       BCC $A011
.C:a00b  AD 30 40    LDA $4030
.C:a00e  8D 24 40    STA $4024
.C:a011  4A          LSR A
.C:a012  4A          LSR A
.C:a013  AC 1B 40    LDY $401B
.C:a016  AE 1A 40    LDX $401A
.C:a019  C0 07       CPY #$07
.C:a01b  D0 04       BNE $A021
.C:a01d  E0 95       CPX #$95
.C:a01f  F0 0C       BEQ $A02D

Let’s look at $A003 first, as it’s the target of several jump and branch instructions. This section of code loads the value at $4024 into the A register, then compares it to the value at $4030. If the value at $4024 is less than the value at $4030, the code jumps ahead to $A011. Otherwise, if the value at $4024 is greater than or equal to the value at $4030, the value at $4030 is copied into $4024, and the next instruction is at $A011. So, this seems to be setting an upper limit for $4024, with an adjustable limit value stored at $4030.

$A000 is the address jumped to when the value at $4024 is less than 4. This instruction increments $4024, and therefore enforces a hard-coded lower limit of 4 for this value. The next instruction is $A003, which, as we’ve just seen, contains the upper-limit code for the same location. Since $A000 has just enforced the lower limit, the upper-limit code will likely have nothing to do. After this, the code once again reaches $A011.

$A011 seems not to have much more to do with setting the speed, and likely just carries on with other game logic. Nevertheless, I think we’ve finally achieved our first milestone, which is to find the code which handles the speed. That is, after all, where the bug seems to be. In particular, the crash happens when the player accelerates past 80.

Stepping Through the Crash

The VICE monitor has the capability to set breakpoints, which stop the computer and open the monitor prompt when the breakpoint’s condition is satisfied. These conditions are usually something like “stop when execution reaches a certain address”, but they can also break on reading/writing memory locations, and can be restricted to break only when other memory locations have specific values. It’s a very powerful debugging tool, because while the computer is paused at a breakpoint, the user can examine and/or modify memory, and step execution forward one instruction at a time.

Since the crash we’re looking for happens when the player tries to speed up past the limit, and since instruction $9FFB is executed if and only if a limit is exceeded during a speed-up action, let’s do the following:

start the game,
break into the monitor,
set a breakpoint at $9FFB,
resume the game, and
accelerate past 80.

Then, if/when our breakpoint is hit, we can trace through the code and see what might be causing the crash.

(C:$bf5e) bk 9ffb
BREAK: 1  C:$9ffb  (Stop on exec)
(C:$bf5e) g
#1 (Stop on  exec 9ffb)   69/$045,  55/$37
.C:9ffb  4C 00 A0    JMP $A000      - A:03 X:00 Y:00 SP:ff N.-.....   63751795

Nice, we hit our breakpoint, right when we expected to. Now let’s trace through the next few instructions to see if we can determine what’s going wrong.

(C:$9ffb) z
.C:a000  55 24       EOR $24,X      - A:03 X:00 Y:00 SP:ff N.-.....   63751798
(C:$a000) z
.C:a002  40          RTI            - A:03 X:00 Y:00 SP:ff ..-.....   63751802
(C:$a002)

That’s definitely very wrong. RTI is an instruction meant to return from an interrupt handler. Executing that when not in an interrupt handler is a good way to corrupt the stack and crash the computer. We have our smoking gun. But why is this instruction being executed? And why is $A000 now EOR $24,X instead of the INC $4024 we saw before? The three bytes starting at $A000 are currently $55 $24 $40. But before we started the game, they were $EE $24 $40, which disassembles to the expected INC $4024. Only the first byte is different. Somehow, the byte at $A000 is getting corrupted and breaking the code.

Understanding the Bug

The Source of the Corruption

As I mentioned, the breakpoints in VICE’s monitor are pretty powerful, and they can be set up to break when a specific memory location is written to. We can use that try to find the culprit. But there’s a small complication to deal with. We don’t want to break when loading the machine code into RAM. We only want to break when $A000 changes after that initial load.

So, instead of the typical LOAD"*",8 followed by RUN to start the game, we’re going to do things a bit more manually. First we’ll load the BASIC loader program to look and see what steps it takes to load and start the game. Then we’ll load the machine code into RAM ourselves, and then set our breakpoint. Finally, we’ll run the command from the loader program that actually starts the game.

Quest for Tires BASIC Loader Listing

The relevant lines of code for us are 50 and 60. The rest are boilerplate code to display a loading screen and handle some quirks when a BASIC program loads another program. Line 50 loads the machine code with LOAD"QFT.8000-C010",8,1. Line 60 then executes that code with SYS64738. SYS is a BASIC command that jumps to an address and starts executing the machine code there. BASIC doesn’t use hexadecimal, so the address is specified in decimal.

64,738 in decimal is $FCE2 in hexadecimal, which some readers may recognize as the address of the KERNAL routine to reset the computer (without clearing the RAM). The loader program apparently loads the machine code into RAM and then resets the computer, which somehow causes the game to start. We’ll revisit this a bit later. For now, we know what steps we need to take to set our breakpoint and find out what’s corrupting memory at address $A000. Let’s do that now.

(C:$e5cf) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$e5cf) bk store a000
WATCH: 1  C:$a000  (Stop on store)
(C:$e5cf) bank default
(C:$e5cf) g fce2
#1 (Stop on store a000)    2/$002,  10/$0a
.C:fd73  91 C1       STA ($C1),Y    - A:55 X:94 Y:00 SP:fd ..-..I.C 2315364035
(C:$fd75) m c1 c2
>C:00c1  00 a0                                                ..

And there’s our culprit. For some reason, the instruction at $FD73 is being executed, and that instruction stores the value of the A register ($55) into the location pointed to by $C1 and $C2, which is $A000 (remember, the CPU is little-endian). The value of the Y register is added to the destination address before the store happens, but Y is currently 0, so that doesn’t change anything. What is address $FD73, and why is it writing into our program’s code? Why is it even executing at all? The monitor’s backtrace command might help us find out.

(C:$00c3) bt
(0) 800f
(C:$00c3)

Interesting. The caller of this routine is at $800F, which is very near the start of our game’s machine code. Let’s disassemble the code from the beginning and look at $800F.

(C:$00c3) d 8000 8020
.C:8000  00          BRK
.C:8001  C0 20       CPY #$20
.C:8003  80 C3       NOOP #$C3
.C:8005  C2 CD       NOOP #$CD
.C:8007  38          SEC
.C:8008  30 8E       BMI $7F98
.C:800a  16 D0       ASL $D0,X
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 50 FD    JSR $FD50
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI
.C:8019  78          SEI
.C:801a  4C 47 FE    JMP $FE47
.C:801d  8D 18 03    STA $0318
.C:8020  20 BC F6    JSR $F6BC
(C:$8023)

RAMTAS

At $800F, we find the instruction JSR $FD50. This instruction calls the subroutine at the specified address. $FD50 is pretty close to the offending instruction, $FD73, so this makes sense. $FD50 is within the KERNAL ROM’s address space, and is in fact a documented routine, named RAMTAS, that can be called by other programs. The pagetable.com KERNAL API reference says this about RAMTAS:

Perform RAM test

Communication registers: A, X, Y

Preparatory routines: None

Error returns: None

Stack requirements: 2

Registers affected: A, X, Y

Description: This routine is used to test RAM and set the top and bottom of memory pointers accordingly. It also clears locations $00 to $0101 and $0200 to $03FF. It also allocates the cassette buffer, and sets the screen base to $0400. Normally, this routine is called as part of the initialization process of a Commodore 64 program cartridge.

and also this:

RAMTAS

This routine clears zero-page RAM (locations $02-$FF) and initializes Kernal memory pointers in zero page. For the 64 only, the routine also clears pages 2 and 3 (locations $0200-$03FF), tests all RAM locations from $0400 upwards until ROM is encountered, and sets the top-of-memory pointer. For the 128, the routine sets the BASIC restart vector ($0A00) to point to BASIC’s cold-start entry address, $4000.

This line is particularly interesting: “Normally, this routine is called as part of the initialization process of a Commodore 64 program cartridge.” Quest for Tires was released both on floppy disk and on ROM cartridge, and there were some utility programs available to convert a cartridge program to be able to load and run from disk. Furthermore, ROM cartridges use the memory space $8000–$9FFF for 8k cartridges, and $8000–$BFFF for 16k cartridges. This latter range matches our code’s address range almost exactly. It seems probable that this floppy disk copy of the game was originally created by converting from the cartridge version. The extra 16 bytes at the end could be some glue code to help with the conversion.

Cartridge Conversion

Why does it matter that this code was converted from a cartridge? First, it explains why the BASIC loader resets the computer to start the game. Cartridges are inserted before powering on the computer, and when the booting system notices that a cartridge is plugged in, it starts executing code from that cartridge. Also, since the game was originally a cartridge, it needs to call the RAMTAS routine itself, as the KERNAL doesn’t do that automatically when a cartridge is inserted.

The RAMTAS routine, among other things, writes a $55 byte (sound familiar?) to each memory location and reads it back to make sure it matches. If the byte matches, the original byte that was stored at the location is restored, and testing moves on to the next address. If the byte doesn’t match, the routine assumes it’s found a ROM, stops testing RAM, and updates some KERNAL pointers to indicate where the top and bottom of RAM are. Take a look at the C64 BASIC & KERNAL ROM Disassembly section of Michael Steil’s reference site for full details on this routine.

When the game is run from a ROM cartridge, it’s mapped into memory at $8000–$BFFF. As part of its initialization, it calls RAMTAS, which then starts testing RAM, but stops at $8000, because it hits the ROM. But, when the game is in RAM instead, RAMTAS keeps going when it hits $8000. This is generally non-destructive, since each byte is put back after it passes the RAM test. However, recall that the BASIC ROM is mapped in at $A000–$BFFF, the first byte of which is exactly our problem address.

Here’s the crucial point. RAMTAS writes $55 to $A000, and reads back a different value (the first byte in the BASIC ROM), so it (correctly) concludes that it’s found a ROM. However, it has already written to the RAM underneath the ROM at that address, which is holding our program! Furthermore, RAMTAS doesn’t try to restore the byte, since it thinks it’s from a ROM, and what would be the point of trying to write to ROM? This is the root cause of our corrupted byte.

But shouldn’t the BASIC ROM be banked out already? It’s masking half of the game code, so it will need to be banked out for the game to be able to run. Let’s take a look at that glue code at $C000 to $C00F.

Glue Code

When the C64 boots up, it looks at addresses $8004 to $8008 for the byte sequence $C3 $C2 $CD $38 $30. That’s PETSCII for “CBM80”. “CBM” stands for Commodore Business Machines. I’m not sure what the “80” signifies. If that sequence is found, the system sets the NMI (Non-Maskable Interrupt) vector to point to the address found at $8002 and $8003. The C64’s Restore key is wired to the CPU’s NMI line, so pressing that key will trigger an NMI and jump to the address pointed to by the NMI vector. After setting up the vector, the system jumps to the address pointed to by $8000 and $8001 and starts executing code there.

If we look at memory at $8000 to $8008, we find the CBM80 signature at $8004, and the two pointers at $8000 and $8002:

(C:$8023) m 8004 8008
>C:8004  c3 c2 cd 38  30                                      ...80
(C:$8009) m 8000 8001
>C:8000  00 c0                                                ..
(C:$8002) m 8002 8003
>C:8002  20 80                                                 .

This is definitely a cartridge conversion, or at the very least, is using the cartridge mechanism. $8000 points to $C000, so that’s where the system will start executing code once it’s finished booting. Notably, this is just outside the 16k ROM cartridge address range, so it must have been added by the conversion utility. Let’s look at that code:

(C:$8004) d c000 c00f
.C:c000  A9 36       LDA #$36
.C:c002  85 01       STA $01
.C:c004  4C 09 80    JMP $8009
.C:c007  AA          TAX
.C:c008  AA          TAX
.C:c009  AA          TAX
.C:c00a  AA          TAX
.C:c00b  AA          TAX
.C:c00c  AA          TAX
.C:c00d  AA          TAX
.C:c00e  AA          TAX
.C:c00f  AA          TAX

Aha, the first two instructions set the memory bank register to $36, or %00110110 in binary. The default value for this register is $37, or %00110111 in binary, so this clears the lowest-order bit. That happens to be the bit that controls the BASIC ROM, and setting it to 0 banks out that ROM. This is the sort of thing we should expect to see here.

The next instruction jumps to the start of the game code. It’s likely that the original cartridge had the entry point at $8000 set to $8009, and the conversion utility redirected execution to $C000, where it put code to bank out the BASIC ROM (because that’s necessary when running from RAM) and then jump back to where the code would have started. The remaining $AA bytes are simply padding to ensure that the file size is a multiple of 16 bytes. That’s not really necessary, but it doesn’t hurt anything either.

This technique is sometimes referred to as patching, although that term is more broadly applicable than this specific approach. The additional code itself is sometimes called patch code, for hopefully obvious reasons, or glue code, since its purpose is to stick together things that normally don’t go together (in this case, code from a ROM cartridge running from RAM).

If the glue code sets the bank register to bank out the BASIC ROM, why is the RAMTAS routine hitting it and messing up the game code? Let’s trace through the glue code and see if we can find out.

(C:$c010) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$c010) bk c000
BREAK: 1  C:$c000  (Stop on exec)
(C:$c010) g fce2
#1 (Stop on  exec c000)   60/$03c,  17/$11
.C:c000  A9 36       LDA #$36       - A:C3 X:00 Y:91 SP:ff ..-..IZC    6482922
(C:$c000) m 00 01
>C:0000  2f 37                                                /7
(C:$0002) z
.C:c002  85 01       STA $01        - A:36 X:00 Y:91 SP:ff ..-..I.C    6482924
(C:$c002) z
.C:c004  4C 09 80    JMP $8009      - A:36 X:00 Y:91 SP:ff ..-..I.C    6482927
(C:$c004) m 00 01
>C:0000  2f 36                                                /6
(C:$0002) z
.C:8009  8E 16 D0    STX $D016      - A:36 X:00 Y:91 SP:ff ..-..I.C    6482930
(C:$8009) z
.C:800c  20 A3 FD    JSR $FDA3      - A:36 X:00 Y:91 SP:ff ..-..I.C    6482934
(C:$800c) m 00 01
>C:0000  2f 36                                                /6
(C:$0002) n
.C:800f  20 50 FD    JSR $FD50      - A:D7 X:FF Y:91 SP:ff N.-..I.C    6483073
(C:$800f) m 00 01
>C:0000  2f 37                                                /7

As we can see from this trace, The glue code does, in fact, successfully bank out the BASIC ROM, and it stays banked out right up until the JSR $FDA3 subroutine call at $800C. That routine must have banked it back in, just before the call to RAMTAS ($FD50). What is this routine?

IOINIT

According to pagetable.com’s KERNAL ROM disassembly and KERNAL API reference, $FDA3 contains a KERNAL routine named IOINIT. This is what it has to say about that routine:

Initialize I/O devices

Communication registers: None

Preparatory routines: None

Error returns:

Stack requirements: None

Registers affected: A, X, Y

Description: This routine initializes all input/output devices and routines. It is normally called as part of the initialization procedure of a Commodore 64 program cartridge.

EXAMPLE:

JSR IOINIT

and also this:

Initialize I/O devices

Called by: None.

JMP FDA3 to initialize the CIA registers, the 6510 I/O data registers, the SID chip registers, start CIA #1 timer A, enable CIA #1 timer A interrupts, and set the serial clock output line high.

The closest equivalent to this routine on the VIC is JMP FDF9 to initialize the 6522 registers, set VIA #2 timer 1 value, start timer 1, and enable timer 1 interrupts.

Since these routines are called during system reset, the main use for IOINIT is for an autostart cartridge that wants to use the same I/O settings that the Kernal normally uses.

So, ultimately, the cartridge conversion utility used a patch and some glue code to bank out the BASIC ROM to let the game run, but wasn’t clever enough to notice that IOINIT banks it back in right away, and then RAMTAS comes along and clobbers the code at $A000, leading to the crash bug we’ve been diagnosing. This is almost everything we need to know to understand why the game crashes when it does, but one small question remains. If the BASIC ROM has been banked in over the top of the second half of the game code, how does the game run at all?

The answer is that, a few instructions later, the game code itself banks the BASIC ROM back out again, and so it’s only banked in for a brief moment, during which RAMTAS gets confused and introduces the bug.

(C:$0002) d 8009 8033
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 50 FD    JSR $FD50
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI
.C:8019  78          SEI
.C:801a  4C 47 FE    JMP $FE47
.C:801d  8D 18 03    STA $0318
.C:8020  20 BC F6    JSR $F6BC
.C:8023  20 15 FD    JSR $FD15
.C:8026  20 A3 FD    JSR $FDA3
.C:8029  20 18 E5    JSR $E518
.C:802c  58          CLI
.C:802d  A9 36       LDA #$36
.C:802f  85 01       STA $01
.C:8031  4C CD 8F    JMP $8FCD

At $802D–$8030, the game sets $01 back to $36, banking out the BASIC ROM (for good this time) and then jumps off to $8FCD, which presumably shows the title screen and then starts the game itself.

Fixing the Bug

Now that we’ve figured out what’s going wrong, how do we fix it? Through my journey diagnosing this issue, I’ve tried several different approaches, some of which were successful. In this section, I’ll describe some of the successful fixes and discuss their pros and cons.

Fix 1: Avoid the Corrupted Instruction

The first fix I tried that actually worked was to adjust to code to skip over the instruction at $A000, instead achieving the same effect by jumping somewhere else. Recall that the correct instruction is INC $4024, which is followed by LDA $4024 at location $A003. Let’s see if we can find another place in the code that does that:

(C:$8034) h 8000 bfff ee 24 40
9fe8
a000

There are only two instances of that instruction in the entire program. One of them is $A000, which is what we’re trying to avoid. Let’s look at the other one at $9FE8:

(C:$8034) d 9fe8 9ffd
.C:9fe8  EE 24 40    INC $4024
.C:9feb  4C 03 A0    JMP $A003
.C:9fee  EE 61 40    INC $4061
.C:9ff1  CE 24 40    DEC $4024
.C:9ff4  AD 24 40    LDA $4024
.C:9ff7  C9 04       CMP #$04
.C:9ff9  B0 08       BCS $A003
.C:9ffb  4C 00 A0    JMP $A000

As expected, the code at $9FE8 executes the same increment instruction as the one that got corrupted. However, somewhat miraculously, it then unconditionally jumps to $A003, which is immediately after the corrupted instruction, and just where we need to go next. So, if we replace every jump to $A000 with a jump to $9FE8 instead, that should avoid the corruption and do what the original program intended. Let’s look for the places we need to change:

(C:$9ffe) h 8000 bfff 4c 00 a0
9ffb
bffb

Only two places found. The first ($9FFB), we just saw above. What about the second?

(C:$9ffe) d bfe0 bfff
.C:bfe0  01 80       ORA ($80,X)
.C:bfe2  01 80       ORA ($80,X)
.C:bfe4  01 80       ORA ($80,X)
.C:bfe6  01 80       ORA ($80,X)
.C:bfe8  01 1E       ORA ($1E,X)
.C:bfea  1F 20 46    SLO $4620,X
.C:bfed  47 48       SRE $48
.C:bfef  6E 6F 70    ROR $706F
.C:bff2  97 98       SAX $98,Y
.C:bff4  BF C0 E6    LAX $E6C0,Y
.C:bff7  E7 E8       ISB $E8
.C:bff9  B0 08       BCS $C003
.C:bffb  4C 00 A0    JMP $A000
.C:bffe  F9 07 A9    SBC $A907,Y

This is basically gibberish, so it’s probably not code. It’s likely some sort of data, perhaps graphics or sound data. We shouldn’t need to worry about this, so the only JMP $A000 instruction we’ve found that we need to update is the one at $9FFB. Note that there could be conditional branch instructions or indirect jump instructions that target $A000.

The branch instructions would be within the 256 bytes around $A000, since they can only jump a limited distance. That’s still lot of disassembly code to show, however, so perhaps you’ll take my word for it that I looked and didn’t see any. Indirect jumps to a specific location are hard to find, since it would require dynamic analysis of the code to determine the jump target. If we miss one, the game might still crash, but let’s try it and see what happens. The following commands make the change and then save the updated code to a new file on disk called "FIX1.8000-C010".

(C:$c000) a 9ffb
.9ffb  jmp $9fe8
.9ffe  
(C:$9ffe) d 9fe8 9ffd
.C:9fe8  EE 24 40    INC $4024
.C:9feb  4C 03 A0    JMP $A003
.C:9fee  EE 61 40    INC $4061
.C:9ff1  CE 24 40    DEC $4024
.C:9ff4  AD 24 40    LDA $4024
.C:9ff7  C9 04       CMP #$04
.C:9ff9  B0 08       BCS $A003
.C:9ffb  4C E8 9F    JMP $9FE8
(C:$9ffe) bank ram
(C:$9ffe) s "fix1.8000-c010" 8 8000 c00f
Saving file `fix1.8000-c010' from $8000 to $c00f

Next, we need to update the BASIC loader to load our fixed version of the code:

Quest for Tires BASIC Loader (Fix 1)

Now, all that’s left to do is try it! I started this section by saying I would only discuss successful fixes, so there’s not much suspense, but here’s a video of this fixed code running, with the speed going all the way up to 80 without crashing!

Quest for Tires Gameplay (Fix 1)

Clearly, this fix works, which is great. One downside, however, is that it slightly changes the number of cycles it takes to execute the speed update code during the main game loop. In practice, it’s unlikely to matter, but it might subtly change the gameplay. Another drawback is more aesthetic—it’s not the same control flow as what was originally published. Let’s see how we might improve on this.

Fix 2: Undo the Corruption

Instead of modifying the main game loop, maybe there’s a way we can repair the damage RAMTAS did to our code. We know exactly which byte it clobbered and what the value was before said clobbering. Perhaps we can borrow the trick that the cartridge conversion utility used, and insert some code right after RAMTAS returns that restores the byte’s original value. We have nine padding bytes to work with, and since this is stored on a disk, we could add quite a bit of extra code if we needed to. But, I think we can fit what we need to do into the existing padding bytes.

What we want to do is store the value $EE into the location $A000 as soon as possible after the call to RAMTAS returns. One way to do that is to write our own subroutine that calls RAMTAS and then immediately fixes up memory before returning. That would look something like this:

(C:$9ffe) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$9ffe) d c000 c00f
.C:c000  A9 36       LDA #$36
.C:c002  85 01       STA $01
.C:c004  4C 09 80    JMP $8009
.C:c007  AA          TAX
.C:c008  AA          TAX
.C:c009  AA          TAX
.C:c00a  AA          TAX
.C:c00b  AA          TAX
.C:c00c  AA          TAX
.C:c00d  AA          TAX
.C:c00e  AA          TAX
.C:c00f  AA          TAX
(C:$c010) a c007
.c007  jsr $fd50
.c00a  lda #$ee
.c00c  sta $a000
.c00f  rts
.c010
(C:$c010) d c000 c00f
.C:c000  A9 36       LDA #$36
.C:c002  85 01       STA $01
.C:c004  4C 09 80    JMP $8009
.C:c007  20 50 FD    JSR $FD50
.C:c00a  A9 EE       LDA #$EE
.C:c00c  8D 00 A0    STA $A000
.C:c00f  60          RTS

We used up all the padding bytes, but we now have the routine we need stored in memory. Next, we need to patch the game code to call our routine instead of calling RAMTAS directly:

(C:$c010) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 50 FD    JSR $FD50
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI
(C:$8019) a 800f
.800f  jsr $c007
.8012
(C:$8012) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 07 C0    JSR $C007
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI

And that should do it. Let’s save the code, set some breakpoints, start the game, and trace through to make sure our patch restores the value of $A000 to $EE and then resumes normal execution, as we expect:

(C:$8019) bank ram
(C:$8019) s "fix2.8000-c010" 8 8000 c00f
Saving file `fix2.8000-c010' from $8000 to $c00f
(C:$8019) bk 800f
BREAK: 1  C:$800f  (Stop on exec)
(C:$8019) bk c00a
BREAK: 2  C:$c00a  (Stop on exec)
(C:$8019) g fce2
#1 (Stop on  exec 800f)   77/$04d,  49/$31
.C:800f  20 07 C0    JSR $C007      - A:D7 X:FF Y:A0 SP:ff N.-..I.C   49939549
(C:$800f) m a000 a000
>C:a000  ee                                                   .
(C:$a001) z
.C:c007  20 50 FD    JSR $FD50      - A:D7 X:FF Y:A0 SP:fd N.-..I.C   49939555
(C:$c007) n
#2 (Stop on  exec c00a)  156/$09c,  62/$3e
.C:c00a  A9 EE       LDA #$EE       - A:04 X:00 Y:A0 SP:fd ..-..I..   52115762
(C:$c00a) m a000 a000
>C:a000  55                                                   U
(C:$a001) z
.C:c00c  8D 00 A0    STA $A000      - A:EE X:00 Y:A0 SP:fd N.-..I..   52115764
(C:$c00c) z
.C:c00f  60          RTS            - A:EE X:00 Y:A0 SP:fd N.-..I..   52115768
(C:$c00f) m a000 a000
>C:a000  ee                                                   .
(C:$a001) z
.C:8012  20 15 FD    JSR $FD15      - A:EE X:00 Y:A0 SP:ff N.-..I..   52115774

It worked! The trace shows that execution gets diverted to our new routine. Before we call RAMTAS, $A000 has the correct value. After RAMTAS returns, the value is corrupted. But after the next two instructions, it’s restored to the correct value again. Finally, our routine returns, and the game continues as if nothing happened.

This is much nicer than the first fix, because it doesn’t modify code in the middle of the game loop. It’s still annoying, however, that we have to let the code get corrupted in the first place. Perhaps we can improve this further.

Fix 3: Prevent the Corruption

Recall that the reason RAMTAS corrupts our program is that, even though the glue code banked out the BASIC ROM, the call to IOINIT banked it back in right before RAMTAS is called. What if we inserted some code between those two calls to bank it out again. Then RAMTAS should be able to complete without screwing anything up. Let’s give it a shot. First we set up our routine in the padding area, as before. This time, however, instead of calling RAMTAS right away, we set the bank register first. Then we jump directly to the RAMTAS routine, so that when it returns, it will return to our routine’s caller.

(C:$8012) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$8012) d c000 c00f
.C:c000  A9 36       LDA #$36
.C:c002  85 01       STA $01
.C:c004  4C 09 80    JMP $8009
.C:c007  AA          TAX
.C:c008  AA          TAX
.C:c009  AA          TAX
.C:c00a  AA          TAX
.C:c00b  AA          TAX
.C:c00c  AA          TAX
.C:c00d  AA          TAX
.C:c00e  AA          TAX
.C:c00f  AA          TAX
(C:$c010) a c007
.c007  lda #$36
.c009  sta $01
.c00b  jmp fd50
.c00e  
(C:$c00e) d c000 c00f
.C:c000  A9 36       LDA #$36
.C:c002  85 01       STA $01
.C:c004  4C 09 80    JMP $8009
.C:c007  A9 36       LDA #$36
.C:c009  85 01       STA $01
.C:c00b  4C 50 FD    JMP $FD50
.C:c00e  AA          TAX
.C:c00f  AA          TAX

Then, as before, we replace the original call to RAMTAS with a call to our routine instead:

(C:$c010) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 50 FD    JSR $FD50
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI
(C:$8019) a 800f
.800f  jsr $c007
.8012  
(C:$8012) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 07 C0    JSR $C007
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI

All right, let’s try this out.

(C:$8019) bank ram
(C:$8019) s "fix3.8000-c010" 8 8000 c00f
Saving file `fix3.8000-c010' from $8000 to $c00f
(C:$8019) bk 800f
BREAK: 1  C:$800f  (Stop on exec)
(C:$8019) bk 8012
BREAK: 2  C:$8012  (Stop on exec)
(C:$8019) g fce2
#1 (Stop on  exec 800f)  184/$0b8,  60/$3c
.C:800f  20 07 C0    JSR $C007      - A:D7 X:FF Y:0A SP:ff N.-..I.C    4200295
(C:$800f) bank default
(C:$800f) m 00 01
>C:0000  2f 37                                                /7
(C:$0002) bank ram
(C:$0002) m a000 a000
>C:a000  ee                                                   .
(C:$a001) z
.C:c007  A9 36       LDA #$36       - A:D7 X:FF Y:0A SP:fd N.-..I.C    4200301
(C:$c007) z
.C:c009  85 01       STA $01        - A:36 X:FF Y:0A SP:fd ..-..I.C    4200303
(C:$c009) z
.C:c00b  4C 50 FD    JMP $FD50      - A:36 X:FF Y:0A SP:fd ..-..I.C    4200306
(C:$c00b) bank default
(C:$c00b) m 00 01
>C:0000  2f 36                                                /6
(C:$0002) g
#2 (Stop on  exec 8012)   39/$027,   2/$02
.C:8012  20 15 FD    JSR $FD15      - A:04 X:11 Y:D0 SP:ff ..-..I..    6994392
(C:$8012) bank default
(C:$8012) m 00 01
>C:0000  2f 36                                                /6
(C:$0002) bank ram
(C:$0002) m a000 a000
>C:a000  ee                                                   .

This worked nicely. Because we set the bank register before calling RAMTAS, $A000 was never corrupted. As before, the game continues after our patch, none the wiser. This is a pretty good fix, but there’s one thing I can think of that might be even better.

Fix 4: Prevent the Corruption, but Even Better

The reason the game code calls IOINIT and RAMTAS is that, as a ROM cartridge, these routines would not be called by the KERNAL during bootup. However, we’re not running this code from ROM anymore, and there’s no cartridge in the expansion port. Maybe these calls are redundant, and could be entirely eliminated. Let’s remove the calls, and set some breakpoints to see if the routines still get called from somewhere else.

To remove these subroutine calls, the simplest thing to do is overwrite them with the NOP instruction. That’s a one-byte instruction that tells the processor to do nothing for a couple of clock cycles. Each subroutine call assembles to three bytes of code, so we need to put in six NOPs in total, starting at $800C.

(C:$e5cd) l "qft.8000-c010" 8
Loading qft.8000-c010 from 8000 to C00F (4010 bytes)
(C:$e5cd) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  20 A3 FD    JSR $FDA3
.C:800f  20 50 FD    JSR $FD50
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI
(C:$8019) a 800c
.800c  nop
.800d  nop
.800e  nop
.800f  nop
.8010  nop
.8011  nop
.8012  
(C:$8012) d 8009 8018
.C:8009  8E 16 D0    STX $D016
.C:800c  EA          NOP
.C:800d  EA          NOP
.C:800e  EA          NOP
.C:800f  EA          NOP
.C:8010  EA          NOP
.C:8011  EA          NOP
.C:8012  20 15 FD    JSR $FD15
.C:8015  20 18 E5    JSR $E518
.C:8018  58          CLI

Okay, let’s try it:

(C:$8019) bank ram
(C:$8019) s "fix4.8000-c010" 8 8000 c00f
Saving file `fix4.8000-c010' from $8000 to $c00f
(C:$8019) bk fda3
BREAK: 1  C:$fda3  (Stop on exec)
(C:$8019) bk fd50
BREAK: 2  C:$fd50  (Stop on exec)
(C:$8019) g fce2
#1 (Stop on  exec fda3)   66/$042,   2/$02
.C:fda3  A9 7F       LDA #$7F       - A:31 X:30 Y:FF SP:fa N.-..I..   34382337
(C:$fda3) g

The breakpoint at IOINIT was hit, so the KERNAL must be calling that for us. The breakpoint at RAMTAS wasn’t hit. Interestingly, however, the game still started and ran just fine, and did not exhibit the crash bug. Apparently neither of these calls was needed. So, we have a fix that consists solely of removing two subroutine calls in the initialization code for the game, and replacing them with NOPs. That feels pretty elegant to me, and so I’m going to call this solved. There may be even more elegant solutions, and I’d love to hear about them, but I’m pretty happy with this.

Conclusion

When I started the process of diagnosing and fixing this crash, I had no idea what I’d find, or whether it would be something I could fix. It was a lot of fun though. There were quite a few twists and turns, and I’ve now fixed the copy of a game I’ve had since childhood, which is pretty cool.

Out of curiosity, I tried a few different copies of Quest for Tires that I acquired from places (ahem), both disk and cartridge images, and none of them exhibited this bug. So, it seems that it was a quirk of my particular copy. (I’m honestly not even sure exactly where the disk came from.) I suppose I could just play those copies, but it feels very satisfying to be able to play the version I grew up with, albeit slightly altered, without any more crashing.

This is the first time I’ve dug into the code of a commercial C64 game to try to make modifications, and I might not have been as efficient as someone more experienced in this domain. If I’ve made mistakes, or made inaccurate statements, or done things the hard way, I’d love to hear about it. I’d also very much like to hear about other solutions people might come up with to fix the crash. In any case, I hope you found this at least a bit entertaining, and possibly even educational. I definitely learned a lot.

EC11 Rotary Encoders

Nate Barney — Mon, 24 Jul 2023 06:39:56 +0000

EC11 incremental rotary encoders are user-interface controls for electronic devices. They’re particularly useful for quickly adjusting settings through a range of possible values, or for scrolling through lists or menus. They’re also relatively inexpensive, and pretty easy to find. The way these devices work is rather interesting, and somewhat surprising if you’re unfamiliar with it. In the remainder of this post, for brevity, I’ll simply refer to them as EC11’s.

EC11 Rotary Encoder

EC11’s look similar to potentiometers, but they’re very different devices. A potentiometer turns smoothly through a limited range of motion, but an EC11 turns in discrete steps, and its range of rotation is unlimited. A potentiometer is a variable resistor divider, but an EC11 has nothing to do with resistance. Instead, when it rotates, it produces signals that indicate the direction of rotation. EC11’s can also function as push buttons.

As a side note, I’m not sure why they’re called “EC11” encoders. My web search on this topic has turned up nothing useful. I’ve also seen references to EC12 and EC16 encoders, that presumably differ in some important way. There are likely other types as well. The best I’ve been able to come up with is that “EC” stands for “EnCoder”, and the 11 means 11 mm, for (I guess) the shaft length. However, apparently, EC12’s and EC16’s don’t have the push button feature that EC11’s have, so I’m not convinced that’s the whole story. If you know more about this, please let me know.

Electrical Interface

An EC11 has five pins. There is no standard pin labeling, at least not that I could find, but it’s common for the pins to be labeled A, B, C, D, and E. The pinout diagram below is one possible labeling, and is the one I’ll use for this post. Note, however, that some documents have the A and B pin labels swapped.

EC11 Rotary Encoder Pinout (Top View)

Pins D and E are used for the button functionality. They are normally disconnected from each other, but are shorted together when the shaft is pressed like a button. This is relatively straightforward, and behaves just like almost any other push button. The other signals are a little more surprising. Pins A, B, and C provide information about rotations, but they’re not simply “clockwise,” “counterclockwise,” and “ground” pins, as I initially expected.

C serves as a common pin, and A and B work together to indicate whether the shaft is being turned, and in which direction. They do this by connecting and disconnecting from C using something called quadrature encoding. When the EC11 is not being turned, A, B, and C are all disconnected from each other. As the shaft is turned clockwise through one step, the following connections and disconnections happen in sequence:

A is connected to C,
B is connected to C,
A is disconnected from C,
B is disconnected from C.

For counter-clockwise rotations, the roles of A and B are reversed. Note that only one of the two signals changes at any one time. This encoding enables a simple mechanical design for the device, and also eliminates ambiguous transition states that could arise if both signals changed at nearly, but not exactly, the same time.

To illustrate these signals a little better, I connected an EC11 to an oscilloscope and took some captures. (An oscilloscope shows a graph of voltage on the vertical axis vs. time on the horizontal axis.) In all of these images, A is the yellow trace, and B is the blue trace. For the first two images, C was connected to V_CC (+5V), and A and B were each pulled down to GND (0V) with 10kΩ pull-down resistors. This means that A and B are normally low, but are high while connected to C. The first image shows clockwise rotation, and the second image shows counter-clockwise rotation.

EC11, Positive Polarity, Clockwise Rotation

EC11, Positive Polarity, Counter-Clockwise Rotation

For the next two images, C was connected to GND, and A and B were each pulled up to V_CC with 10kΩ pull-up resistors. This means that A and B are normally high, but are low while connected to C. As before, the first image shows clockwise rotation, and the second image shows counter-clockwise rotation.

EC11, Negative Polarity, Clockwise Rotation

EC11, Negative Polarity, Counter-Clockwise Rotation

The choice of connection method may depend on various factors in the larger circuit, but often it will work just fine either way, and the choice is made arbitrarily or by personal preference of the circuit designer.

Mechanism

Before we dig into this section, I should point out that I haven’t actually taken one of these devices apart, so what follows about the specifics of their construction is speculation and educated guesses on my part. The precise nature and arrangement of the internal parts may differ from what I describe. However, I believe the basic principles are correct.

My understanding is that inside the body of an EC11 are two wipers, or conductive contacts—one connected to pin A and one to pin B. The shaft is attached to a rotating plate with a pattern of conductive pads that can make electrical connections to the wipers. These pads are connected to pin C. It’s a bit hard to explain with words, so here’s a diagram:

Hypothetical EC11 Rotary Encoder Mechanism (Top View)

This diagram is very approximate. For instance, actual EC11’s have many more than four positions. The ones I’m using have twenty. However, it should communicate the basic idea. As the shaft and plate rotate, but the wipers stay stationary, one wiper will connect to a pad, then the other wiper will, and then the first wiper will disconnect, then the second will, just as described above. The mechanism also includes detents that cause the shaft and plate to snap into place with the wipers between pads.

As I said, some of the details above may not be exactly correct. It’s possible that the wipers move with the shaft over stationary pads. It’s possible that the pads are not on a plate perpendicular to the shaft, but actually placed on the surface of the shaft itself, with the wipers oriented sideways. It’s possible all these methods, and others, could be used by different models of encoders. However, I’m reasonably confident that the general principles outlined above are correct. I welcome feedback and/or corrections from anyone who knows more about this than I do.

Contact Bounce

Things like switches, buttons, and rotary encoders are all subject to a phenomenon known as contact bounce. This means that the physical conductors inside the mechanism can actually bounce off of each other and introduce spurious connections and disconnections before finally settling down to the correct state. This can make it appear as though, for example, a button was pressed several times when the user only meant to press it once.

To demonstrate contact bounce with the EC11, I took another oscilloscope capture, this time significantly zoomed in on the time (horizontal) axis. In the previous captures, the oscilloscope was set to 20 ms per division (i.e. background grid cell). To be able to show the bouncing, I set the oscilloscope to 100 μs, or 0.1 ms, per division. There is some inherent randomness to the bouncing. Sometimes the contacts don’t bounce at all. Sometimes they bounce a great deal. I ended up turning the EC11 a few times to get a good example of bouncing to capture. The following image was taken with the EC11 connected with positive polarity, and the shaft was rotated clockwise.

EC11 Contact Bounce

There are several ways to deal with contact bounce, including both hardware and software solutions, with varying levels of complexity. It might be worth going into this topic in more detail in a future post. However, one simple software solution is to keep track of the previously read value of the input pin, and introduce a delay between reads. The software can then react when there’s a difference between the previous value and the current value. If the delay is sufficiently short, the user won’t notice it, and if the delay is sufficiently long, the bouncing will have stopped by the time the pin is read again. In practice, I’ve found that delays of a few milliseconds work pretty well.

Arduino Sketch

I’ve put together a simple circuit and Arduino sketch to demonstrate how to use these devices. Feel free to use or adapt this code in your own projects. We’ll first take a look at the hardware setup, then we’ll discuss the code.

This example uses an Arduino Uno R3, but it should work with almost any Arduino. Similarly, while this example (somewhat arbitrarily) uses Arduino pins 2, 3, and 4 to connect to the EC11, any free GPIO pins should do. The EC11 is connected with positive polarity. (Changing the hardware connections and sketch code to use the EC11 with negative polarity is relatively straightforward, and is left as an exercise for the reader.)

Arduino EC11 Example Circuit

Connect the Arduino’s 5V pin to the positive breadboard rails and connect the Arduino’s GND to the negative breadboard rails. Connect EC11 pins C and E to the positive rails as well. Connect EC11 pins A, B, and D to the negative breadboard rails through 10kΩ resistors. (Since these are pull-down resistors, the resistor values themselves don’t matter too much, as long as they’re large-ish.) Connect EC11 pin A to Arduino pin 3, EC11 pin B to Arduino pin 4, and EC11 pin D to Arduino pin 2.

Now that the circuit is hooked up, we can move on to the code. The following code uses a simple polling loop with a basic software debouncing scheme, and simply prints messages over the serial connection when the EC11 is rotated or pressed. It should be straightforward to adapt this code to do other things on these events instead.

To read the rotation state, we first check to see if pin A is going from low to high. This rising edge occurs every time the EC11 is rotated. The direction is then determined by looking at the B pin at the moment of A‘s rising edge. In a clockwise rotation, A goes high before B does, so while A is going high, B will still be low. In a counter-clockwise rotation, B goes high before A, so while A is going high, B will already be high. If this isn’t clear, it might help to take another look at the oscilloscope captures above.

We only need to check B when we’ve already determined there’s a rising edge on A, so the code below omits that check in other cases. Also note that we don’t need to debounce B, because bouncing happens right after a state change, and we only read B well before or after its state changes.

The button press events are handled slightly differently from the rotation events. We check for both rising and falling edges on the button input. A rising edge corresponds to the button being pressed, and a falling edge corresponds to the button being released. It’s sometimes useful to track both types of button events, so they’re included in the example.

//
// ec11_example.ino
//
// Example Arduino sketch to demonstrate reading an EC11 rotary encoder
//

// Pin definitions
#define EC11_PIN_A 3
#define EC11_PIN_B 4
#define EC11_PIN_D 2

// Number of milliseconds to wait between pin reads, for software debouncing
#define DEBOUNCE_DELAY_MILLIS 3

// Global variables to hold the previous state of the EC11 pins, initialized
// to LOW. We don't need to track the previous state of B to debounce it
// because we don't read it near an edge, and any bouncing should already
// have stopped by the time we do read it.
int prevA = LOW;
int prevD = LOW;

// Runs once at beginning of sketch
void setup()
{
  // Set pins to input
  pinMode(EC11_PIN_A, INPUT);
  pinMode(EC11_PIN_B, INPUT);
  pinMode(EC11_PIN_D, INPUT);

  // Initialize serial connection
  Serial.begin(9600);
}

// Runs continually in a loop after setup is complete
void loop()
{
  // Local variables to hold the current state of the pins
  int currentA;
  int currentB;
  int currentD;

  // Wait a bit to allow any contact bounce to settle
  delay(DEBOUNCE_DELAY_MILLIS);

  // Process rotation pins
  currentA = digitalRead(EC11_PIN_A);
  if ((prevA == LOW) && (currentA == HIGH))
  {
    // There's a rising edge on A, so the EC11 is being rotated. The state
    // of B at the rising edge of A indicates the direction. If B is low,
    // then A went high first, and the rotation is clockwise. If B is high,
    // then B went high first, and the rotation is counter-clockwise.
    currentB = digitalRead(EC11_PIN_B);

    if (currentB == LOW)
    {
      Serial.println("Clockwise");
    }
    else
    {
      Serial.println("Counter-clockwise");
    }
  }

  // Process button pin
  currentD = digitalRead(EC11_PIN_D);
  if ((prevD == LOW) && (currentD == HIGH))
  {
    Serial.println("Pressed");
  }
  else if ((prevD == HIGH) && (currentD == LOW))
  {
    Serial.println("Released");
  }

  // update previous state globals
  prevA = currentA;
  prevD = currentD;
}

After building this circuit and compiling and uploading the sketch to the Arduino, open the serial monitor and set it to 9600 baud. Then, turn and press the EC11. You should see output similar to the following in the serial monitor:

Clockwise
Counter-clockwise
Pressed
Released
Pressed
Clockwise
Counter-clockwise
Released

Summary

So, that’s about as much as I can say about the EC11 rotary encoder. It’s a cool little device, and can be used to build pretty nice user interfaces for hobbyist projects using Arduino’s, or Raspberry Pi’s, or any other microcontroller. Feel free to reach out if you have any questions/comments/corrections about anything in this post.

GPIO Pins, Shift Registers, and SPI

Nate Barney — Sun, 16 Jul 2023 15:35:56 +0000

Microcontroller GPIO Pins

When using microcontrollers, like the ATmega line of chips, or boards based on them, like the Arduino series, General Purpose Input/Output (GPIO) pins are often at a premium. The Arduino Uno R3 (based on the ATmega328P), for example, has a maximum of fourteen GPIO pins, and that’s if you’re not using some of the pins’ special functions.

It’s very easy to run out of these GPIO pins if multiple devices are connected directly to them. Each LED uses one pin, configured as an output. Each button uses one pin, configured as an input. A seven-segment display uses eight outputs (there’s a decimal point that’s not counted in the name for some reason). In its most minimal configuration, an LCD character display module uses six pins, but can be configured to use as many as eleven pins. All this quickly adds up.

Shift Registers

A common solution to the problem of too few GPIO pins is to use one or more shift registers. Simply put, a shift register is a chip that converts a single signal changing over time (i.e. serial) to multiple signals at the same time (i.e. parallel), or vice-versa. This solution is so common that the Arduino core library has functions specifically to support their use, and there are lots of Arduino/shift-register tutorials to be found on the internet. (I guess I’m adding to that list.)

There are many different types of shift register available, but two of the most popular are the 74HC595 (to add output pins) and the 74HC165 (to add input pins). These chips are easy to use, inexpensive, and readily available. You can get them from electronics suppliers like Mouser and DigiKey, and they’re even available from Amazon.

The basic principle behind shift registers, and the reason for their name, is that each chip holds a number of bits (typically eight) in a small memory circuit called a register, and these bits shift from one to the next when a clock pulse is applied. Bit 0 shifts into bit 1, bit 1 shifts into bit 2, and so on. The new value for bit 0 comes from a serial input pin at the time of the clock pulse. The last bit falls out of the shift register, and is made available on one of the chip’s output pins, called serial output. Typically, the shift happens on the rising edge of the clock pulse (i.e. from low to high).

The fact that each shift register has a serial input pin that feeds into bit 0 and a serial output pin that holds the value of the last bit as it’s shifted out means that these chips may be cascaded, or chained together, by connecting the serial output of one chip to the serial input of another, and by connecting their clocks together. Two 8-bit shift registers cascaded together, for example, behave as if they were a single 16-bit shift register. The number of chips that can be cascaded into a single effective register is typically limited only by things like available power and the capacitance of the connecting wires, but the chain would need to be pretty long for those things to start to matter.

There are a few disadvantages to using shift registers for GPIO expansion. First, it’s slower. Instead of reading or writing several pins at once, the microcontroller has to process them one at a time. Second, the pins can’t be read or written independently. To read/write bit 4, for example, bits 0-3 must first be read/written. When the shift register is being used to add output pins, this means the software needs to keep track of what bits have been written. Third, most shift registers are unidirectional. That is, each type of chip provides either inputs or outputs, but not both. (The 74HC299 is one exception to this.) In practice, however, these disadvantages are often easily overcome and are more than made up for by the added inputs and/or outputs.

74HC595

The 74HC595 is an 8-bit Serial-In / Parallel-Out (SIPO) shift register. It shifts in one bit from its serial input pin (pin 14) on each rising edge of its serial clock pin (pin 11), and has eight parallel output pins (pins 15 and 1-7). However, the bits in the shift register are not immediately made available on the output pins.

The 74HC595 internally has two registers: the shift register and the storage register (sometimes called the latch register). The shift register behaves as described above. However, the parallel output pins are not connected directly to the shift register, but rather to the storage register. The contents of the shift register are transferred to the storage register (and therefore the output pins) on the rising edge of a second clock pin (pin 12), called the storage clock pin (or sometimes the latch pin).

This dual-register configuration may seem like a needless complication. Worse, instead of just two GPIO pins (serial input and serial clock), it uses those plus a third (storage clock). However, there is a good reason for this design. Some devices connected to the 75HC595’s output pins may not react well to the data shifting along one bit at a time. The two-register approach allows data bits to be shifted in while keeping the output pins unchanged, then all made available effectively simultaneously once the complete set of bits is ready.

If a device can tolerate seeing the data shift one bit at a time, the serial clock and storage clock pins may be connected together. This causes the 74HC595 to behave (more or less) as if the outputs were connected directly to the internal shift register, and it only requires a total of two GPIO pins instead of three.

To interface a 74HC595 with an Arduino (or similar microcontroller), connect the serial input, serial clock, and storage clock pins of the 74HC595 to GPIO pins on the microcontroller, with all three configured as outputs. Then, in the microcontroller code (i.e. Arduino sketch), write each bit to the shift register as follows.

Set the serial input pin to the value of the data bit.
Bring the serial clock pin low, then high, to create a rising edge and shift the bit.
Repeat steps 1 and 2 until all bits have been shifted.
Once all bits have been shifted, bring the storage clock low then high to transfer the bits to the parallel output pins.

Step 4 can be omitted if the serial clock and storage clock are tied together. The Arduino core library function shiftOut() performs steps 1 and 2 in a loop to write eight bits.

The 74HC595 has a few other useful features, including tri-state outputs and a reset pin. The datasheet has more information about these if you’re interested, but they’re not particularly relevant to this post, so that’s all I’ll say about them here.

74HC165

The 74HC165 is an 8-bit Parallel-In / Serial-Out (PISO) shift register. Data bits are loaded into the chip all at once from eight parallel input pins (pins 11-14 and 3-6) while the parallel load pin (pin 1) is set low. Once the bits are loaded, they can be shifted out one at a time to the serial output pin (pin 9) by pulsing the serial clock pin (pin 2). Bits are shifted on the rising edge of the clock pulse, and the serial input pin (pin 10) provides a new value for bit 0 on each pulse.

The process to interface the 74HC165 with a microcontroller is very similar to the process for the 74HC595. Connect the 74HC165’s serial clock and parallel load pins to output pins on the microcontroller, and connect the 74HC165’s serial output pin to a microcontroller input pin. Then, in software, perform the following:

Bring the parallel load pin low, then high to latch the inputs into the shift register.
Bring the serial clock pin low.
Bring the serial clock pin high to create a rising edge and shift out a bit.
Read and process the data bit from the serial output pin.
Bring the serial clock pin back low to prepare for the next bit.
Repeat steps 3 through 5 for each bit.

The Arduino core library function shiftIn() performs steps 3 through 5 in a loop to read eight bits.

Like the 74HC595, the 74HC165 has additional features, such as clock inhibit and complementary serial output, that I’m not going to discuss here. For more information, see the datasheet.

Serial Peripheral Interface (SPI)

The Serial Peripheral Interface (SPI) protocol is another way to reduce the number of microntroller GPIO pins needed for a project. There are many devices that support SPI natively, and there are quite a few available on Amazon, for instance. The protocol is relatively simple, and it’s straightforward to implement in software. Ben Eater has an excellent video in which he interfaces an SPI device with a 65C02 microprocessor using assembly code. However, most microcontrollers, including Arduinos, have hardware support for communicating via SPI. This enables much faster speeds than would be possible with a software implementation.

SPI is, as its name implies, a serial protocol. This means that one bit is transferred at a time. Actually, that’s not quite true, since SPI supports full-duplex bidirectional communication. This means that one bit is sent and one bit is received at the same time. It’s also a sychronous protocol, which means that the timing is controlled by a dedicated clock signal. Finally, it’s an asymmetric protocol. This means that the communication is entirely controlled by one device, called the master (sometimes controller). This is usually the microcontroller. The other device is called the slave (sometimes peripheral), and it sends and receives data only as directed by the master. Multiple slave devices can be connected to a single master.

The basic setup for SPI uses four different signals, each on a separate pin:

Serial Clock (SCK): This is the signal that controls the rate of the data transfer, and is driven by the master. One bit is transferred in each direction when this signal is pulsed. SPI doesn’t specify whether the bit is transferred on the rising or falling edge of the clock signal. Instead, that’s left up to the individual implementation. Most microcontrollers will have an option to specify this as part of the SPI mode setting. SPI modes 0 and 3 transfer on the rising edge, and modes 1 and 2 transfer on the falling edge. Mode 0 differs from 3, and 1 from 2, in the level of the clock signal when idle. Modes 0 and 1 keep the clock low when idle, and 2 and 3 keep it high. The datasheet for the slave device will specify which mode(s) it expects, but mode 0 is typical.
Master Out / Slave In (MOSI): This is a data transmission signal over which the master sends data bits to the slave, at the rate dictated by the SCK signal. Note that this way of naming the signal is unambiguous about direction, and can be labeled the same on both master and slave devices. If the slave device is read-only (e.g. a sensor of some sort), this signal may be omitted. This signal is also sometimes called Controller Out / Peripheral In (COPI). On slave devices, this signal may also be called Data In (DI, DIN) or Serial Data In (SDI).
Master In / Slave Out (MISO): This is a data transmission signal over which the slave sends data bits to the master, at the rate dictated by the SCK signal. If the slave device is write-only (e.g. a display of some sort), this signal may be omitted. This signal is also sometimes called Controller In / Peripheral Out (COPI). On slave devices, this signal may also be called Data Out (DO, DOUT) or Serial Data Out (SDO).
Slave Select (SS): This is a control signal which activates the slave (i.e. tells it to pay attention to the other signals). The overbar on the abbreviation indicates that this signal is active low, and is pronounced by adding “bar” after the abbreviation (i.e. “ess ess bar”). In contexts where overbars are typographically difficult, this may be rendered as /SS or SSB (“B” for “Bar”). Active low means that the slave is selected when this signal is low, and deselected when the signal is high. When multiple slave devices are connected, each device is assigned its own SS signal. The other signals are shared among all slave devices. Usually, only one slave is selected at any one time. This signal is also sometimes called Chip Select (CS, /CS, CSB).

To perform SPI transfers in software, make sure the relevant signals are connected between the master and slave, and perform the following:

Bring the SS pin low to select the slave device.
Set the MOSI pin to the value of the outgoing bit. (Omit for read-only slave.)
Pulse the SCK pin (high then low, or low then high, depending on mode).
Read the incoming bit from the MISO pin. (Omit for write-only slave.)
Repeat steps 2 through 4 until all bits are transferred.
Bring the SS pin high to deselect the slave device.

As previously mentioned, most microcontrollers have SPI support built in hardware. The SPI hardware in the Arduino Uno’s ATmega328P can run the serial clock at up to half the rate of the system clock, which is 16 MHz on an Arduino Uno, so the Arduino can do SPI at up to 8 MHz. This is significantly faster than would be possible with a pure software implementation. Some microcontrollers can run the SPI serial clock at 10’s or even 100’s of MHz.

To use the microcontroller’s SPI hardware to perform transfers, MOSI, MISO, and SCK on the slave should be connected to the corresponding SPI pins on the microcontroller. On the Arduino Uno R3, MOSI is pin 11, MISO is pin 12, and SCK is pin 13. The slave’s SS pin can be connected to any free GPIO pin on the master, configured as an output. The Arduino has a pin labeled SS (pin 10), but it’s only necessary to use that when connecting the Arduino as a slave to another SPI master.

Once everything is connected, issue the microcontroller instructions to enable and use the SPI hardware. In an Arduino sketch, this can be done with the SPI library, specifically the functions SPI.begin() (called once to enable the SPI hardware), SPI.beginTransaction() (called once before each batch of transfers), SPI.transfer() (called any number of times to transfer a batch of eight bits each time), and SPI.endTransaction() (called once after each batch of transfers). If you need to disable the SPI hardware again for some reason, that can be done with function SPI.end(). The SS pin must still be handled explicitly by the software, but the others will be taken care of by the hardware.

Using Shift Registers with SPI

You may have noticed that the protocols for shift registers and SPI are remarkably similar, and wondered if they could be used together. In fact, they can! (If you didn’t notice, don’t worry too much; it took me way longer than I care to admit to notice it myself.) With just a few extra chips, 74HC595 and 74HC165 can be used to build a hardware SPI interface for a device which normally uses a parallel interface, thereby significantly reducing the number of GPIO pins used by the design, while only sacrificing a little bit of speed.

Shift registers and SPI each have serial clock signals, so those can simply be connected together. Similarly, SPI’s MOSI signal can be connected to the serial input signal of a 74HC595 shift register, and SPI’s MISO signal can be connected to the serial output signal of a 74HC165 shift register (although there’s a complication with MISO that we’ll discuss a little later). This takes care of three of the four SPI signals, and two of the three signals needed for each shift register.

The remaining signals, SPI’s SS, the 74HC595’s storage clock, and the 74HC165’s parallel load require just a little more thought. Recall that SS is brought low before an SPI transfer, and high after the transfer is complete. Recall also that the 74HC595’s storage clock needs a rising edge to move the data from the shift register to the storage register. That rising edge can be provided at just the right time by connecting SS to the 74HC595’s storage clock pin. This does mean that the slave device must be deselected after transferring each set of bits, but in practice, that’s not a significant disadvantage, just something to be aware of.

Before each transfer, the 74HC165’s parallel load pin needs to be set low to load data into the shift register, and high to enable that data to be shifted out. This is the inverse of the SS signal, which is set high then low before each transfer. To correct this, we can use an inverter (a.k.a. a NOT gate) to flip the SS signal to be what we need for the parallel load pin. The 74HC04 has six inverters, which is more than we need, but will work for our purposes. (The inverters we’re not using could be used in other parts of a larger circuit, or their inputs can simply be connected to GND or V_CC to keep them from floating and introducing noise into the circuit.) If we connect SS to the input of one of these inverters, and the output of that inverter to the 74HC165’s parallel load pin, the 74HC165 will continually load data into its shift register until it’s selected, at which point it will latch that data and make it ready to be shifted out by the other SPI signals.

If this SPI interface that we’ve built out of shift registers is the only slave connected to the SPI pins, then we’re done; all that’s left is to connect the shift registers’ parallel inputs and outputs to the device. However, recall that if the SPI hardware is shared, MOSI, MISO, and SCK are also shared among all slave devices. For the 74HC595, this is no problem. It will shift bits in and out of its shift register even while another slave is selected, but its parallel output pins will stay latched, because they are controlled by its SS signal. The 74HC165 is another matter. If another slave is selected and SCK is pulsed, the 74HC165 will still shift bits out into the MISO pin, interfering with the signal coming from the other slave.

To fix this, we need to connect the 74HC165’s serial output to MISO only while its SS signal is low. This can be accomplished with a tri-state buffer. A tri-state buffer is a logic gate that has a data input, a control input, and a data output. The control input is active low, and when it’s low, the value on the data input is propagated to the data output. When the control input is high, the data output is set to high impedance, which basically means it’s not connected. The 74HC125 contains four tri-state buffers. Again, we only need one, but that’s okay. We connect the 74HC165’s serial output to the data input of one of the 74HC125’s tri-state buffers, connect the output of that tri-state buffer to MISO, and connect SS to the tri-state buffer’s control input. Now, the 74HC165 can only send data on MISO when it’s selected, which is exactly what we want.

As previously mentioned, shift registers of the same type may be cascaded together to create shift registers that hold more bits. This remains true when using them with SPI. If you’re building an SPI interface for a device that has sixteen inputs and twenty-four outputs, you can cascade two 74HC595’s together and three 74HC165’s. The principles are the same; there are just more bits. If the cascaded chains are of unequal length, then each transfer should include enough bits for the longest chain. The extra bits for the shorter chain can safely be ignored by the software. If the device is read-only or write-only, the interface can be built using only one of the two types of shift registers.

Conclusion

Both SPI and shift registers are useful ways to reduce the number of microcontroller GPIO pins needed for a design. Combining them (and adding a bit of glue logic) is a great way to do so without sacrificing much speed. I’ve used this technique in some of my projects and it’s worked very well. If you use this in your projects, I’d enjoy hearing about what went well or not so well. I hope that this post has been informative and enjoyable. If I’ve made any errors or left out important details, please do let me know.