Chapter B5. Microcode

The conversion from the LaTeX source of the CFT Book is a work in progress. Things are quite broken for the time being, but I expect them to improve soon, and by leaps and bounds.

B5.1. What is Microcode?

Microcode is to a processor what the cylinder in a music box is to the music box: it encodes simple instructions that activate parts of the machine in sequence to produce some desired effect.

Microcode in computers is usually encoded as a huge table and stored in some sort of memory. Steps are normally executed in sequence, but the microcode engine can also jump from one step to another. A processor's microcode will usually include steps to fetch and execute each of the processor's instructions. Each step controls the various units of the processor to do what we want. For example, microcode to fetch an instruction from memory would look like this:

Read the Program Counter.
Write the value to the Address Bus.
Assert memory read signals.
Wait for the data bus to stabilise.
Read the data bus.
Write the value to the Instruction Register.
Jump to the micro-program indicated by the Instruction Register.

It can be this simple. Prepare to be shocked: though some complex microcode formats are Turing Complete, this is not a requirement at all! The CFT's microcode format is not, and doesn't even allow explicit jumps!

B5.2. Why Microcode?

A processor's control unit is a complicated state machine, but it's very feasible to implement most of them with logic rather than microcode. Many processors did it this way. The Kenbak-1 implements all its states using discrete logic chips.

This works great for production machines, and is very cheap once you've debugged your design and you're convinced it works perfectly.

But having the states stored in some sort of memory is a great boon when a bug is found, an adjustment is necessary, or you get a great idea for a new extension to the processor. If you've been reading, you'll have noticed I'm very prone to mistakes. I'm also prone to changing my mind, and the CFT's instruction set has gone through several major revisions since its inception.

With microcode, all I need to do any of this is to pull the microcode ROMs and reprogram them. Without microcode, it's down to Kynar wire, loupes, and having circuit boards refabricated every week. Expensive and slow!

B5.3. Microcode Theory

Microcode is usually seen as a slightly magic thing. In many cases, rightly so: it takes a collection of state machines, multiplexers and registers and uplifts them into a magical machine that can simulate itself. Is it any wonder it can carry philosophical and metaphysical undertones?

Let's explain the theory behind it, and hopefully that won't dispel the magic in it. If you need some background on microcode, I recommend this very useful write-up on understanding and writing microcode by Dieter Müller (Mueller). It also covers how to abuse an assembler to produce microcode.

Microcode formats are usually split into three major groups:

Horizontal:: control signals in the processor get their own dedicated bits in the microcode store. The microcode store ends up being very wide but every unit can be controller in parallel. Downside: bugs in the microcode can do some nasty stuff because there are no safety interlocks. Two units could attempt to drive the same bus simultaneously.
Vertical:: control signals are encoded as binary numbers. Microcode stores are considerably narrower, but the control unit must decode the micro-instruction into individual control signals for every unit. Each micro-instruction will usually have multiple such encoded fields. Additional circuitry is needed for everything, and you get less parallelism, but every field is naturally interlocked.
Hybrid:: each micro-instruction contains both horizontal and vertical fields, depending on the machine's needs.

To better understand this, look at the truth table of a horizontal microcode format controlling the output of units in a hypothetical processor:

A	B	C	What happens
1	1	1	Nothing. All units idle.
0	X	X	Unit A drives the bus.
X	0	X	Unit B drives the bus.
X	X	0	Unit C drives the bus.
0	0	0	All units drive the bus! Smoke generator!

The first four cases are valid cases of microcode use. In the fifth one, however, all units end up driving the bus and we get (at best) bus contention and at worst a fried processor. We can build microcode validation tools that stop this from happening, but there are bad things that could happen to a micro-instruction signal on the way from the Microcode Store to the units: bad bits in the ROM, noise, metastability, etc. Better to build some sort of interlock that makes sure only one unit can drive the bus. Also, note how wasteful this format can be if we only need one unit to be active at any time. We need three bits for three units, seven bits for seven units, 15 bits for 15 units!

One solution to this problem is to go for vertical microcode. In vertical microcode, we'd do this:

BUS-DRIVER	What happens
00	Nothing. All units idle.
01	Unit A drives the bus.
10	Unit B drives the bus.
11	Unit C drives the bus.

Then, we use a decoder like the 74HC139 to take those two bits and produce four control signals (we'll ignore the first one, where we want units to be idle). This uses the same extra circuitry as an interlock for a horizontal microcode field, but one less bit. Vertical microcode gives us a free interlock, plus we need two bits for three units, three bits for seven units, 4 bits for 15 units!

The savings are pretty high. In fact, you need ⌈log₂ n⌉ bits for n signals.

The cost is in parallelism. You can no longer do more than one thing at once.

To get the best of both worlds, CFT Microcode is hybrid. A number of fields in the format are vertically encoded to save bits, but also to avoid bus contention issues. The rest are horizontal so they can happen in parallel.

B5.4. The CFT's Microcode

On the CFT, we treat microcode as an immense multivariate truth table: a function with 15 parameters that outputs 24 values. (on recent versions of the processor with a Microcode Banking Extension (µCB), that's 19 parameters.) Collectively, I call the 15 (or 19) input parameters the Microcode Address Vector. The output signals are known as the Micro-Instruction Control Vector.

The microcode includes sequences (micro-programs) for:

Resetting the processor.
Responding to an interrupt request by saving registers and jumping to the address of the interrupt handler.
Fetching instructions.
Executing every instruction the processor understands, in every addressing mode applicable to this instruction.

B5.5. The Microcode Address Vector

18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

UCB				RSTHOLD	IRQS	FV	FL	IR11–15					SKIP	AINDEX	uPC

UCB:: Identifies the currently used microcode bank, if the µCB is present. Defaults to 0000, and is always 0000 if the µCB isn't present.
RSTHOLD:: active low. Indicates the processor is being reset.
IRQS:: active low. Indicates an interrupt has been seen and must be serviced.
FV:: the current value of the overflow flag.
FL:: the current value of the link flag.
IR_11–15:: the five most significant bits of the IR, including the indirection field (least significant bit) and the op-code.
SKIP:: if a conditional test was selected in the previous micro-instruction using the OPIF field, this signal will be the result of that conditional check.
AINDEX:: active low. Indicates that the IR's operand field contains a value in the range 080–0FF.
UPC:: all four bits of the Microprogram Counter (µPC).

B5.6. The Micro-Instruction Control Vector

The output of the microcode ROMs looks like this:

23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

END	WEN	R	IO	MEM	DEC	STPAC	STPDR	INCPC	CLI	STI	CLL	CPL	OPIF				WUNIT			RUNIT

RUNIT:: four bits. Read from unit. Selects a unit to read from. This enables the appropriate tri-state IBUS driver. The values of this field are discussed in The Read Unit Decoder.
WUNIT:: three bits. The specified unit reads from the IBUS. The values of this field are discussed in The Write Unit Decoder.
OPIF:: four bits. This field instructs the Skip and Branch Logic (SBL) to perform an operation, the result of which will be checked in the next processor cycle. The exact values of this field are discussed below.
CPL:: complements The L Register when low.
CLL:: clears The L Register when low.
STI:: sets Interrupt Flag (I) when low.
CLI:: clears I when low.
INCPC:: increments the Program Counter (PC) when low.
STPDR:: steps the Data Register (DR) when low.
STPAC:: steps the PC when low.
DEC:: if low, STPDR and STPAC decrement the DR and Accumulator (AC) respectively. Otherwise, they increment them.
MEM:: if low, request a memory cycle.
IO:: if low, request an I/O cycle.
R:: if low, request a read cycle from memory or I/O space.
WEN:: if low, request a write cycle to memory or I/O space.
END:: if low, ends execution of this microprogram.

The RUNIT vertical field is decoded like this:

RUNIT	Signal	Meaning
0000		Nothing: the IBUS is tri-stated.
0001	R1	Reserved.
0010	RAGL	Read from the Address Generation Logic (AGL).
0011	RPC	Read from the PC.
0100	RDR	Read from DR.
0101	RAC	Read from AC.
0110	R6	Reserved.
0111	R7	Reserved.
1000	N/A	Arithmetic and Logic Unit (ALU): read from the Adder sub-unit.
1001	N/A	ALU: read from the AND sub-unit.
1010	N/A	ALU: read from the OR sub-unit.
1011	N/A	ALU: read from the XOR sub-unit.
1100	N/A	ALU: read from the Roll sub-unit.
1101	N/A	ALU: read from the NOT sub-unit.
1110	N/A	ALU: read Constant 1.
1111	N/A	ALU: read Constant 2.

The WUNIT vertical field is decoded like this:

WUNIT	Signal	Meaning
000		Nothing.
001	W1	Reserved.
010	WAR	Write to the AR.
011	WPC	Write to the PC.
100	WIR	Write to the IR.
101	WDR	Write to the DR.
110	WAC	Write to the AC.
111	WALU	Write to the ALU's B Port.

And the final vertical field, OPIF:

OPIF	Meaning
0000	No operation. Always returns false.
0001	Test bit 3 of the IR.
0010	Test bit 4 of the IR.
0011	Test bit 5 of the IR.
0100	Test bit 6 of the IR.
0101	Test bit 7 of the IR.
0110	Test bit 8 of the IR.
0111	Test bit 9 of the IR.
1000	No operation. Reserved for expansion.
1001	No operation. Reserved for expansion.
1010	Test Overflow flag (V).
1011	Test The L Register.
1100	Test Zero Flag (Z).
1101	Test Negative Flag (N).
1110	Check if the least significant 3 bits of IR are non-zero.
1111	Check if the least significant 4 bits of IR are non-zero.

B5.7. Microprograms

We (somewhat arbitrarily) define a microprogram to be a subset of microcode with a distinct vector of <RSTHOLD, IRQS, IR_11–15, AINDEX>. Normal microprogram flow consists of just µPC incrementing at every step, so a microprogram has up to 16 instructions.

Every microprogram is aligned with a 16-instruction boundary; it starts with a microcode address that is a multiple of 16, ending with the binary sequence 0000. Every microcode address that is a multiple of 16 in the ROM is a valid microprogram. (many of them might be very short though)

The following truth table shows all the microprograms available:

RSTHOLD	IRQS	IR_12–15	IR₁₁	AINDEX	Microprogram
0	X	X	X	X	Reset the processor
1	0	X	X	X	Jump to interrupt handler
1	1	0000	0	X	TRAP
1	1	0000	1	1	TRAP Indirect
1	1	0000	1	0	TRAP Autoindex
1	1	0001	0	X	IOT
1	1	0001	1	1	IOT Indirect
1	1	0001	1	0	IOT Autoindex
1	1	0010	0	X	LOAD
1	1	0010	1	1	LOAD Indirect
1	1	0010	1	0	LOAD Autoindex
1	1	0011	0	X	STORE
1	1	0011	1	1	STORE Indirect
1	1	0011	1	0	STORE Autoindex
1	1	0100	0	X	IN
1	1	0100	1	1	IN Indirect
1	1	0100	1	0	IN Autoindex
1	1	0101	0	X	OUT
1	1	0101	1	1	OUT Indirect
1	1	0101	1	0	OUT Autoindex
1	1	0110	0	X	JMP
1	1	0110	1	1	JMP Indirect
1	1	0110	1	0	JMP Autoindex
1	1	0111	0	X	JSR
1	1	0111	1	1	JSR Indirect
1	1	0111	1	0	JSR Autoindex
1	1	1000	0	X	ADD
1	1	1000	1	1	ADD Indirect
1	1	1000	1	0	ADD Autoindex
1	1	1001	0	X	AND
1	1	1001	1	1	AND Indirect
1	1	1001	1	0	AND Autoindex
1	1	1010	0	X	OR
1	1	1010	1	1	OR Indirect
1	1	1010	1	0	OR Autoindex
1	1	1011	0	X	XOR
1	1	1011	1	1	XOR Indirect
1	1	1011	1	0	XOR Autoindex
1	1	1100	0	X	OP1
1	1	1100	1	X	Reserved
1	1	1101	0	X	OP2
1	1	1101	1	X	POP
1	1	1110	0	X	ISZ
1	1	1110	1	1	ISZ Indirect
1	1	1110	1	0	ISZ Autoindex
1	1	1111	0	X	LIA
1	1	1111	1	1	JMPII
1	1	1111	1	0	JMPII Autoindex

There are some ‘interesting’ side-effects of this design due to its hierarchical operation:

Since the computer must be able to reset from any state, RSTHOLD is the highest priority signal. This implies that half of the Microcode ROMs contain copies of the reset microprogram.
When the computer isn't resetting, it has to be able to service an interrupt no matter the rest of its state. That makes IRQS the next highest priority signal, and half of the remaining space on the Microcode ROMs contain the microprogram that jumps to the interrupt handler.
The remaining microprograms that fetch and execute instructions occupy just 25% of the ROM space!

This is wasteful, but there was excessive space in the ROMs anyway. And to do it this way saves plenty of logic to prioritise the reset and interrupt states and select appropriate microprograms.

B5.8. Microprogram Flow and its Repercussions

With so many signals in the micro-address vector, the state transitions can be very complex. However, most transition happen in a very controlled fashion, and most can only happen at certain times only. There are three groups, in terms of their timing:

Asynchronous jumps. Only RSTHOLD does this.
At the end of each micro-program.
Possibly at every microprogram step.

Here is a description of how micro-address vector fields change:

Condition	How and what
RSTHOLD	Asserted asynchronously. Can jump to the middle of the reset micro-program.
IRQS	End of instruction. Force execution of the interrupt handler.
FV	During micro-program execution. Used by only a few instructions.
FL	During micro-program execution. Used by only a few instructions.
IR_11–15	Changes at the end of the fetch part of every micro-program, on assertion of WIR.
SKIP	During micro-program execution, when OPIF ≠ 0000.
AINDEX	Changes at the end of the fetch part of every micro-program, on assertion of WIR.
UPC	Increments or resets to zero at every micro-program step.

B5.8.1. Asynchronous Jump: the Reset Microprogram

Asserting RESET sets RSTHOLD, but RSTHOLD holds its value for some time, to allow for units to reset gracefully, the clock to stabilise (if the computer is starting from cold) and any metastability to go away. Hopefully.

This means RSTHOLD can be asserted at any time. Luckily, the UPC is also reset to 0000 when RSTHOLD is true. Also luckily, the reset sequence is mostly autonomic. The boot address is put on the Internal Bus (IBUS) automatically while RSTHOLD is active. All we need to do is load it into the PC and wait out the reset sequence.

So, the reset microprogram consists of copies of the same micro-instruction, over and over again: ‘load PC, end microprogram’. And since half of all microprograms are reset microprograms, this implies that half of all locations in the microcode ROMs have just this micro-instruction.

B5.8.2. Jump to Interrupt Handler

This signal is registered, and can only change at the end of every microprogram. The microprogram lacks a fetch part. It performs just two tasks:

Write the current value of the PC to address 0002 so we can return to it later.
Set the PC to FFF8 and mask interrupts, clearing the interrupt seen flag. End the microprogram.

Clearing the interrupt seen flag and ending the microprogram deasserts the IRQS to 1, and with the PC at FFF8, an instruction-handling microprogram will fetch an instruction and jump to its microprogram, starting the Interrupt Service Routine.

B5.8.3. Normal Microprogram End and the Fetch Cycle

When a microprogram signals END, the UPC resets to 0000. Weirdly enough, the same microprogram starts executing again.

How does that not cause an endless loop? Simple! (or not)

The first two micro-instructions of every instruction perform a Fetch operation:

Write PC to Address Register (AR).
Assert MEM and R to start a memory read transaction. Load Instruction Register (IR). Increment PC.

When the IR is loaded with a new instruction, the microcode address implicitly and instantly changes to that of the micro-program handling that instruction. The UPC is still 0001(the second step), and about to increment to 0010. All instruction micro-programs start with the same Fetch sequence, and all instruction micro-programs starts the Execution part at microcode step 3. So the previous instruction's Fetch operation will be responsible for fetching the next one.

Again, that would be less wasteful with a simplistic state machine directly representing these five states:

Figure 5. CFT Processor Major States: Reset (R), Fetch (F), Execute (E), Stop (S), and Interrupt (I).

But that would need extra logic, and the whole point of the ROMs is to reduce chip count and avoid as many state machines as possible.

B5.8.4. Auto-Indexing

The AINDEX input changes when IR is written to, at the end of the Fetch part of every micro-program. It persists for the duration of the whole micro-program so that the auto-index variant of the micro-program can run.

The microcode sequencer contains a bug-like limitation to keep circuitry simple: if an indirect-mode instruction is fetched from an address in the range 0080–00FF and the previousaddress was also an indirect-mode instruction, autoindex mode will be set regardless of the newly fetched instruction’s operand.

This is an acceptable limitation: since the autoindex locations are a limited, highly useful resource, there is no good reason to be executing code in those addresses.

Is this still the case?

The old Auto-Index Logic used to read the AR. This one decodes the IBUS directly when WIR is asserted. I think this flaw is long gone, but it also looks like the Verilog description of the AIL and the physically implemented logic are very different.

To Do: Fix the Verilog description of the AIL!

Title says all. Fix the Verilog description of the AIL so it matches that of the implemented processor and retest. The C and JS emulators' implementation of the AIL might also need to be checked.

B5.8.5. Skips

The SKIP signal changes at the end of every processor clock. It's controlled by the OPIF field of the micro-instruction control vector. As a result, a micro-instruction uses OPIF to check something and can then act on it in the next micro-instruction, one processor clock later.

Micro-instructions that don't act on the SKIP signal exist in two identical copies: one with SKIP=1, and one with SKIP=0. So far so good, this is how we implement don't care values in the CFT's ROM-based microcode.

To act on a tested conditional in SKIP, the micro-instruction must exist in two non-identical copies, so the micro-code equivalent of if has a mandatory else section. IF need be, that else section would have to be a NOP micro-instruction. There hasn't been a need for this one so far, thankfully—it would really grate on my OCD.

Since SKIP is not guaranteed to remain in the same state one processor cycle after that, conditional actions can only consist of one micro-instruction. This is limiting, but it works just fine for the CFT.

A side-effect of this is that entire micro-programs must exist in two copies, to implement conditionals. On every conditional check, control might jump from one copy to the next. The main users of OPIF, the OP1 and OP2 instructions, do this a lot.

B5.8.6. Negative and Overflow skips

The microcode ROM had a few spare address pins, and I connected these to the negative and overflow flag registers to allow for some unforeseen skip instructions. For instance, this eventually allowed the IFV and IFN minor operations to be microcoded to make arithmetic a little easier.

These are handled and behave in a way almost identical to that of the SKIP signal. The unfortunate side effect is that microprograms that handle both flag conditionals need to be coded four times, one for each combination of the two flags. Bitmap instructions like OP1 need eight copies in total.

B5.8.7. The UPC

Last, but not least. Well, least-significant for sure. But not quite least. By far the most common microcode ‘jump’ is an incrementation of the UPC. This should be very obvious by now.

B5.8.8. Paranoia

Glitches happen. The CFT is a craptastic processor designed by a high-functioning cretin with zero experience in processor design, and a superpower for making even a single-instruction program buggy. A glitchy jump to a weird microcode address is possible. We can't avoid this, but at least we can avoid the processor being in an indeterminate state, so every unused location in the ROMs contains asserts just the END signal, which will (hopefully) start the microprogram again.

To make this work, the beginning of those glitchy microprograms must have some means of moving away from them. To do this, every unused microprogram also starts with a fetch cycle. Hopefully, the fetch cycle will alter the IR and move us to a proper microprogram. Even if not, though, the PC will be stepped and this will increase our changes dramatically. The executed program is likely to be hosed at that point, but at least a test harness will detect this and the microcode bug will be fixed.

B5.9. The Microcode Assembler

Most microcode formats are usually assembled using helper software of various types. The CFT microcode was prepared with mcasm, a microcode assembler I wrote for this express task, but probably of use in other similar tasks.

The assembler is written in Python. It provides macro and preprocessing facilities using the GNU C Preprocessor, which is provided with the free GNU C Compiler. To make syntax highlighting easier, and also to play well with the C Preprocessor, the assembler's syntax borrows numerous elements from C.

The macro facilities allow common tasks like memory cycles to be repeated without the possibility of minor errors that could introduce processor bugs.

The assembler is capable of providing diagnostic output about the quality of the code, and generates binary images of as many ROMs as are required to store the wide micro-instructions.

B5.10. Testing

Testing of the microcode happens in triplicate:

I use a Verilog model of the processor (and some basic peripherals) to run a wide number of tests, in hardware testing fashion: test all combinations of everything, as far as that's possible. Errors are fixed in the Verilog models and microcode, and also fixed on the processor.
The same tests are run on the emulated computer to fix potential issues with the emulator and keep the Verilog and C emulators in sync. This is much faster than the Verilog version.
Finally, the same tests are pushed to the working processor using the Debugging Front Panel (DFP) system for in-system testing to keep the hardware in sync with its virtual counterparts.