A register file is an array of processor registers in a central processing unit (CPU). Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports. Such RAMs are distinguished by having dedicated read and write ports, whereas ordinary multiported SRAMs will usually read and write through the same ports.
The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on the chip. In simpler CPUs, these architectural registers correspond one-for-one to the entries in a physical register file within the CPU. More complicated CPUs use register renaming, so that the mapping of which physical entry stores a particular architectural register changes dynamically during execution. The register file is part of the architecture and visible to the programmer, as opposed to the concept of transparent caches.
Register bank switching
Register files may be clubbed together as register banks.
Some processors have several register banks.
x86 processor use context switch and fast interrupt for switch between instruction, decoder, GPRs and register files (if they are more than one) before the instruction is issued, but this is only existing on processor that support superscalar. However context switch is a totally different mechanism to ARM's register bank within the registers.
The usual layout convention is that a simple array is read out vertically. That is, a single word line, which runs horizontally, causes a row of bit cells to put their data on bit lines, which run vertically. Sense amps, which convert low-swing read bitlines into full-swing logic levels, are usually at the bottom (by convention). Larger register files are then sometimes constructed by tiling mirrored and rotated simple arrays.
Register files have one word line per entry per port, one bit line per bit of width per read port, and two bit lines per bit of width per write port. Each bit cell also has a Vdd and Vss. Therefore, the wire pitch area increases as the square of the number of ports, and the transistor area increases linearly. At some point, it may be smaller and/or faster to have multiple redundant register files, with smaller numbers of read ports, rather than a single register file with all the read ports. The MIPS R8000's integer unit, for example, had a 9 read 4 write port 32 entry 64-bit register file implemented in a 0.7 µm process, which could be seen when looking at the chip from arm's length.
In principle anything that could be done with a 64-bit-wide register file with many read and write ports could be done with a single 8-bit-wide register file with a single read port and a single write port. However, the bit-level parallelism of wide register files with many ports allows them to run much faster—they can do things in a single cycle that would take many cycles with fewer ports or a narrower bit width or both.
The width in bits of the register file is usually the number of bits in the processor word size. Occasionally it is slightly wider in order to attach "extra" bits to each register, such as the poison bit. If the width of the data word is different than the width of an address—or in some cases, such as the 68000, even when they are the same width—the address registers are in a separate register file than the data registers.
- The decoder is often broken into predecoder and decoder proper.
- The decoder is a series of AND gates that drive word lines.
- There is one decoder per read or write port. If the array has four read and two write ports, for example, it has 6 word lines per bit cell in the array, and six AND gates per row in the decoder. Note that the decoder has to be pitch matched to the array, which forces those AND gates to be wide and short
The basic scheme for a bit cell:
- State is stored in pair of inverters.
- Data is read out by nmos transistor to a bit line.
- Data is written by shorting one side or the other to ground through a two-nmos stack.
- So: read ports take one transistor per bit cell, write ports take four.
Many optimizations are possible:
- Sharing lines between cells, for example, Vdd and Vss.
- Read bit lines are often precharged to something between Vdd and Vss.
- Read bit lines often swing only a fraction of the way to Vdd or Vss. A sense amplifier converts this small-swing signal into a full logic level. Small swing signals are faster because the bit line has little drive but a great deal of parasitic capacitance.
- Write bit lines may be braided, so that they couple equally to the nearby read bitlines. Because write bitlines are full swing, they can cause significant disturbances on read bitlines.
- If Vdd is a horizontal line, it can be switched off, by yet another decoder, if any of the write ports are writing that line during that cycle. This optimization increases the speed of the write.
- Techniques that reduce the energy used by register files are useful in low-power electronics
Most register files make no special provision to prevent multiple write ports from writing the same entry simultaneously. Instead, the instruction scheduling hardware ensures that only one instruction in any particular cycle writes a particular entry. If multiple instructions targeting the same register are issued, all but one have their write enables turned off.
The crossed inverters take some finite time to settle after a write operation, during which a read operation will either take longer or return garbage. It is common to have bypass multiplexers that bypass written data to the read ports when a simultaneous read and write to the same entry is commanded. These bypass multiplexers are often part of a larger bypass network that forwards results which have not yet been committed between functional units.
The register file is usually pitch-matched to the datapath that it serves. Pitch matching avoids having the many busses passing over the datapath turn corners, which would use a lot of area. But since every unit must have the same bit pitch, every unit in the datapath ends up with the bit pitch forced by the widest unit, which can waste area in the other units. Register files, because they have two wires per bit per write port, and because all the bit lines must contact the silicon at every bit cell, can often set the pitch of a datapath.
Area can sometimes be saved, on machines with multiple units in a datapath, by having two datapaths side-by-side, each of which has smaller bit pitch than a single datapath would have. This case usually forces multiple copies of a register file, one for each datapath.
The Alpha 21264 (EV6), for instance, it was the first micro-architecture to implemented "Shadow Register File Architecture". It had two copies of the integer register file and two copies of Floating point register that locate in its front end(future and scaled file, each contain 2 read and 2 write port.), and took an extra cycle to propagate data between the two during context switch. The issue logic tried to reduce the number of operations forwarding data between the two and greatly improved its integer performance and help reduce the impact of limited number of GPR in superscalar and speculative execution. The design was later adapted by SPARC, MIPS and some later x86 implementation.
The MIPS uses multiple register file as well, R8000 floating-point unit had two copies of the floating-point register file, each with four write and four read ports, and wrote both copies at the same time with contaxt switch. However it does not support integer operation and integer register file still remain one. Later shadow register file was abandoned in newer design in favor of embedded market.
The SPARC uses "Shadow Register File Architecture" as well for its high end line, It had up to 4 copies of integer register files(future, retirered, scaled, scratched, each contain 7 read 4 write port) and 2 copies of floating point register file. but unlike Alpha and x86, they are locate in back end as retire unit right after its Out of Order Unit and renaming register files and do not load instruction during instruction fetch and decoding stage and contaxt switch is needless in this design.
IBM uses the same mechanism as many major microprocessors, deeply merging the register file with the decoder but its register file are work independently by the decoder side and do not involve contaxt switch, which is different from Alpha and x86. most of its register file not just serve for its dedicate decoder only but up to the thread level. For example, POWER8 has up to 8 instruction decoders, but up to 32 register files of 32 general purpose registers each (4 read and 4 write port.), to facilitate simultaneous multithreading, which its instruction cannot be used cross any other register file(lack of contaxt switch.).
In the x86 processor line, a typical pre-486 CPU did not have an individual register file, as all general purpose register were directly work with its decoder, and the x87 push stack was located within the floating-point unit itself. Starting with Pentium, a typical Pentium-compatible x86 processor is integrated with one copy of the single-port architectural register file containing 8 architectural registers, 8 control registers, 8 debug registers, 8 condition code registers, 8 unnamed based register,[clarification needed] one instruction pointer, one flag register and 6 segment registers in one file.
One copy of 8 x87 FP push down stack by default, MMX register were virtually simulated from x87 stack and require x86 register to supplying MMX instruction and aliases to exist stack. On P6, the instruction independently can be stored and executed in parallel in early pipeline stages before decoding into micro-operations and renaming in out-of-order execution. Beginning with P6, all register files do not require additional cycle to propagate the data, register files like architectural and floating point are located between code buffer and decoders, called "retire buffer", Reorder buffer and OoOE and connected within the ring bus(16 bytes). register file itself is still remain one x86 register files and one x87 stack and both serve as retirement storing. Its x86 regfile increase to dual ported to increase bandwidth for reult storage. Register like debug/condition code/control/unnamed/flag were stripped the from main register file and placed into individual files between the micro-op ROM and instruction sequencer. Only inaccessible registers like the segment register and Are now saperated from general purpose file(except instruction pointer.), they are now located between scheduler and instruction allocator, in order to facilitate register renaming and out-of-order execution. The x87 stack was later merged with the floating-point register file after a 128 bit XMM register debuted in Pentium III, but XMM register file is still locate separately from x86 integer register files.
Later P6 implementations (Pentium M, Yonah) introduced "Shadow Register File Architecture" that expanded to 2 copies of dual ported integer architectural register file and consist with contaxt switch(between future&retirered file and scaled file using the same trick that used between integer and floating point.). It was in order to solve the register bottleneck that exist in x86 architecture after micro op fusion is introduced, but it is still have 8 entries 32 bit architectural registers for total 32 bytes in capacity per file(segment register and instruction pointer remain within the file, though they are unaccesable by program.) as speculative file. the second file are served as scaled shadow register file, which without contaxt switch the scaled file cannot store some instruction independently. Some instruction from SSE2/SSE3/SSSE3 require this feature for integer operation, for example instruction like PSHUFB, PMADDUBSW, PHSUBW, PHSUBD, PHSUBSW, PHADDW, PHADDD, PHADDSW would require loading EAX/EBX/ECX/EDX from both of register file, though it was uncommon for x86 processor to take use of another register file with same instruction, most of time the second file is serve as scale retirered file. Pentium M architecture still remain one dual ported FP register file(8 entries MM/XMM) shared with three decoder and FP register does not have shadow register file with it as its Shadow Register File Architecture did not including floating point function. Processor after P6, the architectural register file are external and locate in processor's backend after retired, opposite to internal register file that are locate in inner core for register renaming/reorder buffer. However in Core 2 it is now within a unit called "register alias table" RAT, located with instruction allocator but have same size of register size as retirement. Core 2 increased the inner ring bus to 24 bytes(allow more than 3 instructions to be decoded) and extended its register file from dual ported (one read/one write) to quad ported (two read/two write), register still remain 8 entries in 32 bit and 32 bytes (not including 6 segment register and one instruction pointer as they are unable to be access in the file by any code/instruction) in total file size and expaded to 16 entries in x64 for total 128 bytes size per file. From Pentium M as its pipeline port and decoder increased, but they're located with allocator table instead of code buffer. Its FP XMM register file are also increase to quad ported (2 read/2 write), register still remain 8 entries in 32 bit and extended to 16 entries in x64 mode and number still remain 1 as its shadow register file architecture is not including floating point/SSE functions.
In later x86 implementations, like Nehalem and later processors, both integer and floating point registers are now incorporated into an unified Octa-ported (6 read/2 write) general purpose register file (8 + 8 in 32 bit and 16 + 16 in x64 per file), while the register file extended to 2 with enhanced "Shadow Register File Architecture" in favorite of executing hyper threading and each thread use Independent register files for its decoder. Later Sandy bridge and onward had replaced shadow register table and architectural registers with much large and yet more advance Physical Register File before decoding to reorder buffer. Randered that Sandy Bridge and onward no longer carry architectural register anymore.
On Atom line were the modern simplified revision of P5. It has included single copies of register file share with thread and decoder, the register file is dual port design, 8/16 entries GPRS, 8/16 entries debug register and 8/16 entries condition code are integrated in the same file. However it has 8 entries 64 bit shadow based register and 8 entries 64 bit unnamed register that are now separated from main GPRs unlike original P5 design and located after the execution unit, and the file of these registers is single ported and not expose to instruction like scaled shadow register file found on Core/Core2 (shadow register file are made of architectural registers and Bonnell did not due to not have "Shadow Register File Architecture".), however the file can be use for renaming purpose due to lack of out of order execution found on Bonnell architecture. It also had one copy of XMM floating point register file per thread. The difference from Nehalem is Bonnell do not have unified register file and has no dedicated register file for its hyper threading, instead Bonnell use separate rename register for its thread despite its not out of order. Similar to Bonnell, Larrabee and Xeon Phi also each have only one general-purpose integer register file, but the Larrabee has up to 16 XMM register files (8 entries per file), and the Xeon Phi has up to 128 AVX-512 register files, each containing 32 512 bit ZMM registers for vector instruction storage, which can be as big as L2 cache.
There are some other Intel's x86 line that don't have register file in their internal design, Geode GX and Vortex86 and many embedded processors that aren't Pentium-compatible or reverse-engineered early 80x86 processor, therefore most of them don't have register file for its decoder but its GPRs been use individually. Pentium 4, on the other hand, does not have a register file for its decoder, as its x86 GPRs weren't exist within its structure, due to introduction of physical unified renaming register file(similar to Sandy Bridge but slightly different due to pentium 4 cannot use the register before naming.) for attempting replace architectural register file and skip the x86 decoding scheme. Instead it use SSE for integer execution and storage before the ALU and after result, SSE2/SSE3/SSSE3 use the same mechanism as well for its integer operation.
AMD's early design like K6 do not have register file like Intel and do not support "Shadow Register File Architecture" as it's lack of context switch and bypass inverter that necessary require for register file to function appropriately, instead they use separate GPRS that directly link to rename register table for its OoOE cpu with dedicate integer decoder and floating decoder. The mechanism is similar to Intel's pre-Pentium processor line. For example, K6 processor has 4 int (one 8 entries temporary scratched register file + one 8 entries future register file + one 8 entries fetched register file + 8 entries unnamed register file) and 2 fp rename register file ( two 8 entries x87 ST file one goes fadd and one goes fmov) that directly link with its x86 EAX for integer renaming and XMM0 register for floating point renaming, but later Athlon included "shadow register" in its front end, it's scale up to 40 entries unified register file for in order integer operation before decoded, the register file contain 8 entries scratch register + 16 future GPRs register file + 16 unnamed GPRs register file. In later AMD design it abandon the shadow register design and favored to K6 architecture with individual GPRs direct link design. Like Phenom, it has 3 int register files and 2 SSE register files that located in physical register file direct link with GPRs ,however it scale down to 1 integer + 1 floating-point on Bulldozer. Like early AMD design, most of x86 manufacturer like Cyrix, VIA, DM&P, SIS used the same mechanism as well, result lack of integer performance without register renaming for their in order CPU. Company like Cyrix and AMD had to increase cache size in hope to reduce the bottlneck. AMD's SSE integer operation work in different way than Core 2 and Pentium 4, it uses its separate renaming integer register to load the value directly before decode stage. Though theoretically it will only need shorter pipeline than Intel's SSE implement but generally the cost of branch prediction are much greater and higher missing rate than Intel and it would have take at least two cycle for its SSE instruction to be executed regardless of instruction wide, as early AMD's implement cannot execute both FP and Int in SSE instruction set like Intel's implement did.
Unlike Alpha, Sparc, MIPS that only allow one register file to load/fetch one operand at the time, to achieve superscalar would require multiple register file to achieve superscale. ARM processor on the other hand do not integrate multiple register file to load/fetch instruction. All GPRs hold no special purpose to the instruction set (ARM ISA do no require accumulator, index, and stack/base points, registers do not have accumulator and base/stack point can only used in thumb mode). Any GPRs can propagate and store multiple instruction independently in smaller code size that small enough to able to fit in one register and its architectural register act as table and shared to all decoder/instruction with simple bank switch between decoder. Major difference between ARM and other design is that ARM allows to run on same general purpose register with quick bank switch without requiring additional register file in superscalar. Despite x86 share same mechanism with ARM that its GPRS can store any data individually, but x86 will confront data dependency if more than 3 non related instruction are stored, as its GPRs per file are too small (8 in 32 bit mode and 16 in 64 bit, compare to ARM's 13 in 32 bit and 31 in 64 bit) for data and it is impossible to have superscalar without multiple register file to feed for its decoder (x86 code are big and complex compare to ARM). Cause most of x86's front-end become much larger and much more power hungry than ARM processor in order to be competitive (ex: Pentium M & Core 2 duo, bay trail). Some third party x86 equivalent processor even became noncompetitive to ARM due have no dedicate register file architecture. Particularly for AMD, Cyrix and VIA that cannot bring any reasonable performance without register renaming and out of order execution, which leave only Intel atom to be the only in order x86 processor core in the mobile competition. This was until x86 in Nehalem merged both of its integer and floating point register into one single file and introduce of large physical register table and enhanced allocator table in its front-end before renaming in its out of order internal core.
Processors that do register renaming can arrange for each functional unit to write to a subset of the physical register file. This arrangement can eliminate the need for multiple write ports per bit cell, for large savings in area. The resulting register file, effectively a stack of register files with single write ports, then benefits from replication and subsetting the read ports. At the limit, this technique would place a stack of 1-write, 2-read regfiles at the inputs to each functional unit. Since regfiles with a small number of ports are often dominated by transistor area, it is best not to push this technique to this limit, but it is useful all the same.
The SPARC ISA defines register windows, in which the 5-bit architectural names of the registers actually point into a window on a much larger register file, with hundreds of entries. Implementing multiported register files with hundreds of entries requires a lot of area. The register window slides by 16 registers when moved, so that each architectural register name can refer to only a small number of registers in the larger array, e.g. architectural register r20 can only refer to physical registers #20, #36, #52, #68, #84, #100, #116, if there are just seven windows in the physical file.
To save area, some SPARC implementations implement a 32-entry register file, in which each cell has seven "bits". Only one is read and writeable through the external ports, but the contents of the bits can be rotated. A rotation accomplishes in a single cycle a movement of the register window. Because most of the wires accomplishing the state movement are local, tremendous bandwidth is possible with little power.
This same technique is used in the R10000 register renaming mapping file, which stores a 6-bit virtual register number for each of the physical registers. In the renaming file, the renaming state is checkpointed whenever a branch is taken, so that when a branch is detected to be mispredicted, the old renaming state can be recovered in a single cycle. (See Register renaming.)
- Wikibooks: Microprocessor Design/Register File#Register Bank.
- "Energy efficient asymmetrically ported register files" by Aneesh Aggarwal and M. Franklin. 2003.
|The Wikibook Microprocessor Design has a page on the topic of: Register File|
- Register file design considerations in dynamically scheduled processors - Farkas, Jouppi, Chow - 1995