精简指令集
维基百科,自由的百科全书
精简指令集,计算机CPU的一种设计模式,也被称为RISC(Reduced Instruction Set Computing 的缩写)。常見的精簡指令集微處理器包括AVR、PIC、ARM、DEC Alpha、PA-RISC、SPARC、MIPS、Power Architecture等。
早期,这种CPU指令集的特点是指令数目少,每条指令都采用标准字长、执行时间短、CPU的实现细节对于机器级程序是可见的等等。
实际上在后来的发展中,RISC与CISC在争吵的过程中相互学习,现在的RISC指令集也达到数百条,运行周期也不再固定……虽然如此,RISC设计的根本原则--针对流水线化的处理器优化--没有改变。
目录 |
[编辑] RISC之前的設計原理
在早期的電腦業中,編譯器技術尚未出現。程式是以機器語言或組合語言完成的。為了便於編寫程式,電腦架構師造出越來越複雜的指令,可以高階程式語言直接陳述高階功能。當時的看法是硬體比編譯器更易設計,所以複雜的東西就加進硬體了。
加速複雜化的其它因素是缺乏大記憶體。記憶體小的環境中,具有極高訊息密度的程式較有利。當記憶體中的每一位元組如此珍貴,例如儲存某個完整系統只需幾千位元組,它使產業移向高度編碼的指令、長度不等的指令、執行多個操作的指令,和執行資料傳輸與計算的指令。當時指令封包問題遠比易解的指令重要。
記憶體不僅小,而且很慢,打從當時使用磁性技術。這是維持極高訊息密度的其它原因。藉著具有極高訊息密度封包,當必須存取慢速資源時可以降低頻率。
CPU只有少數暫存器的兩個原因︰
- CPU內部暫存器遠貴於外部記憶體。以當時的積體電路技術水準,大暫存器集對晶片或電路板區域只是多餘的浪費。
- 具有大數量的暫存器將需要大數量的指令位元(使用珍貴的RAM)以做為暫存器指定器。
基於上述原因,CPU設計師試著令指令盡可能做更多的工作。這導致一個指令將做全部的工作︰讀入兩個數字,相加,並且直接在記憶體儲存計算結果。其它版本將從記憶體讀取兩個數字,但計算結果儲存在暫存器。另一個版本將從記憶體和暫存器各讀一個數字,並再次存入記憶體。以此類推。這種處理器設計原理最終成為複雜指令集(CISC)。
當時的目標是給所有的指令提供所有的尋址模式,此稱為「正交性」。這在 CPU 上導致了一些複雜性,但就理論上每個可能的命令都可以單獨的調試(調用,be tuned),這樣使得程式員能夠比用簡單的命令來得更快速。
這類的設計最終可以由光譜的兩端來表達, 6502 在光譜的一端,而 VAX 在光譜的另一端。單價25美元的 1MHz 6502 晶片只有單一的通用暫存器, 但它的極精簡的單週期記憶體界面(single-cycle memory interface)讓一個位元的操作效能和更高時脈設計幾乎相同,例如 4MHz Zilog Z80 在使用相同慢速的記憶晶片下(大約近似 300ns)。The VAX was a minicomputer whose initial implementation required 3 racks of equipment for a single cpu, and was notable for the amazing variety of memory access styles it supported, and the fact that every one of them was available for every instruction. The VAX was a minicomputer whose initial implementation required 3 racks of equipment for a single cpu, and was notable for the amazing variety of memory access styles it supported, and the fact that every one of them was available for every instruction.
[编辑] RISC設計原理
後70年代的IBM研究人員(以及其它地方的類似計劃)顯示,大多數「正交」尋址模式已被多數程式員所忽略。這是逐漸使用編譯器的副作用,不太使用組合語言。The compilers in use at the time only had a limited ability to take advantage of the features provided by CISC CPUs; this was largely a result of the difficulty of writing a compiler. The market was clearly moving to even wider use of compilers, diluting the usefulness of these orthogonal modes even more.
Another discovery was that since these operations were rarely used, in fact they tended to be slower than a number of smaller operations doing the same thing. This seeming paradox was a side effect of the time spent designing the CPUs, designers simply did not have time to tune every possible instruction, and instead tuned only the most used ones. One famous example of this was the VAX's INDEX
instruction, which ran slower than a loop implementing the same code.
At about the same time CPUs started to run even faster than the memory they talked to. Even in the late 1970s it was apparent that this disparity was going to continue to grow for at least the next decade, by which time the CPU would be tens to hundreds of times faster than the memory. It became apparent that more registers (and later caches) would be needed to support these higher operating frequencies. These additional registers and cache memories would require sizeable chip or board areas that could be made available if the complexity of the CPU was reduced.
Yet another part of RISC design came from practical measurements on real-world programs. Andrew Tanenbaum summed up many of these, demonstrating that most processors were vastly overdesigned. For instance, he showed that 98% of all the constants in a program would fit in 13 bits, yet almost every CPU design dedicated some multiple of 8 bits to storing them, typically 8, 16 or 32, one entire word. Taking this fact into account suggests that a machine should allow for constants to be stored in unused bits of the instruction itself, decreasing the number of memory accesses. Instead of loading up numbers from memory or registers, they would be "right there" when the CPU needed them, and therefore much faster. However this required the operation itself to be very small, otherwise there would not be enough room left over in a 32-bit instruction to hold reasonably sized constants.
Since real-world programs spent most of their time executing very simple operations, some researchers decided to focus on making those common operations as simple and as fast as possible. Since the clock rate of the CPU is limited by the time it takes to execute the slowest instruction, speeding up that instruction -- perhaps by reducing the number of addressing modes it supports -- also speeds up the execution of every other instruction. The goal of RISC was to make instructions so simple, each one could be executed in a single clock cycle [1]. The focus on "reduced instructions" led to the resulting machine being called a "reduced instruction set computer" (RISC).
Unfortunately, the term "reduced instruction set computer" is often misunderstood as meaning that there are fewer instructions in the instruction set of the processor. Instead, RISC designs often have huge command sets of their own. Inspired by the desire for simpler designs, some people have developed some interesting MISC and OISC machines such as Transport Triggered Architectures, while others have walked into a Turing tarpit.
The real difference between RISC and CISC is the philosophy of doing everything in registers and loading and saving the data to and from them. To avoid that misunderstanding, many researchers prefer the term load-store.
Over time the older design technique became known as Complex Instruction Set Computer, or CISC, although this was largely to give them a different name for comparison purposes.
Code was implemented as a series of these simple instructions, instead of a single complex instruction that had the same result. This had the side effect of leaving more room in the instruction to carry data with it, meaning that there was less need to use registers or memory. At the same time the memory interface was considerably simpler, allowing it to be tuned.
不過RISC也有它的缺點。當需要一系列指令用來完成非常簡單的程式時,從記憶體讀入的指令總數會很多,程式也因此變大。當時對RISC的優劣有很多的爭論。
[编辑] 提升CPU性能的方法
While the RISC philosophy was coming into its own, new ideas about how to dramatically increase performance of the CPUs were starting to develop.
In the early 1980s it was thought that existing design was reaching theoretical limits. Future improvements in speed would be primarily through improved semiconductor "process", that is, smaller features (transistors and wires) on the chip. The complexity of the chip would remain largely the same, but the smaller size would allow it to run at higher clock rates. A considerable amount of effort was put into designing chips for parallel computing, with built-in communications links. Instead of making faster chips, a large number of chips would be used, dividing up problems among them. However history has shown that the original fears were not valid, and there were a number of ideas that dramatically improved performance in the late 1980s.
One idea was to include a pipeline which would break down instructions into steps, and work on one step of several different instructions at the same time. A normal processor might read an instruction, decode it, fetch the memory the instruction asked for, perform the operation, and then write the results back out. The key to pipelining is the observation that the processor can start reading the next instruction as soon as it finishes reading the last, meaning that there are now two instructions being worked on (one is being read, the next is being decoded), and after another cycle there will be three. While no single instruction is completed any faster, the next instruction would complete right after the previous one. The result was a much more efficient utilization of processor resources.
Yet another solution was to use several processing elements inside the processor and run them in parallel. Instead of working on one instruction to add two numbers, these superscalar processors would look at the next instruction in the pipeline and attempt to run it at the same time in an identical unit. However, this can be difficult to do, as many instructions in computing depend on the results of some other instruction.
Both of these techniques relied on increasing speed by adding complexity to the basic layout of the CPU, as opposed to the instructions running on them. With chip space being a finite quantity, in order to include these features something else would have to be removed to make room. RISC was tailor-made to take advantage of these techniques, because the core logic of a RISC CPU was considerably simpler than in CISC designs. Although the first RISC designs had marginal performance, they were able to quickly add these new design features and by the late 1980s they were significantly outperforming their CISC counterparts. In time this would be addressed as process improved to the point where all of this could be added to a CISC design and still fit on a single chip, but this took most of the late-80s and early 90s.
The long and short of it is that for any given level of general performance, a RISC chip will typically have many fewer transistors dedicated to the core logic. This allows the designers considerable flexibility; they can, for instance:
- 增加暫存器的大小
- 增進內部的平行性
- 增加快取大小
- 加入其它功能,如I/O和計時器
- 加入向量處理器(SIMD),如AltiVec、Streaming SIMD Extensions(SSE)
- build the chips on older fabrication lines, which would otherwise go unused
- 避免附加。使朝向省電化(battery-constrained)或小型化的應用
RISC設計中常見的特徵︰
- 統一指令編碼(例如,所有指令中的op-code永遠位於同樣的位元位置、等長指令),可快速解譯︰
- 汎用的暫存器,所有暫存器可用於所有內容,以及編譯器設計的單純化(不過暫存器中區分了整數和浮點數);
- 單純的尋址模式(複雜尋址模式以簡單計算指令序列取代);
- 硬體中支援少數資料型別(例如,一些CISC電腦中存有處理位元組字串的指令。這在RISC電腦中不太可能出現)。
RISC designs are also more likely to feature a Harvard memory model, where the instruction stream and the data stream are conceptually separated; this means that modifying the addresses where code is held might not have any effect on the instructions executed by the processor (because the CPU has a separate instruction and data cache), at least until a special synchronization instruction is issued. On the upside, this allows both caches to be accessed simultaneously, which can often improve performance.
Many of these early RISC designs also shared a not-so-nice feature, the branch delay slot. A branch delay slot is an instruction space immediately following a jump or branch. The instruction in this space is executed whether or not the branch is taken (in other words the effect of the branch is delayed). This instruction keeps the ALU of the CPU busy for the extra time normally needed to perform a branch. Nowadays the branch delay slot is considered an unfortunate side effect of a particular strategy for implementing some RISC designs, and modern RISC designs generally do away with it (such as PowerPC, more recent versions of SPARC, and MIPS).
[编辑] 参考
例如:Intel的奔腾系列CPU属于复杂指令集CPU,IBM 的PowerPC 970(用于苹果机MAC G5)CPU属于精简指令集CPU。