Vector processor: Difference between revisions

Content deleted Content added
m simplify links. caps.
m unpiped links using script. less WP:ASTONISHing section link. WP:DUPLINK.
 
(2 intermediate revisions by 2 users not shown)
Line 10:
 
=== Early research and development ===
{{unreferenced section|date=July 2023}}
 
Vector processing development began in the early 1960s at the [[Westinghouse Electric Corporation]] in their ''Solomon'' project. Solomon's goal was to dramatically increase math performance by using a large number of simple [[coprocessor]]s under the control of a single master [[Central processing unit]] (CPU). The CPU fed a single common instruction to all of the [[arithmetic logic unit]]s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single [[algorithm]] to a large [[data set]], fed in the form of an array.{{cn|date=July 2023}}
 
In 1962, Westinghouse cancelled the project, but the effort was restarted by the [[University of Illinois at Urbana–Champaign]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, [[massively parallel]] computing. Around this time Flynn categorized this type of processing as an early form of [[single instruction, multiple threads]] (SIMT).{{cn|date=July 2023}}
 
[[International Computers Limited]] sought to avoid many of the difficulties with the ILLIAC concept with its own [[Distributed Array Processor]] (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.<ref name="newscientist19760617_dap">{{ cite magazine | url=https://archive.org/details/bub_gb_m8S4bXj3dcMC/page/n11/mode/2up | title=Computers by the thousand | magazine=New Scientist | last1=Parkinson | first1=Dennis | date=17 June 1976 | access-date=7 July 2024 | pages=626–627 }}</ref>
In 1962, Westinghouse cancelled the project, but the effort was restarted by the [[University of Illinois at Urbana–Champaign]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, [[massively parallel]] computing. Around this time Flynn categorized this type of processing as an early form of [[single instruction, multiple threads]] (SIMT).
 
===Computer for operations with functions===
Line 22 ⟶ 23:
{{unreferenced section|date=July 2023}}
 
The first vector supercomputers are the [[Control Data Corporation]] [[CDC STAR-100|STAR-100]] and [[Texas Instruments]] [[Advanced Scientific Computer]] (ASC), which were introduced in 1974 and 1972, respectively.<!--The STAR was announced before the ASC-->
 
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
Line 34 ⟶ 35:
[[File:Cray J90 CPU module.jpg|thumb|[[Cray J90]] processor module with four scalar/vector processors]]
 
Other examples followed. [[Control Data Corporation]] tried to re-enter the high-end market again with its [[ETA-10]] machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ([[Fujitsu]], [[Hitachi, Ltd.|Hitachi]] and [[Nippon Electric Corporation]] (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. [[Oregon]]-based [[Floating Point Systems]] (FPS) built add-on array processors for [[minicomputer]]s, later building their own [[minisupercomputer]]s.
 
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the [[Cray-2]], [[Cray X-MP]] and [[Cray Y-MP]]. Since then, the supercomputer market has focused much more on [[massively parallel]] processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed [[IBM ViVA|Virtual Vector Architecture]] for use in supercomputers coupling several scalar processors to act as a vector processor.
 
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their [[NEC SX architecture|SX series]] of computers. Most recently, the [[SX-Aurora TSUBASA]] places the processor and either 24 or 48 gigabytes of memory on an [[High Bandwidth Memory|HBM]] 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
Line 53 ⟶ 54:
{{As of | 2016}} most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition, the addition of SIMD cannot, by itself, qualify a processor as an actual ''vector processor'', because SIMD is {{em|fixed-length}}, and vectors are {{em|variable-length}}. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.{{citation needed|date=June 2021}}
 
* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[SIMD within a register]] (SWAR), and [[Flynn's taxonomy#Pipelined processor|Pipelined Processor]] in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's [[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]] and [[Advanced Vector Extensions|AVX]] instructions, AMD's [[3DNow!]] extensions, [[ARM architecture#Advanced SIMD (Neon)|ARM NEON]], Sparc's [[Visual Instruction Set|VIS]] extension, [[PowerPC]]'s [[AltiVec]] and MIPS' [[MIPS_architecture#Application-specific_extensions|MSA]]. In 2000, [[IBM]], [[Toshiba]] and [[Sony]] collaborated to create the [[Cell (microprocessor)|Cell processor]], which is also SIMD.
* '''Predicated SIMD''' - also known as [[Flynn's taxonomy#Associative processor|associative processing]]. Two notable examples which have per-element (lane-based) predication are [[Scalable Vector Extension|ARM SVE2]] and [[AVX-512]]
* '''Pure Vectors''' - as categorised in [[Duncan's taxonomy#Pipelined vector processors|Duncan's taxonomy]] - these include the original [[Cray-1]], [[Convex Computer|Convex C-Series]], [[NEC SX]], and [[RISC-V#Vector set|RISC-V RVV]]. Although memory-based, the [[CDC STAR-100]] was also a vector processor.
 
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as [[Multiple instruction, multiple data|MIMD]] (Multiple Instruction, Multiple Data) and realized with [[VLIW]] (Very Long Instruction Word) and [[Explicitly parallel instruction computing|EPIC]] (Explicitly Parallel Instruction Computing). The [[Fujitsu FR-V]] VLIW/vector processor combines both technologies.
 
=== Difference between SIMD and vector processors ===
Line 83 ⟶ 84:
== Description ==
 
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this [[memory latency]] has historically become a large impediment to performance; see [[{{slink|Random-access memory#Memory wall|Memory wall]]}}.
 
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as [[instruction pipelining]] in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an [[assembly line]], so the [[address decoder]] is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the ''[[Latency (engineering)|latency]]'', but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.
Line 438 ⟶ 439:
With many 3D [[shader]] applications needing [[trigonometric]] operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
 
* '''Sub-vectors''' – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#mapping-for-lmul-1-2 LMUL > 1 in RVV]</ref> Subvectors are a critical integral part of the [[Vulkan (API)|Vulkan]] [[SPIR-V]] spec.
* '''Sub-vector Swizzle''' – aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight [[permute instruction|mini-permute]] of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the [[Vulkan (API)|Vulkan]] [[SPIR-V]] spec. The Broadcom [[Videocore]] IV uses the terminology "Lane rotate"<ref>[https://patents.google.com/patent/US20110227920 Abandoned US patent US20110227920-0096]</ref> where the rest of the industry uses the term [[Swizzling (computer graphics)|"swizzle"]].<ref>[https://github.com/hermanhermitage/videocoreiv-qpu Videocore IV QPU]</ref>
* '''Transcendentals''' – [[trigonometric]] operations such as [[sine]], [[cosine]] and [[logarithm]] obviously feature much more predominantly in 3D than in many demanding [[High-performance computing|HPC]] workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The [[Vulkan (API)|Vulkan]] specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce power usage. The concept of reducing accuracy where it is simply not needed is explored in the [[MIPS-3D]] extension.
 
=== Fault (or Fail) First ===