Content deleted Content added
m Fix broken anchor: 2021-12-22 #Array Processor→Flynn's taxonomy#Array processor, 2021-12-22 #Pipelined Processor→Flynn's taxonomy#Pipelined processor, 2021-12-22 #Associative Processor→Flynn's taxonomy#Associative processor |
m unpiped links using script. less WP:ASTONISHing section link. WP:DUPLINK. |
||
(28 intermediate revisions by 17 users not shown) | |||
Line 1:
{{Short description|Computer processor which works on arrays of several numbers at once}}
{{
{{
In [[computing]], a '''vector processor''' or '''array processor''' is a [[central processing unit]] (CPU) that implements an [[instruction set]] where its [[Instruction (computer science)|instructions]] are designed to operate efficiently and effectively on large [[Array data structure|one-dimensional array]]s of data called ''vectors''. This is in contrast to [[scalar processor]]s, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional [[single instruction, multiple data]] (SIMD) or [[
Vector machines appeared in the early 1970s and dominated [[supercomputer]] design through the 1970s into the 1990s, notably the various [[Cray]] platforms. The rapid fall in the [[price-to-performance ratio]] of conventional [[microprocessor]] designs led to a decline in vector supercomputers during the 1990s.
Line 9:
== History ==
=== Early
Vector processing development began in the early 1960s at [[Westinghouse Electric Corporation|Westinghouse]] in their "Solomon" project. Solomon's goal was to dramatically increase math performance by using a large number of simple [[coprocessor|math co-processors]] under the control of a single master [[Central processing unit|CPU]]. The CPU fed a single common instruction to all of the [[arithmetic logic unit]]s (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single [[algorithm]] to a large [[data set]], fed in the form of an array.▼
▲Vector processing development began in the early 1960s at the [[Westinghouse Electric Corporation
In 1962, Westinghouse cancelled the project, but the effort was restarted at the [[University of Illinois at Urbana–Champaign|University of Illinois]] as the [[ILLIAC IV]]. Their version of the design originally called for a 1 [[GFLOPS]] machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as [[computational fluid dynamics]], the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, [[massively parallel]] computing. Around this time Flynn categorised this type of processing as an early form of [[Flynn's taxonomy#Array processor|SIMT]].▼
▲In 1962, Westinghouse cancelled the project, but the effort was restarted
A [[computer for operations with functions]] was presented and developed by Kartsev in 1967.<ref name="Malinovsky">{{cite book|title=The history of computer technology in their faces (in Russian)|last=Malinovsky|first=B.N.|publisher=Firm "KIT"|year=1995|isbn=5-7707-6131-8|location=Kiew}}</ref>▼
[[International Computers Limited]] sought to avoid many of the difficulties with the ILLIAC concept with its own [[Distributed Array Processor]] (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.<ref name="newscientist19760617_dap">{{ cite magazine | url=https://archive.org/details/bub_gb_m8S4bXj3dcMC/page/n11/mode/2up | title=Computers by the thousand | magazine=New Scientist | last1=Parkinson | first1=Dennis | date=17 June 1976 | access-date=7 July 2024 | pages=626–627 }}</ref>
===Computer for operations with functions===
▲A [[computer for operations with functions]] was presented and developed by Kartsev in 1967.<ref name="Malinovsky">{{cite book| title=The history of computer technology in their faces (in Russian)|
=== Supercomputers ===
{{unreferenced section|date=July 2023}}
The first vector supercomputers are the [[Control Data Corporation]] [[CDC STAR-100|STAR-100]] and [[Texas Instruments]] [[Advanced Scientific Computer]] (ASC), which were introduced in 1974 and 1972, respectively.<!--The STAR was announced before the ASC-->▼
▲The first vector supercomputers are the [[Control Data Corporation]] [[
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
Line 29 ⟶ 35:
[[File:Cray J90 CPU module.jpg|thumb|[[Cray J90]] processor module with four scalar/vector processors]]
Other examples followed. [[Control Data Corporation]] tried to re-enter the high-end market again with its [[ETA-10]] machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ([[Fujitsu]], [[
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the [[Cray-2]], [[Cray X-MP]] and [[Cray Y-MP]]. Since then, the supercomputer market has focused much more on [[massively parallel]] processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed [[
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their [[NEC SX architecture|SX series]] of computers. Most recently, the [[SX-Aurora TSUBASA]] places the processor and either 24 or 48 gigabytes of memory on an [[High Bandwidth Memory|HBM]] 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
=== GPU ===
{{
Modern graphics processing units ([[GPUs]]) include an array of [[shaders|shader pipelines]] which may be driven by [[compute kernel]]s, and can be considered vector processors (using a similar strategy for hiding memory latencies).
=== Recent development ===
Several modern CPU architectures are being designed as vector processors. The [[RISC-V#Vector_set|RISC-V vector extension]] follows similar principles as the early vector processors, and is being implemented in commercial products such as the [[Andes Technology]] AX45MPV.<ref>{{cite press release | title = Andes Announces RISC-V Multicore 1024-bit Vector Processor: AX45MPV | url = https://www.globenewswire.com/en/news-release/2022/12/07/2569216/0/en/Andes-Announces-RISC-V-Multicore-1024-bit-Vector-Processor-AX45MPV.html | publisher = GlobeNewswire | date = 7 December 2022 | access-date = 23 December 2022}}</ref> There are also several [[open source]] vector processor architectures being developed, including [[Agner Fog#ForwardCom_instruction_set|ForwardCom]] and [[Libre-SOC]].
== Comparison with modern architectures ==
{{As of | 2016}} most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However '''by definition''' the addition of SIMD cannot by itself qualify a processor ''as'' an actual Vector Processor because SIMD is fixed-length and Vectors are variable. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.{{citation needed|date=June 2021}}▼
▲{{As of | 2016}} most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However,
* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[SWAR|SIMD within a Register (SWAR)]], and [[Flynn's taxonomy#Pipelined processor|Pipelined Processor]] in Flynn's Taxonomy. Common examples using SIMD with features inspired ''by'' Vector processors include Intel x86's [[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]] and [[Advanced Vector Extensions|AVX]] instructions, AMD's [[3DNow!]] extensions, [[ARM architecture#Advanced SIMD (Neon)|ARM NEON]], Sparc's [[Visual Instruction Set|VIS]] extension, [[PowerPC]]'s [[AltiVec]] and MIPS' [[MIPS SIMD Architecture|MSA]]. In 2000, [[IBM]], [[Toshiba]] and [[Sony]] collaborated to create the [[Cell (microprocessor)|Cell processor]], which is also SIMD.▼
▲* '''Pure (fixed) SIMD''' - also known as "Packed SIMD",<ref>{{cite conference|first1=Y.|last1=Miyaoka|first2=J.|last2=Choi|first3=N.|last3=Togawa|first4=M.|last4=Yanagisawa|first5=T.|last5=Ohtsuki|title=An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions|conference=Asia-Pacific Conference on Circuits and Systems|date=2002|pages=171–176|volume=1|doi=10.1109/APCCAS.2002.1114930|hdl=2065/10689|hdl-access=free}}</ref> [[
* '''Predicated SIMD''' - also known as [[Flynn's taxonomy#Associative processor|associative processing]]. Two notable examples which have per-element (lane-based) predication are [[Scalable Vector Extension|ARM SVE2]] and [[AVX-512]]
* '''Pure Vectors''' - as categorised in [[Duncan's taxonomy#Pipelined vector processors|Duncan's taxonomy]] - these include the original [[Cray-1]], [[Convex Computer|Convex C-Series]], [[NEC SX]], and [[RISC-V#Vector set|RISC-V RVV
Other CPU designs include some multiple instructions for vector processing on multiple (
=== Difference between SIMD and vector
SIMD instruction sets lack crucial features when compared to vector
* a way to set the vector length,
* Iteration and reduction over elements {{em|within}} vectors.
Predicated SIMD (part of [[Flynn's taxonomy]]) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2.<ref>{{Cite web|url=https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2/single-page|title = Documentation – Arm Developer}}</ref>
SIMD,
[[File:Simd vs vector.png|thumb|500px]]
Additionally, vector processors can be more resource-efficient
Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence.
Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise.
▲-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a [[Superscalar processor|multi-issue execution model]], the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when ''all four'' LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.
A vector processor, by contrast,
== Description ==
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this [[memory latency]] has historically become a large impediment to performance; see [[Random-access memory#Memory wall|Memory wall]].▼
▲In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up.
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as [[instruction pipelining]] in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an [[assembly line]], so the [[address decoder]] is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the ''[[Latency (engineering)|latency]]'', but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.▼
▲In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as [[instruction pipelining]] in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the fashion of an [[assembly line]], so the [[address decoder]] is constantly in use.
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction ''itself'' that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.
Line 82 ⟶ 94:
<syntaxhighlight lang=gas>
; Hypothetical RISC machine
; add 10 numbers in a to 10 numbers in b, storing results in c▼
; assume a, b, and c are memory locations in their respective registers
move $10, count ; count := 10
loop:
Line 100 ⟶ 112:
But to a vector processor, this task looks considerably different:
<syntaxhighlight lang="gas">
; assume we have vector registers v1-v3
; with size equal or larger than 10
move $10, count ; count = 10
Line 137 ⟶ 149:
But more than that, a high performance vector processor may have multiple [[functional unit]]s adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being the time required to fetch the data from memory.
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes ''other'' instructions run slower—i.e., whenever it is '''not''' adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding.
Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in [[supercomputer]]s, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched".
=== Vector instructions ===
{{
The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.
The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like {{code|1=vadd c, a, b, $10}}). They are also found in the [[x86]] architecture as the {{code|REP}} prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.
Interestingly, though, Broadcom included space in all vector operations of the [[Videocore]] IV ISA for a {{code|REP}} field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations.
The [[Cray-1]] introduced the idea of using [[processor register]]s to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed{{By whom|date=November 2021}} to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in [[GPU]]s, which face exactly the same issue.
Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the [[SX-Aurora TSUBASA]]) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime.
([[MMX (instruction set)|MMX]], [[Streaming SIMD Extensions|SSE]], [[AltiVec]]) categorically do not.
Line 160 ⟶ 173:
=== Vector instruction example ===
This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one.
<syntaxhighlight lang=c>
void iaxpy(size_t n, int a, const int x[], int y[]) {
Line 168 ⟶ 182:
</syntaxhighlight>
In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar
==== Scalar assembler ====
Line 190 ⟶ 204:
The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.
<syntaxhighlight lang=gas>
; Assume tmp is pre-allocated
Line 199 ⟶ 214:
==== Pure (non-predicated, packed) SIMD ====
A modern packed SIMD architecture, known by many names (listed in [[Flynn's taxonomy#Pipelined processor|Flynn's taxonomy]]), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are [[Data structure alignment|properly aligned]] here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version.
<syntaxhighlight lang=gas>
Line 226 ⟶ 241:
Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.
Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself.
* first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm inner loop.
* perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
Line 232 ⟶ 247:
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).
This more than ''triples'' the size of the code, in fact in extreme cases it results in an ''order of magnitude'' increase in instruction count!
Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why [[AVX-512]] exists in x86.
Line 238 ⟶ 253:
Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.<ref>[https://www.sigarch.org/simd-instructions-considered-harmful/ SIMD considered harmful]</ref>
Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no {{code|setvl}} instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).
==== Predicated SIMD ====
Line 270 ⟶ 284:
One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.
It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks.
==== Pure (true) vector ISA ====
For Cray-style vector ISAs such as RVV, an instruction called "{{Not a typo|setvl}}" (set vector length) is used.
On calling {{Not a typo|setvl}} with the number of outstanding data elements to be processed, "{{Not a typo|setvl}}" is permitted (essentially required) to limit that to the Maximum Vector Length (MVL) and thus returns the ''actual'' number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount.
Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:
Line 293 ⟶ 307:
</syntaxhighlight>
This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is ''going'' to be processed in each iteration.
A number of things to note, when comparing against the Predicated SIMD assembly variant:
Line 299 ⟶ 313:
# Where the SIMD variant hard-coded both the width (4) into the creation of the mask ''and'' in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
# Setting VL effectively ''creates a hidden predicate mask'' that is automatically applied to the vectors
# Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation.
Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.
Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0.
Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because Instruction Decode and Issue can sit idle.
Additionally, the number of elements going in to the function can start at zero.
=== Vector reduction example ===
This example starts with an algorithm which involves reduction.
<syntaxhighlight lang=c>
void (size_t n, int a, const int x[]) {
Line 343 ⟶ 357:
==== SIMD reduction ====
This is where the problems start.
<syntaxhighlight lang=gas>
Line 359 ⟶ 373:
but with 4-wide SIMD being incapable '''by design''' of adding {{code|x[0]+x[1]}} for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements.
examples online can be found for [[AVX-512]] of how to do "Horizontal Sum"<ref>{{Cite web|url=https://stackoverflow.com/questions/52782940/1-to-4-broadcast-and-4-to-1-reduce-in-avx-512|title = Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512}}</ref><ref>{{Cite web|url=https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026|title = Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)}}</ref>
Line 390 ⟶ 404:
The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where maximum vector length could be only one).
Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel.
This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
Line 413 ⟶ 427:
* '''Masked Operations''' – [[Predication (computer architecture)|predicate masks]] allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
* '''Compress and Expand''' – usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in [[AVX-512#Compress and expand|AVX-512]].
* '''Register Gather, Scatter (aka permute)'''<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#vector-register-gather-instructions RVV register gather-scatter instructions]</ref> – a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector.
* '''Splat and Extract''' – useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
* '''Iota''' – a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
* '''Reduction and [[Iteration#
* '''Matrix Multiply support''' – either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions<ref>{{Cite web|url=https://m.youtube.com/watch?v=27VRdI2BGWg&t=1260 |archive-url=https://ghostarchive.org/varchive/youtube/20211211/27VRdI2BGWg| archive-date=2021-12-11 |url-status=live|title = IBM's POWER10 Processor - William Starke & Brian W. Thompto, IBM|website = [[YouTube]]}}{{cbignore}}</ref> although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources.<ref>{{Cite arXiv <!-- unsupported parameter |url=https://arxiv.org/pdf/2104.03142 --> |eprint=2104.03142 |last1 = Moreira|first1 = José E.|last2 = Barton|first2 = Kit|last3 = Battle|first3 = Steven|last4 = Bergner|first4 = Peter|last5 = Bertran|first5 = Ramon|last6 = Bhat|first6 = Puneeth|last7 = Caldeira|first7 = Pedro|last8 = Edelsohn|first8 = David|last9 = Fossum|first9 = Gordon|last10 = Frey|first10 = Brad|last11 = Ivanovic|first11 = Nemanja|last12 = Kerchner|first12 = Chip|last13 = Lim|first13 = Vincent|last14 = Kapoor|first14 = Shakti|author15 = Tulio Machado Filho|author16 = Silvia Melitta Mueller|last17 = Olsson|first17 = Brett|last18 = Sadasivam|first18 = Satish|last19 = Saleil|first19 = Baptiste|last20 = Schmidt|first20 = Bill|last21 = Srinivasaraghavan|first21 = Rajalakshmi|last22 = Srivatsan|first22 = Shricharan|last23 = Thompto|first23 = Brian|last24 = Wagner|first24 = Andreas|last25 = Wu|first25 = Nelson|title = A matrix math facility for Power ISA(TM) processors|year = 2021|class = cs.AR}}</ref><ref>{{Cite book|chapter-url=https://link.springer.com/chapter/10.1007/978-1-4471-1011-8_8|doi=10.1007/978-1-4471-1011-8_8|chapter=A Modular Massively Parallel Processor for Volumetric Visualisation Processing|title=High Performance Computing for Computer Graphics and Visualisation|year=1996|last1=Krikelis|first1=Anargyros|pages=101–124|isbn=978-3-540-76016-0}}</ref> NVidia provides a high-level Matrix [[CUDA]] API although the internal details are not available.<ref>{{Cite web|url=https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma|title = CUDA C++ Programming Guide}}</ref> The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
* '''Advanced Math formats''' – often includes [[Galois field]] arithmetic, but can include [[binary-coded decimal]] or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
Line 425 ⟶ 439:
With many 3D [[shader]] applications needing [[trigonometric]] operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
* '''Sub-vectors''' – elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL").<ref>[https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#mapping-for-lmul-1-2 LMUL > 1 in RVV]</ref> Subvectors are a critical integral part of the [[
* '''Sub-vector Swizzle''' – aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight [[permute instruction|mini-permute]] of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the
* '''Transcendentals''' – [[trigonometric]] operations such as [[sine]], [[cosine]] and [[logarithm]] obviously feature much more predominantly in 3D than in many demanding [[High-performance computing|HPC]] workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The
=== Fault (or Fail) First ===
Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads.
The basic principle of {{Not a typo|ffirst}} is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the ''actual'' amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient.
Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.
This begins to hint at the reason why {{Not a typo|ffirst}} is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated {{Not a typo|non-ffirst}} SIMD.
By contrast, the same strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.<ref>[https://github.com/riscv/riscv-v-spec/blob/master/example/strncpy.s RVV strncpy example]</ref>
The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries.
== Performance and speed up ==
Let '''''r''''' be the vector speed ratio and '''''f''''' be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e., 90% of the work is done by the vector unit. It follows the achievable speed up of:
Line 460 ⟶ 475:
* [[RISC-V]], an open ISA standard with an associated variable width [[RISC-V#Vector set|vector extension]].
* [[Barrel processor]]
* [[Tensor
* [[History of supercomputing]]
* [[Supercomputer architecture]]
== References ==
{{Reflist}}
{{Parallel computing}}
|