컴퓨터_구조
차이
문서의 선택한 두 판 사이의 차이를 보여줍니다.
양쪽 이전 판이전 판다음 판 | 이전 판 | ||
컴퓨터_구조 [2012/06/05 00:44] – [4.3 SIMD Instruction Set Extensions for Multimedia] lindol | 컴퓨터_구조 [2013/12/11 00:46] (현재) – 바깥 편집 127.0.0.1 | ||
---|---|---|---|
줄 23: | 줄 23: | ||
Figure 4.8 summarizes typical multimedia SIMD instructions. | Figure 4.8 summarizes typical multimedia SIMD instructions. | ||
- | 그림 4.8은 일반적인 멀티미디어 SIMD 지침을 요약한 것입니다. | + | 그림 4.8은 일반적인 멀티미디어 SIMD 인스트럭션을 요약한 것입니다. |
Like vector instructions, | Like vector instructions, | ||
줄 72: | 줄 72: | ||
■ Multimedia SIMD usually does not offer the mask registers to support conditional execution of elements as in vector architectures. | ■ Multimedia SIMD usually does not offer the mask registers to support conditional execution of elements as in vector architectures. | ||
- | 멀티미디어 SIMD는 일반적으로 | + | 멀티미디어 SIMD는 일반적으로 벡터 아키텍처 요소의 조건부 실행을 지원하는 mask 레지스터를 |
These omissions make it harder for the compiler to generate SIMD code and increase the difficulty of programming in SIMD assembly language. | These omissions make it harder for the compiler to generate SIMD code and increase the difficulty of programming in SIMD assembly language. | ||
- | 이러한 누락은 SIMD 코드를 생성하고 | + | 이러한 누락은 SIMD 코드 생성과 SIMD 어셈블리 언어로 프로그래밍하는데 |
For the x86 architecture, | For the x86 architecture, | ||
- | x86 아키텍처의 경우 | + | x86 아키텍처의 경우, 1996년에 |
These were joined by parallel MAX and MIN operations, a wide variety of masking and conditional instructions, | These were joined by parallel MAX and MIN operations, a wide variety of masking and conditional instructions, | ||
- | 이러한 | + | 여기에는 |
Note that MMX reused the floating-point data transfer instructions to access memory. | Note that MMX reused the floating-point data transfer instructions to access memory. | ||
- | MMX가 메모리에 액세스하는 부동 소수점 데이터 전송 | + | MMX가 메모리에 액세스하는 부동 소수점 데이터 전송 |
The Streaming SIMD Extensions (SSE) successor in 1999 added separate registers that were 128 bits wide, so now instructions could simultaneously perform sixteen 8-bit operations, eight 16-bit operations, or four 32-bit operations. | The Streaming SIMD Extensions (SSE) successor in 1999 added separate registers that were 128 bits wide, so now instructions could simultaneously perform sixteen 8-bit operations, eight 16-bit operations, or four 32-bit operations. | ||
- | 1999년 SIMD Extensions 스트리밍 | + | 1999년에 Streaming |
It also performed parallel single-precision floating-point arithmetic. | It also performed parallel single-precision floating-point arithmetic. | ||
- | 또한 병렬 단일 정미도 부동 소수점 | + | 이것은 |
Since SSE had separate registers, it needed separate data transfer instructions. | Since SSE had separate registers, it needed separate data transfer instructions. | ||
- | SSE는 별도의 레지스터를 | + | SSE가 별도의 레지스터를 |
Intel soon added double-precision SIMD floating-point data types via SSE2 in 2001, SSE3 in 2004, and SEE4 in 2007. | Intel soon added double-precision SIMD floating-point data types via SSE2 in 2001, SSE3 in 2004, and SEE4 in 2007. | ||
- | 인텔은 곧 2004년과 2007년 SEE4에서 SSE3, 2001년 SSE2를 통해 | + | 인텔은 곧 배정밀도 SIMD floating-point 데이터 타입들을 2001년에 SSE2, 2004년에 SSE3, 2007년에 SSE4를 통해 추가했다. |
Instructions with four single-precision floating-point operations or two parallel double-precision operations increased the peak floating-point performance of the x86 computers, as long as programmers place the operands side by side. | Instructions with four single-precision floating-point operations or two parallel double-precision operations increased the peak floating-point performance of the x86 computers, as long as programmers place the operands side by side. | ||
- | 프로그래머 옆에 피연산자 측면의 배치만큼 네 단일 정밀도 | + | 4개 단일 정밀도 |
With each generation, they also added ad hoc instructions whose aim is to accelerate specific multimedia functions perceived to be important. | With each generation, they also added ad hoc instructions whose aim is to accelerate specific multimedia functions perceived to be important. | ||
- | 각 세대로, 그들은 또한 누구의 목표는 | + | 각 세대에서 |
- | The Advanced Vector Extensions (AVX), added in 2010, doubles the width of the registers again to 256 bits and thereby | + | The Advanced Vector Extensions (AVX), added in 2010, doubles the width of the registers again to 256 bits and there by offers instructions that double the number of operations on all narrower data types. |
- | 2010년에 추가된 | + | 고급 벡터 확장(AVX)은 다시 레지스터의 넓이를 두배인 256비트로 모든 |
Figure 4.9 shows AVX instructions useful for double-precision floating-point computations. | Figure 4.9 shows AVX instructions useful for double-precision floating-point computations. | ||
- | 그림 4.9는 배정 밀도 부도 소수점 계산에 유용 AVX지침을 보여줍니다. | + | 그림 4.9는 배 정밀도 부동 소수점 계산에 |
AVX includes preparations to extend the width to 512 bits and 1024 bits in future generations of the architecture. | AVX includes preparations to extend the width to 512 bits and 1024 bits in future generations of the architecture. | ||
- | AVX는 | + | AVX는 |
In general, the goal of these extensions has been to accelerate carefully written libraries rather than for the compiler to generate them (see Appendix H), but recent x86 compilers are trying to generate such code, particularly for floating-point-intensive applications. | In general, the goal of these extensions has been to accelerate carefully written libraries rather than for the compiler to generate them (see Appendix H), but recent x86 compilers are trying to generate such code, particularly for floating-point-intensive applications. | ||
- | 일반적으로 | + | 일반적으로 |
Given these weaknesses, why are Multimedia SIMD Extensions so popular? | Given these weaknesses, why are Multimedia SIMD Extensions so popular? | ||
- | 이러한 약점을 감안할 때, 왜 멀티미디어 SIMD Extensions가 인기인가요? | + | 이러한 약점에도 |
First, they cost little to add to the standard arithmetic unit and they were easy to implement. | First, they cost little to add to the standard arithmetic unit and they were easy to implement. | ||
- | 첫째, | + | 첫 번째, 표준 연산 |
Second, they require little extra state compared to vector architectures, | Second, they require little extra state compared to vector architectures, | ||
- | 둘째, 그들은 벡터 아키텍처는 | + | 두 번째, 항상 컨텍스트 스위치 시간에 대해 관심을 갖고 백터 아키텍처에 약간의 |
Third, you need a lot of memory bandwidth to support a vector architecture, | Third, you need a lot of memory bandwidth to support a vector architecture, | ||
- | 셋째, 당신은 | + | 세 번째, 많은 컴퓨터가 |
Fourth, SIMD does not have to deal with problems in virtual memory when a single instruction that can generate 64 memory accesses can get a page fault in the middle of the vector. | Fourth, SIMD does not have to deal with problems in virtual memory when a single instruction that can generate 64 memory accesses can get a page fault in the middle of the vector. | ||
- | 넷째, SIMD는 64메모리 접근을 생성할 수 있는 단일 | + | 네 번째, SIMD는 64 메모리 접근을 생성할 수 있는 단일 |
^ AVX Instruction ^ Description ^ | ^ AVX Instruction ^ Description ^ | ||
줄 161: | 줄 161: | ||
| VBROADCASTSD | Broadcast one double-precision operands to four locations in a 256-bit register | | | VBROADCASTSD | Broadcast one double-precision operands to four locations in a 256-bit register | | ||
| Figure 4.9 AVX instructions for x86 architecture useful in double-precision floating-point programs. Packed double for 256-bit AVX means four 64-bit operands executed in SIMD mode. As the width increases with AVX, it is increasingly important to add data permutation instructions that allow combinations of narrow operands from different parts of the wide registers. AVX includes instructions that shuffle 32-bit, 64-bit, or 128-bit operands within a 256-bit register. For example, BROADCAST replicates a 64-bit operand 4 times in an AVX register. AVX also includes a large variety of fused multiply-add/ | | Figure 4.9 AVX instructions for x86 architecture useful in double-precision floating-point programs. Packed double for 256-bit AVX means four 64-bit operands executed in SIMD mode. As the width increases with AVX, it is increasingly important to add data permutation instructions that allow combinations of narrow operands from different parts of the wide registers. AVX includes instructions that shuffle 32-bit, 64-bit, or 128-bit operands within a 256-bit register. For example, BROADCAST replicates a 64-bit operand 4 times in an AVX register. AVX also includes a large variety of fused multiply-add/ | ||
- | | 배정밀도 부동 소수점 | + | | **그림 4.9 배 정밀 |
SIMD extensions use separate data transfer per SIMD group of operands that are aligned in memory, and so they cannot cross page boundaries. | SIMD extensions use separate data transfer per SIMD group of operands that are aligned in memory, and so they cannot cross page boundaries. | ||
- | SIMD 확장 메모리에 정렬됩니다 | + | SIMD 확장은 메모리 |
Another advantage of short, fixed-length " | Another advantage of short, fixed-length " | ||
- | SIMD의 | + | SIMD의 |
Finally, there was concern about how well vector architectures can work with caches. | Finally, there was concern about how well vector architectures can work with caches. | ||
- | 마지막으로, 벡터 아키텍처 캐시 작업할 수 얼마나 잘 대해 우려가 발생했습니다. | + | 마지막으로 |
- | More recent vector architectures | + | More recent vector architectures |
+ | |||
+ | 최근 벡터 아키텍처는 모든 문제의 주소지정을(해겨했다.) 갖는다. 그러나 아키텍트들 사이에서 벡터는 호의적인 태도를 형성하였다. | ||
- | 최근 벡터 아키텍처는 이러한 문제를 모두 해결했지만 과거 결함의 유산 건축가들 사이에 벡터 대한 회의적인 태도를 형성. | ||
===== Example ===== | ===== Example ===== | ||
To give an idea of what multimedia instructions look like, assume we added 256-bit SIMD multimedia instructions to MIPS. | To give an idea of what multimedia instructions look like, assume we added 256-bit SIMD multimedia instructions to MIPS. | ||
- | 멀티미디어 | + | 멀티미디어 |
We concentrate on floating-point in this example. | We concentrate on floating-point in this example. | ||
- | 우리는 | + | 이 예제에서는 부동 소수점(floating-point)에 대해 살펴본다. |
We add the suffix " | We add the suffix " | ||
- | 우리는 | + | 우리는 |
Like vector architectures, | Like vector architectures, | ||
- | 벡터 아키텍처와 마찬가지로, 이 경우 차선, 넷 것으로 | + | 벡터 아키텍처와 마찬가지로 이 사례에서 |
MIPS SIMD will reuse the floating-point registers as operands for 4D instructions, | MIPS SIMD will reuse the floating-point registers as operands for 4D instructions, | ||
- | MIPS SIMD 이중 정밀 원래 MIPS에서 단일 정밀 레지스터를 재사용마찬가지로 4D 지침 | + | MIPS SIMD는 4D 인스트럭션을 위한 |
This example shows MIPS SIMD code for the DAXPY loop. | This example shows MIPS SIMD code for the DAXPY loop. | ||
- | 이 예제는 DAXPY루프 MIPS SIMD 코드를 보여줍니다. | + | 이 예제는 DAXPY 루프를 위한 |
Assume that the starting addresses of X and Y are in Rx and Ry, respectively. | Assume that the starting addresses of X and Y are in Rx and Ry, respectively. | ||
- | X와 Y의 시작 주소가 각각 | + | X, Y를 시작 주소가 각 각 Rx, Ry안에 있는 것으로 가정한다. |
Underline the changes to the MIPS code for SIMD. | Underline the changes to the MIPS code for SIMD. | ||
- | SIMD를 위한 MIPS코드에 변경 내용을 | + | SIMD를 위한 MIPS코드는 밑줄로 표시했다. |
===== Answer ===== | ===== Answer ===== | ||
줄 234: | 줄 235: | ||
The changes were replacing every MIPS double-precision instruction with its 4D equivalent, increasing the increment from 8 to 32, and changing the registers from F2 and F4 to F4 and F8 to get enough space in the register file for four sequential double-precision operands. | The changes were replacing every MIPS double-precision instruction with its 4D equivalent, increasing the increment from 8 to 32, and changing the registers from F2 and F4 to F4 and F8 to get enough space in the register file for four sequential double-precision operands. | ||
- | 변경 사항은 8시 부터 32 증가를 증가 자사 | + | 4D와 상응하도록 |
So that each SIMD lane would have its own copy of the scalar a, we copied the value of F0 into registers F1, F2, and F3. | So that each SIMD lane would have its own copy of the scalar a, we copied the value of F0 into registers F1, F2, and F3. | ||
- | 각각의 SIMD의 차선은 | + | 각 각의 SIMD 통로는 |
(Real SIMD instruction extensions have an instruction to broadcast a value to all other registers in a group.) | (Real SIMD instruction extensions have an instruction to broadcast a value to all other registers in a group.) | ||
- | (리얼 | + | (실제 |
Thus, the multiply does F4*F0, F5*F1, F6*F2, and F7*F3. | Thus, the multiply does F4*F0, F5*F1, F6*F2, and F7*F3. | ||
- | 증식하지 F4 * F0, F5 * F1, F6 * F2 및 F3 F7 *. | + | 그러므로 곱은 F4xF0, F5xF1, F6xF2, F7xF3 이다. |
While not as dramatic as the 100x reduction of dynamic instruction band-width of VMIPS, SIMD MIPS does get a 4x reduction: 149 versus 578 instructions executed for MIPS. | While not as dramatic as the 100x reduction of dynamic instruction band-width of VMIPS, SIMD MIPS does get a 4x reduction: 149 versus 578 instructions executed for MIPS. | ||
- | MIPS로 처형 149 대 578 지침: 역동적인 | + | VMIPS의 |
====== Programming Multimedia SIMD Architectures ====== | ====== Programming Multimedia SIMD Architectures ====== | ||
Given the ad hoc nature of the SIMD multimedia extensions, the easiest way to use these instructions has been through libraries or by writing in assembly language. | Given the ad hoc nature of the SIMD multimedia extensions, the easiest way to use these instructions has been through libraries or by writing in assembly language. | ||
- | SIMD 멀티미디어 확장 특별 | + | SIMD 멀티미디어 확장의 특별함을 감안할 때 인스트럭션을 사용하는 가장 쉬운 방법은 |
Recent extensions have become more regular, giving the compiler a more reasonable target. | Recent extensions have become more regular, giving the compiler a more reasonable target. | ||
- | 최근 확장을 컴파일러에게 | + | 최근 확장은 컴파일러에게 |
By borrowing techniques from vectorizing compilers, compilers are starting to produce SIMD instructions automatically. | By borrowing techniques from vectorizing compilers, compilers are starting to produce SIMD instructions automatically. | ||
- | 컴파일러를 vectorizing의 | + | 벡터 라이징 |
For example, advanced compilers today can generate SIMD floating-point instructions to deliver much higher performance for scientific codes. | For example, advanced compilers today can generate SIMD floating-point instructions to deliver much higher performance for scientific codes. | ||
- | 예를들어, | + | 예를들어, |
However, programmers must be sure to align all the data in memory to the width of the SIMD unit on which the code is run to prevent the compiler from generating scalar instructions for otherwise vectorizable code. | However, programmers must be sure to align all the data in memory to the width of the SIMD unit on which the code is run to prevent the compiler from generating scalar instructions for otherwise vectorizable code. | ||
- | 그러나 프로그래머는 | + | 그러나 프로그래머는 |
====== The Roofline Visual Performance Model ====== | ====== The Roofline Visual Performance Model ====== | ||
One visual, intuitive way to compare potential floating-point performance of variations of SIMD architectures is the Roofline model[Williams et al. 2009]. | One visual, intuitive way to compare potential floating-point performance of variations of SIMD architectures is the Roofline model[Williams et al. 2009]. | ||
- | SIMD 아키텍처의 변화의 잠재적인 부동 소수점 성능을 비교하기 위해 하나의 시각적, 직관적인 방법은 Roofline 모델[윌리엄스 외 입니다. 2009]. | + | SIMD 아키텍처의 변화로 잠재적인 부동 소수점 성능을 비교하기 위한 하나의 시각적, 직관적 방법은 Roofline 모델이다. |
^ ^ | ^ ^ | ||
| Figure 4.10 Arithmetic intensity, specified as the number of floating-point operations to run program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. | | | Figure 4.10 Arithmetic intensity, specified as the number of floating-point operations to run program divided by the number of bytes accessed in main memory [Williams et al. 2009]. Some kernels have an arithmetic intensity that scales with problem size, such as dense matrix, but there are many kernels with arithmetic intensities independent of problem size. | | ||
- | | 메인 메모리[윌리암스 외에 액세스한 | + | | 그림 4.10은 연산의 강도를 |
+ | 일부 커널이 연산의 강도로 | ||
It ties together floating-point performance, | It ties together floating-point performance, | ||
- | 그것은 2차원 그래프에서 부동 소수점 성능, 메모리 성능 | + | 그것은 2차원 그래프에서 |
Arithmetic intensity is the ratio of floating-point operations per byte of memory accessed. | Arithmetic intensity is the ratio of floating-point operations per byte of memory accessed. | ||
- | 산술 강도는 | + | 산술 강도는 메모리 |
It can be calculated by taking the total number of floating-point operations for a program divided by the total number of data bytes transferred to main memory during program execution. | It can be calculated by taking the total number of floating-point operations for a program divided by the total number of data bytes transferred to main memory during program execution. | ||
- | 이것은 프로그램 실행시 메인 메모리로 전송 데이터 바이트의 | + | 이 비율은 프로그램의 전체 부동 소수점 연산의 수를 |
Figure 4.10 shows the relative arithmetic intensity of several example kernels. | Figure 4.10 shows the relative arithmetic intensity of several example kernels. | ||
- | 그림 4.10는 | + | 그림 4.10는 |
Peak floating-point performance can be found using the hardware specifications. | Peak floating-point performance can be found using the hardware specifications. | ||
- | 피크 | + | 부동 소수점의 최고 |
Many of the kernels in this case study do not fit in on-chip caches, so peak memory performance is defined by the memory system behind the caches. | Many of the kernels in this case study do not fit in on-chip caches, so peak memory performance is defined by the memory system behind the caches. | ||
- | 이 사례 | + | 커널의 대부분은 |
Note that we need the peak memory bandwidth that is available to the processors, not just at the DRAM pins as in Figure 4.27 on page 325. | Note that we need the peak memory bandwidth that is available to the processors, not just at the DRAM pins as in Figure 4.27 on page 325. | ||
- | 우리는 | + | 참고로 |
One way to find the (delivered) peak memory performance is to run the Stream benchmark. | One way to find the (delivered) peak memory performance is to run the Stream benchmark. | ||
- | (배달) | + | 최대 메모리 성능을 찾는 |
Figure 4.11 shows the Roofline model for the NEC SX-9 vector processor on the left and the Intel Core i7 920 multicore computer on the right. | Figure 4.11 shows the Roofline model for the NEC SX-9 vector processor on the left and the Intel Core i7 920 multicore computer on the right. | ||
- | 그림 4.11은 왼쪽 NEC SX-9 벡터 프로세서와 인텔 코어 i7 920의 | + | 그림 4.11 왼쪽은 NEC SX-9 벡터 프로세스를 위한 Rootline 모델과, |
The vertical Y-axis is achievable floating-point performance from 2 to 256 GFLOP/sec. | The vertical Y-axis is achievable floating-point performance from 2 to 256 GFLOP/sec. | ||
- | 세로 Y축 2에서 256 GFLOP/초까지 달성 부동 소수점 성능입니다. | + | 세로 Y축은 2에서 256 GFLOP/sec까지 달성할 수 있는 |
The horizontal X-axis is arithmetic intensity, varying from 1/8th FLOP/DRAM byte accessed to 16 FLOP/ DRAM byte accessed in both graphs. | The horizontal X-axis is arithmetic intensity, varying from 1/8th FLOP/DRAM byte accessed to 16 FLOP/ DRAM byte accessed in both graphs. | ||
- | 수평 X 축 16플롭/ | + | 수평 X 축은 산술의 강도로 |
Note that the graph is a log-log scale, and that Rooflines are done just once for a computer. | Note that the graph is a log-log scale, and that Rooflines are done just once for a computer. | ||
- | 그래프는 | + | 참고로 |
For a given kernel, we can find a point on the X-axis based on its arithmetic intensity. | For a given kernel, we can find a point on the X-axis based on its arithmetic intensity. | ||
- | 주어진 커널 위해, 우리는 산술 강도에 따라 X축에 지점을 찾을 수 있습니다. | + | 주어진 커널 위해, 우리는 산술 강도에 따라 X축 지점을 찾을 수 있습니다. |
If we drew a vertical line through that point, the performance of the kernel on that computer must lie somewhere along that line. | If we drew a vertical line through that point, the performance of the kernel on that computer must lie somewhere along that line. | ||
- | 우리가 그 지점을 통해 수직으로 선을 경우 해당 컴퓨터에서 커널의 성능은 그 라인을 따라 어딘가에 누워 있어야합니다. | + | 우리가 그 지점을 통해 수직으로 선을 경우 해당 컴퓨터에서 커널의 성능은 그 라인을 따라서 어딘가 |
We can plot a horizontal line showing peak floating-point performance of the computer. | We can plot a horizontal line showing peak floating-point performance of the computer. | ||
- | 우리는 컴퓨터의 최대 부동 소수점 성능을 보여주는 수평선의 주술하실 | + | 우리는 컴퓨터의 최대 부동 소수점 성능을 보여주는 수평 선으로 구간을 설정할 |
Obviously, the actual floating-point performance can be no higher than the horizontal line, since that is a hardware limit. | Obviously, the actual floating-point performance can be no higher than the horizontal line, since that is a hardware limit. | ||
줄 349: | 줄 350: | ||
How could we plot the peak memory performance? | How could we plot the peak memory performance? | ||
- | 우리가 어떻게 최고의 메모리 성능을 | + | 우리가 어떻게 최고의 메모리 성능을 |
Since the X-axis is FLOP/byte and the Y-axis is FLOP/sec, bytes/sec is just a diagonal line at a 45-degree angle in this figure. | Since the X-axis is FLOP/byte and the Y-axis is FLOP/sec, bytes/sec is just a diagonal line at a 45-degree angle in this figure. | ||
- | X-축은 | + | X-축은 |
Hence, we can plot a third line that gives the maximum floating-point performance that the memory system of that computer can support for a given arithmetic intensity. | Hence, we can plot a third line that gives the maximum floating-point performance that the memory system of that computer can support for a given arithmetic intensity. | ||
- | 그러므로 | + | 이런 까닭에 |
We can express the limits as a formula to plot these lines in the graphs in Figure 4.11: | We can express the limits as a formula to plot these lines in the graphs in Figure 4.11: | ||
- | 우리는 그림 4.11의 그래프에서 이러한 라인을 모략하는 수식으로 | + | 우리는 그림 4.11에서 |
Attainable GFLOPs/sec = Min(Peak Memory BW x Arithmetic Intensity, Peak Floating-Point Perf.) | Attainable GFLOPs/sec = Min(Peak Memory BW x Arithmetic Intensity, Peak Floating-Point Perf.) | ||
- | 달성 GFLOPs/초는 | + | 달성할 GFLOPs/sec = Min( Peak Memory |
The horizontal and diagonal lines give this simple model its name and indicate its value. | The horizontal and diagonal lines give this simple model its name and indicate its value. | ||
- | 수평 및 대각선 라인이 간단한 모델에게 이름을 주고 그 가치를 나타냅니다. | + | 수평 및 대각선 라인이 간단한 모델에게 이름을 주고 그 값를 나타냅니다. |
- | The " | + | The " |
- | " | + | " |
If we think of arithmetic intensity as a pole that hits the roof, either it hits the flat part of the roof, which means performance is computationally limited, or it hits the slanted part of the roof, which means performance is ultimately limited by memory bandwidth. | If we think of arithmetic intensity as a pole that hits the roof, either it hits the flat part of the roof, which means performance is computationally limited, or it hits the slanted part of the roof, which means performance is ultimately limited by memory bandwidth. | ||
- | 우리가 지붕을 눌렀을 때 장대로 | + | 만약 roof의 맟닿은 점을 산술 강도라 생각하면 |
In Figure 4.11, the vertical dashed line on the right (arithmetic intensity of 4) is an example of the former and the vertical dashed line on the left(arithmetic intensity of 1/4) is an example of the latter. | In Figure 4.11, the vertical dashed line on the right (arithmetic intensity of 4) is an example of the former and the vertical dashed line on the left(arithmetic intensity of 1/4) is an example of the latter. | ||
- | 그림 4.11에서 오른쪽에 세로 점선(4산술 강도)는 과거의 예를과 | + | 그림 4.11에서 오른쪽의 세로 점선(4의 산술 강도)은 이전의 예이며 |
Given a Roofline model of a computer, you can apply it repeatedly, since it doesn' | Given a Roofline model of a computer, you can apply it repeatedly, since it doesn' | ||
- | 그것이 | + | 주어진 컴퓨터의 Roofline은 |
Note that the "ridge point," | Note that the "ridge point," | ||
- | 대각선과 수평 | + | 참고로 |
If it is far to the right, then only kernels with very high arithmetic intensity can achieve the maximum performance of that computer. | If it is far to the right, then only kernels with very high arithmetic intensity can achieve the maximum performance of that computer. | ||
- | 그것이 | + | 만약 |
If it is far to the left, then almost any kernel can potentially hit the maximum performance. | If it is far to the left, then almost any kernel can potentially hit the maximum performance. | ||
- | 그것이 멀리 왼쪽에 있다면, 거의 모든 커널은 잠재적으로 최대 성능을 | + | 그것이 멀리 왼쪽에 있다면, 거의 모든 커널은 잠재적으로 최대 성능을 |
As we shall see, this vector processor has both much higher memory bandwidth and a ridge point far to the left when compared to other SIMD processors. | As we shall see, this vector processor has both much higher memory bandwidth and a ridge point far to the left when compared to other SIMD processors. | ||
- | 우리가 보게 될 것이다로서, | + | 우리는 알 수 있습니다. 이 벡터 프로세서는 훨씬 더 높은 메모리 대역폭 및 기타 SIMD 프로세서에 비해 훨씬 왼쪽으로 능선 지점을 모두 가지고 있습니다. |
Figure 4.11 shows that the peak computational performance of the SX-9 is 2.4x faster than Core i7, but the memory performance is 10x faster. | Figure 4.11 shows that the peak computational performance of the SX-9 is 2.4x faster than Core i7, but the memory performance is 10x faster. | ||
- | SX-9의 | + | 그림 4.11은 |
For programs with an arithmetic intensity of 0.25 the SX-9 is 10x faster (40.5 versus 4.1 GFLOP/sec). | For programs with an arithmetic intensity of 0.25 the SX-9 is 10x faster (40.5 versus 4.1 GFLOP/sec). | ||
- | 0.25의 산술 강도가 | + | 프로그램의 산술 강도가 |
The higher memory bandwidth moves the ridge point from 2.6 in the Core i7 to 0.6 on the SX-9, which means many more programs can reach peak computational performance on the vector processor. | The higher memory bandwidth moves the ridge point from 2.6 in the Core i7 to 0.6 on the SX-9, which means many more programs can reach peak computational performance on the vector processor. | ||
- | 높은 메모리 대역폭을 어떤 더 많은 프로그램은 | + | 높은 메모리 대역폭은 |
컴퓨터_구조.1338824663.txt.gz · 마지막으로 수정됨: 2013/12/11 00:46 (바깥 편집)