A native tensor-vector multiplication algorithm for high performance computing

Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This paper proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.


INTRODUCTION
T HE numerical application of multilinear algebra is present in a wide variety of scientific domains such as data mining and analysis [1], [2]. It is based on the notion of tensors (or multidimensional arrays), their properties and 5 the operations defined on them. The tensor contraction is a fundamental operation since it is ubiquitous in many algorithms [3]. Examples of tensor contraction operations include the tensor-tensor multiplication (TTM), the tensormatrix multiplication (TMM) and the tensor-vector multi-10 plication (TVM). The TTM and TMM algorithms can be generalized to BLAS level 3 functions while the TVM can be classified as a BLAS level 2 kernel. This paper focuses on the TVM algorithm applied to dense tensors as well as its implementation within the 15 higher-order power method (HOPM) algorithm [4], which relies upon a series of TVM operations. The TVM involves a contraction over a unique mode of the tensor, hence turning out to be the most memory-bound numerical kernel among the three core tensor operations mentioned above. 20 The TTM/TMM is analogous to the matrix-matrix multiplication (MMM) and the TVM to the matrix-vector multiplication (MVM). The MMM and MVM are widely known by the function names gemm and gemv, respectively, where the former is compute-bound and the latter is memory-bound. For a TVM contraction over mode k the arithmetic intensity (AI) is 2n/(n + n/n k + n k ) FLOPs per item being n and n k the total number of elements of the involved tensor and vector, respectively. This results in an AI bounded between 1 and 2 corresponding to the most extreme cases n n k and 30 n = n 2 k , respectively. For this reason, the throughput of a TVM execution is often measured in terms of bandwidth (e.g. GB/s). Another aspect worth considering in tensor computations, although it also applies to matrix computations to a much lesser extent, is mode obliviousness. This 35 property guarantees that a tensor operation, e.g. a TVM, yields roughly similar performance independently of the contraction mode it is applied to.
It can be readily seen that both performance and mode obliviousness are crucial for efficient tensor algorithms. In 40 the case of the TVM, previous studies focused on sequential execution have demonstrated that even the most widely extended "loop" and "unfold" TVM implementations are mode-dependent and deliver poor performance overall [5]. This is due to how tensors are stored in system memory. 45 While matrices are commonly stored following a row-major or column-major ordering, a tensor of rank d can be stored as a multidimensional array using d! different combinations although, in practice, only the first-order or last-order layouts are considered. 50 Another important observation with regard to the TVM efficiency is that, regardless of the tensor ordering, researchers do express the TVM algorithm as a series of calls to gemm [6] or gemv [7], [8] kernels over contiguous matrices that represent the tensor. In multi-threaded scenarios this 55 approach is prone to load imbalance as it may not be possible to evenly distribute these matrices among all the available threads [9]. This constitutes a potential source of performance loss. The present study contributes to enhance the TVM performance by: • Designing from scratch an open-source algorithm exclusive to the TVM that is different from previous implementations based on external BLAS routines.
• Applying specific optimizations to this algorithm to enable full vectorization, mode obliviousness, multithreading and NUMA awareness.
• Performing exhaustive benchmarks on several architectures using different compilers to compare its parallel performance against the theoretical bandwidth values reported by the manufacturers. 70 The remainder of this paper is organized as follows. Section 2 and 3 present the background and related work, respectively. The proposed TVM algorithm is introduced in Section 4 together with its integration into the HOPM benchmark. Section 5 evaluates in detail the performance of 75 the TVM and HOPM algorithms and Section 6 presents the main conclusions and future work.

BACKGROUND
This section has been made intentionally concise and the reader is referred to Kolda & Bader [3] for a more detailed 80 description of tensors and their operations. In this work scalars, vectors, matrices and tensors follow the notation x, x, X and X , respectively. The d-order dense tensors considered herein can be interpreted as d-dimensional arrays which are stored in system memory according to 85 the last-order storage format layout given by the tuple . This decision is determined by the C++ programming language: we adopt the row-major order for storing multi-dimensional arrays and start counting or indexing elements from zero.

90
Consider a tensor A ∈ R n0×n1×...×n d−1 composed of d modes, that is a d-order tensor, and a vector x ∈ R n k of n k elements with 0 k d − 1. The k-mode tensorvector multiplication is defined as Y = A × k x and results in another tensor Y = R n0×...×n k−1 ×1×n k+1 ×...×n d−1 com-95 posed of d − 1 modes and whose kth mode is of size unity. Let n = d−1 i=0 n i , l = n/n k and A l×n k (and its transpose A n k ×l ) be the matricization of A which reinterprets that tensor as an l×n k matrix (and n k ×l matrix). A particularity of the matricized form of a tensor is that its elements do not 100 need to be ordered consecutively (see for instance Fig. 1). A TVM algorithm can be thought as one or more instances of a matrix-vector multiplication on contiguous regions of the matricized form of a tensor. Since tensors are stored following a last-order layout, left-hand sided vector-matrix 105 multiplications (VMMs, Y = x × A n k ×l ) are necessary for modes k < d − 1 while a single matrix-vector multiplication (MVM, Y = A l×n k × x) is required for the last contraction mode k = d − 1.
We refer to getvm_loop and getvm_unfold to desig-110 nate two common algorithms used to compute the TVM; see Pawłowski et al. [5] for details. The main difference between the two is that the former applies a VMM on consecutive portions of the matricized tensor A n k ×l while the latter reorders (i.e. unfolds) the tensor in memory to guarantee 115 that all data is contiguous in system memory before carrying out a single VMM over the entire matrix. However, such unfolding yields an overhead over the looped algorithm and hence we discard it in this study. From a high performance computing perspective, a parallel implementation of the 120 loop TVM imposes two nested levels: (i) the outer level corresponds to a loop over the aforementioned consecutive parts of the matricized tensor and (ii) the inner level which corresponds to the VMM algorithm itself. In order to exploit nested parallelism, researchers employ OpenMP to 125 parallelize the outer loop and rely on parallel BLAS level 2 routines available in heavily optimized libraries [9]. Nested parallelism helps to increase the overall performance of the TVM loop algorithm but it also has caveats. For instance, one must depend on two different parallel strategies due to 130 nesting making the algorithm prone to load balance issues.

RELATED WORK
The literature is populated with examples of TVM implementations that simply consist of interfaces to either BLAS level 3 [10] or level 2 kernels [5], [11]. Such implemen-135 tations largely benefit from libraries like Intel MKL [12], LIBXSMM [13] and also BLIS [14] which offer heavily optimized BLAS kernels that are crucial for high performance.
As explained in the previous section, it is fairly common for some TVM implementations to resort to tensor unfolding 140 operations with their subsequent overheads while others employ a looped algorithm to operate on parts of the tensor so that no unfolding is necessary. All things considered, the aforementioned TVM algorithms have been mainly motivated by performance gains and cache-oblivious behaviour. 145 Pawłowski et al. [5] were the first authors to assess mode obliviousness for the TVM algorithm by storing tensors following a Morton-order layout.
On the other hand, there has been extensive work put into libraries that deal natively with tensors. For example, 150 TACO is a library for performing tensor algebra computations that automatically generates efficient code [15]. However, the TVM contraction defined therein sometimes can lack performance by not reverting to optimized BLAS level 2 kernels within its generated codes [5]. More recent libraries 155 such as TBLIS [16] applies the BLIS philosophy to generate efficient tensor algebraic reductions thus eliminating the possible storage and performance overheads of resorting to external BLAS routines. Similarly, the authors of the framework GETT [6] propose a tensor-contraction genera-160 tor to systematically reduce a tensor contraction to loops around a highly tuned matrix-matrix multiplication kernel (gemm) that delivers great single-threaded performance for single and double precision computations. Motivated by the fact that level-3 tensor contractions based on gemm do 165 not perform equally well for tensor-vector multiplications, Bassoy [11] suggests a series of contraction algorithms that either directly call gemv kernels or recursively apply gemv on tensor slices multiple times. It is worth noting that such algorithms depend on an external BLAS level 2 routine 170 provided by the OpenBLAS library.
To the best of our knowledge there are no bibliographic works proposing a TVM algorithm independent of thirdparty BLAS libraries featuring mode obliviousness, multithreading and NUMA awareness for non-hierarchically or-175 dered dense tensors. In this regard the present study not only seeks to compare the TVM performance against the Fig. 1: Memory addresses following a last-order layout for the matricized form A 5×39 derived from the 1-mode TVM applied to the third-order tensor A 13×5×3 . The execution pattern of getvm shown here arises from parallelizing the entire TVM operation over columns (n) using three threads. theoretical bandwidth of different hardware architectures, it can also be seen as a software tool to be used by compiler designers to optimize their generated binaries.

ALGORITHM DESCRIPTION
We proceed by describing our algorithm on a single NUMA node, extending it to multiple NUMA nodes and, finally, integrating it into the HOPM benchmark. 185 Contrary to previous approaches, a native algorithm for the k-mode TVM expresses the operation on subsets of both the input and output tensors. This is analogous to applying the BLAS level-2 routine gemv on portions of the input matrix and output vector. Based on this premise we propose a 190 TVM native function getvm whose definition adopts the following schematic form in C++ syntax: template<class objT> void getvm( intT layout, intT trans, intT m , intT n, objT alpha , objT * a , intT lda ,

195
, objT * x , intT incx, objT * y0 , objT beta , objT * y , intT incy); Such a function shares the same list of arguments as gemv, plus an additional term y0 which is a pointer to 200 the very first element of the output tensor Y. The variable a refers to the input tensor A and x corresponds to the input vector x. The primitive data type intT represents an integer of any size while objT can be either an integer or a floating-point number of any given precision (for the sake of 205 briefness, only floating-point tensors are considered in this work). This native definition brings us the opportunity to build our own VMM and MVM kernels from scratch, i.e. vecMat and matVec, in order to exploit novel optimization techniques for further single-threaded performance.

210
To illustrate the proposed algorithm, consider the following example of a 1-mode TVM on a third-order dense tensor A 13×5×3 of doubles or 8-byte integers. Its matricized form has 5 rows and 39 columns whose layout in system memory is partially represented in Fig. 1. Recall that the tensor is 215 stored in a generalized row-major format and thus Fig. 1 shows a logical view of A under a 1-mode TVM. Assume that the TVM is only to be applied to a subset of A; for instance, it is executed over the first 13 columns of that matrix. Figure 1 represents indeed that subset composed 220 of 5 × 13 elements. The corresponding portion of Y, not shown here for the sake of conciseness, contains the first 13 elements of that tensor. Regardless of the size of the subset, each one of them is composed of up to three clearly differentiated zones: (i) a left zone marked in Fig. 1 with a 225 pattern filled with packed circles, (ii) a certain number of central zones represented with a checkerboard pattern and (iii) a right zone marked with a wavy pattern.
Pointers y and y0 are employed to index the zones of the tensor, which permits our single-threaded algorithm 230 getvm to traverse all of them, one by one and from left to right, performing a vector-matrix multiplication in this case. While doing so, our vecMat routine exploits vectorization with the aid of OpenMP directives as well as unroll & jam techniques. The black (e.g. 128-bit SIMD/SVE) and gray 235 (e.g. unroll=4) boxes represent the memory touched by these two techniques on each zone during the first vector iteration. An important optimization for the TVM native algorithm arises from adding a second unroll to the central zones (e.g. unrollz=3, light gray boxes). Should all the 240 elements touched along this third dimension remain within L1 CPU cache, this technique economizes function calls which directly translates into greater performance, especially for higher-order tensors and large contraction modes. In this regard, the proposed function vecMatz can offer a 245 competitive advantage over the conventional gemv routine for all internal modes 0 < k < d − 1.
For the first mode a single VMM is carried out via vecMat and for the last mode a single MVM is executed via matVec. The matrix-vector multiplication algorithm 250 shares similar vectorization and unroll & jam optimizations. Furthermore, both algorithms benefit from C++ template arguments to generate, at compile time, versioned functions which deal with remainder loops issued from unrolling (as seen in Fig. 1) thus mitigating any performance degradation 255 on these particular cases.
It is worth mentioning that our TVM library already provides optimized getvm versions for both aligned and unaligned memory accesses and the two most common usages, that is α = 1 and β = 0 or β = 1, while it also supports 260 the general expression Y := αA × k x + βY which can access the involved arrays with non-unit strides (incx, incy). At this moment, the library only supports dense tensors stored with last-or row-major layout (layout=0) but extending it to first-or column-major ordering should be straight-265 forward. Lastly, changing the floating-point precision can be simply done by redefining the primitive object floatT (e.g. via a preprocessor directive). The number of elements that fit within the vector length is automatically adjusted to guarantee fully vectorized code via OpenMP directives 270 and the same applies to the memory alignment. This opensource TVM library (Version 1.0; Martinez-Ferrer [17]) is released under the GPLv3 license.
The multi-threaded implementation of getvm is pretty straightforward. Using the previous example as a reference, 275 the matricized form of the input tensor A 5×39 can be viewed as a collection of three subsets, each one composed of 13 columns. One can assign a unique thread to each portion of A in order to carry out the three TVM executions simultaneously since they constitute an embarrassingly parallel 280 operation. Code 1 shows the multi-threaded implementa-Code 1: Multi-threaded implementation of the native algorithm getvm by means of OpenMP "parallel for" pragma directives for the matrix-vector and vector-matrix multiplication variants. 1 assert(layout == iZero); // Ensure row-major layout: last-order 2 if (trans == iZero) { // Matrix-vector (non-transposed vector) 3 const intT bsM = splitInterval(m, nbA, unroll); 4 5 #pragma omp parallel for firstprivate(x, a, y) 6 for(intT i = iZero; i < m; i += bsM) { 7 const intT bsMA = MIN(bsM, m -i); 8 getvm(layout , trans, bsMA, n , 9 alpha, a + i * lda, lda , x, incx, 10 y, beta , y + i , incy); 14 const intT bsN = splitInterval2(n, nbA, simdVMsize/objTsize), 15 zsize = m * lda; 16 17 #pragma omp parallel for firstprivate(x, a, y) 18 for(intT j = iZero; j < n; j += bsN) { 19 const intT bsNA = MIN(bsN, n -j); 20 getvm(layout , trans, m, bsNA, 21 alpha, a + zsize * (j/lda) + j % lda, lda , x, incx, 22 y, beta , y + j , incy); 23 } 24 } tion of the getvm algorithm via the OpenMP "parallel for" clause for both the matrix-vector (trans=0) and vectormatrix (trans=1) multiplication variants. The MVM case is covered in lines 8-10 and is similar to a gemv kernel with the 285 extra argument y at the beginning of line 10. For the VMM operation one needs to set the appropriate offset to the input tensor a (line 21). The functions splitInterval attempt to divide the total number of rows (m) or columns (n) by the number of available threads (nbA) so that the result is a mul-290 tiple of the unroll factor or the vector length, respectively, in order to maximize the algorithm performance. Critically, the multi-threaded implementation of the native algorithm presents a flat loop structure in contrast to the TVM loop algorithm which relies on 2-level nested parallelism [5]. 295 This ultimately renders getvm easier to implement, easier to execute in parallel and more robust vis-à-vis load imbalance.

NUMA awareness support for TVM
The algorithm described in Code 1 is executed in parallel in a shared memory system via a single OpenMP parallel 300 for loop. When dealing with NUMA architectures one can expect performance penalties if inter-NUMA communication takes place during the TVM operation [9]. The amount of communication is closely related to (i) the contraction mode of the TVM and (ii) how the system allocates the 305 arrays a and y given by their first-touch policy, taken into account that memory is allocated on the node beloging to the thread that first accesses a memory page [18]. The first contraction mode is especially critical because, if the array a is uniformily allocated across all NUMA nodes, 310 applying getvm on different partitions of a would yield a significant amount of inter-NUMA communication and therefore the algorithm will not scale [9]. For the remaining contraction modes, communication reduces so dramatically that the impact on performance is negligible. In this study 315 we propose two different strategies to render our TVM algorithm NUMA-aware. (down) are allocated on NUMA nodes 0 and 1, respectively. This particular matricized form corresponds to the 1-mode TVMs The first method is the most straightforward, it remains transparent or implicit to the user and also yields better scalability results. By means of the function mmap we align 320 all buffers to the system page size which is typically 2 MiB if transparent huge pages (THP) are active. Next, we carry out a first-touch policy for the actual allocation of buffers which modestly consists of initializing them to zero by following the exact same memory access pattern involved in the actual 325 TVM operation (see Fig. 1). This first touch is done in parallel using the same hardware resources that are exploited later on by the k-mode TVM thus removing intra-NUMA communication on a for a given value of k. This method does not modify the original memory layout of the tensor 330 and is supported by the multi-threaded implementation of getvm shown in Code 1.
The second approach explicitly splits or disjoints the original tensor across different NUMA nodes. This modifies the original memory layout in order to carry out the 335 computation in parallel without NUMA interference and is partially based on earlier work [9]. Its implementation in a global address space is far more complex than the first approach. What is more, we only consider one-dimensional splitting: that is, tensors are disjointed along one of their 340 modes. Depending on the contraction mode, tensors are disjointed on their last mode for k < d − 1 and on their first mode for k = d − 1. To better illustrate this, consider disjointing the input tensor from the previous 1-mode TVM (Section 4.1) across two NUMA nodes. Such splitting yields 345 two subtensors, i.e.
, whose matricized form in system memory is represented in Fig. 2. To achieve this, the first 130 elements of buffer a need to be first-touched by threads belonging to the first NUMA node while the remaining elements are touched by threads 350 of the other node. The same decomposition is carried out on buffer y. Note also that there is only one single buffer per tensor independently of the number of NUMA partitions. Translating between such an SPMD-like disjointed view and a standard last-order layout is by no means trivial. For 355 example, the third column of A in Fig. 1 (addresses 2 , 5, 8, 11, 14) corresponds to the first column of A 1 located at node 1 in Fig. 2 (addresses 130, 131, 132, 133, 134). Furthermore, the amount of load imbalance this method can incur in our experience often results in subpar performance with 360 respect to the first NUMA-aware strategy presented. Lastly, we note that both strategies presented achieve the asymptotically optimal data movement bound for a sequence of (0, 1, . . . , d−1)-mode TVMs derived by Pawłowksi et al. [9].
The multi-threaded function getvmd adds support for 365 disjointed tensors. It accepts the same arguments as getvm although variables m, n and lda are no longer primitive objects; instead, they are pointers to lists of integers with a size equal to the number of NUMA partitions. This multithreaded function relies on two-level nested parallelism in 370 which the first level corresponds to an outer loop along NUMA nodes. Here we rely on OpenMP to specify the correct affinity in order to ensure that each iteration of the loop is assigned to a distinct node. At the same time, each iteration of this outer loop contains a call to the 375 regular getvm function which itself utilizes all the hardware resources available on that particular NUMA partition. This section concludes with a brief mention to the role of the input vector x in achieving NUMA awareness. In practice, this variable is orders of magnitude smaller than the 380 tensors involved in the TVM operation and, for this reason, its influence on performance is negligible. In this work, the array x is uniformily distributed among all nodes and thus it is subject to inter-NUMA communication. Nevertheless, the communication of buffer x remains minimal, due in part 385 to its efficient reuse in CPU cache, which prevents us from measuring any performance penalty.

TVM integration in the HOPM benchmark
For the sake of completeness, this work also contemplates the higher-order power method (HOPM) [4], which is em-390 ployed to find the best rank one approximation of a given tensor. Henceforth it is utilized as a benchmark since, given a tensor A of rank d, the algorithm performs d(d − 1) TVM operations across all tensor modes of A. Therefore, the HOPM can be seen as an excellent tool to measure the 395 overall performance of new TVM implementations.
Our sequential implementation requires up to two additional buffers to store intermediate tensors resulting from TVM operations. It also contains two in-house optimized functions dot and axpby that normalize the final output 400 vectors. Since the first-touch based NUMA-aware algorithm outperforms the one using disjointed tensors, we achieve a NUMA-aware HOPM by the former approach.
If there are no memory constraints, one can allocate a third buffer to store a copy of the input tensor A with a 405 different touch policy in relation to NUMA locality. The original tensor stored in buffer a is first-touched following the (d − 1)-mode TVM memory access pattern and is later on employed on the HOPM for the 1-mode TVM over A. At the same time, another copy of the original tensor is stored 410 in buffer aIT, which has been previously first touched according to the 0-mode TVM memory access pattern, and is employed in every 0-mode TVM over A taking place during the HOPM operation. The two additional buffers used to store intermediate tensors do follow the same layout 415 as the original buffer a since it yields minimal inter-NUMA communication for all modes except the first one, in which case the buffer aIT is used. This HOPM implementation based on three temporary buffers maximizes performance at the expense of doubling 420 the memory requirements for storing the input tensor A. In NUMA systems with limited memory resources, as it can be the case for systems equipped with small high-bandwidth memories, it might be worth considering one single buffer for the input tensor with an inverse first-touch policy. In 425 such scenarios, any performance degradation due to NUMA effects will only be bounded to the 1-mode TVM over A which occurs every d(d − 1) TVM operations, that is, once per HOPM external iteration.  Table 1 shows the tensors used in this work. These are hypersquare, dense tensors filled with 8-byte floating-point numbers, which is typical of this kind of studies [5], [9]. Taking the Intel Xeon Platinum 8160 CPU as a reference, two families of small and large tensors are considered. On 445 the one hand, small tensors are selected so that all the associated working buffers fit within the 33 MiB of L3 cache memory. On the other hand, large tensors are constricted to occupy less than 8 GB so that all buffers used during a TVM or HOPM operation can be comfortably allocated 450 in the 16 GB of high bandwidth memory (HBM) installed in the KNL. The length or size of each tensor dimension has been intentionally chosen to prevent it from being a multiple of the vector length (e.g. 8  responding to the fifth-order L3 tensor is rather motivated by its memory footprint, which decreases monotonically with the tensor order. The tenth-order L3 tensor does not follow this rule because the minimum size per dimension is kept at 4. This particular election of dimension sizes 460 presented in Table 1 is intended to maximize unaligned memory accesses. Similarly, the fact that large tensors are relatively small to fit in HBM, compared to other studies dealing with tensors of hundreds of gigabytes [9], means that the performance figures presented in this work corre-465 spond again to unfavourable scenarios. In this respect, much larger tensors with SIMD/SVE-friendly dimension sizes will certainly yield better performance figures.

PERFORMANCE EVALUATION
With regard to compilation flags and taking the GCC compiler as a reference, -march=native is used to en-470 able CPU specific optimizations, -Ofast combined with -mprefer-vector-width=512 ensure that full 512-bit vector instructions are generated (except in the case of AMD, which remains a 256-bit vector architecture) and -fopenmp enables OpenMP pragma directives used for 475 both vectorization and parallelization. Other code adjustments are made possible via preprocessor directives: L1 cache memory (e.g. -DL1C=32768, except for ARM which is 64 KiB), transparent huge pages memory (e.g. -DTHP=2097152), unroll factor (fixed at -DUNROLL=8 at 480 all times) and default vector length (-DSIMD=512, except for AMD which reduces to 256). Each benchmark runs a particular TVM or HOPM kernel during 5 seconds, which is enough to generate multiple kernels calls (between 10 2 and 10 6 instances) and extract statistical figures. 485

L3 and DDR bandwidths
The first experiment conducted in this section is the popular STREAM benchmark. We utilize McCalpin's source code [19] and an in-house implementation written in C++ 490 following the same optimization techniques included in our TVM algorithm. Contrarily to McCalpin's, the size of the buffers is not known at compile time in the in-house implementation. Both benchmarks are executed using all the 24 cores of the 8160 CPU and, for the sake of conciseness, 495 only the triad function z := x + ky is considered. Figure 3 reports on the average memory bandwidth for increasing values of the touched memory (the sum of buffers x, y and z) with various compilers. The L3-and DDR-bound scenarios are of great importance for the benchmarks that 500 will be later shown. As expected, the peak bandwidth value above 1000 GB/s is reached at about half the L3 cache capacity. For very large buffers, all curves except one converge to an average bandwidth of about 80 GB/s. Indeed, the Intel compiler takes advantage of knowing the buffer size 505 at compile time in McCalpin's static code and hence uses streaming stores on the output buffer z. These stores bypass the CPU cache in order to write directly into system memory achieving 104.5 GB/s of bandwidth, a value that is closer to the theoretical 128 GB/s that the manufacturer reports for a 510 single socket of MareNostrum 4 with two Intel CPUs.
We discuss other aspects of Fig. 3. First, the Intel compiler seems victim of its own success since it causes a premature performance drop for L3-bound buffers due to an early use of streaming stores. Second, GCC falls behind the 515 other two compilers in terms of performance for L3-bound scenarios and smaller, especially when the buffer size is unknown at compile time. This lead us to believe that either the binaries generated by GCC or its OpenMP runtime are definitively not optimal (Intel and Clang compilers both use 520 the same OpenMP runtime KMP while GCC has its own implementation). As a result, when the memory bottleneck moves from system RAM to CPU cache, it manifests in the form of overhead. Table 2 shows the average bandwidth for the L3 and DDR tensors of Table 1. For a given tensor of rank d, this is in fact the average TVM performance over all contraction modes k ∈ [0, 1, . . . , d − 1]. The standard deviation over these modes is also specified in percentage terms. Best results 530 correspond to greater average performance values (bold characters) and smaller deviation rates (in italics). Finally, the arithmetic mean of these two quantities over all tensors is reported on the last row in order to give a better idea of the average TVM performance for different compilers.

535
In an L3-bound scenario, Table 2(a), Intel reports the best results followed closely by Clang which is about 3% slower when looking at the arithmetic mean. At the same time, GCC is 30% slower than Intel. This performance regression affecting GCC coincides with the observations made on 540 Fig. 3. Taking Intel results as a reference (although the same applies to the other compilers), the average TVM performance across modes varies significantly with the tensor order, which is expected of such tiny tensors. The average bandwidth per tensor is 812.3 GB/s and, although 545 it may be deemed small in comparison to the peak value of 1165.4 GB/s measured in the STREAM benchmark with the same compiler, the tensors used here are indeed many times smaller than the L3 cache. Under these circumstances the parallel performance decreases with the touched mem-550 ory, see Fig. 3, which ultimately explains that difference in performance. Lastly, the standard deviation rates reflect strong variations (of up to 45.8%) across modes, which is also expected of such small tensors. It is worth noting that it is mandatory to limit the vector length to 256 bits on 555 this processor in order to get competitive L3 results over the last contraction mode k = d − 1. This is done via the preprocessor directive -DSIMDMV=256, which only adjusts the vector length of the kernels inside the function matVec. When considering DDR-bound scenarios, The TVM native algorithm is now compared against an in-house TVM loop algorithm (getvm_loop) that relies on the Intel MKL optimized kernel gemv. In order to maximize its parallel performance, this looped implementation combines the parallelism already integrated in the MKL function  Table 1. with an OpenMP "parallel for" directive at the outer loop over contiguous matrices. With the aim of keeping this part concise, only the arithmetic mean figures of the looped algorithm are reported here: 576.9 GB/s (43.2%) and 95.4 GB/s (17.4%) for the L3 and DDR tensors, respectively, 580 using the same notation of Table 2. From these figures it can be inferred that the native algorithm is between 1.41× and 1.03× faster than its looped counterpart, on average, and also reduces the standard deviation rates by about half. A closer inspection at each contraction mode, not shown 585 here for the sake of conciseness, reveals that the looped implementation remains mode-dependent in all scenarios and quickly begins losing performance with increasing tensor orders and contraction modes. This is clearly depicted in Fig. 4 for the tenth-order tensor with 7 10 elements of 590 Table 1. The axis of abscissae represents the core count to evidence any scalability issues present at each k-mode TVM. The results speak from themselves and demonstrate once more the mode-obliviousness properties of the native algorithm when applied to dense tensors stored following a 595 non-hierarchical, last-order layout.

HOPM performance
The last benchmark to be run is the higher-order power method, which is another candidate for measuring the ultimate performance of a given TVM implementation. Table 3   600 shows the results corresponding to the HOPM with calls to getvm for the small (L3) and large (DDR) tensors executed on the Intel Platinum CPU. The reported bandwidths are smaller than those of Table 2 which is consistent with the HOPM benchmark performing consecutive TVM operations 605 on successively smaller input tensors. As the touched memory involved in the HOPM decreases, so does its parallel performance. When that memory is smaller than the L1 cache size (32 KiB for this CPU), the corresponding kernels getvm, dot or axpby are executed sequentially for best 610 throughput. Comparing the arithmetic mean results against those of Table 2, the HOPM bandwidth is halved with respect to the TVM bandwidth for L3 tensors while it is only reduced by less than 10% when considering large tensors. In the latter case, the benchmark is bounded by the system memory bandwidth, which is where all three compilers deliver similar performance. The HOPM standard deviation across DDR tensors stays below 10% on average, rendering the HOPM benchmark based on our native TVM algorithm oblivious to the tensor order.

620
To complete this section the HOPM benchmark based on calls to the TVM loop algorithm with Intel MKL optimized kernels (gemv, dot and axpby) is studied. The final results obtained by arithmetic mean are 166.9 GB/s and 68.1 GB/s for the small and large tensors, respectively. These figures 625 clearly demonstrate the superiority of the native algorithm when integrated in the HOPM, making this benchmark between 2.4× and 1.31× faster than its looped counterpart.  Table 1 provided by the getvm and getvmd algorithms. The arithmetic mean (Σ) is also provided at the end for completeness. Results from different compilers on a two-socket system with Intel Xeon Platinum 8160 CPUs. theoretical bandwidth of 256 GB/s. However, McCalpin's STREAM triad function reports an average bandwidth of 635 155.5 GB/s and 205.6 GB/s when using the GCC and Intel compilers, respectively. This represents roughly 60% and 80% of the theoretical peak value and demonstrates again how the Intel binary benefits from using streaming stores in this particular benchmark.  Table 1 will be considered. By looking at the results corresponding to the TVM disjointed algorithm getvmd, it can be readily seen 645 that GCC lacks support for OpenMP parallel nesting (two levels are required) at least up to version 10.1.0. Furthermore, this explicit NUMA-aware implementation somehow experiences performance regression on certain tensors. For instance, the measured bandwidth of the seventh-order 650 tensor is barely better than that of the single-socket configuration, see Table 2(b). The same applies, but to a lesser extent, to the eighth-and tenth-order tensors.

Two-socket system
On the other hand, Fig. 5 also reports on the TVM performance of the implicit NUMA-aware implementation 655 provided by getvm. Since parallel nesting is not longer required, GCC is able to fully compete with Intel and Clang as evidenced by the mean bandwidth values. Indeed, the three compilers deliver once again similar performance. For lower-order tensors the measured bandwidths are slightly 660 larger than those reported by STREAM. What is more, the results also indicate that the standard deviation remains invariant of the number of NUMA nodes and stays, on average, below 8% while the speedup reaches 1.99×.
All things considered, getvm is 7% faster than getvmd 665 and manages to attain more than 76% of the theoretical bandwidth when considering the arithmetic mean over all tensors. The implicit NUMA-aware implementation of getvm, which is fully supported by the GCC OpenMP  Table 1. The arithmetic mean (Σ) is also provided at the end for completeness. Results from different compilers on one-and two-socket systems equipped with Intel Xeon Platinum 8160 CPUs. runtime, is faster and much simpler than its disjointed 670 counterpart. Hence, from now on, we study only the implicit NUMA-aware getvm version. Figure 6 reports on the HOPM average bandwidth based on calls to getvm for this two-socket system. The three compilers are tied and the arithmetic mean figures reflect 675 an overall speedup of 1.9× over the single-socket results. Although the HOPM bandwidth results remain competitive, these figures are on average about 10% inferior to those reported by the TVM algorithm; indeed, the HOPM benchmark attains more than 66% of the theoretical bandwidth 680 compared to the 76% of the TVM. This is again attributed to the getvm, dot or axpby functions whose memory footprint is smaller than the L1 cache and hence are executed sequentially within the HOPM. Although this results in overall best performance, it does not necessarily yield best 685 NUMA scalability and therefore we cannot longer claim that the HOPM performance for more than one NUMA node is oblivious to the tensor order.

Single-socket systems
The AMD Epyc 7742, Intel Xeon Phi 7230 (KNL) and Fujitsu 690 ARM A64FX CPUs are now tested. The first one is made of chiplets and uses conventional DDR while the other two benefit from HBM. These three CPU architectures are composed of 4 NUMA nodes and therefore it is extremely important to setup their firmware in such a way that all the 695 nodes are available to the programmer in order to minimize inter-NUMA communication as much as possible.
In terms of compilers, if one considers performance scenarios limited by the system memory (DDR or HBM), as it is the case of this section, then GCC delivers the best  the measurements represent roughly 53%, 72% and 59% of the theoretical bandwidth of the AMD, KNL and ARM platforms, respectively. This is because the corresponding GCC binaries do not use streaming stores. As it was the case for the Intel Platinum CPU, one should expect TVM 720 and HOPM performance figures ranging from the values measured via STREAM and the theoretical ones reported on Table 4. Table 5 summarizes the average parallel performance of the TVM native algorithm for different architectures 725 and NUMA partitions. Starting with the AMD platform, Table 5(a), it can be noted that the performance scales pretty well with the number of nodes. The accumulated standard deviation percentages are similar to those previously reported on the Intel Platinum CPU which also utilizes 730 conventional DDR. Looking at the 4-NUMA results, the average bandwidth per tensor is 136.2 GB/s, which is above the 110.2 GB/s reported by STREAM. The TVM algorithm reaches, on average, 66% of the theoretical bandwidth on the AMD platform. This is about 10% less than what was 735 achieved on the previous two-socket Intel system with monolithic CPUs.
The effects of using fast RAM -HBM in the case of the KNL and HBM2 for the ARM CPU-are shown on Table 5(b)-(c). For example, it can be readily seen that 740 the standard deviation rates obtained by arithmetic mean increase by almost a factor of two compared to the same values reported on the previous two systems with conventional DDR. A closer inspection reveals that mode obliviousness on higher-order tensors cannot longer be considered for the 745 KNL. This is even worse for the ARM architecture since the 0-mode TVM clearly indicates that there is a major performance gap of up to 55.6% between the VMM and MVM kernels. We believe that this may be due to the binaries generated by the experimental version of GCC installed in 750 this platform. On the other hand, the NUMA scalability remains pretty good, especially on ARM platform. The KNL delivers an average bandwidth per tensor of 307 GB/s, that is about 66% of the theoretical value and slightly below the one reported by STREAM. On ARM this value is 637.3 GB/s, 755 just above the STREAM measurement, and represents 62% of the theoretical bandwidth. All things considered, these three architectures exhibit similar performance figures in relative terms and remain below the average value reported on the previous two-socket Intel platform (see Fig. 7(a)). This 760 is perhaps an indication that more sophisticated hardware architectures, equipped or not with high bandwidth memory, become more challenging to fully exploit. In this regard, compilation parameters such as the unroll factor have remained unchanged for all cases and hence no additional 765 efforts have been made to fine-tune the compilation for a particular architecture. Finally, Table 6 summarizes the average bandwidth achieved by the HOPM native algorithm for different ar- chitectures and NUMA partitions. One may expect that the 770 HOPM performance would always stay below the TVM performance but, surprisingly, this is not the case for the KNL using 4 NUMA nodes and the ARM CPU with 2 nodes. The arithmetic mean figures corresponding to the 4-NUMA case indicate that the HOPM reaches, on average, 59%, 69% and 775 58% of the theoretical bandwidth on the AMD, KNL and ARM processors, respectively. In relative terms, the KNL performs slightly better than the two-socket Intel platform. For the AMD and ARM architectures, the approximate 10% performance drop-off observed in the TVM algorithm is 780 carried forward in the HOPM benchmark.
The results presented in this section have demonstrated the performance improvements brought by the TVM native algorithm in various scenarios. All the mathematical operations have been carried out using double precision arith-785 metic over the dense tensors described in Table 1. Figure 7 presents the average performance of the TVM and HOPM for the same tensors filled with single precision floatingpoint numbers when using all the NUMA nodes available on the four architectures considered in this work. Note that 790 since the number of elements per tensor does not change, their memory footprint is reduced by half. In order to facilitate the comparison between different architectures the performance is expressed as a percentage of the theoretical bandwidth. Similarly, double precision performance figures 795 previously obtained are included in the comparison. Figure 7(a) shows that the TVM performance is invariant of the arithmetic precision for Intel and AMD processors, es-  Table 1 provided by the (a) TVM algorithm and (b) HOPM benchmark. The arithmetic mean over all tensors (Σ) and the STREAM performance figures are also provided at the end for completeness. Results correspond to the four architectures considered in this work exploiting their full potential.
pecially when looking at the arithmetic mean values. There is a slight reduction of bandwidth on the KNL CPU when 800 using 4-byte floats and a definitive performance gap on the ARM architecture. We take the ARM results with a grain of salt since we have experienced inconsistent behaviour over certain contraction modes where the multi-threaded performance was barely better than the single-threaded one; 805 it may be the experimental version of the GCC compiler not being able to generate proper SVE vector instructions in this particular scenario. It is worth mentioning that the principal consequence of reducing the floating-point precision is an overall increment of the standard deviation rates of 69% and 810 73% for the KNL and ARM architectures, respectively. We attribute this to the HBM installed in these systems.
Finally, Figure 7(b) shows that, on average, the HOPM performance also tends to decrease for the KNL and ARM CPUs due to the use of HBM. In this particular case, the 815 performance drop is of the same order of magnitude for the two architectures. However, only the standard deviation value across all tensors increases by almost a factor of two on the ARM CPU when using single precision arithmetic, while it remains practically unaltered on the KNL system.

CONCLUSION AND FUTURE WORK
This work has presented a novel tensor-vector multiplication (TVM) algorithm in the form of an exclusive BLAS level 2 function. It has been built from scratch, uses a non-hierarchical layout, contains specialized optimizations 825 and shares the same list of arguments as the well-known BLAS matrix-vector multiplication function plus an additional term. Moreover, its multi-threaded implementation is straightforward and, because it presents a flat parallel structure, delivers better scalability and performance over 830 other popular approaches such as the TVM unfold and loop algorithms. Two first-touch strategies have been proposed with the objective of bringing NUMA awareness to TVM although working with disjointed tensors has proven to yield worse results. Lastly, this TVM native algorithm has 835 been integrated in the higher-order power method (HOPM) that serves as a real-world benchmark.
The TVM and HOPM algorithms have been satisfactorily tested using up to three different compilers and four hardware architectures, from which two of them feature 840 high bandwidth memory. For large tensors, the performance results are limited by the system memory bandwidth and remain compiler agnostic. Moreover, mode obliviousness is achieved on architectures with conventional DDR contrarily to what is observed in the case of the TVM loop algorithm, 845 even when relying on heavily optimized BLAS libraries with architecture-tailored kernels. However, the proposed TVM exhibits some mode dependency in the presence of HBM since the associated TVM deviation rates tend to double. In general, the bandwidth achieved by the proposed algorithm 850 oscillates between 62% and 76% for the TVM and from 58% to 69% for the HOPM relative to the theoretical system memory speeds reported by the manufacturers. These percentages diminish slightly on systems equipped with HBM when performing single precision arithmetic computations.

855
Our results also confirm that the overall performance tends to diminish as a consequence of using more complex architectures composed of several NUMA partitions.
Although all the figures shown in the present study are related to worst-case scenarios due to the particular selec-860 tion of tensors, they remain very competitive and clearly assess the effectiveness of the proposed TVM algorithm, making it suitable for integration in the most popular BLAS libraries available today. Future work will be carried out to integrate task-based shared memory programming 865 models as well as the message passing interface (MPI) for distributed parallelism. Porting the TVM library to GPUs and other accelerators is also envisaged. Finally, the fundamental principles applied to our TVM algorithm will also be extended to the BLAS level 3 tensor-matrix multiplication