Exploiting in-Hub Temporal Locality in SpMV-based Graph Processing

The skewed degree distribution of real-world graphs is the main source of poor locality in traversing all edges of the graph, known as Sparse Matrix-Vector (SpMV) Multiplication. Conventional graph traversal methods, such as push and pull, traverse all vertices in the same manner, and we show applying a uniform traversal direction for all edges leads to sub-optimal memory locality, hence poor efficiency. This paper argues that different vertices in power-law graphs have different locality characteristics and the traversal method should be adapted to these characteristics. To solve this problem, we propose to inspect the number of destination and source vertices in selecting a cache-compatible traversal direction for each type of vertex. We introduce in-Hub Temporal Locality (iHTL) , a structure-aware SpMV that combines push and pull in one graph traversal, but for different vertex types. iHTL exploits temporal locality by traversing incoming edges to in-hubs in push direction, while processing other edges in pull direction. The evaluation shows iHTL is 1 . 5 × - 2 . 4 × faster than pull and 4 . 8 × - 9 . 5 × faster than push in state-of-the-art graph processing frameworks such as GraphGrind, GraphIt and Galois. More importantly, iHTL is 1 . 3 × - 1 . 5 × faster than pull traversal of state-of-the-art locality optimizing reordering algorithms such as SlashBurn, GOrder, and Rabbit-Order.


Introduction
Among data-intensive problems, graph processing is particularly challenging due to high memory bandwidth requirements and irregular memory access patterns that make it hard to optimize locality of memory accesses, especially in traversing all edges of the graph (or Sparse Matrix-Vector (SpMV) multiplication) where caches cannot contain the data of all vertices. Graph traversal is performed in two major patterns or directions: (1) In pull direction, a vertex pulls data of its in-neighbours and results in random read memory accesses and sequential write memory accesses. (2) In push direction, a vertex updates data of its out-neighbours and read accesses are sequential but write accesses are random.
The push traversal requires protecting data of out-neighbours from concurrent updates of parallel threads which is performed through (1) atomic instructions, (2) buffering [29], or (3) partitioning edges by destination [35]. Pull traversal is faster than push by traversing edges based on their unique destinations and therefore write memory accesses do not require protection. Pull traversal underpins several analytics like Hyperlink Induced Topic Search [20], Belief Propagation [19], Graph Neural Networks [40], Recurrent Neural Networks [27], PageRank [11], and Community Detection [47].
The structure of many real-world graphs poses challenges to efficient pull traversal. Graphs derived from social networks, the internet, and world-wide web show a skewed or heavy-tailed degree distribution, often following a powerlaw distribution: a very small fraction of vertices, which are known as hubs, are connected to a disproportionately large fraction of edges. The impact of hubs on temporal locality becomes problematic when a massive amount of vertex data is pulled into the cache by pull processing of an in-hub (a vertex with large in-degree) which displaces much of the cache contents and intensely reduces the opportunity for future reuse.
Locality optimizing graph relabeling algorithms [2,9,24,36,41,42] use different techniques like graph clustering, community detection, and cache simulation to improve memory locality by rearranging the order in which vertices are numbered. However, we found that graph reordering algorithms are successful in improving locality of non-hub vertices, but not so for the hubs. Crucially, hubs have so many neighbours that exploiting reuse of the neighbours of one hub to the neighbours of the next vertex is unlikely.
Our solution for this problem is based on the property of skewed graphs that a very small fraction of the vertices (the hubs) connect to a disproportionately high number of edges: when traversing the in-edges of the in-hubs, it follows that the set of possible destinations (in-hubs) is very small, yet the set of possible sources is very large (due to the large in-degree of the in-hubs). Our insight is that, for such a sub-graph, a cache-compatible traversal direction is the push direction and not the intended pull: while the cache cannot satisfy random accesses to numerous source vertices of the edges to in-hubs in the pull traversal, the push traversal results in random memory accesses to a small set of destination vertices (inhubs) that can be satisfied by cache. Importantly, these destination-based random accesses can be captured in full in the on-chip caches if we control the number of hubs correctly. Consequently, we can traverse an important fraction of the edges with no random accesses to main memory. This paper develops this insight and presents the in-Hub Temporal Locality (iHTL) that optimizes SpMV-based graph analytics by using push direction for processing incoming edges to in-hubs and pull for incoming edges to non-hubs.
The contributions of this paper are thus: • We analyze the challenges imposed by hubs during graph traversal and demonstrate that locality optimizing graph relabeling algorithms fail to address locality issues of hubs. • We introduce the iHTL algorithm that applies a bespoke mix of push and pull traversal in order to maximize cache reuse. We demonstrate how to efficiently prepare the iHTL graph in a light-weight preprocessing step by designing an algorithm to determine hubs based on graph structure. • We evaluate iHTL on 10 real-world graphs with up to 7.9 billion edges demonstrating that iHTL significantly improves locality and outperforms the pull traversal in state-of-the-art graph processing systems such as GraphGrind [35], GraphIt [46] and Galois [16] by 1.5 × −2.4×. Moreover, the evaluation of iHTL in comparison to state-of-the-art relabeling algorithms such as SlashBurn [24], GOrder [41] and Rabbit-Order [2] shows iHTL is faster than pull traversal of the relabeled graphs by 1.3×−1.5× while reducing the preprocessing time by 780×. Section 2 explains key background material and motivates the iHTL approach. Section 3 presents iHTL. The evaluation of iHTL is discussed in Section 4 and Section 5 studies related work. Section 6 presents future work. 1 Terminology Graph = ( , ) has a set of vertices , and a set of directed edges . The adjacency matrix is a binary matrix representing the graph: the element at row and column is 1 if contains an edge from vertex to , and 0 otherwise. Graphs are represented in Compressed Sparse Rows and Columns (CSR, and CSC) [30]. − , and + are the set of in-neighbours and out-neighbours of vertex , respectively.
The hub, in-hub, and out-hub vertices are vertices with the highest degree, in-degree, and out-degree, respectively. We do not present a formulaic definition of hubs. Instead, we reserve the term hub only for those vertices identified by iHTL as meriting an opposite traversal direction.

High Cache Miss Rate of in-Hubs
We first demonstrate that cache misses incurred during pull traversal are concentrated in the highest-degree vertices. We use SpMV multiplication (Algorithm 1) that iteratively calculates the new data of a vertex as summation of previous data of its in-neighbours. Using the notation of D for data of vertex in iteration , SpMV calculates: D = ∈ − D −1 . Figure 1 depicts the miss rate conditional on the degree of a vertex for a social network graph (Twitter MPI) and a web graph (SK-Domain). It shows that the initial graphs incur substantial miss rates for hubs that are the destination of the majority of the edges in power-law graphs [21]   and consequently a significant part of execution time is consumed for processing hub vertices. Figure 1 also illustrates the result of optimizing locality using relabeling algorithms: SlashBurn, GOrder, and Rabbit-Order. Relabeling algorithms change the order of vertices that affects hubs' locality by changing (1) the cache contents immediately before starting the processing of hubs, and (2) the spatial locality of hubs' neighbours. However, locality optimizing relabeling algorithms cannot change the structure of the graph and the number of hubs' neighbours. Figure 1 shows that the locality optimizing relabeling algorithms can improve cache miss rates of non-hub vertices. However, hubs experience a high cache miss rate even after reordering [21]. In contrast, iHTL significantly reduces the cache miss rates corresponding to the hubs, which translates to substantial performance gain.

Inefficient Cache Utilization in Pull Traversal
To explain this high miss rate of hubs, Figure 2.(a) presents an example graph, and Figure 2.(c) shows its pull traversal execution timeline. Vertices 3 and 7 are in-hubs and the effective cache size is 2. For a LRU cache with one vertex data per cache line, the notation [ , ] is used to show the data of vertices and are in the cache. Before processing vertex 3, vertices 1 and 2 have been processed and ⇒ [1,7].
For pull traversal of the first in-hub (vertex 3), the data of vertices 2, 5, 6, 7, and 8 should be read. Starting from vertex 2, its data is not in the cache and is fetched from the memory, and ⇒ [2,1]. Then the data of vertex 5 is required that is not in the cache and is read from the memory and ⇒ [5,2]. In the same way, the data of vertices 6, 7, and 8 also is read from the main memory and no reuse happens for processing 5 in-edges of vertex 3. , and no reuse is experienced for processing 4 edges. This shows that the high degree of an in-hub limits the reuse of the cache contents in pull traversal of in-hubs.

iHTL Idea
In a pull traversal (Algorithm 1), each vertex reads data of its in-neighbours (Line 4) and writes its new data. Therefore most of the capacity of cache is dedicated to random read accesses to data of source vertices in pull traversal. In a push traversal (Algorithm 2), the new data of out-neighbours (Line 3) are randomly updated by data of source vertices and most of the capacity of cache is dedicated to random write accesses to destination vertices in push traversal.
On the other hand, real-world graphs with power-law structure have few in-hub vertices with each having several in-neighbours. In a pull traversal of an in-hub, cache is dedicated to source vertices and the number of these source vertices is greater than the capacity of cache resulting in high rate of cache misses ( Figure 1).
iHTL states that for incoming edges to in-hubs, the number of destination vertices (in-hubs) is much less than the number of source vertices, therefore cache can be efficiently used only if it is dedicated to the destination vertices. In other words, push traversal is suitable for traversing incoming edges to in-hubs.
In order to facilitate this, iHTL relabels the graph and assigns the lowest vertex IDs to in-hubs: vertices 3 and 7 of the example graph in Figure 2.(a) become vertices 1 and 2 in Figure 2.(b). Moreover, iHTL processes the graph in two steps. In the first step incoming edges to in-hubs are processed in push direction. In the second step, other edges are processed in pull direction. Figure 2.(c) shows the iHTL execution timeline for Figure 2.(b). For push traversal of incoming edges to in-hubs, first, vertex 1 updates data of vertex 2 by its data and vertex 2 updates data of vertex 1 ⇒ [1,2]. Then, vertex 3 reads its data from main memory and updates the data of in-hubs (vertices 1, 2), and one reuse out of two updates is achieved (cache content is [2,1]). In the same way, two more reuses are achieved for push traversal of outgoing edges of vertices 4 and 5 to in-hubs.
The comparison of pull and iHTL execution timelines shows that reuse is more frequent in iHTL because by push traversal of incoming edges to in-hubs, random accesses are targeted at a small number of in-hubs that are maintained in cache. In iHTL every edge is traversed exactly once as it should be, even though iHTL mixes push  Figure 3. Adjacency matrix of an iHTL graph and pull traversals. The timeline also shows that by increasing the number of common in-hubs, reuse is increased.
3 In-Hub Temporal Locality iHTL performs push for incoming edges to in-hubs, and applies pull for incoming edges to non-hubs. To perform efficient push and pull, iHTL creates subgraphs for groups of edges. We explain iHTL graph in Section 3.1. Sections 3.2 and 3.3 discuss the process of creating an iHTL graph, and Section 3.4 shows how SpMV is performed in iHTL.

iHTL Graph
In order to facilitate push and pull traversals in iHTL, we distinguish blocks (subgraphs) within the graph adjacency matrix. Figure 3 shows the adjacency matrix structure of iHTL graph. Note that we use the convention that a pull traversal corresponds to a column-major traversal of the adjacency sub-matrix, while a push traversal corresponds to a row-major traversal.
The iHTL graph is comprised of three major parts: • A number of flipped blocks that contain incoming edges to in-hubs, • A sparse block that contains edges to non-hubs, and • A zero block that contains no edges.
iHTL uses push traversal for processing incoming edges to in-hubs and it is necessary to ensure data of in-hubs are maintained in cache. For a graph that has in-hubs more than cache capacity, iHTL creates multiple flipped blocks.
Due to the skewed degree distribution of graphs, flipped blocks are very dense (contain few hubs, but many edges). We will show in the evaluation section that flipped blocks in iHTL contain 40-70% of the edges for 8 out of 10 graphs.
The sparse block, on the other hand, contains edges to non-hubs and iHTL uses pull traversal which dedicates cache to the source vertices of edges and since there is no in-hub in the sparse block, reuse of cache contents is improved.
To create these blocks, iHTL categorizes vertices into: • in-hubs, • VWEH: Vertices With Edges to Hubs, and • FV: Fringe Vertices, which have no edges to in-hubs.
In iHTL, all edges in the flipped blocks are either edges from VWEH to in-hubs, or from in-hubs to in-hubs. Fringe vertices do not link to in-hubs. As such, they do not appear in flipped blocks and a zero block (∅) appears in the adjacency matrix ( Figure 3). We separate out fringe vertices in order to (1) avoid loading their vertex data from main memory during processing of flipped blocks, and also to (2) shrink the size of topology data of flipped blocks.

Creating iHTL Graph
The iHTL graph ( Figure 3) is created in 3 steps: (1) Creating Relabeling Array: To enforce the new arrangement of vertices, the iHTL relabeling array is created such that all in-hubs have smaller labels than VWEH and all VWEH have smaller labels than FV. iHTL brings vertices of the same type (in-hubs, VWEH, and FV) close to each other by assigning consecutive IDs. However, it keeps the initial order between vertices of the same type in VWEH and FV. In this way, iHTL tries to have a minimal change on the initial neighbourhood of the vertices. Figure 4 shows the creation of the relabeling array for the example graph in Figure 2.(a). Firstly, in-hubs are selected as a number of vertices with the highest degree and first IDs are dedicated to in-hubs. The number of in-hubs depends on the number of flipped blocks and is discussed in Section 3.3. Secondly, the VWEH is identified by traversing CSC representation of the main graph for the selected in-hub vertices. The remaining vertices are FV. Figure 5 shows the adjacency matrix of the example graph, and Figure 6 shows the adjacency matrix of its iHTL graph after relabeling.
It is worth mentioning that contrary to locality optimizing relabeling algorithms like GOrder and Rabbit-Order, the relabeling array in iHTL does not improve locality and is used to form the blocks required in iHTL adjacency matrix. Locality is improved in iHTL by increasing reuse in push traversal for incoming edges to in-hub vertices.
(2) Creating Flipped Blocks: Flipped blocks in iHTL contain in-edges of in-hubs. If a flipped block contains inhubs, then the -th flipped block contains edges to the in-hubs with IDs in the range = [( − 1) , ). Creating flipped blocks requires a pass over outgoing edges from {ℎ ∪ } in the CSR representation of the main graph and selecting edges with in-hub destinations (that are identified using the iHTL relabeling array).
(3) Creating A Sparse Block: The sparse block of iHTL contains edges to non-hubs that are processed in pull direction. It is formed by a pass over the CSC representation of the main graph for all in-edges to { ∪ } and relabeling source of edges using the iHTL relabeling array.

Number of in-Hubs and Flipped Blocks
The main benefit of iHTL is to traverse the flipped blocks such that random accesses are made to the few hubs that are maintained in the cache. To accomplish this, the number of hubs is dimensioned based on a combination of cache size and graph structure. Taking cache size into account is critical to catch the random accesses to the in-hubs on chip. However,    Figure 5. Adjacency matrix of graph in Figure 2.(a)  Figure 6. iHTL Adjacency matrix of graph in Figure 2.(a) after relabeling graph data sets may require more hubs than contemporary processor cache sizes can handle. Because of this, iHTL fixes the number of in-hubs in a flipped block based on cache size and constructs multiple flipped blocks as needed based on graph structure.
The number of in-hubs in a flipped block is determined by the on-chip cache size. We identified that the level 2 cache is the best location for holding the vertex data of in-hubs (Section 4.7). As such, we specify the number of hubs per flipped block as by dividing the level 2 cache size by the size of vertex data.
If graph structure mandates more hubs, we increase the number of flipped blocks. Therefore, HTL needs to balance the benefit of creating more flipped blocks with the drawbacks. The benefit is improved locality, however, there are two drawbacks for increasing the number of flipped blocks: (1) While all members of {ℎ ∪ } have edges to in-hubs in the first flipped block, this number diminishes in subsequent flipped blocks, reducing efficiency as some fetched vertex data will not be used during push traversal.
(2) Flipped blocks, moreover, increase the size of the graph topology data, as each block requires its own metadata. Based on these observations, iHTL allows a new flipped block to be formed if its hubs have edges from at least 50% of the {hubs ∪ VWEH}.
iHTL increases the number of flipped blocks (# ), while | # | > 0.5 * | 1 |. In order to calculate | |, a pass over in-edges to in-hub vertices in the -th flipped-block is required to mark the members and one other pass is needed to count the number of marked vertices.

iHTL Processing
In parallel processing of a flipped block, concurrent threads will perform random updates to the vertex data of in-hubs. To avoid race conditions, we opt for a buffering technique (where each thread operates on copies of the vertex data which are later merged [29]) as it is more efficient in the setting of iHTL using the private and fast L2 cache for each thread. As we see in the evaluation, buffer merging in iHTL does not take more than 3% of iHTL execution time (each thread buffers * # vertex data). Algorithm 3 shows SpMV execution for iHTL. Flipped blocks (Lines 1-4) use push traversal in iHLT. For each flipped block, the old data of a vertex that has edges to hubs ( .ℎ ) is read and the related index of the local buffer (buffer_d ) of thread ( ) is updated. Since threads write updates locally during processing flipped blocks, the parallel for loop in Line 1 does not require synchronization between threads and different threads can process vertices of different flipped blocks. However, each thread should process only one flipped block at a time.
After completion of processing flipped blocks, thread buffers are merged (Lines 5-7) to specify data of hubs. Finally, pull traversal is used for processing the sparse block (Lines 8-10).

Evaluation Method and Datasets
We use a 2-socket machine with 768 GB main memory. Each socket has an Intel ® Xeon ® Gold 6130 with 16 cores, 32KB L1 cache, 1MB L2 cache, and 22MB L3 shared cache.
HTL has been implemented in the C language using pthread, libnuma, and papi [37]   with -O3 flag. The implementation uses interleaved NUMA memory policy and applies work-stealing [6] for parallel processing of graph partitions created by vertex and edge partitioning [35,44]. The master-worker model has been used for managing parallel threads. Table 1 shows the datasets and their sources: "Konect" (KN) [7,22,26], "NetworkRepository" (NR) [10,13,15,28,32], and "Laboratory for Web Algorithmics" (LWA) [7][8][9][10]23]. The first 4 datasets are social networks, and the other ones are web graphs. Numbers of edges are in billions and numbers of vertices are in millions, counted after removing zero degree vertices because of their destructive effect [35]. Graphs are represented in CSR and CSC with | | + 1 index values of 8 bytes per index value and | | neighbour IDs of 4 bytes each as | | < 2 32 .
We evaluate iHTL using the PageRank application which has been implemented in all graph processing frameworks and iteratively performs SpMV-type calculations:    push direction). We compare against several frameworks as each applies a different set of optimizations. GraphGrind performs an edge-balanced partitioning for a pull traversal. GraphIt includes the Cagra [45] locality optimizations (Section 5.4) which make it faster than Galois for some graphs. Figure 7 demonstrates the effectiveness of the iHTL locality optimizations as it is faster than different implementations of pull traversal by 1.5× -2.4×. Figure 7 also shows that iHTL preserves the initial locality of graphs well, even for graphs like "SK-Domain" with high initial locality. Table 2 shows the overhead of iHTL preprocessing as a multiple of PageRank iterations in different frameworks. It shows that, on average, iHTL requires 7.3 -11.7 SpMV iterations in other frameworks as the preprocessing time. The preprocessing overhead can be completely amortized between different executions if the iHTL graph is stored in its binary format (similar to the special file formats that each framework uses) on disk after preprocessing. In Section 6, we present avenues for future work that may reduce the preprocessing time of iHTL. Table 3 compares the memory accesses (loads and stores of data) and also the level 3 cache misses for pull traversal vs iHTL, captured using PAPI. iHTL incurs additional memory accesses due to: (1) increased volume of topology data, (2) updates to local buffers when processing flipped blocks, (3) merging buffers and (4) resetting buffers. Types 1, 3, and 4 are sequential, i.e., assisted by prefetching. Type 2 includes random writes, however, these are captured by the level 2 cache.  Table 4. Size of topology data (in Giga Bytes) As such, the key distinction in cache misses that impacts performance occurs when processing in-hubs: where the pull traversal performs random reads that result in L3 cache misses, iHTL performs random writes captured by the L2 cache. This large difference in L3 cache misses is a key explainer for the performance of iHTL. Table 4 compares the memory size of CSC representation of the main graphs vs. their iHTL graphs. The topology data grows in iHTL compared to a standard compressed sparse columns representation. This results from replication of the index array for each block.

iHTL Memory Overhead
However, topology data is read sequentially from main memory as the graph topology is too large to fit in on-chip caches. The size increase is therefore not a major problem. Section 6 explains methods for reducing topology data.

iHTL vs Relabeling Algorithms
To have a better scale of locality optimization of iHTL, Fig Relabeling algorithms rearrange the vertices to provide better reuse of vertex data, and as Figure 1 shows they can provide better locality for non-hub vertices. However, a structure-agnostic pull traversal does not allow relabeling algorithms to improve locality of hubs ( Figure 1). In contrast, iHTL targets locality of hubs (Figure 1), which capture a significant portion of the edges (Table 5). Thus, iHTL outperforms the relabeling algorithms. Figure 8 compares the preprocessing time of iHTL to relabeling algorithms. GOrder has a sequential implementation. SlashBurn and Rabbit-Order have a parallel code, however the complexity of their algorithms makes them much slower than iHTL. iHTL has a simple preprocessing algorithm (Section 3.2) and does not need to investigate the neighbourhood of each vertex in detail. This gives iHTL a very short preprocessing time. Table 5 characterizes the iHTL graph and relative processing speed for flipped blocks. For social networks, flipped blocks contain 45% -65% of the edges. The push traversal of flipped blocks makes good use of the sequentially fetched vertex data, as a high percentage of the vertices link to the hubs  Table 5. iHTL graph statistics and iHTL PageRank execution breakdown (FB: flipped blocks, Topo: topology data) (column VWEH). As a result, iHTL spends just 22% -40% of its time for processing flipped blocks of social networks. Web graphs contain only one flipped block that contains 40% of edges on average and is processed in just 25% of the processing time, on average. The relatively high processing speed of flipped blocks compared to the whole graph is captured by the flipped block speed (column "FB speed"). It is calculated as the percentage of edges in the flipped blocks divided by the relative time spent in flipped blocks. Values higher than 1 indicate that an edge in a flipped block is processed more efficiently than average across the graph. This is a consequence of containing the random memory accesses in the on-chip caches during processing of flipped blocks, which cannot be guaranteed for the sparse block. Table 5 shows that buffer aggregation in iHTL requires less than 2.5% of total processing time. Each flipped block implies buffer merging overhead, however, by inspecting graph structure iHTL incurs this overhead only when there is a corresponding gain in locality. Table 6 shows the impact of iHTL buffer size. Note that buffer size determines the number of hubs per flipped block. The aim is for random accesses to the buffers to be serviced fast. Aligning the buffers to L1 cache size is inefficient as its 32 KB size is too small to accommodate many hubs. The L2 cache is private to each core, which implies unfettered access.
Consequently, as Table 6 shows, L2 cache is the best choice for accommodating data of in-hub vertices.

Low-Degree vs. High-Degree
The history of using different traversals for different vertices returns to the AYZ algorithm [1] for triangle counting where low-degree and high-degree vertices are differentiated to reduce computational complexity.
PowerLyra [14] reduces the communication cost in a distributed graph processing system by using vertex-cut partitioning for low-degree vertices and edge-cut for high-degree vertices. In this way, PowerLyra ensures that replicas of lowdegree vertices are not increased and processing high-degree vertices will experience better load balance.

Push OR Pull
The effectiveness of push OR pull traversals are discussed in [3,5,25,33,39]. These studies apply the same traversal direction for all vertices in a single traversal of all edges. The push or pull traversal is selected in these works based on density of frontier or possible convergence optimization that can be applied on a special direction. Also, push and pull locality have different effects on the traversal performance [21].
On the other hand, iHTL applies different traversal directions for different vertex types in one traversal of all edges of the graph.

Locality Optimizing Graph Reordering
Community detection algorithms like Scan [42] provide better locality. Scan isolates hubs and outliers (vertices marginally appended to clusters) from clusters to prevent unrelated communities to be merged because of only one hub neighbour. ScaleScan [31] removes unnecessary computations in Scan and parallelizes the execution. Graph relabeling algorithms like SlashBurn [24], GOrder [41] and Rabbit-Order [2] rearrange vertices in order to improve locality.
In contrast to vertex reordering algorithms, iHTL concentrates on reordering edges as its primary goal to Figure 9. Asymmetricity degree distribution provide temporal locality and as Section 4.5 shows iHTL provides better locality while reducing the preprocessing time.

Blocking Strategies
Starting from [18], blocking techniques have been widely used to achieve different goals. GraphGrind [34] and Graptor [38] apply vertical blocking in their push traversals in order to prevent race conditions made by concurrent updates.
Cagra [45] applies horizontal blocking of the adjacency matrix in pull traversal that limits the range of random memory accesses during processing of a block and cache misses are reduced. Per-thread buffers are used in Cagra to contain intermediate updates of data of all vertices. iHTL provides an efficient buffering limited to in-hubs.
Lav [43] reduces the overheads of Cagra by creating horizontal dense blocks only for those out-hubs that capture 80% of the out-edges. To avoid buffer merging and to reduce cache misses, Lav prevents concurrent processing of blocks that may introduce load imbalance. In contrast, iHTL's flipped blocks are easily load-balanced and are processed concurrently (Section 3.4).
Moreover, since real-world graphs are not truly powerlaw graphs [12], it is not always possible to select the number of dense blocks using estimated degree distribution statistics. So, iHTL identifies the number of flipped blocks by assessing the relation between hubs independently from their degree (Section 3.3), and flipped blocks contain a wide range of 13% -68% of the edges (Table 5).
Efficient horizontal blocking based on out-degrees is fundamentally impossible in some graphs. Figure 9 compares the asymmetricity of vertices grouped by degree for a social network (Twitter MPI) and a web graph (UK-Union). Asymmetricity of a vertex is defined as the fraction of inneighbours that are not out-neighbours :  Figure 9 shows that in-hubs are almost symmetric in social networks (in-hubs are out-hubs), but web graphs do not have symmetric in-hubs. Therefore, the lack of very high out-degrees in the graph implies that horizontal blocking cannot create dense blocks, which is most prominent in web graphs and can increase the overhead of reading topology data. Similarly, if the graph does not have very high in-degree vertices, it is not possible to create vertical dense blocks.
As an example, SK-Domain has in-hubs and no out-hubs (Table 1), and for creating horizontal dense blocks based on out-hubs, 36% of vertices are required to capture 80% of edges, but iHTL creates a single vertical flipped block that contains 68% of the edges by selecting 0.3% of the vertices as in-hubs (Section 4.6).
iHTL creates flipped blocks using the same type of hubs that experience low locality: in a pull traversal, inhubs do not experience locality and iHTL creates vertical flipped blocks based on in-hubs that exist.
Moreover, iHTL maintains the relative order of vertices within the VWEH and FV categories, while other locality optimizing algorithms apply degree sorting throughout [4,43,45]. This destroys locality expressed in the initial assignment of vertex labels [36].

Conclusion and Future Work
This paper represents iHTL that improves temporal locality using both push and pull traversals in one graph traversal but for different vertex types. The evaluation on 10 realworld graph datasets shows that iHTL is much faster than pull and push traversals in graph processing frameworks. Furthermore, iHTL outperforms state-of-the-art locality optimization relabeling algorithms.
This paper concentrates on improving locality in pull traversal which is widely used in several graph analytics. However, the idea that irregular datasets require irregular traversals is not limited to pull traversal and can be useful for improving locality in other graph analytics like Triangle Counting, Single Source Shortest Path, and Connected Components. Moreover, iHTL can be improved by: -The number of flipped blocks (Section 3.3) can be identified in an algorithm with lower complexity by limiting the maximum number of flipped blocks and applying a pass over out-edges of 1 to identify all | |. -The size of topology data of iHTL graph (Section 4.4) can be reduced using light-weight graph compression techniques [9,10] and vectorization [17,38,43].
-iHTL reduces cache misses of hubs and high-degree vertices. Locality optimizing relabeling algorithms like Rabbit-Order improve spatial locality of low-degree vertices [21]. In this way locality of the sparse block may improve by applying Rabbit-Order.

Code Availability
Source code repository and further discussions relating to this paper are available online in https://blogs.qub.ac.uk/ GraphProcessing/Exploiting-in-Hub-Temporal-Locality-in-SpMV-based-Graph-Processing/.