TY - GEN
T1 - Accelerating key-value data structures using AVX-512 SIMD extensions
AU - Hoseinyfarahabady, Mohammad Reza
AU - Taheri, Javid
AU - Zomaya, Albert Y.
PY - 2025/10/7
Y1 - 2025/10/7
N2 - Advanced Vector Extensions 512 (AVX-512), a modern SIMD instruction set for x86 architectures, enables data-level parallelism through 512-bit wide ZMM registers capable of processing multiple data elements concurrently within a single instruction cycle. In this study, we present a high-throughput, lock-free, in-memory architecture for key-value data-stores that exploits AVX-512 vector operations to accelerate fundamental operations such as insertion and lookup. Our design introduces an optimized memory layout that partitions the key space into two disjoint regions (primary and secondary) and employs three independent hash functions to identify candidate slots. This asymmetric layout improves key distribution, reduces collision probability, and enhances overall lookup efficiency. Experimental evaluation shows that this strategy yields the lowest insertion failure rate among tested memory partitioning schemes. By leveraging AVX-512 instructions in combination with most optimized memory layout, our implementation achieves insertion throughput within 6% of Intel TBB's highly optimized multithreaded hash map, despite avoiding explicit synchronization or thread-level parallelism. Under workloads with 550 million entries and a 90% miss rate, our approach delivers 4.0-5.1x speedup over standard STL, Boost, Robin-Hood, and Abseil hash maps, and up to 2.5 x improvement relative to TBB and Abseil. These gains are consistently observed for both 32-bit and 64-bit floating-point key types. The results confirm the viability of AVX-512-centric designs as a cost-effective alternative to thread-level parallelism, particularly in environments where minimizing synchronization overhead and ensuring deterministic execution are critical. Our findings suggest for a paradigm shift in CPU and system architecture, emphasizing wider vector units and improved memory bandwidth utilization as primary levers for scalable high-performance computing. These findings suggest that future extensions of AVX-512 capabilities, such as non-blocking memory loads, expanded vector registers, and asynchronous prefetching, could enhance the efficiency of data-intensive workloads.
AB - Advanced Vector Extensions 512 (AVX-512), a modern SIMD instruction set for x86 architectures, enables data-level parallelism through 512-bit wide ZMM registers capable of processing multiple data elements concurrently within a single instruction cycle. In this study, we present a high-throughput, lock-free, in-memory architecture for key-value data-stores that exploits AVX-512 vector operations to accelerate fundamental operations such as insertion and lookup. Our design introduces an optimized memory layout that partitions the key space into two disjoint regions (primary and secondary) and employs three independent hash functions to identify candidate slots. This asymmetric layout improves key distribution, reduces collision probability, and enhances overall lookup efficiency. Experimental evaluation shows that this strategy yields the lowest insertion failure rate among tested memory partitioning schemes. By leveraging AVX-512 instructions in combination with most optimized memory layout, our implementation achieves insertion throughput within 6% of Intel TBB's highly optimized multithreaded hash map, despite avoiding explicit synchronization or thread-level parallelism. Under workloads with 550 million entries and a 90% miss rate, our approach delivers 4.0-5.1x speedup over standard STL, Boost, Robin-Hood, and Abseil hash maps, and up to 2.5 x improvement relative to TBB and Abseil. These gains are consistently observed for both 32-bit and 64-bit floating-point key types. The results confirm the viability of AVX-512-centric designs as a cost-effective alternative to thread-level parallelism, particularly in environments where minimizing synchronization overhead and ensuring deterministic execution are critical. Our findings suggest for a paradigm shift in CPU and system architecture, emphasizing wider vector units and improved memory bandwidth utilization as primary levers for scalable high-performance computing. These findings suggest that future extensions of AVX-512 capabilities, such as non-blocking memory loads, expanded vector registers, and asynchronous prefetching, could enhance the efficiency of data-intensive workloads.
KW - AVX-512 Intrinsics
KW - CPU-based Key-Value Data Structures
KW - Hash Table Acceleration
KW - High Performance Computing
KW - Low-Latency Data Access
KW - Memory Layout Design
KW - Multiple Data (SIMD) Parallelism
KW - Single Instruction
KW - Vectorized Hashing
U2 - 10.1109/CLUSTER59342.2025.11186494
DO - 10.1109/CLUSTER59342.2025.11186494
M3 - Conference contribution
AN - SCOPUS:105019791230
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
BT - Proceedings of the 2025 IEEE International Conference on Cluster Computing, CLUSTER 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2025 IEEE International Conference on Cluster Computing, CLUSTER 2025
Y2 - 3 September 2025 through 5 September 2025
ER -