KV-Direct: Fast Remote Key-Value Access with Specialized ...

KV-Direct: Fast Remote Key-Value Access with Specialized ...

SocksDirect: Datacenter Sockets can be Fast and Compatible Bojie Li*1,2 Zhang1 Tianyi Cui*3 Microsoft Research Washington 1 * Co-first authors Zibo Wang1,2 USTC 2 Wei Bai1 3 Lintao University of The Socket Communication Primitive Server Client socket() bind() listen() accept() recv() send() close() socket() connect() send() recv() close() Socket: a Performance Bottleneck Socket syscall time >> user

Kernel time User application time application time NSC DNS Server 92% Redis Key-Value Store 8% 87% 13% Nginx HTTP Load Balancer 77% 23% Lighttpd HTTP Server 78% 22% Socket latency >> hardware transportInter-host latency (vs. RDMA) 30 1.6 11 Intra-host (vs. shared memory) 0.25 0 5 shared memory (SHM) / RDMA 10

15 20 Linux Socket 25 30 35 High Performance Socket Systems Kernel Optimization User App User App Kernel VFS Kernel TCP/IP NIC packet API Host 1 TCP/IP NIC packet API Kernel Host 2 User App FastSocket, Megapipe, StackMap User-space TCP/IP Offload to RDMA NIC User App User App UserUserspace space VFS UserUserVFS

space space TCP/IP TCP/IP NIC packet API Host 1 TCP/IP NIC packet UserAPI space Stack User App Host 2 IX, Arrakis, SandStorm, mTCP, LibVMA, OpenOnload User App User App UserUserspace space VFS VFS NIC RDMA API Hardware-based Transport Host 1 RDMA NIC RDMA User-API space VFS User App Host 2 Rsocket, SDP, FreeFlow Linux Socket: from send() to recv()

Sender Host Applicatio n Receiver Host Applicatio n send() socket call Wakeup process C library send() syscall OS C library recv() syscall OS Lock VFS send Copy data, allocate mem TCP send buffer Process Schedulin g Lock Event Notificatio n Copy data, free mem VFS recv

TCP recv buffer TCP/IP TCP/IP Network packet Network packet Packet processing (netfilter, tc) Network packet Packet processing (netfilter, tc) Network packet NIC recv() socket call Network packet NIC Round-Trip Time Breakdown Type Overhead (ns) Linux LibVMA Inter-host Total Per operation

Intra-host Inter-host Intra-host Inter-host Intra-host 177 209 53 12 10 10 15 Kernel crossing (syscall) 205 N/A N/A N/A Socket FD locking 160 121 138 N/A C library shim 15000

5800 2200 1300 1700 1000 850 150 Buffer management 430 320 370 50 TCP/IP protocol 360 260 N/A N/A 130 N/A N/A Packet processing NIC doorbell and DMA NIC processing and wire Handling NIC interrupt 500

N/A 2100 N/A 900 450 900 450 600 N/A 200 N/A 200 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Inter-host SocksDirect

413 Total Per packet Intra-host RSocket 5000 365 Copy Wire transfer 160 N/A N/A N/A N/A N/A 540 160 160 N/A 381 239 320 N/A 160 212 173

160 N/A 160 13 13 N/A 160 N/A SocksDirect Design Goals Compatibility Drop-in replacement, no application modification Isolation Security isolation among containers and applications Enforce access control policies High Performance High throughput Low latency Scalable with number of CPU cores SocksDirect: Fast and Compatible Socket in User Space ACL rules Monitor Process Process 1

Process 2 Event queue Application connect Request queue Event queue Dat a send Shared Buffer Dat a recv Application accept Request queue User Mode Kernel Mode or: user-space daemon process to coordinate global resources and enforce ACL rules. sses as a shared-nothing distributed system: use message passing over shared-memory queues. SocksDirect Supports Different Transports for Data User App Host 1 libsd NIC RDMA API Mem Queue

Mem Queue Monitor libsd Mem Queue User App Monitor TCP/IP NIC Packet API TCP/IP RDMA Host 2 NIC RDMA API libsd User App Host 3 TCP/IP No SocksDirect User App Remove the Overheads (1) Type Overhead Linux RTT (ns) SocksDirect RTT (ns)

Inter-host Inter-host Total Per operation 53 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A C library shim 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360

N/A Packet processing NIC doorbell and DMA NIC processing and wire Handling NIC interrupt 500 N/A 2100 N/A 600 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Intrahost 413 Total Per packet Intra-host N/A

5000 365 Copy Wire transfer N/A N/A 160 173 160 160 13 13 N/A 160 N/A Remove the Overheads (2) Type Overhead Linux RTT (ns) SocksDirect RTT (ns) Inter-host Inter-host Total Per operation 53 15

15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A C library shim 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing NIC doorbell and DMA NIC processing and wire Handling NIC interrupt 500 N/A 2100

N/A 600 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Intrahost 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer N/A N/A 160 173

160 160 13 13 N/A 160 N/A Token-based Socket Sharing Socket is shared among threads and forked processes. Optimize for common cases. Be correct for all cases. Many to one Many to one Lock Sender 1 Sender 2 Queue Send token One to many Sender Receiver Queue Receiver One to many Lock

Queue Sender 1 Sender 2 Receiver 1 Receiver 2 Sender Queue Receiver 1 Receiver 2 Receive token Transfer ownership of tokens via monito Handling Fork Transport limitations: Linux semantics requirement: RDMA QP cannot be shared among proce File descriptors need to be sequential (1,2,3,4,5). Sockets are shared for parent and child processes. Parent process RDMA RDMAQPQP FD Table 3 3 5 (COW) SHM

Socket Data Socket Shared pages (shared) Data FD Table 3 3 3 4 4 4 5 FD Table 5 (COW) Child process SHM Socket Data (private) 5 5 SHM (shared) SHM Queue 4 RDMA QP 3 (on demand) 5 Remove the Overheads (3) Type Overhead Linux RTT (ns) SocksDirect RTT (ns)

Inter-host Inter-host Total Per operation 53 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A C library shim 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A

Packet processing NIC doorbell and DMA NIC processing and wire Handling NIC interrupt 500 N/A 2100 N/A 600 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Intrahost 413 Total Per packet Intra-host N/A 5000

365 Copy Wire transfer N/A N/A 160 173 160 160 13 13 N/A 160 N/A Per-socket Ring Buffer head head tail tail Traditional ring buffer Many sockets share a ring buffer Receiver segregates packets from the NIC Buffer allocation overhead Internal fragmentation SocksDirect per-socket ring buffer

One ring buffer per socket Sender segregates packets via RDMA or SHM address Back-to-back packet placement Minimize buffer mgmt. overhead Per-socket Ring Buffer send_next RDMA write data head tail Sender side Receiver side RDMA write credits (batched) Two copies of ring buffers on both sender and receiver. Use one-sided RDMA write to synchronize data from sender to receiver, and return credits (i.e. free buffer size) in batches. Use RDMA shared completion queue to amortize polling overhead. Remove the Overheads (4) Type Overhead Linux RTT (ns) SocksDirect RTT (ns) Inter-host Inter-host

Total Per operation 53 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A C library shim 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360 N/A Packet processing NIC doorbell and DMA NIC processing and wire

Handling NIC interrupt 500 N/A 2100 N/A 600 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Intrahost 413 Total Per packet Intra-host N/A 5000 365 Copy Wire transfer

N/A N/A 160 173 160 160 13 13 N/A 160 N/A Payload Copy Overhead Why both sender and receiver need to copy payload? Sender App Receiver Mem NIC NIC Mem App send(buf, size); Network packet DMA read D memcpy(buf, data, size); M A DMA read Wrong data

DMA to socket buf Notify event (epoll) user_buf = malloc(size); recv(user_buf, size); Copy socket buf to user buf Page Remapping Sender Virtual addr New page Send physical page Data page Problem: page remapping needs syscalls! Map 1 page: 0.78 us Copy 1 page: 0.40 us Solution: batch page remapping for large messages Map 32 pages: 1.2 us Copy 32 pages: 13.0 us Receiver Virtual addr Old page Remove the Overheads (5) Type Overhead Linux RTT (ns) SocksDirect RTT (ns)

Inter-host Inter-host Total Per operation 53 15 15 Kernel crossing (syscall) 205 N/A Socket FD locking 160 N/A C library shim 15000 5800 850 150 Buffer management 430 50 TCP/IP protocol 360

N/A Packet processing NIC doorbell and DMA NIC processing and wire Handling NIC interrupt 500 N/A 2100 N/A 600 N/A 200 N/A 200 N/A 4000 N/A Process wakeup Total Per kbyte Intrahost 413 Total Per packet Intra-host N/A

5000 365 Copy Wire transfer N/A N/A 160 173 160 160 13 13 N/A 160 N/A Process Wakeup Overhead Problem: multiple processes share a CPU core Linux process wakeup (mutex, semaphore, read): 2.8 ~ 5.5 us Cooperative context switch (sched_yield): 0.2 us Solution: Pin each thread to a core Each thread poll for some time slice, and sched_yield after then. All threads on the core run in round-robin order. Summary: Overheads & Techniques 1. 2. 3. 4. 5. 6. Kernel crossing (syscall) Monitor and library in user space. Shared-nothing, use message passing for communication.

TCP/IP protocol Use hardware-based transports: RDMA / SHM. Locking of socket FDs Token-based socket sharing. Optimize common cases, prepare for all cases. Buffer management Per-socket ring buffer. Payload copy Batch page remapping. Process wakeup Cooperative context switch. Evaluation Setting User App Host 1 libsd NIC RDMA API RDMA Host 2 NIC RDMA API libsd User App Monitor Mem Queue Monitor Mem Queue Mem Queue User App libsd Latency Intra-host Inter-host

Throughput Intra-host Inter-host Multi-core Scalability Intra-host Inter-host Application Performance Nginx HTTP server request latency Limitations Scale to many connections RDMA scalability: under high concurrency, NIC cache miss will degrade throughput. Recent NICs have larger cache. Connection setup latency: future work. Congestion control and QoS Emerging RDMA congestion control (e.g. DCQCN, MP-RDMA, HPCC) and loss recovery (e.g. MELO) mechanisms in hardware. QoS: OVS offload in RDMA NICs, or use programmable NICs. Scale to many threads Monitor polling overhead Conclusion Contributions of this work: An analysis of performance overheads in Linux socket. Design and implementation of SocksDirect, a high performance user space socket system that is compatible with Linux and preserves isolation among applications. Techniques to support fork, token-based connection sharing, allocation-free ring buffer and zero copy that may be useful in many scenarios other than sockets. Evaluations show that SocksDirect can achieve performance that is comparable with RDMA and SHM queue, and significantly speedup existing applications.

Thank you! High Performance Socket Systems Kernel Optimization FastSocket, Megapipe, StackMap Good compatibility, but leave many overheads on the table. User-space TCP/IP Stack IX, Arrakis, SandStorm, mTCP, LibVMA, OpenOnload Does not support fork, container live migration, ACLs Use NIC to forward intra-host packets (SHM is faster). Fail to remove payload copy, locking and buffer mgmt overheads. Offload to RDMA NIC The Slowest Part in Linux Socket Stack Application socket send / recv Virtual File System TCP/IP Protocol Application pipe read / write Virtual File System Loopback interface 10 us RTT 0.9 M op/s tput

8 us RTT 1.2 M op/s tput TCP/IP protocol is NOT the slowest part! Payload Copy Overhead Solution: page remapping Sender Receiver B 1 A C B B 2 RDMA write B C Socket Queues Monitor 3: S1R1, 4: S1R2, 5: S2R2 Sender FDSocket QueuesReceiver S1 3,4 S2

5 3 4 5 R1 3 R2 4,5 Connection Management Connection setup procedure Token-based Socket Sharing Takeover: transfer token among threads. Takeover request Sender 1 Data Queue Monitor Send token Takeover request Send token Sender 2 Receiver Rethink the Networking Stack SocksDirect eRPC LITE Application Communication primitive Socket Software Stack

HW/SW work split NIC RDMA RPC LITE Packet RDMA

Recently Viewed Presentations

  • SAPR NROTC Titles - United States Navy

    SAPR NROTC Titles - United States Navy

    Perception exists that many males resent presence of females in Navy Inundated with SAPR training but not seen as relevant as it appears geared toward active-duty Misuse of alcohol is the number one contributing factor to inappropriate behavior and sexual...
  • What happens in Organic Chemistry, stays in Organic

    What happens in Organic Chemistry, stays in Organic

    Ways to Make Crystal Meth…using stuff you learned in O. Chem. 1) Grignard synthesis. Class of compound ??? C=N- acts like C=O. when reacting with . RMgX … immine. 2-phenylethanal. The Grignard piece. Crystal meth
  • Behaviorism - California State University, Fresno

    Behaviorism - California State University, Fresno

    Behaviorism and the Teaching of Mathematics. ... discovered by Pavlov that involves repeatedly pairing a neutral stimulus with a response-producing stimulus until the neutral stimulus triggers the same response. Ivan Pavlov's Classical Conditioning. Before Conditioning.
  • Computer Networking Basics - Swikis on coweb.cc

    Computer Networking Basics - Swikis on coweb.cc

    The Computer Network What is a Computer Network net·work: [net-wurk] - noun, a system containing any combination of computers, computer terminals, printers, audio or visual display devices, or telephones interconnected by telecommunication equipment or cables: used to transmit or receive...
  • Relational Data Base Fundamentals

    Relational Data Base Fundamentals

    Free Public Education and Crowding Out In fact, Peltzman (1973) argued that it is possible that providing public education could lower the educational attainment in society. In Peltzman's model some parents choose lower-quality public schools over higher-quality private schools in...
  • Congress, Presidency, Judiciary

    Congress, Presidency, Judiciary

    Intelligence agencies are formally inside the executive branch of government, so in this case it would be an element of the executive branch checking the president, who is the head of the executive branch. ... John Kerry's Trip to Nicaragua....
  • The Private Marginal Benefit of Pollution

    The Private Marginal Benefit of Pollution

    The Private Marginal Benefit of Pollution. As the firm attempts to abate more and more pollution, stronger and more expensive interventions (such as stack scrubbers) must be employed→ increasing marginal costs to abatement. If we were to graph the private...
  • When the Roll Is Called Up Yonder

    When the Roll Is Called Up Yonder

    Let us labor for the Master from the dawn till setting sun, Let us talk of all His wondrous love and care; Then when all of life is over, and our work on earth is done, And the roll is...