Manank Patel | UCSC OSPO

KV store final Blog

Fri, 25 Aug 2023 00:00:00 +0000

Hello again! Before we get started, take a look at my previous blogs, Introduction and Mid Term. The goal of the project was to implement io_uring based backend driver for client side, which was at that time using traditional sockets. The objective was improving performance from the zero copy capabilities of io uring. In the process, I learnt about many things, about libkinetic and KV stores in general.

I started by writing a separate driver using io_uring in libkinetic/src in ktli_uring.c, most of which is similar to the sockets backend in ktli_sockets.c. The only difference was in the send and receive functions. For more detailed description about the implementation, refer to the mid term blog.

After the implementation, it was time to put it to test. We ran extensive benchmarks with a tool called fio, which is generally used to run tests on filesystems and other IO related things. Thanks to Philip, who had already written an IO engine for testing kinetic KV store (link), I didn’t have much problem in setting up the testbench. Again thanks to Philip, He set up a ubuntu server with the kinetic server and gave me access through ssh. We ran extensive tests on that server, with both socket and uring backends, with several different block sizes. The link to the benchmarks sheet can be found here.

We spent a lot of time in reading and discussing the numbers, probably the most time consuming part of the project, we had several long discussions analyzing numbers and their implications, for example in the initial tests, we were getting very high std dev in mean send times, then we figured it was because of the network bottleneck, as we were using large block sizes and filling up the 2.5G network bandwidth quickly.

In conclusion, we found out that there are many other major factors affecting the performance of the KV store, for example the network, and the server side of the KV store. Thus, though io_uring offers performance benefit at the userspace-kernel level, in this case, there were other factors that had more significant effect than the kernal IO stack on the client side. Thus, for increasing the performance, we need to look at the server side

I would like to thank Philip and Aldrin for their unwavering support and in depth discussions on the topic in our weekly meetings, I learned a lot from them throughout the entire duration of the project.

Implemented IO uring for Key-Value Drives

Mon, 31 Jul 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, (link to my Introduction post) and am currently working on Efficient Communication with Key/Value Storage Devices. The goal of the project was to leverage the capabilities of io_uring and implement a new backend driver.

In the existing sockets backend, we use non-blocking sockets with looping to ensure all the data is written. Here is a simplified flow diagram for the same. The reasoning behind using non blocking sockets and TCP_NODELAY is to get proper network utilization. This snippet from the code explains it further.


NODELAY means that segments are always sent as soon as possible,
even if there is only a small amount of data. When not set,
data is buffered until there is a sufficient amount to send out,
thereby avoiding the frequent sending of small packets, which
results in poor utilization of the network. This option is
overridden by TCP_CORK; however, setting this option forces
an explicit flush of pending output, even if TCP_CORK is
currently set.

In the above figure, we have a loop with a writev call, and we check the return value and if all the data has not been written, then we modify the offsets and then loop again, otherwise, if all the data has been written, we exit the loop and return from the function. Now this works well with traditional sockets, as we get the return value from the writev call as soon as it returns. In case of io_uring, if we try to follow the same design, we get the following flow diagram.

Here, as you can see, there are many additional steps/overhead if we want to check the return value before sending the next writev, as we need to know how many bytes has been written till now to change the offsets and issue the next request accordingly. Thus, in every iteration of the loop we need to to get an sqe, prep it for writev, then submit it, and then get a CQE, and then wait for the CQE to get the return value of writev call.

The alternate approach would be to write the full message/iovec atomically in one call, as shown in following diagram.

However, on trying this method, and running fio tests, we noticed that it worked well with smaller block sizes, like 16k, 32k and 64k, but was failing constantly with larger block sizes like 512k or 1m. This was because it was not able to write all the data to the socket in one go. This method showed good results as compared to sockets backend (for small BS i.e). We tried to increase the send/recv buffers to 1MiB-10MiB but it still struggled with larger blocksizes.

Going forward, we discussed a few ideas to understand the performance trade-offs. One is to use a static variable and increment it on every loop iteration, in this way we can find out if that is really the contirbuting factor to our problem. Another idea is to break down the message in small chunks, say 256k and and set up io uring with sqe polling and then link and submit those requests in loop, without calling io_uring_submit and waiting for CQE. The plan is to try these ideas, discuss and come up with new ideas on how we can leverage io_uring for ktli backend.

Efficient Communication with Key/Value Storage Devices

Fri, 26 May 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, and am currently an undergraduate student at Birla Institute of Technology and Sciences - Pilani, KK Birla Goa Campus. As part of the Efficient Communication with Key/Value Storage Devices my proposal under the mentorship of Aldrin Montana and Philip Kufeldt aims to implement io_uring based communication backend for network based key-value store.

io_uring offers a new kernel interface that can improve performance and avoid the overhead of system calls and zero copy network transmission capabilities. The KV store clients utilize traditional network sockets and POSIX APIs for their communication with the KV store. A notable advancement that has emerged in the past two years is the introduction of a new kernel interface known as io_uring, which can be utilized instead of the POSIX API. This fresh interface employs shared memory queues to facilitate communication between the kernel and user, enabling data transfer without the need for system calls and promoting zero copy transfer of data. By circumventing the overhead associated with system calls, this approach has the potential to enhance performance significantly.