Writing to an erasure coded pool in Ceph Rados

CephLately, we’ve been working very closely with RedHat’s Ceph Rados storage and it’s librados API, where we’re seeking ever closer integration with the backend storage to utilise many of Ceph’s benefits.

However lately, we hit an issue where one of our customers had configured their Pool to be erasure coded. Erasure coding is a form of data protection and data redundancy whereby the original file or object is split up into a number of parts, and distributed across a number of storage nodes, either within the same data centres or across multiple multiple data centres and regions.

This is not an uncommon practice within the storage space, however, when our implementation was tested on an EC pool, we observed some nuances between how you can write to an erasure-coded pool and a non-erasure-coded pool.

Since the Storage Made Easy Appliance acts as a gateway or access layer to your backend storage when we’re handling files we prefer not to keep the full file in memory, as this consumes the valuable memory on the machine. For example, when a user uploads a file, as the data is streamed from the client, we stream the data we receive straight onto the backend storage.

We were utilising the same approach for Ceph Rados, through it’s rados_append API function, which is also similar to the rados_write API method with offsets.

When we were dealing with non-erasure-coded pools, we found this to be the perfect balance in terms of memory usage and upload speed. However, when we transitioned to an erasure-coded pool, our second append operation would return “Error 95: Operation not supported”. Despite looking around for a while, there seemed to be little information around on how you could solve this issue.

Researching around, the leading solution people were suggesting was to install a replicated caching pool in front of the erasure-coded pool, however, given we don’t control our customer’s storage, we opted not to stick with this resolution. We approached the Ceph mailing list with this issue, and we were pointed towards writing in multiples of the stripe width. Striping is the process of “storing sequential pieces of information across multiple storage devices–to increase throughput and performance”.

As we found out, the trick is to write in a multiple of the stripe width. For example, if you have an object that is 10000 bytes in size, and your stripe width is 4000 bytes, the most efficient way would be to first append / write the first 8000 bytes, then followed by the remaining 2000 bytes. However, you have to be aware that no further writes or appends can be made to this object after you have written data that is not a multiple of the stripe width. In such cases, you will have to re-write the whole object from the beginning.

To find out the Stripe width of your erasure-coded pool, you can make use of the rados_ioctx_pool_required_alignment API method, which will return the stripe width in bytes.

We hoped and believed that this should be an implementation detail managed by the Ceph cluster, but it’s unfortunately not. Hopefully, this information will help anyone who may run into this issue!

Facebooktwitterredditpinterestlinkedinmailby feather
The following two tabs change content below.