More and more unstructured data stored on S3 object systems is not always just a backup or archive and there is increasingly a requirement to provide real-time access to unstructured data to end users.
Some of these unstructured data sets we see that are moved to such S3 compatible systems are huge, literally hundreds of millions to billions of objects and often they are copied or moved in the same hierarchical data structures from where they were originally stored (on file or block based storage).
Often the way in which such data is copied are moved does not take into account object storage design best practices for storing large data sets. The Amazon S3 object storage data model is a flat structure. A bucket is created and the bucket stores objects. There is no real hierarchy of subbuckets or subfolders but a logical hierarchy can be inferred using key name prefixes and delimiters (this is exactly what the S3 console does).
In a scenario where deeply nested hierarchies are copied without thought to best practice design patterns on-demand access can be slow and quickly become a problem.
The File Fabric has many features but one of them that is often overlooked is its caching of metadata when indexing a storage provider. Metadata is cached and this metadata contains, amongst other things, items such as the object name, location, size etc. Having this information to hand means that when the File Fabric constructs its global file manager listing it has to hand all the necessary information to quickly construct the S3 hierarchical tree. Practically it means when an end user is using the File Fabric with a large hierarchical object data estate access is relatively quick, even when working with hundreds of millions of (pseudo) nested folders and files.
So how does this magic work ? When a new storage provider is connected the File Fabric crawls the object estate for the metadata and caches it. Doesn’t this take a long time I hear you ask ? It can, perhaps a few days when there are hundreds of millions of objects, but once cached ongoing access is fast. Ok, but what about new objects being added ? Well the File Fabric has various methods to deal with this:
(i) There is no need to sync at all if access is direct through the File Fabric as once the index is created metadata is extracted for all future operations through the File Fabric automatically. If the S3 compatible storage is being used in a bi-modal fashion ie. through the File Fabric and direct then after the initial metadata sync you can use points (ii) through (v) to decide the best strategy for discovering newly added data direct.
(ii) It can simply resync in the background to ensure it picks up the latest meta-data
(iii) It can work in what we call real-time mode in which the current user view of data I refreshed on demand (rather than the whole data estate)
(iv) A schedule can be set to resync just particular elements of the data estate as needed.
(v) If the S3 compatible Object Storage provider is Amazon S3 It can hook into Cloud Trail, Lambda and SNS to update / remove / add new objects on demand. In fact below is a template that shows this:
We are currently engaged with a large healthcare provider in the United States who has billions of nested objects and who is using the File Fabric as a secure means of (quick) access to this data estate.
Similarly in the United Kingdom we are working with a defense company who is using on-premises object storage but in which the method of direct access using the S3 protocol is taking too long so therefore they are using the Fie Fabric to speed up access (in addition to providing CIFS / NFS user desktop access to the data).
If you would like to know more about what the Enterprise File Fabric can do for your Amazon S3 or S3 compatible storage please feel free to contact us.
Latest posts by Storage Made Easy (see all)
- How to Secure and Simplify End User Access to Amazon S3 Object Storage for Remote Workers - February 10, 2021
- The File Fabric is now supported by Rclone 1.54 release - February 4, 2021