The File Fabric has two main modes of operation with regard to knowing what files are on a storage provider: cached mode and real-time mode. In cached mode the File Fabric maintains, in its database, metadata about each file on the underlying storage. As files are created, updated and deleted, the changes flow through the File Fabric and the File Fabric updates its metadata accordingly.
Cached mode is often used when:
- Companies are working directly through the File Fabric with no access to the underlying storage.
- When there are particularly large amounts of data on the remote data store that make it impossible to pull back in real-time. Millions of files on Amazon S3 are a good example of this. With cached mode the files can be accessed and searched very quickly compared to the alternative.
Cached mode presents two obvious challenges.
- If a storage provider that already contains files is added to the File Fabric, how does the File Fabric initialize the metadata for that provider?
- If the storage is being used in a bi-modal way ie. changes are made directly to the storage – how does the File Fabric learn about those changes?
The answer to the two questions are somewhat similar. When a storage provider is added, the File Fabric populates its meta-database with an initial sync process to discover the content of the storage and thereby record the appropriate metadata in the File Fabric’s metadata database. It is probably worth pointing out that the File Fabric does not copy ofrmove the files that exist on the remote storage. The file integrity is maintained in a single location.
If changes have been made directly to the storage, the user can initiate the File Fabric’ re-sync process, which compare file information retrieved from the storage with the metadata in the File Fabric’s meta-database and updates the meta-database as needed.
There are other options to the above that are also worth mentioning:
- The File Fabric can operate in real-time mode which negates the needs for a re-synchronization process as new data added directly to the storage is discovered in real-time as users browser directories.
- The File Fabric can be scheduled to spider the underling storage at set intervals and update the metadata. This is particularly useful when the underlying data set is very large.
These options are not mutually exclusive and can be used in combinations.
Initial sync and re-sync are bread-and-butter operations for the File Fabric and have been available since the File Fabric’s inception With digital transformation and a companies digital assets doubling every year companies are now dealing with very large unstructured datasets which can now require billions of files and objects to be indexed, so it had become clear to the engineering team that these meta-sync operations would benefit from being optimized.
In the latest major File Fabric release, v1906, both types of sync operations have become much faster. How much faster? The actual performance depends on a host of variables such as CPU speed, network capacity, storage speed (because the storage has to tell the File Fabric about its contents), number of changes (re-sync only), background load etc, so there is no single answer, but here are two hard facts:
- In some situations we have seen throughput (measured as number of files sync’d per second) increase by more than 50x.
- We have also seen sustained throughput of more than 1,000 files per second.
How did we optimise this ? A combination of code re-factoring improvements coupled with updated algorithms to deal with how meta-data is synchronised.
These performance improvements have enabled File Fabric to deal with extremely large datasets with ease.
If you are planning to evaluate the File Fabric for use by your organization, be sure to include these operations in your evaluation. We think you will be impressed.by