Removing layers of abstraction over procedure calls makes total sense. So if I was in a stack of 5 virtual calls, and it's elided to 3, thats brilliant. Thats not dissimilar to stuff we talked about in shared lib vs static lib runtimes and kernel, IIRC OSF/1 did a preload thing which reduced the cost of shared link calls significantly, at a barrier of reboot to make changes. This is "fix the software stack in userspace" and it's to me the same as the virtio improvements for access running virtualised OS on linux or BSD.
What I miss, is some sense of how to aggregate things the way ceph does. I wish that ZFS had a better abstraction to say "all these discrete runtime things you have, this chunk of them can now contribute to an ECC protected datastore"
IIRC the point is that each NBD device is backed by a different S3 endpoint, probably in different zones/regions/whatever for resiliency.
Edit: Oops, "zpool create global-pool mirror /dev/nbd0 /dev/nbd1" is a better example for that. If it's not that, I'm not sure what that first example is doing.
FS metrics without random IO benchmark are near meaningless, sequential read is best case for basically every file system and it's essentially "how fast you can get things from S3" in this case
Just going by the submitted article, it seems very similar in what it achieves, but seems to be implemented slightly differently. As I recall the DelphiX solution did not use a character device to communicate with the user-space S3 service, and it relied on a local NVMe backed write cache to make 16kB blocks performant by coalescing them into large objects (10 MB IIRC).
This solution instead seems to rely on using 1MB blocks and store those directly as objects, alleviating the intermediate caching and indirection layer. Larger number of objects but less local overhead.
DelphiX's rationale for 16 kB blocks was that their primary use-case was PostgreSQL database storage. I presume this is geared for other workloads.
And, importantly since we're on HN, DelphiX's user-space service was written in Rust as I recall it, this uses Go.
You have it the wrong way around. Here, ZFS uses many small S3 objects as the storage substrate, rather than physical disks. The value proposition is that this should be definitely cheaper and perhaps more durable than EBS.
What I miss, is some sense of how to aggregate things the way ceph does. I wish that ZFS had a better abstraction to say "all these discrete runtime things you have, this chunk of them can now contribute to an ECC protected datastore"
Edit: Oops, "zpool create global-pool mirror /dev/nbd0 /dev/nbd1" is a better example for that. If it's not that, I'm not sure what that first example is doing.
This solution instead seems to rely on using 1MB blocks and store those directly as objects, alleviating the intermediate caching and indirection layer. Larger number of objects but less local overhead.
DelphiX's rationale for 16 kB blocks was that their primary use-case was PostgreSQL database storage. I presume this is geared for other workloads.
And, importantly since we're on HN, DelphiX's user-space service was written in Rust as I recall it, this uses Go.
Why would I use zfs for this? Isn't the power of zfs that it's a filesystem with checksum and stuff like encryption?
Why would I use it for s3?
You have it the wrong way around. Here, ZFS uses many small S3 objects as the storage substrate, rather than physical disks. The value proposition is that this should be definitely cheaper and perhaps more durable than EBS.
See s3backer, a FUSE implementation of similar: https://github.com/archiecobbs/s3backer
See prior in kernel ZFS work by Delphix which AFAIK was closed by Delphix management: https://www.youtube.com/watch?v=opW9KhjOQ3Q
BTW this appears to be closed too!
zfs-share already implements SMB and NFS.
Not sure what is the use case out of my ignorance, but I guess one can use it to `zfs send` backups to s3 in a very neat manner.