Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Linux

Will New Object Storage Protocol Mean the End For POSIX? (enterprisestorageforum.com) 76

"POSIX has been the standard file system interface for Unix-based systems (which includes Linux) since its launch more than 30 years ago," writes Enterprise Storage Forum, noting the POSIX-compliant Lustre file system "powers most supercomputers."

Now Slashdot reader storagedude writes: POSIX has scalability and performance limitations that will become increasingly important in data-intensive applications like deep learning, but until now it has retained one key advantage over the infinitely scalable object storage: the ability to process data in memory. That advantage is now gone with the new mmap_obj() function, which paves the way for object storage to become the preferred approach to Big Data applications.
POSIX features like statefulness, prescriptive metadata, and strong consistency "become a performance bottleneck as I/O requests multiply and data scale..." claims the article.

"The mmap_obj() developers note that one piece of work still needs to be done: there needs to be a munmap_obj() function to release data from the user space, similar to the POSIX function."
This discussion has been archived. No new comments can be posted.

Will New Object Storage Protocol Mean the End For POSIX?

Comments Filter:
  • by phantomfive ( 622387 ) on Saturday October 03, 2020 @05:09PM (#60569154) Journal
    The article is quite confusing. I think what this is saying is that they built a function() that allows you to emulate mmap on a cloud store, like S3.
    • by fahrbot-bot ( 874524 ) on Saturday October 03, 2020 @05:21PM (#60569180)

      The article is quite confusing. I think what this is saying is that they built a function() that allows you to emulate mmap on a cloud store, like S3.

      And TFA has nonsensical gems like this:

      Using memory mapping to copy object data into the device means that all the data is temporarily stored and processed on the device rather than in POSIX.

      Having used mmap() many times over the years, as well as the more usual read/write APIs, I simply have to ask, "What?"

      • Words used to mean something. Obviously it doesn't have to any more...
        • So does you sentence mean anything? Or not mean anything?? Of what do my sentences mean or not??? God I am so confused!!!

          Back to reality, these semantic paradoxes really shook up the foundation of mathematics about 100 years ago, trying to solve them directly lead to the development of computer science.

      • by Pravetz-82 ( 1259458 ) on Sunday October 04, 2020 @02:01AM (#60570348)
        Someone unfamiliar with a subject is trying to write an article about it.
        What they've done here is create a mmap() like function, which can map a remotely stored "file/object" to local memory.
        This however implies that you now have HTTP client, a JSON parser and god knows how many libraries in the kernel. What could go wrong with that....
        • What need is of a mmap-like function when mmap() itself works on NBDs (Network Block Devices)? You can even use NBDs as swap.

          Since NBD itself works similarly to FUSE (only of course much more simple), you can implement all the HTTP code in userland, and since you don't have to support all the intricacies of the HTTP protocol (the only thing you should support is fetching ranges), all that could be very simply done, right now.

          I'm no Linux historian, but NBDs are probably supported since at least two decades.

        • Someone unfamiliar with a subject is trying to write an article about it.

          What they've done here is create a mmap() like function, which can map a remotely stored "file/object" to local memory.

          This however implies that you now have HTTP client, a JSON parser and god knows how many libraries in the kernel.

          That seems a longer jump even than an underpants gnome could make.

          It has nothing to do with any of that, it is like switching from local storage with hand-coded metadata to in-memory ProtocolBuffers or something.

          And even in the awful article, it explains that the advantage of mmap() is that it doesn't go through the kernel, and that this is adding a mmap_obj() that also doesn't go through the kernel. To replace the POSIX code, which does. So you're on your head there.

          But this is already what we do in embe

      • rather than in POSIX.

        I don't even understand how a compatibility standard is a place!

    • "The article is quite confusing. "

      You read the article? With that uid?
      Wait until your dad finds out you're using his account.

      • I thought it was all the kids with uids > 3M that didn’t read the articles ever???

        --

        I lost access to this account for over 10 years starting in 2007. Every day, every hour, every minute, johntheripperworked its ass off. I knew the password was within 8 alphanumerics plus a “-” and “@” and started with a specific number. 2,176,782,336 possibilities.

        I dutifully attempted to crack my slashdot.org password, every 15 seconds, for years. Their supoprt team never ever responded.

        On

  • by Hawke ( 1719 ) <kilpatds@oppositelock.org> on Saturday October 03, 2020 @05:10PM (#60569156) Homepage Journal
    The problem for adapting applications that assume posix semantics to object storage isn't mmap()... it's rename(). Unix systems use rename as a core atomic primitive, which is possible because of an implied "directory object". A nameserver or other system that implements an atomic rename would cover around 80% of the use cases... and many of the others could be forgotten about. (atime. Wat?. Posix advisory locking isn't really what anyone exactly wanted either. The lifecycle around that is ... a bit crazy and not what you think it is)
    • by Z00L00K ( 682162 )

      I get the feeling that stuff like this is "The old is bad, let's throw it out completely and build something new and completely different".

      Then you'll discover that you only create discontent among everyone that's going to use it.

      Breaking backwards compatibility is one of the worst things you can do in a system because it kills well-working old solutions and ways of working with little or no benefit.

      • by Hawke ( 1719 ) <kilpatds@oppositelock.org> on Saturday October 03, 2020 @11:18PM (#60569974) Homepage Journal
        No, atime is completely crazy. It has two uses I'm aware of (and I assume a few I don't)
        • Mailbox-like "did this change since I last read it" (that requires monotomic timestamps instead of real ones to work correctly)
        • Tiering/cleanup. "No one's read this in two years: I think it can be deleted/archived offline/etc"

        And in return, every read becomes a write, and you lose all parallelism of read-primary workloads. Nope, atime's crazy. Relatime is a good hack, but better would be throwing that misfeature away.

        Similarly I can go on about Posix locking, fcntl(..., F_[GS]ETLK(W)?, ...) vs. flock. fcntl has the lock owned by the file descriptor, so if you fork while you hold the lock, your child owns the lock... but support range locks. lockf has the lock owned by the process so you're not shocked by the ownership rules ... but doesn't support range locks. fcntl() locks calls "set the lock state" instead of taking a lock... so if you lock [0-10] and [5-15] and then unlock [5-10], you have [0-4] and [11-15] locked... don't lose your state. And so on...

        rename? Rename we should keep. And hard links while we're at it.

  • by fahrbot-bot ( 874524 ) on Saturday October 03, 2020 @05:11PM (#60569158)

    No.

  • Great News! (Score:5, Funny)

    by Waffle Iron ( 339739 ) on Saturday October 03, 2020 @05:26PM (#60569188)

    From TFA:

    The need for a POSIX interface could be bypassed altogether with object storage by using a REST interface for applications.

    For many years I've been wishing that they'd replace the bloated, slow and hard-to-understand POSIX API with a simple, streamlined, high-performance interface like REST.

    The only downside I see is having to spend dozens of hours in meetings deliberating over which calls should be "POST" vs "PUT". But nevertheless, that will be well worth it for this upgrade!

    • Version 1.0 was simple, you could just make a rest API call like this to copy a file cp: {"original_file_name": "path", "copied_file_name": path}.

      In 2.0 we're expecting an upgrade. There will be no new functionality but the rest API will be like this: cp: {"originalFileName": "path", "copiedFileName": path}.

      The new way is more correct and anyone who doesn't "cling to the old" will have no problem spending two days rewriting parts of their code for the update.
      • Can you see how insanely verbose and cumbersome this is?
        And I mean "insane" as in literally mentally insane.

        A plain text parser?? UTF-8 at the bottom. Basically a compiler at the top. Tons of escaping and variant data types in the middle. Data & CPU waste level: Over 9000.

        If you absolutely need variable length fields, at least use binary markup! You can still have the editor translate binary numeric tokens to plain text tokens back and forth, using a simple map file. Unicode and ASCII/ANSI already do th

    • They did this back in the early 90's, created 9P and made it so that any process could serve a filesystem in Plan9. Also, per-process namespaces. It works nicely and even 9P is relatively simple to understand.
    • by ebvwfbw ( 864834 )

      From TFA:

      The need for a POSIX interface could be bypassed altogether with object storage by using a REST interface for applications.

      For many years I've been wishing that they'd replace the bloated, slow and hard-to-understand POSIX API with a simple, streamlined, high-performance interface like REST.

      The only downside I see is having to spend dozens of hours in meetings deliberating over which calls should be "POST" vs "PUT". But nevertheless, that will be well worth it for this upgrade!

      Use REST with RUST. That'll lead to DECAY.... right?

      Thanks, I'm here all week.

  • by Gravis Zero ( 934156 ) on Saturday October 03, 2020 @05:28PM (#60569196)

    What they are talking about is adding a function that would allow proper utilization of object storage. Honestly, this is like saying epoll would be the end of POSIX. Frankly, if they standardized how object storage worked then they could even get it into a future version of POSIX.

    Everything about this article is hype, even if object storage is a major component of what Big Data uses in the future.

    • Object storage makes me think of serialisation. I am currently working on a binary serialisation format and API that can represent lists, tuples, records, dictionaries, etc. Think of it as a binary JSON, with the emphasis on access speed. That is what object storage means to me: structured data on disk. I may be way off here, so I won't mind being called an idiot, well not much, anyway.

  • I mean, the API will still exist, if only for the massive amount of legacy code that expects it to exist. Much like Win32 isn''t going anywhere any time soon.

    • I agree with you. It's actually a pretty useful thing too; it's handle-based and c-friendly. The idea of a message-loop and event-driven programming is also very useful

  • At one time object-oriented databases were all the rage - destined to make SQL databases obsolete. Where are they now?

    Object storage? Snake oil, methinks.

    • Object storage (like S3) is a thing and it's here to stay, but it's not a replacement for POSIX. That doesn't even make sense: S3 is built on POSIX.
  • Betteridge (Score:5, Informative)

    by PPH ( 736903 ) on Saturday October 03, 2020 @06:02PM (#60569302)

    No.

    POSIX has been the standard file system

    POSIX isn't a file system. POSIX is also a lot more than the file I/O spec. Perhaps an object storage spec will be added to POSIX. It's been done for DBMS systems already.

    Next step: Who's object model shall we adopt? Let the competition begin. I'll get the popcorn.

  • by broknstrngz ( 1616893 ) on Saturday October 03, 2020 @06:20PM (#60569326)

    The TFA was written by a marketing bot or human drone and contains many nuggets of wisdom such as:

    "POSIX has been the standard file system interface for Unix-based systems (which includes Linux) since its launch more than 30 years ago. Its usefulness in processing data in the user address space, or memory, has given POSIX-compliant file systems and storage a commanding presence in applications like deep learning that require significant data processing"

    "POSIX has its limits, though and features like statefulness, prescriptive metadata, and strong consistency become a performance bottleneck as I/O requests multiply and data scales, limiting the scalability of POSIX-compliant systems. That's often an issue in deep learning[...]"

    "Object storage is the most scalable of the three forms of storage (file and block are the others) because it allows enormous amounts of data in any form to be stored and accessed. "

    "Using memory mapping to copy object data into the device means that all the data is temporarily stored and processed on the device rather than in POSIX."

    "The SSD or other external device has much more available space for computing. The external device (a form of secondary storage for that computer) connects directly to the computer system and the CPU has a path to the data in the device: it is available almost as main memory while attached. Memory stays in the SSD during computing, and actively accessing the data—particularly the metadata—becomes much faster."

    "Network computing power and speed will skyrocket. Though this may have its limitations - transferring data in file and block storage to object storage, for one - it will mean new developments for data-intensive computing."

    • It really is just gibberish. What the fuck would an object-oriented filesystem be other than some vast linked list with oodles of metadata. This is how the Presentation Manager worked on top of HPFS on OS/2, so that you could use inheritance to make special kinds of files and folders.

      • by Etcetera ( 14711 )

        It really is just gibberish. What the fuck would an object-oriented filesystem be other than some vast linked list with oodles of metadata. This is how the Presentation Manager worked on top of HPFS on OS/2, so that you could use inheritance to make special kinds of files and folders.

        Hell, I feel like classic MacOS system software did this better with the Resource Manager and empty data forks by the '80s.

    • by Junta ( 36770 )

      I've been told /dev/null is webscale, but I don't know if it supports sharding...

      Object store is kind of a cult. They have some points that are frequently valid and may justify a 'POSIX-lite' where some POSIX guarantees that are expensive could be relaxed, but in general in a local context an object store model doesn't generally outdo a POSIX filesystem. POSIX over remote data stores is where things get messy, and why over-the-network software sometimes benefits by skipping POSIX guarantees to get some perf

    • Possibly by GPT-3 or similar. Expect more of this kind of nonsense in the future.

      https://www.theguardian.com/co... [theguardian.com]

      What's more concerning, was that it slipped passed the editors (not too surprising though, since they are probably also bots) and that people here are discussing the content at face value.

      The article is spam. Slashdotters (at least those who are not bots nor Russian trolls) should know better.

  • Poettering really missed out here. But seriously, why is /. posting C-level jargon pap?
  • Long live File::read!
  • by BAReFO0t ( 6240524 ) on Saturday October 03, 2020 @07:27PM (#60569480)

    A file system is a database is an object storage is a network os a graph is a structured binary file is a whatever.

    Itâ(TM)s all just different interfaces optinized for different use cases.

    And humble files are not going away anytime soon

    Also, seriously, look up what "POSIX" actually is. Because I doubt you really know.

  • by Todd Knarr ( 15451 ) on Saturday October 03, 2020 @08:07PM (#60569598) Homepage

    First problem: map the object into what representation in memory? C++ has a different in-memory representation than Ruby, which in turn differs from Javascript. In fact it probably varies depending on which flavor of the language you're using, not just the language. And some parts of the representation that, for instance, tie the object to the code needed to implement it's class can't really be represented in the storage representation because they aren't known until an application goes to access the object. POSIX file-access functions may go away for some application programmers who're working in a specific language within a specific framework and with a specific object-storage system implemented for that language and framework, but the fundamental calls to deal with physical storage will still be there and the only question will be how many layers of the stack exist between the application programmer and the physical storage access.

    What will make for a game-changer is a new method of physical storage that follows different rules from address-based random-access storage (eg. content-addressable memory). Developers have, with the rise of fast hard drives (or devices that look/act like hard drives), forgotten the fun of dealing with different kinds of physical storage (eg. ones that have to be physically accessed sequentially, you can access the next or previous bit but you can't jump around except by repetitively scanning across each bit in the desired direction in turn).

    • by Junta ( 36770 )

      I've been told by a few people I find credible that the problem is generally not with the POSIX calls people are accustomed to making, but on the filesystem side the guarantees it must comply with, particularly in a remote cached content. An 'object store' approach is simpler to implement in a fast way than NFS.

      Of course, 'object store' is a bit vague and devoid of standardization, and generally only useful to a human after another layer of software has abstracted it somehow, so it's a bit silly to imagine

      • Those semantics are there to guarantee that the filesystem behaves the way people expect it to. The problems if you relax those constraints are the same problems you get in relational databases if you relax the rules surrounding transactions or eliminate transactions entirely. We've already seen the results when we started implementing RESTful services in front of databases: SQL record locking became impossible because the Read and Update operations had to be in separate transactions, so someone had to come

  • im not sure what they are saying in this article but it seems like they want to get rid of all these complicated subdirectories and filesystems and write data directly to storage.

    DOS 1.0 also did not have a hierarchy, because there were no subdirectories. files names were 11 characters long. that was it. no fancy shamncy redundant metadata. they had drive letters though.

  • ... lalalalalLALALlal!

    Don't spoil it I haven't seen "Piece of Shit 8" yet!

  • by 4wdloop ( 1031398 ) on Saturday October 03, 2020 @11:01PM (#60569948)

    What TF is this about? End of POSIX? In one particular use case perhaps maybe possibly create a new API ... but this is eFFing generalization that is worthy of Fox news!

    Now can some one explain if there is even a tiniest beam of reason here? How is the mmap() different from mmapobj() ? (no I do not have a weekend to spend trying to understand mmapobj-> NVMeOF->RDMA). I suspect it is about resource discovery not the actual write/read verbs perhaps?

    I see it as simple mmap() that accesses non-local "files"? (file = "an entity consisting of sequence of bytes" with random/block access capability perhaps?). I know I am missing something her so help me please.

  • by sjames ( 1099 ) on Saturday October 03, 2020 @11:03PM (#60569952) Homepage Journal

    To borrow a turn of phrase, it's not even wrong.

    I have no idea what they think is so magical about object storage. Inodes are objects.

    As for the whole thing about fabrics somehow magically working, they don't. Sure, you can share memory over a fabric, but the overhead tends to eat you alive. There are fantastically expensive systems that can do so without terrible performance, but even then, they tend to be fragile and nowhere near as fast as local memory. Certainly that has nothing to do with the end of POSIX, most experiments in that direction take place on POSIX systems.

    As for the rest, I guess they've never heard of the AS/400?!?

  • Is this about how quickly one an mmap an object (still object is a "file" == sequence of bytes) w/o having to go through POSIX "file system"?)

    Why would it not be possible by creating a specialized file system, that does not have semantics of directories? (say every file is identified by some unique key w/o any semantics of the key value?). So how is the mmapobj() different from using mmap() with such file system? (I claim 5th in terms of understanding the POSIX overhead of open() and other APIs).

    What is the

  • Aren't we just sliding back to the days of mainframes with different access methods? VSAM, and such like?

    The ideological obsession with Unix "everything is a file", and the more general "everything is one" simply breaks down in contact with the real world.
  • by LostMyBeaver ( 1226054 ) on Sunday October 04, 2020 @02:48AM (#60570418)
    There are many major issues with what's going wrong with this article.

    - The problems associated with the posix file system is that unix file systems such as BtrFs, ZFS, Ext4, XFS, etc... are translation layers to store information on block devices. There are definitely many technologies in these different file systems that add resiliency and even performance, for example, ZFS has RAID, and write logs and read caches. XFS has excellent hashing for integrity and also has a write log that can be stored on low latency devices (though it's quite limited). And all that, but the multi-exabyte storage systems I'm working on in high performance computing for scientific processing, we tend to simply place XFS on top of RAID for massive storage. It's reliable and it's safe. In the core of the HPC, we would never consider this.

    In high performance computing, we tend to have massive near-line storage systems as our data sets are... well huge. The project I'm working on generates 2TB of data per second every second 24/7 for decades at a time. We then have to keep that data online and accessible for 25 years (current mandate, looks like we'll be getting a grant for another 25). So we go for massive and cheap.

    XFS and RAID are not a great option for this, but using technologies like dCache, we get geo-replication on top of pretty much anything. In fact, it allows us to scale pretty well between disk and tape. At the moment, we have at least several petabytes hard disk storage in about 100 countries and we have often much much more in online tape carousels.

    RAID is quite terrible for performance since RAID-6 writes (and we always use RAID6) are painfully expensive and resyncing after a single disk failure can easily take weeks. So we tend to waste a huge amount of space by making smaller RAIDs.. typically only 10-14 disks each. In my little project currently, I have 22x13 drive RAIDs. It takes 2-3 days to resync a disk.

    - For online storage (rather than the nearline mentioned above), we generally have petabytes of RAM. Using tools like Slurm, which is an HPC job scheduler, data is copied from disk (almost always spinning disk) into RAM or onto SSD if the data set exceeds a few petabytes. In biological computing, it's not uncommon for a single job to need 10 or more petabytes of storage. Often this SSD currently is SAS 12GB/s connected, though we're seeing a lot more use of Intel Optane or similar technologies coming in-between. If you're attempting to compare the characteristics of a single strand of DNA again a few hundred other strands, the original strand is typically stored in RAM and the other strands are loaded as segments across nodes.

    - High performance computers, and this may come as a shock... tend to make use of high performance technologies. As such, we use Infiniband for clustering. We are investigating RDMA over Converged Ethernet at this time, it is attractive since Ethernet tends to be a little bit ahead of Infiniband in bandwidth... generally at a high cost of latency. Converged Ethernet is actually a ruinous disaster based on 802.3 flow control with an 802.1q class of service to "prioritize it". The problem is that even with the best Ethernet switches on earth, the generally flawed design of Ethernet almost always requires store and forward for packet forwarding. Infiniband is almost always cut-thru... Ethernet tended to get around this by over provisioning, but it's still not a very good solution. Infiniband is probably going to be around for a while... even though it's much more expensive.

    - This brings us to file systems on the HPC nodes themselves. The author of the article seemed to get very confused. He was under the impression that NVMe over fabric would be a good solution. It is a truly awful solution in HPC. First of all, we use structured data in high performance computing. This can be a file as you'd find on a file system like Lustre. Or it could be an object as he is representing.

    - Also, NVMe over Fabric is a fabulously stupid design as the NVMe proto
    • Besides being crappy, it also looks rather like someone created it out thin air...

      The citation points to an Enterprise Storage Forum (ie, Eweek) article, which eventually points to the Prior Art Database, at https://priorart.ip.com/IPCOM/... [ip.com]

      • - which cautions that it was scanned and OCR'd from a PDF and it only got 43% of the text
      • - has a list of authors and a title, but no publication information.

      A search of Google and Google scholar for "User-level low-latency access via memory semantics to objects in

    • I find your comment much more understandable then the artikel what i absolutely don't understand.! Coming from coding in 8086 HEX machine coding to ASM to PLM (a C-like from Intel) to (Borland) Pascal to BP7 (protected mode) to Delphi to (ansi) C to C++ with the last skills at C99, just retiering in the wake of .NET. Most rememborable changes where going from Hex to Asm to findout that Asm uses only a subset of the instruction set, same sort of things happening all the way up. Also coming from (lowlevel) D
    • Hi LostMyBeaver,

      Given everything you say, you sound like you're working in the same industry and locale as me.
      I'm trying to better understand the use cases involving fast object storage access in scientific research and HPC, specifically in genetics.
      I'm not at the same scale compared to the numbers you're using but is there any chance you would be willing to get in touch to talk?

      ---
      2783b3d7-9aef-4f15-804f-7de056ef4204@anonaddy.me

  • This will cut down on mining and green house gases and vastly reduce HD wear and tear.
  • But does it have blockchain?

    If there is no blockchain then it is doomed.

  • Implementing mutable objects faster is not as important as really supporting functional map/reduce style of computation.
  • Object stores have been the next big thing for 25 years or so. They'll get there, but to say that they'll obsolete the POSIX file-system interface indicates a gross misunderstanding of what people use filesystems for. Kind of comparable to saying that bitcoin will obsolete credit cards, or that iPhone will obsolete automobiles.

UNIX was half a billion (500000000) seconds old on Tue Nov 5 00:53:20 1985 GMT (measuring since the time(2) epoch). -- Andy Tannenbaum

Working...