X11 is just a network protocol that specifies how a client (e.g., your browser) can send messages to a server (e.g., /usr/bin/Xorg) so the server will draw stuff on the screen. A complicated part of this process is the decision of where each client windows will be shown on the screen, so in X11 we have a special client that is the "window manager" (WM), which does stuff like drawing maximize/minimize buttons everywhere, coordinating which apps are being shown, and, these days, also compositing (which is a more complicated version of what I just said, except the apps don't even know where they are on the screen or if they are being displayed). The fact that the WM/compositor is just a client and not a server creates all sorts of problems, especially in the security area, but also synchronization. The protocol was created in the early 1980's so there are all sorts of insecure design decisions and other decisions that made a lot more sense at that time.
Since then, the X11 protocol got extended and extended, but the "core protocol" is not allowed to change (by design). There are a zillion instructions on how applications can tell the server how to draw stuff, but basically none of these are used anymore.
These days, when it comes to drawing, what the clients want to do is very different than the traditional X11 network protocol model. Clients want to create themselves the buffers, draw on them (using GL/Vulkan), and just be able to tell the compositor: "hey, I finish drawing on my buffer, please display it on the screen", without even having to send pixels over the network. This is accomplished by the DRI and Composite extensions (and a lot of glue everywhere) and these clients are able to not make use of a vast part of the X11 protocol.
So based on all that, Wayland's idea is to promote that new model to be the base level. Wayland is also a protocol, but there are no protocol calls such as "draw this line from x:20,y:20 to x:40,y:20". The operations match the new model: "hey, I just finished drawing on my window that has handle 0x30984, can you please display it for me?" (well, actually, it's more like: "when syncobj handle 234 gets signaled, the buffer is yours to display"). The big difference here is that instead of having a server, a client and a WM/compositor, there is just a client and a server/WM/compositor. This simplifies a lot of things, allows more security, but creates the problem that, well, Gnome wants its compositor to behave in some way, but KDE wants its compositor to behave in another way, so each one will have to be a whole Server and not just an application anymore. And that causes the fragmentation that has been haunting us a lot.
On top of all that, Input handling and Accessibility was sorta underestimated on the Wayland side, and things have been evolving rather slowly. A lot of stuff that is possible to accomplish these days with X11 is still not covered by Wayland. A lot of super-specialized X11 apps will never be ported to Wayland, and there is probably not even Wayland protocol in existence to support that. This is all, of course, a solvable problem: throw enough money at these issues and they will all disappear (but you may want to dedicate part of this money as bribes, because gathering consensus for some stuff without bribery may be impossible so your extension proposals may get stuck in limbo). For example: even a filthy rich behemoth such as Valve has been having problems merging extensions that will allow its games work better on Linux. Good luck with anything you may need.
I am also deeply disappointed that Wayland has been taking to much to develop, but I don't have anybody specifically to blame. I believe part of the issue is that more money should be thrown by I-dont-know-who at it. The Compositor fragmentation is also a painful issue, but going back to insecure X11 protocols doesn't seem like a good alternative. Anyway: I just don't know how to make things better. Let's just try not to blame the developers that have been doing herculean efforts to keep all these cards stacked on top of each other. Without them, perhaps we'd still be doing insecure inefficient 80's-style window drawing and wrapping, moving pixels all over network sockets.
I'd love to be corrected on my views by anybody on this forum.