hackernews client

TLA+ Modeling Tips

129 pointsposted 2 months ago

(muratbuffalo.blogspot.com)

33 Comments

agentultra

2 months ago

I really like TLA+ and am glad it exists.

But every time I suggest to a team I’m working on that we should try modelling the problem with it… people are less than enthusiastic.

These are all great tips. Especially starting with the most simple, abstract model and extending it with refinement if you need to.

That basically means that you write your first spec as abstractly and simply as possible with as little detail about the actual implementation as you can.

If details matter, you use the first spec as a module and write a new spec with the added details. The new spec will use implication that if this spec holds then the first one holds too.

Anyway, I was kind of mad when I learned TLA+ that it and systems like it aren’t part of the standard curriculum and common practices.

“We built this distributed database that guarantees quorum if you have at least 3 nodes!” someone will excitedly claim. And when you ask how you get an informal paper and 100k lines of C++ code. Wouldn’t it be nicer if you just had 1k lines of plain old pure maths?

And pure maths that can be exhaustively checked by the computer? It’s not as perfect as a proof but it’s great for a lot of practical applications.

For me it’s nicer I guess.

pron

2 months ago

I agree with everything you've said, and about refinements in particular. TLA+ newcomers often think of refinement as an advanced tool, but it's one of the most pragmatic and effective ways to specify.

But I'd like to add some nuance to this:

> And pure maths that can be exhaustively checked by the computer? It’s not as perfect as a proof but it’s great for a lot of practical applications.

1. A run of a model checker is exactly as perfect as a proof for the checked instance (which, indeed, is a weaker theorem than one that extends to an unbounded set of instances). In fact, it is a proof of the theorem when applied to the instance.

2. A proof is also not perfect for real software. A proof can perfectly convince us of a correctness of an algorithm, which is a mathematical construct, but a software system is not an algorithm; it's ultimately a physical system of silicone and metal elements that carry charge, and a physical system, unlike an algorithm, cannot be proven correct for the simple reason it's not a mathematical object. Virtually all correctness proofs assume that the hardware behaves correctly, but it only behaves correctly with high probability. My point is only that, unlike for algorithms or abstract theorems, there can be no perfect correctness in physical systems. Increasing the confidence in the algorithm beyond the confidence in the hardware (or human operators) may not actually increase the confidence in the system.

agentultra

2 months ago

Absolutely, I was waving my hands a lot there. Thanks for adding that.

And I’d also add the small model theory from Software Abstractions: if there is an error in your spec in a large instance then it is likely also going to appear in a small one. Most of errors can be caught by small instances of a model.

I only mention it because the common retort when people find out that the model checker only exhaustively checks a limited instance is, software is complex and you can’t model everything.

It’s true. But we don’t need models that account for the gravitational waves in our local region of space to get value from checking them. And neither do most programmers when they use unit tests or type checkers.

And proofs… I think I might disagree but I don’t have the professional experience to debate it fully. I do know from a colleague who does proof engineering on a real system written in C++ that it’s possible however… and likely more difficult than model checking.

goostavos

2 months ago

I find the same. Even those who are interested in it in theory hit a pretty unforgiving wall when they try to put it in practice. Learning TLA+ is way harder than leaning another programming language. I failed repeatedly while trying to "program" via PlusCal. To use TLA you have to (re)learn some high-school math and you have to learn to use that math to think abstractly. It takes time and a lot (a lot!) of effort.

Now is a great time to dive in, though. LLMs take a lot of the syntactical pain out of the learning experience. Hallucinations are annoying, but you can formally prove they're wrong with the model checker ^_^

I think it's going to be a learn these tools or fall behind thing in the age of AI.

hwayne

2 months ago

I think the "high school math" slogan is untrue and ultimately scares people away from TLA+, by making it sound like it's their fault for not understanding a tough tool. I don't think you could show an AP calculus student the equation `<>[](ENABLED <<A>>_v) => []<><<A>>_v` and have them immediately go "ah yes, I understand how that's only weak fairness"

goostavos

2 months ago

Oh, hey -- you're that guy. I learned a lot of what I know about TLA from your writings ^_^

Consider my behavior changed. I thought the "high school math" was an encouraging way to sell it (i.e. "if you can get past the syntax and new way of thinking, the 'math' is ultimately straight forward"), but I can see your point, and how the perception would be poor when they hit that initial wall.

plainOldText

2 months ago

This Scott Wlaschin talk [1] is a good introduction to TLA+. And the slides [2].

[1] https://www.youtube.com/watch?v=qs_mmezrOWs

[2] https://speakerdeck.com/swlaschin/tla-plus-for-programmers

walski

2 months ago

> TLA+ is a formal specification language developed by Leslie Lamport. It is used for designing, modelling, documentation, and verification of programs, especially concurrent systems and distributed systems. TLA+ is considered to be exhaustively-testable pseudocode, and its use likened to drawing blueprints for software systems;

https://en.wikipedia.org/wiki/TLA+

tpoacher

2 months ago

In theory (according to Wikipedia at least) the acronym stands for "Temporal Logic for Actions" ... but the pun in the standard meaning of the acronym TLA (i.e. "Three Letter Acronym") is far too enticing for me to believe the choice was fully accidental. :D

hwayne

2 months ago

Also:

- Have a clear notion of what part of the specs represents the system under your control (the "machine"), and what part represents the broader world it interacts with. The world can do more than the machine, and the properties on the world are much more serious.

- Make lots of helpers. You need them more than you think.

- Add way more comments than you normally would. Specs are for analyzing very high-level ideas, and you should be explaining your high-level ideas.

- Make the spec assumptions clear. What has to be true about the operating environment for the spec to be sensible in the first place?

- Use lots of model values, use lots of constants, use lots of ASSUME statements. Constrain your fairness clauses as narrowly as possible.

- Understand the difference between the semantics of TLA+ as an abstract notation and the semantics of TLA+ as something concretely model-checked. For example, TLA+ is untyped, but the model checker is typed. Also, a lot of TLA+ features are not available in different model checkers. OTOH, you can break TLA+ semantics with use of TLCSet and TLCGet.

The last tip applies to whatever modeling language you use: most have the same distinction.

fl4tul4

2 months ago

The strange case of "TLA+" topics reaching HN's Front Page without any reason.

baq

2 months ago

The reason is obvious - LLMs got good at writing them, initial cost to write went down 99%, toil to keep them up to date with code went down 99% and some people noticed. Value proposition went from ‘you must be joking unless we’re AWS and you’re proposing an IAM code change’ to ‘plan mode may also emit TLA+ spec as an intermediate implementation artifact’.

skydhash

2 months ago

Just like with code, the issue was never about writing a TLA+ spec. It was about writing a good TLA+ spec. Anyone can press a piano's key, but not everyone can write the Moonlight Sonata.

elcapitan

2 months ago

Yes, but the accessibility of TLA+, if you come from traditional programming languages, is quite harsh. I have 2 books about it and keep coming back to it every once in a while, but every single time there is a lot of investment into the exact syntax of certain things. Better tooling like AI autocomplete is a huge step there. I can still think about the spec, but not get stuck every single time on minor details.

skydhash

2 months ago

The devil is in the details. That's the whole point of formalism. There should be no ambiguity meaning what's written is what you want to be written. If you generate and can't prove what's generated is what you intended, the whole point of even writing it is moot.

lo_zamoyski

2 months ago

Yes, and no. Relevant details. If the formalism bogs you down in tangential or irrelevant stuff, it means it's a poor fit.

But I don't think that's the case here. Most programmers don't have much experience with specs, and much less experience with formalized specs, especially if they come from an imperative background. I think those from declarative backgrounds will be much more at home with TLA+. The imperative style causes people to obsess over the "how", while the declarative style focuses more on the "what", and specs are more of a matter of "what" than "how".

skydhash

2 months ago

It can be both. Here is one example that provides the how instead of the what:

https://www.rfc-editor.org/rfc/rfc5321.html#section-3.1

A lot of declarative languages do a lot of hard work designing an imperative kernel that will support the declarative construct. It just takes skill to separate the two together, just like it takes skill to split code between states, interfaces and implementations.

elcapitan

2 months ago

I understand the semantics, I forget the syntax. So better tooling is great because it allows even retards like me to use TLA+ after forgetting the syntax details after 2 weeks.

lo_zamoyski

2 months ago

> if you come from traditional programming languages, is quite harsh

This may be the result of failing to adequately distinguish between the spec and the implementation. PlusCal, as I understand it, is an attempt to bridge this gap for programmers who don't yet have a good grasp of this distinction. So where in TLA+ you would model the observable behavior of a system in terms of transitions in a state machine, PlusCal gives users a language that looks more like implementation. (Frankly, I find the PlusCal approach more distracting and opaque, but I'm also a more abstract thinker than most.)

plainOldText

2 months ago

It's not strange at all.

The second story, and highly upvoted, on HN right now is: "AI will make formal verification go mainstream"[1].

[1] https://news.ycombinator.com/item?id=46294574

elcapitan

2 months ago

It's interesting btw that Martin Kleppmann lists a couple of proof assistants, but not model checkers like TLA+. With him working extensively on distributed systems, I would have thought that would be on the list as well.

ngruhn

2 months ago

To me the modeling is not that hard but I struggle to come up with properties/invariants.

Take something like Google Docs. I'm sure they have some websocket protocol to handle collaborative text editing:

- establish connection

- do a few retries if connection is lost

- always accept edits even when there is no connection

- send queued edits when connection is back

- etc

But what properties can you imagine here?

We never enter an error state? Not really. Error states are unavoidable / outside the control of the system.

We always recover from errors? Not really. Connection might never come back. Or the error requires action from the user (e.g. authentication issues).

dwohnitmok

2 months ago

I can imagine a huge number of properties.

1. Eventual consistency: if no new edits are generated, then eventually all connected viewers see the same document.

2. Durability: if the system acknowledges an edit, then that edit is stored permanently in the undo/redo chain of a document.

3. Causal consistency: if the system acknowledges an edit B that depends (for some definition of depend) on edit A, then it must have acknowledged A (instead of e.g. throwing away A due to a conflict and then acknowledging B).

4. Eventual connection: if, after a certain point, the user never fails any part of the handshake process, eventually the user can successfully connect to the document (there are definitely bugs in collaborative tools where users end up never able to connect to a document even with no problems in the connection)

5. Connection is idempotent: Connecting once vs connecting n times has the same result (ensuring e.g. that the process of connecting doesn't corrupt the document)

6. Security properties hold: any user who doesn't have the permissions to view a document is always denied access to the document (because there are sometimes security bugs where an unforeseen set of steps can actually lead to viewing the doc)

6. Order preservation of edits: for any user, even a user with intermittent connection problems, the document they see always has an ordered subset of the edits that user has made (i.e. the user never sees their edits applied out of order)

7. Data structure invariants hold: these documents are ultimately backed by data structures, sometimes complex ones that require certain balancing properties to be true. Make sure that those hold under all edits of the document.

Etc. There's probably dozens of properties at least you could write and check even for an abstract Google Doc-like system (to say nothing of the particulars of a specific implementation). That's not to say you have to write or verify all of these properties! Even just choosing one or two can give a huge boost in reliability confidence.

ngruhn

2 months ago

Thanks that's very helpful. That all requires that you model both server and client together, right? You can't just treat "connection established" as a nondeterministic event that just "happens" to the client at some point.

dwohnitmok

2 months ago

No you can also model things that are client-only or server-only using such non-deterministic events.

For example you could have the following client-only properties where you can treat "connection established" as an entirely non-deterministic event that randomly just pops in and happens to the client.

1. Every edit that is stored on the client is preserved in the client's undo/redo log (e.g. a remote edit after a connection established can't clobber any client's local edits that are already preserved in history)

2. Document invariants always hold no matter what connection established/remote edits come in (e.g. even if the document shrinks as a result of remote deletions, the cursor is always in a valid location)

3. Causal consistency and order preservation can be rescoped to also be client-only properties

4. Undoing and redoing with no other local edits always returns back to the original state (e.g. making sure that remote edits coming in don't violate user's expectations about how to get back to a known state) for some definition of "original state" (this depends a lot on how you want to handle merging of remote edits)

5. Remote edits are always applied after a multi-component edit not in the middle (e.g. we never apply a remote edit in the middle of a grapheme cluster such as an emoji or when we're in the middle of writing one).

6. We handle connection affinity properly (e.g. we might have multiple servers for redundancy that can all be connected to, but over the course of a session we cache data for a single endpoint to improve performance that will need to be recalculated if we connect to a new endpoint), so checking that the initial connection to a new endpoint from the client always has the data that the new endpoint needs and we're able to re-establish the connection that doesn't corrupt our client's local data etc.

Etc.

ngruhn

2 months ago

Very helpful. Thank you!

skydhash

2 months ago

TCP already provides a model for that one [0] and you can also use logical clocks [1], which is a more generic concept. Now that the whole reliability is out of the way and you're already using a queue for the ordering of the edits, then all you need to care is about the collaborative aspects.

[0]: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

[1]: https://dl.acm.org/doi/abs/10.1145/359545.359563

hwayne

2 months ago

I've had to help a client with something not exactly like, but with similar properties as, Google Docs. One of the big properties they had to engineer in was "the doc should eventually look the same for all open browser tabs on the same computer".

The other big question: what happens if user A makes change X and user B makes change Y? There's a lot of outcomes the product can pick between, but whatever they pick it should be consistent. That consistency in conflict resolution is a good property to model.

erichocean

2 months ago

Start with property-based testing, that'll help you learn how to specify properties without the overhead of TLA+.

theknarf

2 months ago

Google Docs uses Operational Transformations (OT). You could try and model OT in TLA+.

RealityVoid

2 months ago

We never enter error state as long as XYZ properties are satisfied, I guess... I will admit, this is probably a naive take, I'm currently trying to go through the TLA+ videos and have trouble wrapping my head around it.

anonymousDan

2 months ago

Some kind of casual or eventual consistency related invariants perhaps?

tonyarkles

2 months ago

The times I've used TLA+ in anger have been generally related to lower-level "communicating state machines". One was a LoRaWAN handshake with the possibility of lost radio packets, another was a slightly-higher-level Bluetooth message exchange between a phone and a device.

In the first situation I wanted to ensure that the state machine I'd put together couldn't ever deadlock. There's a fairness constraint you can turn in on TLA+ that basically says "at some point in every timeline it will choose every path instead of just looping on a specific failure path". The model I put together ensured that, assuming at some point there was a successful transmission and reception of packets, the state machines on both sides could handle arbitrary packet loss without getting into a bad state indefinitely.

On the Bluetooth side it was the converse: a vendor had built the protocol and I believed that it was possible to undetectably lose messages based on their protocol. The model I put together was quite simple and I used it to give me the sequence of events that would cause a message to be lost; I then wrapped that sequence of events up in an email to the vendor to explain a minimal test case to prove that the protocol was broken.

For your specific example, I'd suggest you think about invariants instead of actions. I'd probably split your example into two distinct models though: one for the connection retry behaviour and one for the queued edits.

The invariant for the first model would be pretty straightforward: starting from the disconnected state, ensure that all (fair) paths eventually lead to the connected state. The fairness part is necessary because without it you can end up in an obvious loop of AttemptConnect -> Fail -> AttemptConnect -> Fail... that never ends up connected. A riff on that that gets a little more interesting was a model I put together for an iOS app that was talking to an Elixir Phoenix backend over Phoenix Channels. Not only did it need to end up with a connected websocket but it also needed to successfully join a channel. There were a few more steps to it and the model ended up being somewhat interesting.

The second model in your example is the edits model. What are the invariants here? We completely ignore the websocket part of it and just model it as two (or three!) state machines with a queue in between them. The websocket model handles the reconnect logic. The key invariant here is that edits aren't lost. In the two-state-machine model you're modelling a client that's providing edits and a server that's maintaining the overall state. An obvious invariant is that the server's state eventually reflects the client's queued edits.

The three-state-machine model is where you're going to get more value out of it. Now you've got TWO clients making edits, potentially on stale copies of the document. How do you handle conflicts? There's going to be a whole sequence of events like "client 1 receives document v123", "client 2 receives document v123", "client 2 makes an edit to v123 and sends to server", "server accepts edit and returns v124", "client 1 makes an edit to v123", "server does...?!"

I haven't reviewed this model at all but a quick-ish google search found it. This is likely somewhat close to how Google Docs works under the hood: https://github.com/JYwellin/CRDT-TLA