From Balls to Consciousness: Testing Natural Abstractions
Alignment theories need more than elegant mathematics; they need test cases that range from simple objects to concepts like monogamy and consciousness.
One thing has caught my attention recently: how do we actually test our theories about what concepts are?
There is a mathematical framework called natural abstractions. It tries to explain why different agents carve reality into roughly the same pieces. Why both of us see a “tree” rather than a chaos of atoms. Why we can agree on the meaning of a word without matching every detail. It is a beautiful theory, and it promises a great deal. But it has a problem.
Right now, it is being tested on five examples. An ideal gas in a closed volume. Dogs as a category. Trees, coins, teacups. That is like testing a compiler on a single Hello World. It works, but it tells you almost nothing about what happens when the code becomes complex.
And complex concepts are exactly why this work matters. Friendship, loyalty, beauty, good. Concepts that are not directly anchored in physics, yet shape how AI systems will make decisions. If the framework only works for balls and dogs, but breaks on “justice,” then it is not worth much. For alignment, the question is not whether a system can recognize a teacup. The question is whether it can work with concepts like “harm” or “intent.”
So someone finally tried to fix this. Not with another abstract discussion about concept typologies, but with a concrete list of examples to test against. A working prototype, not a theory of theories.
The list is still short. On one end: a ball, a particular ball named Bluey, an orange, a volume of gas. On the other: a pecking order, monogamy, consciousness. In between: dogs in general and a particular dog named Fido. That is all. No verbs, no relations, no parts and wholes. The text says directly: this is not enough. It is not even close to what is needed. But it is a beginning.
The most interesting part is not the list itself. It is how the list is built. Between “ball” and “consciousness” there is not only a difference in complexity. A ball as a category and a ball named Bluey are two different kinds of concepts. The first generalizes over many objects. The second points to one particular object. A theory has to explain both. The same is true for dogs in general and the particular dog Fido: the same physical reality, but two different ways of thinking about it.
Then comes the hard part. Pecking order, monogamy, consciousness. These concepts are barely tied to physics directly. They exist as relations between agents or as internal states. Can we build a natural latent for “consciousness”? I do not know. But if we cannot, we should say so honestly: here is the boundary where the framework stops working. That would be a valuable result in itself.
An approach that does not defend a theory, but tests it against increasingly difficult examples, is rare in the alignment community. Usually it happens the other way around: first people choose an ontology, then they fit examples to it. Here, someone takes a working prototype, assembles a test set, and watches where it breaks. From simple to complex. From a ball to consciousness.
If the theory does not explain “monogamy,” that does not mean the theory is worthless. It means we do not yet know how to apply it. And that is fine. That is how science works.
I think this is an important precedent. Alignment has too many conversations at the level of metaphor and too few testable claims. Any attempt to replace “well, intuitively, it is kind of obvious” with “here is a concrete example, here is the mathematics, here is where it breaks” deserves attention. Even if there are only ten examples so far.