This blog post is insightful but partly misleading wrt the purpose of dynamic tokenisation (DT) imho. The point of DT is to jointly learn to segment/compress and model sequences, as a way to get rid of hardwired constraints coming with subword tokenisers.
These constraints include placing boundaries:
- forcibly between words
- irrespective of the task/instruction
- preserved identically across layers (i.e., a fixed-granularity assumption)
- only once at the input layer
- optimised separately from language modelling
You are absolutely right that UTF-8 bytes correspond by no means to natural/optimal units. However, you are blending together two questions that are best conceived as separate:
- What are the appropriate basic units?
- How to compress sequences end-to-end to learn (hierarchical) abstractions?
The latter is what DT addresses, and can be applied on top of any units (subwords/bytes/pixels/etc.), in principle. Obviously, choosing units that are smaller, dynamic tokenisation can recover subword tokenisation as a special case. At least, this is the perspective we took in our Dynamic Token Pooling paper.
Still and all, this was a great read. Thanks for bringing attention to tokenisation, which is one of the most fascinating research areas in language modelling!