3 13 1

Edoardo Maria Ponti

ducdauge

https://ducdauge.github.io

AI & ML interests

Multilingual NLP

Recent Activity

upvoted a paper about 1 month ago

OpenSIR: Open-Ended Self-Improving Reasoner

updated a collection about 2 months ago

Accelerated Natural Language Processing

updated a dataset about 2 months ago

EdinburghNLP/europarl-de-en-long

View all activity

Organizations

upvoted a paper about 1 month ago

OpenSIR: Open-Ended Self-Improving Reasoner

Paper • 2511.00602 • Published Nov 1 • 20

updated a collection about 2 months ago

Accelerated Natural Language Processing

Collection

Materials for the Accelerated Natural Language Processing (ANLP) course at the University of Edinburgh. • 8 items • Updated Oct 21 • 1

updated a dataset about 2 months ago

EdinburghNLP/europarl-de-en-long

Viewer • Updated Oct 21 • 100 • 38 • 1

published a dataset about 2 months ago

EdinburghNLP/europarl-de-en-long

Viewer • Updated Oct 21 • 100 • 38 • 1

updated a collection about 2 months ago

Accelerated Natural Language Processing

Collection

Materials for the Accelerated Natural Language Processing (ANLP) course at the University of Edinburgh. • 8 items • Updated Oct 21 • 1

updated a model about 2 months ago

ducdauge/anlp-assignment-de-en

73.9M • Updated Oct 15 • 2

published a model about 2 months ago

ducdauge/anlp-assignment-de-en

73.9M • Updated Oct 15 • 2

upvoted a paper 2 months ago

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Paper • 2509.26507 • Published Sep 30 • 535

commented a paper 2 months ago

Inference-Time Hyper-Scaling with KV Cache Compression

Paper • 2506.05345 • Published Jun 5 • 27 •

commented on There is no such thing as a tokenizer-free lunch 2 months ago

This blog post is insightful but partly misleading wrt the purpose of dynamic tokenisation (DT) imho. The point of DT is to jointly learn to segment/compress and model sequences, as a way to get rid of hardwired constraints coming with subword tokenisers.

These constraints include placing boundaries:

forcibly between words
irrespective of the task/instruction
preserved identically across layers (i.e., a fixed-granularity assumption)
only once at the input layer
optimised separately from language modelling

You are absolutely right that UTF-8 bytes correspond by no means to natural/optimal units. However, you are blending together two questions that are best conceived as separate:

What are the appropriate basic units?
How to compress sequences end-to-end to learn (hierarchical) abstractions?

The latter is what DT addresses, and can be applied on top of any units (subwords/bytes/pixels/etc.), in principle. Obviously, choosing units that are smaller, dynamic tokenisation can recover subword tokenisation as a special case. At least, this is the perspective we took in our Dynamic Token Pooling paper.

Still and all, this was a great read. Thanks for bringing attention to tokenisation, which is one of the most fascinating research areas in language modelling!

upvoted a paper 3 months ago