I've played around with MLM at the UTF8 byte level to train unorthodox models on full sequence translation tasks. Mostly using curriculum learning and progressive random corruption. If you just want to add noise, setting random indices to random byte values might be all you need. For example:
Feeding the model the following input pattern:
[Source UTF8 bytes] => [Corrupted Target UTF8 bytes]
I expect it to output the full corrected target bytes. The overall training process follows this curriculum:
Curriculum Level 0: Corrupt nothing and wait until the population/model masters simple repetition.
Curriculum Level 1: Corrupt 1 random byte per target and wait until the population/model stabilizes.
Curriculum Level N: Corrupt N random bytes per target.
Rinse & repeat until all target sequences are fully saturated with noise.
An important aspect is to always score the entire target sequence each time so that we build upon prior success. If we just evaluate on the masked tokens, the step between each level of difficulty would be highly discontinuous in the learning domain.
Ive stopped caring about a lot of the jargon & definitions. I find that trying to stick things into buckets like "is this diffusion" gets in the way of thinking and trying new ideas. I am more concerned with whether or not it works than what it is called.
The problem with that is we want the model to learn to deal with its own mistakes. With continuous diffusion mistakes mostly look like noise, but with what you’re proposing mistakes are just incorrect words that are semantically pretty similar to the real text, so the model wouldn’t learn to consider those “noise”. The noising function would have to generate semantically similar text (e.g., out of order correct tokens maybe? Tokens from a paraphrased version?)