Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
In a Training Loop 🔄
1778
303
143
Stefan Schweter
PRO
stefan-it
Follow
leongoldengate's profile picture
sadhasivamr's profile picture
snyamson's profile picture
3597 followers
·
366 following
https://schweter.bayern
stefan-it
stefan-it
AI & ML interests
Flair Library 💕, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models, Bavarian NLP 🥨
Recent Activity
liked
a dataset
about 18 hours ago
minilingua-ai/mcqa-minilingua-sft
liked
a model
about 18 hours ago
minilingua-ai/MiniLingua-1b
reacted
to
martinsu
's
post
with 🔥
4 days ago
I wasted days on a GPU node on a bug that shouldn't exist So I was fine-tuning TildeOPEN-30B and the outputs were... weird. Token ID 179 (<0x00>) kept appearing between almost every token pair. Took me a bit to figure out what was going on. Turns out I used the fast tokenizer for training, but the model was trained on the slow one. Silent failure. Well... long story short—TGI uses (forces) the fast tokenizer, no questions asked. And you'll have agile's kryptonite: silent failure. If the model was trained on slow, it's a silent disaster. I got curious and wrote a quick script to check how common this is. Ran it on 6,014 LLM HF models overnight. Roughly 10% of HF model downloads have mismatched tokenizers. Not all mismatches are catastrophic, but some are brutal — like chat template markers inflating from 1 token to 3, silently wrecking context windows and causing model act weird. This wasn't rigorous research, but the drift is real. And the worst part? 968 models(out of 500+ downloads) have both fast and slow tokenizers present, but they still produce different outputs. No missing files, no errors — just silent degradation. TGI defaults to the fast tokenizer, as does AutoTokenizer.from_pretrained(). If a fast tokenizer doesn't exist, it auto-generates one. If your model was trained on slow, you get silent degradation. Output looks fine; the model just performs worse. Sometimes really worse. You'd never know. If model was trained on fast tokenizer, its fine, but how do You know? The root cause? Either model authors run HF conversion and upload both without verifying, or users run TGI, which always forces(converts to) fast . The result of this fight with tokenizers is https://huggingface.co/martinsu/tildeopen-30b-mu-instruct It's based on TildeOPEN-30B (a solid EU HPC multilingual base). Nothing fancy—just a proper instruction fine-tune where I didn't mess up the tokenizer this time. Full article: https://github.com/martins-u/tokenmagedon
View all activity
Organizations
stefan-it
's datasets
22
Sort: Recently updated
stefan-it/xlstm-transformers-bug-data
Viewer
•
Updated
Nov 8
•
62.5k
•
16
stefan-it/grokipedia-urls
Viewer
•
Updated
Oct 28
•
885k
•
35
•
2
stefan-it/nanochat-german-city-populations
Viewer
•
Updated
Oct 26
•
706
•
17
stefan-it/nanochat-german-wordlist
Viewer
•
Updated
Oct 25
•
9.06M
•
58
stefan-it/nanochat-german-openhermes
Viewer
•
Updated
Oct 25
•
239k
•
30
stefan-it/nanochat-german-alpaca
Viewer
•
Updated
Oct 25
•
50.5k
•
37
stefan-it/nanochat-german-data
Viewer
•
Updated
Oct 23
•
51.2M
•
750
stefan-it/nanochat-german-eval-data
Viewer
•
Updated
Oct 21
•
7
•
49
stefan-it/awesome-tagesschau
Updated
Jun 26
•
515
•
1
stefan-it/turblimp-evaluations
Updated
Jun 23
•
168
stefan-it/senti-anno
Viewer
•
Updated
Nov 29, 2024
•
929
•
98
stefan-it/offenseval2020_tr
Viewer
•
Updated
Nov 22, 2024
•
35.3k
•
1.54k
stefan-it/dewiki-20230701-nltk-corpus
Viewer
•
Updated
Sep 6, 2024
•
39.4M
•
33
•
2
stefan-it/germeval14_no_wikipedia
Preview
•
Updated
May 29, 2024
•
49
stefan-it/histnero
Viewer
•
Updated
May 10, 2024
•
217k
•
62
stefan-it/HisGermaNER
Preview
•
Updated
Mar 28, 2024
•
1.09k
•
2
stefan-it/co-funer
Preview
•
Updated
Mar 25, 2024
•
70
stefan-it/german-dbmdz-bert-corpus
Viewer
•
Updated
Dec 22, 2023
•
52.8M
•
103
•
3
stefan-it/span-marker-base-model-detection
Viewer
•
Updated
Sep 5, 2023
•
28
•
25
stefan-it/flair-base-model-detection
Viewer
•
Updated
Sep 5, 2023
•
52
•
39
•
1
stefan-it/autotrain-flair-hipe2022-fr-hmbert
Updated
Sep 4, 2023
•
240
stefan-it/autotrain-flair-hipe2022-de-hmbert
Updated
Sep 4, 2023
•
522