omarkamali/wikipedia-monthly
Viewer • Updated • 195M • 11.4k • 69
How to use AhmetSemih/merged_dataset-32k-bpe-tokenizer with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("AhmetSemih/merged_dataset-32k-bpe-tokenizer", dtype="auto")This is a Byte Pair Encoding (BPE) tokenizer trained specifically for Turkish text. The tokenizer was trained on a curated subset (~30 MB from each dataset) of multiple Turkish datasets, covering news, academic texts, legal Q&A, medical articles, books, and user reviews. The goal is to provide a high-quality subword tokenizer suitable for training or fine-tuning Turkish language models.
Training datasets (~30 MB from each):
total : ~360 MB
from transformers import AutoTokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/merged_dataset-32k-bpe-tokenizer", use_fast=True)
fast_tokenizer.encode("Bugün hava çok güzel.")