Training pipeline
Hi! Thanks for sharing this model.
I was wondering if you could provide more details about the training pipeline. In particular, it would be really helpful to know:
• Which Word2Vec architecture was used (CBOW or Skip-gram)?
• Training hyperparameters (vector size, window size, negative sampling, epochs, etc.)
• The corpus used (e.g., full Wikipedia dump, preprocessing steps)
• Any text normalization applied (tokenization, lowercasing, handling of punctuation, etc.)
• Gensim version or training framework used
These details would make it much easier to properly interpret and reuse the embeddings.
Thanks in advance!
Hi Antonino,
Honestly, I don't remember the exact training setup of this model, it has been 3 years since I uploaded it.
However, I am pretty sure I used the CBOW method and, as described in the model card, I trained the model for 10 epochs with a window size of 5 and 100-dimensional vectors. The training set was the Italian split of the full Wikipedia dump released here: https://huggingface.co/datasets/wikimedia/wikipedia (I think I used a 2020 dump, when I trained the model, now it looks like only the 2023 version is available). The dump is mostly clean already, I only applied lowercasing, removed punctuation and performed word-level tokenization.
Hope this helps!