Training pipeline

#1
by agreco - opened

Hi! Thanks for sharing this model.

I was wondering if you could provide more details about the training pipeline. In particular, it would be really helpful to know:
• Which Word2Vec architecture was used (CBOW or Skip-gram)?
• Training hyperparameters (vector size, window size, negative sampling, epochs, etc.)
• The corpus used (e.g., full Wikipedia dump, preprocessing steps)
• Any text normalization applied (tokenization, lowercasing, handling of punctuation, etc.)
• Gensim version or training framework used

These details would make it much easier to properly interpret and reuse the embeddings.

Thanks in advance!

agreco changed discussion status to closed
agreco changed discussion status to open

Hi Antonino,

Honestly, I don't remember the exact training setup of this model, it has been 3 years since I uploaded it.

However, I am pretty sure I used the CBOW method and, as described in the model card, I trained the model for 10 epochs with a window size of 5 and 100-dimensional vectors. The training set was the Italian split of the full Wikipedia dump released here: https://huggingface.co/datasets/wikimedia/wikipedia (I think I used a 2020 dump, when I trained the model, now it looks like only the 2023 version is available). The dump is mostly clean already, I only applied lowercasing, removed punctuation and performed word-level tokenization.

Hope this helps!

Sign up or log in to comment