# Changelog
## NVIDIA Neural Modules 2.6.0
### Highlights
- Speech
- Add Timestamps to streaming ASR [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14766)
- Add Streaming decoding policies (Wait-K and AlignAtt) for Canary model [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14765)
- Add NeMo Voice Agent [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14325)
- Hybrid RNNT-CTC Prompted Parakeet Model support [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14561)
- [New] MT-Parakeet Streaming Models [release](https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1)
- Removed the Automodel module. Automodel is available in the repo https://github.com/NVIDIA-NeMo/Automodel.
- Removed the Deploy module. Export & Deploy is available in the repo https://github.com/NVIDIA-NeMo/Export-Deploy.
- Non-Speech NeMo 2.0 collections are deprecated and will be removed in a later release. Their functionality is available in the Megatron Bridge repo at https://github.com/NVIDIA-NeMo/Megatron-Bridge.
### Known Issues
- NeMo voice agent pipecat connecting issues
### Detailed Changelogs:
#### ASR
Changelog
- fixing kernel restarting when transcribing by @weiqingw4ng :: PR: #14665
- Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14679
- Fixing Sortformer training tutorial notebook by @tango4j :: PR: #14680
- Fix for "EncDecRNNTBPEModel transcribe() failed with TypeError" by @andrusenkoau :: PR: #14698
- Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite (merge to main) by @erastorgueva-nv :: PR: #14743
- Use lhotse dataloader for ASR models to support in-manifest channel selection for multichannel recordings by @racoiaws :: PR: #14586
- add transducer timestamps without alignments, timestamps to streaming by @lilithgrigoryan :: PR: #14766
- Adding bf16 Sortformer train and inference by @tango4j :: PR: #14627
- Replace texterrors with kaldialign library by @andrusenkoau :: PR: #14775
- fix: Use shutil.copy fallback to handle file metadata permission errors by @vipnydav :: PR: #14639
- Add Customization Capabilities to Cache-Aware Models by @artbataev :: PR: #14757
- Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800
- Streaming decoding policies (Wait-K and AlignAtt) for Canary model by @andrusenkoau :: PR: #14765
- Add tests for streaming buffered and cache-aware transducer models by @artbataev :: PR: #14823
- Merge updates of Multi-Talker Parakeet Model, Modules, Dataloader and Utils PR 01 by @weiqingw4ng :: PR: #14905
- Merge updates of Multi-Talker Parakeet - Unit tests and CI tests PR 02 by @weiqingw4ng :: PR: #14932
- Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support by @ealbasiri :: PR: #14561
- fix notebooks by @nithinraok :: PR: #15079
- cherry pick #15070 by @nithinraok :: PR: #15082
#### TTS
Changelog
- Remove outdated TTS Tutorials by @blisc :: PR: #14660
- Add KokoroTTS support for voice agent framework by @tango4j :: PR: #14910
- remove language_modeling by @dimapihtar :: PR: #14192
#### NLP / NMT
Changelog
- Add gpt-oss by @cuichenx :: PR: #14457
- Fix sequence packing loss calculation by @rayandasoriya :: PR: #14437
- [Perf script] Llama and GPT3 perf script use mlp cast fusion by @guyueh1 :: PR: #14575
- Delete tutorials/llm/llama/biomedical-qa directory by @cuichenx :: PR: #14653
- Add gpt-oss lora exporter by @cuichenx :: PR: #14589
- Replace MegatronTokenizer with MegatronLegacyTokenizer by @chtruong814 :: PR: #14721
- Update ModelCommPGs API from megatron-core by @yaoyu-33 :: PR: #14578
- feat: Compatibility modification of megatron-fsdp by @shjwudp :: PR: #14593
- imported get_moe_layer_wise_logging_tracker from megatron core moe_utils by @prathamk-tw :: PR: #14694
- Fix gpt-oss yarn_original_max_position_embeddings value by @cuichenx :: PR: #14706
- Update docs per guidance by @pablo-garay :: PR: #14841
- Fixing three mcore links by @aschilling-nv :: PR: #14839
- Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800
- Update gpt-oss configs by @cuichenx :: PR: #14674
- remove language_modeling by @dimapihtar :: PR: #14192
- cp: `remove ExportDeploy` into `r2.6.0` by @pablo-garay :: PR: #15053
- cherry pick #15070 by @nithinraok :: PR: #15082
#### Export
Changelog
- fix: fix missing rope scaling in exporting llama embedding model by @ZhiyuLi-Nvidia :: PR: #14523
- Add gpt-oss lora exporter by @cuichenx :: PR: #14589
- Skip trt-llm and vllm install in install test by @chtruong814 :: PR: #14663
- Fix deepseek export dtype by @cuichenx :: PR: #14307
- Remove export-deploy, automodel, and eval tutorials by @chtruong814 :: PR: #14790
- cp: `remove ExportDeploy` into `r2.6.0` by @pablo-garay :: PR: #15053
#### Uncategorized:
Changelog
- Version bump to `2.6.0rc0.dev0` by @github-actions[bot] :: PR: #14512
- [Audio]: added conformer U-Net model for SE by @nasretdinovr :: PR: #14442
- hyena/evo2: Make sure to convert to real after fp32 conversion by @antonvnv :: PR: #14515
- Force-set restore path for student in KD mode by @AAnoosheh :: PR: #14532
- Skip PTQ if PTQ model path exists by @jenchen13 :: PR: #14536
- Support QwenVL for inference API by @meatybobby :: PR: #14534
- Hyena: Allow to use unfused RMSNorm + TELinear to restore accuracy and some speed by @antonvnv :: PR: #14542
- [Audio]: added streaming mode to SpectrogramToAudio by @nasretdinovr :: PR: #14524
- Update evo2 defaults so converted checkpoints have the right parameters by @jstjohn :: PR: #14514
- deprecate t0 scripts by @dimapihtar :: PR: #14585
- cfg typo correction by @malay-nagda :: PR: #14588
- [Perf script] Add use_te_activation_func and activation_func_fp8_input_store flags by @guyueh1 :: PR: #14522
- Modify logging message to signal that RestoreConfig will be used by @balvisio :: PR: #14469
- Bump TE and Mcore by @chtruong814 :: PR: #14568
- Avoid host-device sync in PTL logging by @WanZzzzzz :: PR: #14489
- Integrate implicit filter kernel with Hyena layer by @farhadrgh :: PR: #14621
- Fix kv_channels configuration for Gemma2 27b by @ananthsub :: PR: #14590
- [Flux] small fixes by @CarlosGomes98 :: PR: #14333
- [Flux] Add MXFP8 Support by @alpha0422 :: PR: #14473
- Use hugginface_hub for downloading the FLUX checkpoint by @suiyoubi :: PR: #14638
- Fine-tune embedding models (E5-Large-V2 and LLaMA-3.2-1B) on the allnli triplet dataset with NeMo Framework by @girihemant19 :: PR: #14584
- remove service launch scripts by @dimapihtar :: PR: #14647
- Warn instead of error when chat template doesn't contain generation keyword by @jenchen13 :: PR: #14641
- Fix function calling notebook by @cuichenx :: PR: #14643
- [Audio]: fixed bug in conformer unet by @nasretdinovr :: PR: #14626
- Fix code checkout during test by @chtruong814 :: PR: #14658
- Fix Flux seed as optional Arg by @suiyoubi :: PR: #14652
- Remove PEFT scheme condition from recipe by @JRD971000 :: PR: #14661
- Add NeMo Voice Agent by @stevehuang52 :: PR: #14325
- Update get_tensor_shapes function whose signature was refactored by @AAnoosheh :: PR: #14594
- Delete nemo1 notebooks by @cuichenx :: PR: #14677
- Bump latest Mcore 020abf01 by @chtruong814 :: PR: #14676
- [Flux] correct vae_downscale_factor by @CarlosGomes98 :: PR: #14425
- Bump modelopt to 0.35.0 and remove `safe_import("modelopt")` in llm collection by @kevalmorabia97 :: PR: #14656
- Canary tutorial fix by @nune-tadevosyan :: PR: #14699
- Add option for LoRA with Transformer Engine op fuser by @timmoon10 :: PR: #14411
- add load-in-4bit param by @dimapihtar :: PR: #14636
- Support NVFP4 recipe by @WanZzzzzz :: PR: #14625
- Fix broken link in Reasoning-SFT.ipynb by @cuichenx :: PR: #14716
- Remove artificial block to vortex fp8 TP by @jstjohn :: PR: #14684
- Drop speech_llm example suite by @yaoyu-33 :: PR: #14683
- remove env var by @malay-nagda :: PR: #14739
- detach arg option for run scripts by @malay-nagda :: PR: #14722
- Randomized shard slicing for tarred data by @pzelasko :: PR: #14558
- Data prediction objective for flow matching speech enhancement models by @racoiaws :: PR: #14749
- Fix Some Failures by @alpha0422 :: PR: #14763
- Support additional Slurm parameters (#14701) by @bdubauski :: PR: #14742
- [Flux] Remove Redundant Host & Device Sync by @alpha0422 :: PR: #14711
- [Flux] Full Iteration CUDA Graph by @alpha0422 :: PR: #14744
- Update prune-distill notebooks to Qwen3 + simplify + mmlu eval by @kevalmorabia97 :: PR: #14785
- ci: Automodel deprecation warning by @thomasdhc :: PR: #14787
- Bug in MXFP8 recipe by @adityavavreNVDA :: PR: #14793
- feat: Disable blank Issues by @pablo-garay :: PR: #14788
- ci: Add community label bot by @chtruong814 :: PR: #14796
- Add mistral small3 24B config and recipe by @eagle705 :: PR: #14784
- Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #14812
- QWEN2.5-VL 7B FP8 Recipe by @tomlifu :: PR: #14801
- Feat: Disk space management: for nemo install test by @pablo-garay :: PR: #14822
- Evo2 address rare over-masking in 1m context dataset by @jstjohn :: PR: #14821
- Update cherry-pick workflow to use version 0.63.0 by @pablo-garay :: PR: #14832
- Removing automodel items by @aschilling-nv :: PR: #14840
- Update changelog for `v2.4.1` by @github-actions[bot] :: PR: #14828
- Fix lm_eval installation in pruning tutorial for 25.09 container by @kevalmorabia97 :: PR: #14865
- Add nemotron-nano-v2 support to voice agent by @stevehuang52 :: PR: #14704
- Update changelog for 2.5.0 by @chtruong814 :: PR: #14890
- [Qwen3] Fix the flop cal for Qwen3 by @gdengk :: PR: #14897
- [lhotse][aistore] added support input_cfg.yaml directly from aistore bucket by @XuesongYang :: PR: #14891
- Harden _is_target_allowed by adding runtime class validation on top of prefix checks to prevent unsafe target resolution by @KunalDhawan :: PR: #14540
- Enable simplified DistOpt checkpoint formats by @mikolajblaz :: PR: #14428
- Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. by @liquor233 :: PR: #14945
- Revert "new changelog-build" by @pablo-garay :: PR: #14949
- feat: new changelog-build by @pablo-garay :: PR: #14950
- Update llama4 utils kwargs by @yaoyu-33 :: PR: #14924
- Update README.md by @snowmanwwg :: PR: #14917
- Update all outdated NeMo Curator links by @sarahyurick :: PR: #14760
- Freeze tags in in `r2.6.0` by @github-actions[bot] :: PR: #14957
- cp: `Bump MCore, TE, Pytorch, and modelopt for 25.11 (14946)` into `r2.6.0` by @chtruong814 :: PR: #14976
- cp: `Update ctc-segmentation (14991)` into `r2.6.0` by @chtruong814 :: PR: #14998
- cherry-pick of #14962 by @dimapihtar :: PR: #15000
- cp: `Pass timeout when running speech functional tests (15012)` into `r2.6.0` by @chtruong814 :: PR: #15013
- cp: `check asr models (14989)` into `r2.6.0` by @chtruong814 :: PR: #15002
- cp: `Enable EP in PTQ (15015)` into `r2.6.0` by @chtruong814 :: PR: #15026
- cp: `Update numba to numba-cuda and update cuda python bindings usage (15018)` into `r2.6.0` by @chtruong814 :: PR: #15024
- cp: `Add import guards for mcore lightning module (14970)` into `r2.6.0` by @chtruong814 :: PR: #14981
- cp: `fix loading of hyb ctc rnnt bpe models when using from pretrained (15042)` into `r2.6.0` by @chtruong814 :: PR: #15045
- cp: `fix: fix update-buildcache workflow after ED remove (15051)` into `r2.6.0` by @chtruong814 :: PR: #15052
- cp: `chore: update Lightning requirements version (15004)` into `r2.6.0` by @chtruong814 :: PR: #15049
- cp: `update notebook (15093)` into `r2.6.0` by @chtruong814 :: PR: #15094
- cp: `Fix: Obsolete Attribute [SDE] (15105)` into `r2.6.0` by @chtruong814 :: PR: #15106
- cp: `Upgrade NeMo ASR tutorials from Mozilla/CommonVoice to Google/FLEURS (15103)` into `r2.6.0` by @chtruong814 :: PR: #15107
- cp: `chore: Remove Automodel module (15044)` into `r2.6.0` by @chtruong814 :: PR: #15084
- cp: `Add deprecation notice to modules (15050)` into `r2.6.0` by @chtruong814 :: PR: #15110
## NVIDIA Neural Modules 2.5.3
### Highlights
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at
- Update nv-one-logger
- Update ctc-segmentation
### Detailed Changelogs:
#### Text Normalization / Inverse Text Normalization
Changelog
- chore: update Lightning requirement by @liquor233 :: PR: #15005
#### Uncategorized:
Changelog
- cp: `Update ctc-segmentation (14991)` into `r2.5.0` by @chtruong814 :: PR: #15020
- Bump to 2.5.3 by @chtruong814 :: PR: #15022
## NVIDIA Neural Modules 2.5.2
### Detailed Changelogs:
#### Text Normalization / Inverse Text Normalization
Changelog
- cp: `Add import guards for mcore lightning module` (#14970) into `r2.5.0` by @chtruong814 :: PR: #14982
#### Uncategorized:
Changelog
- Bump to 2.5.2 by @chtruong814 :: PR: #14983
## NVIDIA Neural Modules 2.5.1
### Highlights
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at
- Adds nv-one-logger
- Adds fixes related to Megatron FSDP
### Detailed Changelogs:
#### ASR
Changelog
- Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
#### TTS
Changelog
- Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
#### NLP / NMT
Changelog
- Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
- Megatron FSDP r2.5.0 cherry-pick by @BoxiangW :: PR: #14922
#### Uncategorized:
Changelog
- Bump to 2.5.1 by @chtruong814 :: PR: #14898
- Cherry pick `Feat: Disk space management: for nemo install test (14822)` into `r2.5.0` by @chtruong814 :: PR: #14937
- cp: `Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. (14945)` into `r2.5.0` by @chtruong814 :: PR: #14948
## NVIDIA Neural Modules 2.5.0
### Highlights
- Collections:
- LLM
- Nano v2 12B and 9B
- Speech
- New SpeechLM2 collection
- Streaming Softformer model
- Deprecate Confidence Ensemble models
- parakeet-tdt-0.6b-v3 and canary-1b-v2 models
- Added chunk inference support with .transcribe() for canary based models
- Enable prediction of timestamps with streaming ASR
- Improve ASR models’ invariance to padding/batch size
- Qwen prompt format support, SALM generation fixes
- High-level SALM model.generate API closely resembling HF models
- SALM model initialization with time/memory optimization
- SpeechLM2: fixed excessive padding, support on-the-fly resampling for SALM
- Automodel and Export-Deploy functionality are available in their individual repositories respectively and deprecated in NeMo2
### Detailed Changelogs:
#### ASR
Changelog
- Modernize logger interface by @emmanuel-ferdman :: PR: #13783
- Higher-level API for SALM.generate by @pzelasko :: PR: #14034
- add/refactor docs for asr lm customization by @lilithgrigoryan :: PR: #14088
- Improve NEST GPU Utilization 1/N by @MahmoudAshraf97 :: PR: #14086
- Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827
- Clean up transducer decoding initialization by @artbataev :: PR: #14112
- Improve NEST GPU Utilization 2/N by @MahmoudAshraf97 :: PR: #14089
- GPU-accelerated Phrase-Boosting (GPU-PB) for AED decoding by @andrusenkoau :: PR: #14108
- Fix decoding with ngpu-lm when training (#13994) by @hoangtran9122 :: PR: #13995
- fix eval_beamsearch_ngram_ctc script by @lilithgrigoryan :: PR: #14238
- fix wrong typing for ctc-ws context graph by @andrusenkoau :: PR: #14262
- fix frame vad by @stevehuang52 :: PR: #14337
- Improve NEST GPU Utilization 3/N by @MahmoudAshraf97 :: PR: #14234
- remove confidence ensemble models by @lilithgrigoryan :: PR: #14343
- Fix ASR decoding issues with CUDA graphs in training by @artbataev :: PR: #14184
- Streaming Sortformer release PR01: uploading bugfixes, refactored variables and yaml file name changes by @tango4j :: PR: #14416
- Streaming Sortformer release PR02: unit tests for streaming models and modules by @tango4j :: PR: #14417
- GPU-accelerated Phrase-Boosting (GPU-PB) for CTC, RNN-T, and TDT decoding by @andrusenkoau :: PR: #14277
- Fix subsampling chunking test by @monica-sekoyan :: PR: #14452
- Canary2 with NFA by @monica-sekoyan :: PR: #14121
- Initial Chunking by @nune-tadevosyan :: PR: #14321
- Chunking fix by @nune-tadevosyan :: PR: #14482
- Tutorial and doc update by @nune-tadevosyan :: PR: #14484
- Streaming Sortformer release PR03: NeMo documentations and tutorial notebook by @tango4j :: PR: #14388
- Add wget_from_nemo by @nune-tadevosyan :: PR: #14623
- Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14685
- Canary tutorial fix by @nune-tadevosyan :: PR: #14708
- Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite by @erastorgueva-nv :: PR: #14715
#### TTS
Changelog
- Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827
- remove nlp modules by @dimapihtar :: PR: #14127
- Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167
- Remove T5-TTS by @blisc :: PR: #14252
#### NLP / NMT
Changelog
- add extra params for MegatronDataSampler by @dimapihtar :: PR: #13956
- Modernize logger interface by @emmanuel-ferdman :: PR: #13783
- remove dialogue collection by @dimapihtar :: PR: #14087
- remove QA collection by @dimapihtar :: PR: #14092
- remove text nlp collection by @dimapihtar :: PR: #14110
- remove nlp modules by @dimapihtar :: PR: #14127
- remove rag collection by @dimapihtar :: PR: #14157
- remove nmt collection by @dimapihtar :: PR: #14191
- Fix importerror in transformer_lm_model after nlp module removals by @chtruong814 :: PR: #14199
- fix QA comments NVBug by @huvunvidia :: PR: #14196
- Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167
- remove mixins collections by @dimapihtar :: PR: #14281
- feat: print expert groups on megatron init by @clumsy :: PR: #13874
- [speechlm2] [lhotse] sharegpt data and testloader by @huckiyang :: PR: #14294
- Add notebook for LoRA on GPT-OSS-20B by @shashank3959 :: PR: #14439
- Sketch dist-ckpt content versioning by @mikolajblaz :: PR: #13839
- Change to enable full iteration CUDA graph for LLMs by @vasunvidia :: PR: #14077
#### Text Normalization / Inverse Text Normalization
Changelog
- Check lightning and core imports in install test by @chtruong814 :: PR: #14403
#### Export
Changelog
- ci: Set L2_NeMo_2_Export_Deploy_Query_In_Framework to be optional by @chtruong814 :: PR: #13946
- Remove old export doc by @oyilmaz-nvidia :: PR: #14292
- Llama4 Export: Remove outdated MLP weight transform by @suiyoubi :: PR: #14297
- Update mllama hf import/export for transformers 4.53 by @meatybobby :: PR: #14327
#### Bugfixes
Changelog
- Bugfix for Hyena to the get_t function which comes up when doing longer context inference by @jstjohn :: PR: #14256
- fix skipped cuHyena kernel while training by @farhadrgh :: PR: #14365
- Remove flaky Evo2 dataset performance test by @jstjohn :: PR: #14371
- Use module prefix in restore_modelopt_state by @jenchen13 :: PR: #14384
#### Uncategorized:
Changelog
- Version bump to `2.5.0rc0.dev0` by @github-actions[bot] :: PR: #13944
- [Llama4] Enable tp comm overlap for llama4 by @gdengk :: PR: #13940
- Fix for Squad Dataset Download by @rhmukundan :: PR: #13893
- add nmh HF conversion by @JRD971000 :: PR: #13941
- Speechlm2 SALM improvements by @pzelasko :: PR: #13829
- fix dataset issue by @dimapihtar :: PR: #13953
- Editing MMLU to pull from the correct repo by @ruchaa-apte :: PR: #13991
- move classes to module to use __target__ feature (#14023) by @nithinraok :: PR: #14031
- Add Nemotron-H prompt format, fix cut-to-conversation custom attr propagation by @pzelasko :: PR: #13963
- Bump release_library template to v0.40.0 by @chtruong814 :: PR: #14046
- [automodel] add support for layer-freezing by @akoumpa :: PR: #14000
- [Qwen3] Recipe config bug fix by @gdengk :: PR: #14084
- Add TE import guard in qwen2vl vision module by @chtruong814 :: PR: #14091
- Update bitsandbytes dependency to v0.46.0 by @pramodk :: PR: #14050
- Update FSDP2 docstring by @BoxiangW :: PR: #14105
- Interface to enable fsdp-double-buffer without enabling NCCL-UB by @youngeunkwon0405 :: PR: #14076
- SpeechLM2 SALM: load ckpt faster, with less GPU memory by @pzelasko :: PR: #14113
- Add object_storage_cache_path to PreTrainingDataModule by @shunjiad :: PR: #14103
- Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #14160
- Fix FLUX test with correct env var by @suiyoubi :: PR: #14149
- add mmap_bin_files param by @dimapihtar :: PR: #14122
- Add option to suppress import checks in `Dockerfile.speech` by @artbataev :: PR: #14185
- Safely import optional python packages by @roclark :: PR: #13936
- Set flux test as optional by @chtruong814 :: PR: #14190
- Revert "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14197
- Fix "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14198
- Add fix for evo2 generate/inference by @jwilber :: PR: #14027
- Fixing file path suffix by @gautham-kollu :: PR: #14179
- Update AVLM finetune example for vanilla fine-tuning by @huvunvidia :: PR: #14232
- [finetune] Add dataset_kwargs to prepare packed sequence data by @jiajunly :: PR: #14169
- Allow exception in hf ckpt load attempt before fallback to standard l… by @trvachov :: PR: #14214
- Load master weights from checkpoint by @kunlunl :: PR: #14072
- Add deploy lora adapter portion by @ruchaa-apte :: PR: #14255
- fix speechlm lhotse loading nemo_tarred by @stevehuang52 :: PR: #14314
- Update changelog for `r2.4.0` by @github-actions[bot] :: PR: #14334
- Flaky test timing out: @pytest.mark.pleasefixme by @pablo-garay :: PR: #14351
- Support dump perf recipe diff from base recipe by @guyueh1 :: PR: #14206
- Bugfix degenerate bases evo2 dataset by @jstjohn :: PR: #14359
- Hyena support for flash decode API by @jstjohn :: PR: #14315
- Fix Gemma2/3 & Llava (Next) & Llama4 conversion issue with latest transformers by @suiyoubi :: PR: #14367
- fix: reduce the excessive test time of test_msdd_diar_inference by @tango4j :: PR: #14366
- SpeechLM2: S2S->S2T data reader, excessive padding fixes by @pzelasko :: PR: #14124
- chore: Release 2.5.0rc0 by @ko3n1g :: PR: #14389
- Add pyxis flag for container writable. by @sudostock :: PR: #14395
- [MoE] Partial Cudagraph support for MoE by @gdengk :: PR: #14362
- Revert "[MoE] Partial Cudagraph support for MoE (#14362)" by @chtruong814 :: PR: #14402
- Update AVLM recipes for NeMo-CI runs by @huvunvidia :: PR: #14397
- Remove nemo1 multimodal and vision by @yaoyu-33 :: PR: #14095
- Fix LazyNeMoIterator supervision for multi-channel cuts by @anteju :: PR: #14409
- Bump Mcore to 7f7439f by @chtruong814 :: PR: #14373
- Use cuhyena rearrange when available. by @moradza :: PR: #14383
- Fix model training/eval state after PTL validation loop by @paul-gibbons :: PR: #14152
- Add deprecation notice to eval code by @athitten :: PR: #14316
- Streaming Sortformer release PR04: Adding functional tests for streaming sortformer by @tango4j :: PR: #14435
- QWEN2.5-VL 7B Performance Recipe by @tomlifu :: PR: #14401
- Discount FLOPs in dot-product att by @erhoo82 :: PR: #14424
- Bump to pytorch 25.06 and newer TE commit by @chtruong814 :: PR: #14423
- Enable precision aware optimizer for dsv3 by @guyueh1 :: PR: #14444
- Make VBoost activation conditional by @bdubauski :: PR: #14458
- cuHyena FFTConv support for Hyena Long Implicit (LI) Layer by @farhadrgh :: PR: #14396
- Alit/nano v2 by @JRD971000 :: PR: #14464
- Fix reuse_grad_buf_for_mxfp8_param_ag for mxfp8 by @guyueh1 :: PR: #14445
- Fix loss mask for chat datasets by @cuichenx :: PR: #14369
- Rename to subquadratic_ops by @farhadrgh :: PR: #14486
- Allows using other signals (than SIGTERM) with PreemptionPlugin by @zachmoshe :: PR: #14248
- Qwen2.5-VL 32B Performance Recipe by @tomlifu :: PR: #14485
- Alit/nanov2 12b by @JRD971000 :: PR: #14483
- Freeze tags in in `r2.5.0` by @github-actions[bot] :: PR: #14513
- deprecate t0 by @dimapihtar :: PR: #14599
- Cherry pick `Use hugginface_hub for downloading the FLUX checkpoint (14638)` into `r2.5.0` by @chtruong814 :: PR: #14640
- Cherry pick `Fix function calling notebook (14643)` into `r2.5.0` by @chtruong814 :: PR: #14650
- Cherry pick `remove service launch scripts (14647)` into `r2.5.0` by @chtruong814 :: PR: #14648
- Cherry pick `Delete tutorials/llm/llama/biomedical-qa directory (14653)` into `r2.5.0` by @chtruong814 :: PR: #14654
- Cherry pick `Remove PEFT scheme condition from recipe (14661)` into `r2.5.0` by @chtruong814 :: PR: #14662
- Cherry pick `fixing kernel restarting when transcribing (14665)` into `r2.5.0` by @chtruong814 :: PR: #14672
- Delete nemo 1 notebooks by @cuichenx :: PR: #14675
- Cherry pick `Fixing Sortformer training tutorial notebook (14680)` into `r2.5.0` by @chtruong814 :: PR: #14681
- Cherry-pick `Update get_tensor_shapes function whose signature was refactored` (14594) into `r2.5.0` by @chtruong814 :: PR: #14678
- Cherry pick `Skip trt-llm and vllm install in install test (14663)` into `r2.5.0` by @chtruong814 :: PR: #14697
- Cherry pick `Fix for \EncDecRNNTBPEModel transcribe() failed with TypeError\ (14698)` into `r2.5.0` by @chtruong814 :: PR: #14709
- Cherry pick `Fix broken link in Reasoning-SFT.ipynb (14716)` into `r2.5.0` by @chtruong814 :: PR: #14717
- cherry-pick add load-in-4bit param (14636) into r2.5.0 by @dimapihtar :: PR: #14719
- Cherry pick `Fix deepseek export dtype (14307)` into `r2.5.0` by @chtruong814 :: PR: #14682
- Cherry pick `remove env var (14739)` into `r2.5.0` by @chtruong814 :: PR: #14746
- Cherry-pick 'Bump modelopt to 0.35.0 and remove `safe_import("modelopt")` in llm collection (#14656)' into 'r2.5.0' by @chtruong814 :: PR: #14771
- Cherry pick `Update prune-distill notebooks to Qwen3 + simplify + mmlu eval (14785)` into `r2.5.0` by @chtruong814 :: PR: #14789
- Cherry pick `Remove export-deploy, automodel, and eval tutorials (14790)` into `r2.5.0` by @chtruong814 :: PR: #14792
- Cherry pick `ci: Automodel deprecation warning (14787)` into `r2.5.0` by @chtruong814 :: PR: #14791
## NVIDIA Neural Modules 2.4.1
### Detailed Changelogs:
#### Uncategorized:
Changelog
- Update package_info.py by @ko3n1g :: PR: #14400
- Patch to address issue 14392 by @youngeunkwon0405 :: PR: #14398
- Cherry pick `Fix callbacks in DSV3 script (14350)` into `r2.4.0` by @chtruong814 :: PR: #14370
- Cherry pick `Change Llama Embedding Tutorial to use SFT by default (14231)` into `r2.4.0` by @chtruong814 :: PR: #14303
- Cherrypick `calculate_per_token_loss requirement for context parallel` (#14065) (#14282) into `r2.4.0` by @chtruong814 :: PR: #14448
- Pin nvidia-lm-eval to 25.6.1 by @chtruong814 :: PR: #14470
## NVIDIA Neural Modules 2.3.3
- This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at
- Pin nvidia-lm-eval to 25.5
## NVIDIA Neural Modules 2.4.0
### Highlights
- Collections:
- Speech
- Batched beam search for transducers (RNN-T and TDT)
- RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware
- add support for CTC batched beam search with GPU-LM
- Key fixes
- Punctuation Marks in Timestamps
- Fix timestamps when cuda graphs enabled
- Fix masking of \ tokens in AED inference
- TDT streaming inference fix
- LLM
- Qwen 3 235B-A22B Perf Optimized
- DeepSeek V3 Perf Optimized
- Gemma3 support from Google
- Embedding and Reranker models
- MM
- Llama 4
- AVLM
- Training performance (speed)
- NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200
- MXFP8 with TP communication overlap
- MXFP8 with reduced memory allocation
- FP8 sub-channel recipe (128x128 for weight and 1x128 for activation)
- cudnn fused attention for MLA (both Hopper and Blackwell)
- Advanced custom asymmetric pipelining (for MTP, loss func, and embd)
- BF16 optimizer for model memory saving
- CUDA graph fix for fine-tuning benchmarks
- CUDA graph support for LLAMA4
### Detailed Changelogs
#### ASR
Changelog
- ci: Fix ASR container by @ko3n1g :: PR: #13288
- Set L2_Segmentation_Tool_Parallel_ctc_segmentation test to be optional by @chtruong814 :: PR: #13296
- Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421
- Update flagged docs links by @erastorgueva-nv :: PR: #13391
- [Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445
- Update CP by @pablo-garay :: PR: #13532
- Tdt buffered inference fix by @hainan-xv :: PR: #13500
- Fix transcribe when nbest hypotheses are returned by @lilithgrigoryan :: PR: #13540
- Set ASR test to be optional by @chtruong814 :: PR: #13633
- Enabling chunked inference for AED models in asr_evaluator by @melllinia :: PR: #13674
- Ko3n1g/chore/asr only by @ko3n1g :: PR: #13704
- decompressing joblib file before checking it by @Ssofja :: PR: #13732
- Revert "decompressing joblib file before checking it (#13732)" by @chtruong814 :: PR: #13791
- Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353
- AIStore with Webdataset by @monica-sekoyan :: PR: #13604
- Update to add default for dataclass variables by @nithinraok :: PR: #13814
- This PR addresses to known security issues by @Ssofja :: PR: #13804
- remove model_stride var by @nithinraok :: PR: #13867
- add CTC batched beam search by @lilithgrigoryan :: PR: #13337
- Clean up streaming ASR script and tests by @artbataev :: PR: #13894
- add NGPU-LM fusion during CTC greedy by @lilithgrigoryan :: PR: #13917
#### TTS
Changelog
- Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421
- Update flagged docs links by @erastorgueva-nv :: PR: #13391
- [Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445
- Update CP by @pablo-garay :: PR: #13532
- fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673
- AIStore with Webdataset by @monica-sekoyan :: PR: #13604
#### NLP / NMT
Changelog
- Migrate Hyena to Megatron inference_context. by @cspades :: PR: #13436
- Update CP by @pablo-garay :: PR: #13532
- fix broken links by @dimapihtar :: PR: #13544
- Add nlp import checks by @thomasdhc :: PR: #13563
- PTQ model support, quant_cfg, and documentation updates by @janekl :: PR: #13519
- feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences by @soluwalana :: PR: #13367
- fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673
- Fix resume with MegatronPretrainingBatchSampler by @ashors1 :: PR: #13565
- Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353
- Revert `Adding more doc-strings to megatron_parallel.py #12767` by @ko3n1g :: PR: #13824
- reasoning model evaluation mmlu gpqa by @ruchaa-apte :: PR: #13880
- Remove unused DynamicRetrievalServer and Bert dataset loader classes by @dimapihtar :: PR: #14209
- Huvu/avlm qafix cherrypick from by @huvunvidia :: PR: #14253
#### Export
Changelog
- Improve Nemo2Exporter for Models Using Custom Modelling Files on HF by @suiyoubi :: PR: #13400
- Adding more export tests by @oyilmaz-nvidia :: PR: #13410
- Add Warning to Export when output_path exists by @suiyoubi :: PR: #13465
- Move libsox-fmt-all from Dockerfile.ci.export_deploy to Dockerfile.ci by @chtruong814 :: PR: #13452
- ci: Remove trt-llm breakpoint by @ko3n1g :: PR: #13499
- Add Qwen2VL export_ckpt by @AtsunoriFujita :: PR: #13398
- Add MLlama export_ckpt by @AtsunoriFujita :: PR: #13346
- Update vLLMExporter to use vLLM V1 by @janekl :: PR: #13498
- Add vLLM Mixtral and TRT-LLM qnemo export tests (plus a couple of bugfixes) by @janekl :: PR: #13697
- Fix Qwen3 export + misc by @cuichenx :: PR: #13679
- Extra int cast for successful tracing during ONNX export by @janekl :: PR: #13782
- FP8 lora export by @cuichenx :: PR: #13748
- Add PEFT export check by @cuichenx :: PR: #13835
- Update llm api import_ckpt/export_ckpt docstring by @meatybobby :: PR: #13714
- Use modelopt export and disable dataset calibration for weight only PTQ by @jenchen13 :: PR: #13756
#### Bugfixes
Changelog
- [automodel] move liger kernel patching by @akoumpa :: PR: #13579
#### Uncategorized
Changelog
- build: various bumps by @ko3n1g :: PR: #13285
- ci: Fixes to selective triggering by @ko3n1g :: PR: #13287
- ci: Set timeout by @ko3n1g :: PR: #13294
- Set L2_NeMo_2_T5_Pretraining test as optional by @chtruong814 :: PR: #13282
- Add test environment approval step for CI by @chtruong814 :: PR: #13297
- update num nodes in deepseek v3 finetune recipe by @cuichenx :: PR: #13314
- ci: Increase cache pool by @ko3n1g :: PR: #13306
- Rename adam_with_cosine_annealing as adam since cosin LR is not setup by @ShriyaRishab :: PR: #13315
- ci: Update test queue bot to not assume a workflow is launched from a PR by @chtruong814 :: PR: #13318
- Fix TE pytorch attention doc link by @thomasdhc :: PR: #13327
- ci: Add all recent buildcaches to update-buildcache job by @ko3n1g :: PR: #13289
- Fix neva notebook by @yaoyu-33 :: PR: #13334
- Fix transformer offline for CI/CD llama4 tests by @yaoyu-33 :: PR: #13339
- [automodel] convert lm head to full tensor before passing to lce by @yuanzhedong :: PR: #13319
- ci: No dups in queue by @ko3n1g :: PR: #13352
- ci(hotfix): VLM CPU unit tests by @ko3n1g :: PR: #13348
- vLLM==0.8.5 update by @janekl :: PR: #13350
- ci: Allow bypassing approval by @ko3n1g :: PR: #13365
- Avoid the need to specify optional attributes for lhotse/nemo reader functions by @pzelasko :: PR: #13307
- ci: Fix selective-triggering for non-PR events by @ko3n1g :: PR: #13374
- ci: Revert `no-concurrency-group-on-main` by @ko3n1g :: PR: #13375
- ci: Improve no-fail-fast mechanism by @ko3n1g :: PR: #13370
- 2d buckets estimation fix by @monica-sekoyan :: PR: #13377
- ci: Fix scheduled runs by @ko3n1g :: PR: #13378
- Ko3n1g/ci/fix nightly runs by @ko3n1g :: PR: #13382
- [automodel] fix none issue in dataset for qwen model by @yuanzhedong :: PR: #13311
- update table by @akoumpa :: PR: #13397
- Improve test coverage for audio modules by @anteju :: PR: #13333
- Disable failing maxine loss test by @anteju :: PR: #13361
- Ko3n1g/ci/no notification on cancel by @ko3n1g :: PR: #13403
- document fp8_recipe by @akoumpa :: PR: #13405
- Weekly bump main by @ko3n1g :: PR: #13408
- Handle boolean args for performance scripts and log received config by @guyueh1 :: PR: #13291
- [automodel] add FirstRankPerNode by @akoumpa :: PR: #13373
- tests: Disable flaky audio test by @ko3n1g :: PR: #13429
- ci: Disable flaky audio test by @ko3n1g :: PR: #13435
- Fix loss compute and reduction by @xrennvidia :: PR: #13295
- ci: Skip link check on github links by @chtruong814 :: PR: #13425
- Add NCCL cfg interface to perf scripts by @erhoo82 :: PR: #13407
- ci: Success only if `Run CICD` label attached by @ko3n1g :: PR: #13430
- ci: Add tests to selective triggering by @ko3n1g :: PR: #13404
- ci: Remove jq by @ko3n1g :: PR: #13440
- ci: Fix deps tree for tests by @ko3n1g :: PR: #13443
- Ko3n1g/ci/fix dependency tree by @ko3n1g :: PR: #13448
- Adding additional unit tests for the deploy module by @pthombre :: PR: #13411
- [Audio] fix a flaky test (and also make some tests run faster) by @racoiaws :: PR: #13439
- [automodel] ignore tail padding in TPS calculation by @akoumpa :: PR: #13329
- Ko3n1g/ci/selective triggering 3 by @ko3n1g :: PR: #13460
- ci: Disable broken neva tests by @ko3n1g :: PR: #13461
- fix speechlm data module by @stevehuang52 :: PR: #13362
- ci: Enter queue only with passing linting by @ko3n1g :: PR: #13462
- Adding tests for Schroedinger Bridge model by @nasretdinovr :: PR: #13401
- add more detailed description by @dimapihtar :: PR: #13464
- [Audio] tests for score-based and flow matching enhancement models by @racoiaws :: PR: #13406
- Use expandable cuda memory segmentation by @erhoo82 :: PR: #13418
- Fix llava tokenizer caused nan issue by @yaoyu-33 :: PR: #13466
- Remove cuda method from ModelPT by @erastorgueva-nv :: PR: #13394
- Fix BNR 2 unit test + input, case where input length was not specified by @nitin9252 :: PR: #13467
- ci: Do not run any tests if no match is found by @ko3n1g :: PR: #13479
- Ko3n1g/ci/selective triggering 4 by @ko3n1g :: PR: #13489
- Fix typo in the performance script by @youngeunkwon0405 :: PR: #13487
- ci: No runs on main by @ko3n1g :: PR: #13490
- ci: Upload on schedule by @ko3n1g :: PR: #13491
- ci: Run selective triggering on dockerfiles and dependencies by @ko3n1g :: PR: #13493
- [automodel] fallback FP8 + LCE -> FP8 + CE by @akoumpa :: PR: #13349
- Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #13501
- Update 2.3.0 changelog by @chtruong814 :: PR: #13504
- Enabling flash decode for float16 precision only by @pthombre :: PR: #13471
- Fix changelog formatting by @chtruong814 :: PR: #13505
- Updating the long context performance number for B200 by @youngeunkwon0405 :: PR: #13468
- ci: Add more files to filter by @ko3n1g :: PR: #13517
- Improve error message when HF checkpoint cannot be loaded by @ashors1 :: PR: #13513
- Add Resume_path to llama_nemotron models by @suiyoubi :: PR: #13515
- Add Llama4 GHA by @suiyoubi :: PR: #13442
- add memory profile interface to perf scripts by @erhoo82 :: PR: #13413
- Add fp8_param argument back to mixed precision plugin for backward compatibility by @guyueh1 :: PR: #13522
- [automodel] add find_unused_parameters=True for DDP by @akoumpa :: PR: #13366
- ci: Update success message by @ko3n1g :: PR: #13541
- [Audio] TransformerUNet: predictive model support added by @nasretdinovr :: PR: #13470
- Test Hyena mixer CP equivalency by @farhadrgh :: PR: #13330
- use null tokenizer by @malay-nagda :: PR: #13480
- ci: Remove optional marker by @ko3n1g :: PR: #13469
- Update extra_requires and requirements by @thomasdhc :: PR: #13359
- Fix default config for LlamaNemotron Ultra by @suiyoubi :: PR: #13542
- [audio] Improve test coverage for audio losses by @anteju :: PR: #13309
- deepseek finetuning callback error change by @SDcodehub :: PR: #13483
- ci(fix): Add `__init__` to selective-triggering by @ko3n1g :: PR: #13577
- nsys profile filename ranks info by @malay-nagda :: PR: #13576
- chore: Update setup.py by @ko3n1g :: PR: #13566
- Fix Llama importer by @suiyoubi :: PR: #13583
- [automodel] fix --mbs/gbs dtype and chat-template by @akoumpa :: PR: #13602
- Reconfigure 'limit__batches' by @maanug-nv :: PR: #13523
- ci: Optional speech tests by @ko3n1g :: PR: #13606
- [Automodel] Fix CP device_mesh issue, use PTL distsampler by @BoxiangW :: PR: #13473
- [automodel] fix log message by @akoumpa :: PR: #13612
- Tests for evaluation with NVIDIA Evals Factory by @chtruong814 :: PR: #13627
- Fix ptl import in notebooks by @maanug-nv :: PR: #13608
- [automodel] dist.abort -> dist.destroy_process_group by @akoumpa :: PR: #13578
- Skip eval unit test by @chtruong814 :: PR: #13635
- Fix image_processor config in Energon path by @AtsunoriFujita :: PR: #13618
- Add Gemma3 VL model by @xiangxu-google :: PR: #13536
- Set L2_NeMo_2_EVAL as optional by @chtruong814 :: PR: #13644
- Update install to use pip install by @thomasdhc :: PR: #13605
- Multi node settings for evaluation nemo-run script by @athitten :: PR: #13568
- [Llama4] Fix the missing args in the recipe by @gdengk :: PR: #13649
- Bump nvidia-modelopt to 0.29.0 by @AAnoosheh :: PR: #13599
- Update README.md for 25.04 release by @snowmanwwg :: PR: #13654
- [automodel] consolidate sft peft scripts by @akoumpa :: PR: #13634
- Qwen3 by @cuichenx :: PR: #13554
- Set env variables for eval tests by @marta-sd :: PR: #13658
- build: multimodal-only by @ko3n1g :: PR: #13665
- [Audio] TransformerUNet: predictive model tests added by @nasretdinovr :: PR: #13648
- [automodel] consolidate vllm scripts by @akoumpa :: PR: #13670
- build: Pin transformers by @ko3n1g :: PR: #13675
- ci: Enable codecov checks by @ko3n1g :: PR: #13497
- ci: Add `init-file-checker` by @ko3n1g :: PR: #13684
- Add use_sharp and use user buffer registration args in perf scripts by @youngeunkwon0405 :: PR: #13521
- Remove is-optional marker for L2_NeMo_2_EVAL by @marta-sd :: PR: #13669
- gpu type and #devices CLI args by @malay-nagda :: PR: #13620
- perf scripts updates by @malay-nagda :: PR: #13456
- Use audio codec without discriminators in SpeechLM2 tests by @pzelasko :: PR: #13711
- Update changelog for `r2.3.1` by @github-actions[bot] :: PR: #13719
- Recipe default value fix for Llama4 by @suiyoubi :: PR: #13696
- build: Lift numba by @ko3n1g :: PR: #13735
- New key override for timestamps by @melllinia :: PR: #13743
- Fixed Mllama Energon config by @AtsunoriFujita :: PR: #13574
- Update convert_to_tarred_audio_dataset.py by @ssh-meister :: PR: #13755
- Enable dropout recompute in LoRA by @michal2409 :: PR: #13745
- Address VDR feedback for NeMo FW evaluations by @athitten :: PR: #13701
- remove blocks unused to increase coverage by @romanbrickie :: PR: #13511
- Fix Flux Recipe for FSDP/DDP by @suiyoubi :: PR: #13715
- Try soften protobuf version requirement by @pablo-garay :: PR: #13747
- Flux FP8 recipe by @Victor49152 :: PR: #13584
- Gemma3 Fix and Tests by @suiyoubi :: PR: #13661
- Disable local gradient checker in performance scripts by @erhoo82 :: PR: #13768
- [Audio] Tests: training for mask, pred and SB models by @nasretdinovr :: PR: #13736
- Refactor MSC integration in exp manager by @shunjiad :: PR: #13626
- [fix] vpp error in Gemma3 by @ZhiyuLi-Nvidia :: PR: #13784
- ci: Ensure approval queue fetches all CICD workflows using pagnation by @chtruong814 :: PR: #13798
- ci: make_request in approval test queue appends next url for status checks only by @chtruong814 :: PR: #13802
- Remove guard for masking tests and improve coverage by @anteju :: PR: #13787
- fix: After mcore bump by @ko3n1g :: PR: #13781
- Fix Gemma3VL training bugs by @sharanmayank :: PR: #13766
- [NeMo 2.0] Remove the restriction of load_model_state_dict for cfsdp by @shjwudp :: PR: #13512
- Add option to construct Llama model with Transformer Engine op fuser by @timmoon10 :: PR: #13776
- [Evaluation] Add support for simple-evals and tasks that require logprobs by @marta-sd :: PR: #13647
- remove stale section by @akoumpa :: PR: #13759
- fix moe_router_pre_softmax for Mixtral by @akoumpa :: PR: #13678
- fix: improve sequence length handling to fix nan in loss when turning on cudagraph by @katec846 :: PR: #13779
- Gemma3 Energon Dataset by @suiyoubi :: PR: #13813
- Rectify BLEU evaluation by @ankitapasad :: PR: #13762
- ci: Moved workflows by @ko3n1g :: PR: #13828
- ci: Moved templates by @ko3n1g :: PR: #13830
- [Build] Bump bitsandbytes dependency to 0.45.5 (ubuntu 22.04 compatibility) by @pramodk :: PR: #13789
- update for `PYTORCH_CUDA_ALLOC_CONF` env var by @malay-nagda :: PR: #13837
- [Llama4] Enable VLM Dec cudagraph by @gdengk :: PR: #13767
- Support MSC URL in LLM checkpointing by @shunjiad :: PR: #13805
- additional metrics by @dimapihtar :: PR: #13754
- Expand modelopt version range by @chtruong814 :: PR: #13850
- Alit/nmh4b by @JRD971000 :: PR: #13481
- [Tutorial] Train your own reasoning model in 48 hours on a single GPU by @Maghoumi :: PR: #13853
- Enabled C2C-PCie bridge through NCCL by @sanandaraj5597 :: PR: #13621
- Added safe loading of models by @nithinraok :: PR: #13607
- Add NemotronH Performance Script by @guyueh1 :: PR: #13528
- Hyena SE/MR B2B Kernel integration by @farhadrgh :: PR: #13518
- chore: Destroy buildcache by @ko3n1g :: PR: #13869
- tests: Fix Qwen test by @ko3n1g :: PR: #13888
- fix: improve error handling in `is_multistorageclient_url` by @shunjiad :: PR: #13885
- feat(eval): adds benchmark adapters that allow specisal reasoning models by @agronskiy :: PR: #13709
- perf scripts 25.07 refactor by @malay-nagda :: PR: #13875
- Fix E5 and LlamaEmbedding Conversion by @suiyoubi :: PR: #13890
- Bug fix for NCCL vars by @sanandaraj5597 :: PR: #13908
- Reranker Model Support by @suiyoubi :: PR: #13876
- numa cmd in bash by @malay-nagda :: PR: #13914
- Fix BERT issue with PP by @suiyoubi :: PR: #13916
- [Llama4] Fix Vp_stage to enable VP for VLM llama4 by @gdengk :: PR: #13873
- Enable NVTX profiling in MCore by @minitu :: PR: #13820
- [Qwen3-MoE] Add Qwen3 MoE perf recipe for 30b and 235b by @gdengk :: PR: #13895
- lazy import bnbconfig by @akoumpa :: PR: #13919
- Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13932
- [peft] align adapter output shape with wrapped module output shape by @guyueh1 :: PR: #13922
- [automodel] move only lora adapters to cpu by @akoumpa :: PR: #13931
- Fix vp_stage not found when fsdp by @gautham-kollu :: PR: #13817
- Fix single optional import if ModelOpt not installed by @AAnoosheh :: PR: #13923
- Revert "Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13938
- Enable LoRA for TELinear layers by @cuichenx :: PR: #13929
- Freeze tags in in `r2.4.0` by @github-actions[bot] :: PR: #13945
- Cherry pick `Use jiwer less than 4.0.0 (13997)` into `r2.4.0` by @ko3n1g :: PR: #13998
- Cherry pick `Remove container license reference (14010)` into `r2.4.0` by @ko3n1g :: PR: #14017
- move classes to module to use __target__ feature by @nithinraok :: PR: #14023
- Cherry pick `bf16 grads for bf16 jobs (14016)` into `r2.4.0` by @ko3n1g :: PR: #14020
- Cherry pick `Remove nemo1 stable diffusion test (14018)` into `r2.4.0` by @ko3n1g :: PR: #14019
- Version bump to `2.4.0rc1.dev0` by @github-actions[bot] :: PR: #14047
- Cherry pick `Fix Loading Custom Quantization Config (13934)` into `r2.4.0` by @ko3n1g :: PR: #13950
- Cherry pick `[automodel] fix sft notebook (14002)` into `r2.4.0` by @ko3n1g :: PR: #14003
- Cherry pick `Use average reduction in FSDP grad reduce-scatter when grad dtype is … (13981)` into `r2.4.0` by @ko3n1g :: PR: #14004
- Cherry pick `GPU memory logging update (13982)` into `r2.4.0` by @ko3n1g :: PR: #14021
- Cherry pick `Remove kaldiio (14006)` into `r2.4.0` by @ko3n1g :: PR: #14032
- Cherry pick `Set L2_NeMo_2_Flux_Import_Test to be optional (14056)` into `r2.4.0` by @ko3n1g :: PR: #14058
- Cherry pick `Bump protobuf to 5.29.5 (14045)` into `r2.4.0` by @ko3n1g :: PR: #14060
- Cherry pick `Detect hardware before enabling DeepEP (14022)` into `r2.4.0` by @ko3n1g :: PR: #14068
- Version bump to `2.4.0rc2.dev0` by @github-actions[bot] :: PR: #14115
- Cherry pick `Fix SFT Dataset Bug (13918)` into `r2.4.0` by @ko3n1g :: PR: #14074
- Cherry pick `Align adapter shape with base linear output shape (14009)` into `r2.4.0` by @ko3n1g :: PR: #14083
- Cherry pick `[MoE] Update the fp8 precision interface for llama4 and qwen3 (14094)` into `r2.4.0` by @ko3n1g :: PR: #14104
- Cherry pick `[Llama4] Tokenizer naming update (14114)` into `r2.4.0` by @ko3n1g :: PR: #14123
- Cherry pick `Bump to pytorch 25.05 container along with TE update (13899)` into `r2.4.0` by @ko3n1g :: PR: #14145
- Cherry pick `Perf scripts updates (14005)` into `r2.4.0` by @ko3n1g :: PR: #14129
- Cherry pick `Remove unstructured (14070)` into `r2.4.0` by @ko3n1g :: PR: #14147
- Version bump to `2.4.0rc3.dev0` by @github-actions[bot] :: PR: #14165
- Cherry pick `Add checkpoint info for NIM Embedding Expor Tutorial (14177)` into `r2.4.0` by @ko3n1g :: PR: #14178
- Cherry pick `Fix dsv3 script (14007)` into `r2.4.0` by @ko3n1g :: PR: #14182
- Cherry pick `405b perf script updates (14176)` into `r2.4.0` by @chtruong814 :: PR: #14195
- Cherry pick `Fix nemotronh flops calculator (14161)` into `r2.4.0` by @chtruong814 :: PR: #14202
- Cherry pick `Add option to disable gloo process groups` (#14156) into `r2.4.0` by @chtruong814 :: PR: #14220
- Cherry pick `Remove g2p_en (14204)` into `r2.4.0` by @chtruong814 :: PR: #14212
- Cherry pick `diffusion mock data null args (14173)` into `r2.4.0` by @chtruong814 :: PR: #14217
- Cherry pick `perf-scripts: Change b200 config to EP8 (14207)` into `r2.4.0` by @chtruong814 :: PR: #14223
- Cherry pick `Change RerankerSpecter Dataset question key (14200)` into `r2.4.0` by @chtruong814 :: PR: #14224
- Cherry pick `Fix the forward when final_loss_mask is not present (14201)` into `r2.4.0` by @chtruong814 :: PR: #14225
- Cherry pick `Fix Llama Nemotron Nano Importer (14222)` into `r2.4.0` by @chtruong814 :: PR: #14226
- Cherry pick `[automodel] fix loss_mask pad token (14150)` into `r2.4.0` by @chtruong814 :: PR: #14227
- [Performance script] FSDP-UBR related recipe update (#14208) by @youngeunkwon0405 :: PR: #14233
- Fix for MCore dist ckpt loading #14229 by @stevehuang52 :: PR: #14239
- cherry-pick fix eval beam search ctc script by @lilithgrigoryan :: PR: #14242
- Cherry pick `Moving export security fixes over here (14254)` into `r2.4.0` by @chtruong814 :: PR: #14261
- Cherry pick `Confidence fix for tutorial (14250)` into `r2.4.0` by @chtruong814 :: PR: #14266
- Cherry pick `added new models to documentation (14264)` into `r2.4.0` by @chtruong814 :: PR: #14278
- Cherry-pick `FIx Flux & Flux_Controlnet initialization issue` (#14263) into `r2.4.0` by @chtruong814 :: PR: #14273
- Cherry pick `update ffmpeg install (14237)` into `r2.4.0` by @chtruong814 :: PR: #14279
## NVIDIA Neural Modules 2.3.2
This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at
## NVIDIA Neural Modules 2.3.1
### Highlights
- Collections
- LLM
- Llama 4: Fixed an accuracy issue caused by MoE probability normalization. Improved pre-train and fine-tune performance.
- Export & Deploy
- Updated vLLMExporter to use vLLM V1 to address a security vulnerability.
- AutoModel
- Improved chat-template handling.
- Fault Tolerance
- Local checkpointing: Fixed support for auto-inserted metric names for resuming from local checkpoints.
### Detailed Changelogs
#### Export
Changelog
- Cherry-pick `Update vLLMExporter to use vLLM V1` (#13498) into `r2.3.0` by @chtruong814 :: PR: #13631
#### Uncategorized
Changelog
- Bump to 2.3.1 by @chtruong814 :: PR: #13507
- Cherry pick `Use explicitly cached canary-1b-flash in CI tests (13237)` into `r2.3.0` by @ko3n1g :: PR: #13508
- Cherry pick `[automodel] bump liger-kernel to 0.5.8 + fallback (13260)` into `r2.3.0` by @ko3n1g :: PR: #13308
- Cherry-pick `Add recipe and ci scripts for qwen2vl` to `r2.3.0` by @romanbrickie :: PR: #13336
- Cherry pick `Fix skipme handling (13244)` into `r2.3.0` by @ko3n1g :: PR: #13376
- Cherry pick `Allow fp8 param gather when using FSDP (13267)` into `r2.3.0` by @ko3n1g :: PR: #13383
- Cherry pick `Handle boolean args for performance scripts and log received config (13291)` into `r2.3.0` by @ko3n1g :: PR: #13416
- Cherry pick `new perf configs (13110)` into `r2.3.0` by @ko3n1g :: PR: #13431
- Cherry pick `Adding additional unit tests for the deploy module (13411)` into `r2.3.0` by @ko3n1g :: PR: #13449
- Cherry pick `Adding more export tests (13410)` into `r2.3.0` by @ko3n1g :: PR: #13450
- Cherry pick `[automodel] add FirstRankPerNode (13373)` into `r2.3.0` by @ko3n1g :: PR: #13559
- Cherry pick `[automodel] deprecate global_batch_size dataset argument (13137)` into `r2.3.0` by @ko3n1g :: PR: #13560
- Cherry-pick `[automodel] fallback FP8 + LCE -> FP8 + CE` (#13349) into `r2.3.0` by @chtruong814 :: PR: #13561
- Cherry pick `[automodel] add find_unused_parameters=True for DDP (13366)` into `r2.3.0` by @ko3n1g :: PR: #13601
- Cherry pick `Add CI test for local checkpointing (#13012)` into `r2.3.0` by @ananthsub :: PR: #13472
- Cherry pick `[automodel] fix --mbs/gbs dtype and chat-template (13598)` into `r2.3.0` by @akoumpa :: PR: #13613
- Cherry-pick `Update t5.py` (#13082) to `r2.3.0` and `bump mcore to f98b1a0` by @chtruong814 :: PR: #13642
- [Automodel] Fix CP device_mesh issue, use PTL distsampler (#13473) by @akoumpa :: PR: #13636
- [Llama4] Fix the recipe bug - cherrypick #13649 by @gdengk :: PR: #13650
- build: Pin transformers (#13675) by @ko3n1g :: PR: #13692
## NVIDIA Neural Modules 2.3.0
### Highlights
- Export & Deploy
- NeMo 2.0 export path for NIM
- ONNX and TensorRT Export for NIM Embedding Container
- In-framework deployment for HF Models
- TRT-LLM deployment for HF Models in NeMo Framework
- Evaluation
- Integrate nvidia-lm-eval to NeMo FW for evaluations with OpenAI API compatible in-framework deployment
- AutoModel
- VLM AutoModelForImageForTextToText
- FP8 for AutoModel
- Support CP with FSDP2
- Support TP with FSDP2
- Performance Optimization
- add support for cut cross entropy & liger kernel
- Gradient Checkpointing
- Fault Tolerance
- Integrate NVRx v0.3 Local checkpointing
- Collections
- LLM
- Llama4
- Llama Nemotron Ultra
- Llama Nemotron Super
- Llama Nemotron Nano
- Nemotron-h/5
- DeepSeek V3 Pretraining
- Evo2
- Qwen 2.5
- LoRA for Qwen3-32B and Qwen3-30B-A3B
- MultiModal
- FLUX
- Gemma 3
- Qwen2-VL
- ASR
- NeMo Run support for ASR training
- N-Gram LM on GPU for AED
- N-Gram LM on GPU + Transducer greedy decoding (RNN-T, TDT)
- Timestamps support for AED timestamp supported models
- Migrate SpeechLM to NeMo 2.0
- Canary-1.1
- Replace ClassificationModels class with LabelModels
- Performance
- Functional MXFP8 support for (G)B200
- Current scaling recipe with TP communication overlap and FP8 param gathers
- Custom FSDP support that fully utilizes GB200 NVL72
### Detailed Changelogs
#### ASR
Changelog
- Added model config params for Canary-1B-Flash, Canary-180M-Flash models by @KunalDhawan :: PR: #12588
- Canary tutorial by @ankitapasad :: PR: #12613
- Canary tutorial fix timestamp by @ankitapasad :: PR: #12677
- revert config by @nithinraok :: PR: #12689
- canary longform inference script with timestamps option by @krishnacpuvvada :: PR: #12653
- Fix default timestamps value for Hybrid ASR models by @artbataev :: PR: #12681
- Fix k2 installation with PyTorch 2.6.0 by @artbataev :: PR: #12686
- Improve time and RTFx report for ASR by @artbataev :: PR: #12680
- Modify train args by @ankitapasad :: PR: #12700
- Fix asr doc warnings by @nithinraok :: PR: #12720
- Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755
- transcribe fix for new hypotheses by @nune-tadevosyan :: PR: #12801
- Fix timestamps when cuda graphs enabled by @monica-sekoyan :: PR: #12808
- update streaming conformer by @stevehuang52 :: PR: #12846
- AED Decoding with N-Gram LM by @artbataev :: PR: #12730
- update notebook by @nithinraok :: PR: #13088
- bugfix ASR_Context_Biasing.ipynb by @lilithgrigoryan :: PR: #13109
- Change branch for installation from main to r2.3.0 by @ankitapasad :: PR: #13266
#### TTS
Changelog
- Add Magpie-TTS and Updates NeMo Audio Codecs by @blisc :: PR: #12606
- fix bug from prior commit (#13264) by @blisc :: PR: #13328
#### NLP / NMT
Changelog
- Remove old peft docs by @cuichenx :: PR: #12675
- Add code coverage for llm gpt models conversion tests by @suiyoubi :: PR: #12665
- Make BERT TransformerBlockWithPostLNSupport accept more inputs from Mcore by @suiyoubi :: PR: #12685
- remove gifs from documentation by @dimapihtar :: PR: #12732
- Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755
- fix NeMo documentation by @dimapihtar :: PR: #12754
- GPT Model/Data/Recipe Unit Test by @suiyoubi :: PR: #12757
- ci: Exclude nlp, mm, vision collections by @ko3n1g :: PR: #12816
- Add vocab size as attr to GPT and T5 Configs, use file name based logger in llm.gpt.data by @hemildesai :: PR: #12862
- Fix transformer layer api with megatron cbc89b3 by @yaoyu-33 :: PR: #12885
#### Text Normalization / Inverse Text Normalization
Changelog
- Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755
#### Export
Changelog
- GHA Conversion Test and Importer/Exporter Refactor by @suiyoubi :: PR: #12597
- Fix Llama Embedding Model Exporting keys by @suiyoubi :: PR: #12691
- build: Add trtllm by @ko3n1g :: PR: #12672
- Fix trt-llm install by @chtruong814 :: PR: #12827
- Update LLaVA's next HF exporter to load ViT checkpoint from YAML by @eagle705 :: PR: #12841
- Support huggingface export to tensorrtllm by @pthombre :: PR: #12889
- Adds a built stage for the trt-llm wheel to reduce the overall test image size by @chtruong814 :: PR: #12883
#### Uncategorized
Changelog
- Update changelog-build.yml by @ko3n1g :: PR: #12584
- Update changelog for `r2.2.0` by @github-actions[bot] :: PR: #12585
- Add comments for requirements by @thomasdhc :: PR: #12603
- [automodel] FSDP2Strategy: move to device if using a single-device by @akoumpa :: PR: #12593
- build: Remove numba pin by @ko3n1g :: PR: #12604
- docs: Update installation guides by @ko3n1g :: PR: #12596
- Change Llama Scaling Factor type to Float by @suiyoubi :: PR: #12616
- ci: Test multiple python versions by @ko3n1g :: PR: #12619
- ci: Disable reformat by @ko3n1g :: PR: #12620
- Updating ModelOpt to 0.25.0 by @janekl :: PR: #12633
- [automodel] add additional hf_dataset tests by @akoumpa :: PR: #12646
- [automodel] add jit_transform tests by @akoumpa :: PR: #12645
- [automodel] init eos_token_id inside data module by @yuanzhedong :: PR: #12610
- [automodel] grad ckpt by @akoumpa :: PR: #12644
- bugfix(llm/LLaMa) - dropout_position can never be equal to extended string by @soluwalana :: PR: #12649
- Fix inference pipeline quality issue by @Victor49152 :: PR: #12639
- [automodel] switch to direct=True to propage return codes in nemorun by @akoumpa :: PR: #12651
- add Auto Conf support for bert, t5, qwen, starcoder models by @dimapihtar :: PR: #12601
- ci: Upload coverage by @ko3n1g :: PR: #12668
- ci: Re-enable changed-files action by @ko3n1g :: PR: #12683
- build: Pin sox by @ko3n1g :: PR: #12701
- add neva quantization by @linnanwang :: PR: #12698
- Clip coverage by @abhinavg4 :: PR: #12696
- GHA CI test: Remove unnecessary directive by @pablo-garay :: PR: #12714
- minor perf fixes by @malay-nagda :: PR: #12656
- Add DeepSeek V2 Lite into llm __init__.py by @suiyoubi :: PR: #12664
- Add Llama-Nemotron Nano and 70B models by @suiyoubi :: PR: #12712
- Save batch norm running stats in PEFT checkpoints by @cuichenx :: PR: #12666
- Fix document Readme under nemo to add more information by @yaoyu-33 :: PR: #12699
- Fix ub_overlap_ag by @cuichenx :: PR: #12721
- Toggle fast tokenizer if error occurs by @cuichenx :: PR: #12722
- Update README.md for blackwell and AutoModel by @snowmanwwg :: PR: #12612
- Raise error on import_ckpt with overwrite=False plus README for checkpoint_converters by @janekl :: PR: #12693
- [automodel] fix validation_step by @soluwalana :: PR: #12659
- [automodel] vlm tests by @akoumpa :: PR: #12716
- Auto Configurator code coverage by @dimapihtar :: PR: #12694
- [automodel] fix automodle benchmark script by @yuanzhedong :: PR: #12605
- Remove unnecessary directives by @pablo-garay :: PR: #12743
- Add recipe tests for coverage by @cuichenx :: PR: #12737
- Add Qwen2.5 in NeMo2 by @suiyoubi :: PR: #12731
- add fallback_module to safe_import_from by @akoumpa :: PR: #12726
- Update quantization scripts & relax modelopt requirement specifier by @janekl :: PR: #12709
- Import guard fasttext by @thomasdhc :: PR: #12758
- [automodel] chunked cross entropy by @akoumpa :: PR: #12752
- Add fsdp automodel test by @BoxiangW :: PR: #12718
- [automodel] if peft move only adapters to cpu by @akoumpa :: PR: #12735
- [automodel] update hf mockdataset by @akoumpa :: PR: #12643
- [automodel] remove unused cell in multinode notebook by @yuanzhedong :: PR: #12624
- Yash/llava next coverage by @yashaswikarnati :: PR: #12745
- Tidy code: remove unneeded statements/lines by @pablo-garay :: PR: #12771
- Pass tensor instead of raw number in _mock_loss_function in PTQ by @janekl :: PR: #12769
- ci: Run on nightly schedule by @ko3n1g :: PR: #12775
- Add logs for checkpoint saving start and finalization by @lepan-google :: PR: #12697
- Alit/test coverage by @JRD971000 :: PR: #12762
- Fix loss mask with packed sequence by @ashors1 :: PR: #12642
- Add pruning recipe by @kevalmorabia97 :: PR: #12602
- Update qwen2-v1 to use NeMo quick_gelu by @thomasdhc :: PR: #12787
- [doc] Fixes for audio doc warnings by @anteju :: PR: #12736
- ci: Measure multiprocessing by @ko3n1g :: PR: #12778
- ci: Fix flaky LLM tests by @ko3n1g :: PR: #12807
- Add BERT/Qwen2.5 Unit test and Refactor all GHA Conversion Tests by @suiyoubi :: PR: #12785
- Fix TransformerBlock cuda_graphs compatibility with MCore by @buptzyb :: PR: #12779
- ci: Remove `--branch` by @ko3n1g :: PR: #12809
- ci: Move scripts fully down to files by @ko3n1g :: PR: #12802
- add __init__.py to make this a package by @akoumpa :: PR: #12814
- Update changelog for `r2.2.1` by @github-actions[bot] :: PR: #12818
- add finetune support for Auto Configurator by @dimapihtar :: PR: #12770
- [automodel] add cpu:gloo to backend by @akoumpa :: PR: #12832
- add missing call to _apply_liger_kernel_to_instance by @akoumpa :: PR: #12806
- Prune docker images in GHA older than 8hrs by @chtruong814 :: PR: #12838
- [audio] Adding tests for predictive models by @anteju :: PR: #12823
- Update resiliency example notebook readme and add links to the brev launchable by @ShriyaRishab :: PR: #12843
- [automodel] qlora peft by @yzhang123 :: PR: #12817
- ci: Increase prune time by @ko3n1g :: PR: #12860
- Update base container in `Dockerfile.speech` by @artbataev :: PR: #12859
- Fix qwen2.5 1.5b configuration inheritance bug by @Aprilistic :: PR: #12852
- Update modelopt upperbound to 0.27 by @thomasdhc :: PR: #12788
- Non-blocking checkpoint cleanup failure by @jstjohn :: PR: #12804
- Improve evo2 dataset test and testability by @jstjohn :: PR: #12857
- Expand test converage neva / mllama by @yaoyu-33 :: PR: #12715
- Weekly bump by @ko3n1g :: PR: #12891
- ci: Optional_L2_NeMo_2_SSM_Finetuning by @ko3n1g :: PR: #12893
- docs: Update guide to PEP508 by @ko3n1g :: PR: #12890
- Replace lm-eval with nvidia-lm-eval by @chtruong814 :: PR: #12888
- Handle CUDA_DEVICE_MAX_CONNECTIONS before job launch by @guyueh1 :: PR: #12833
- add nemotron5 by @JRD971000 :: PR: #12660
- Bump vllm 0.8.2 by @Laplasjan107 :: PR: #12753
- DeepseekV3 SFT finetuning perf config by @gdengk :: PR: #12829
- add apply_chat_template method to TokenizerSpec + AutoTokenizer by @akoumpa :: PR: #12878
- add accelerate to dependencies by @akoumpa :: PR: #12871
- [automodel] Add FSDPv2-compatible context parallelism support. by @cspades :: PR: #12821
- [fault tolerance] Add local checkpointing support by @ananthsub :: PR: #12839
- ci: Bump release-freeze by @ko3n1g :: PR: #12914
- ci: Use PAT for code-freeze by @ko3n1g :: PR: #12915
- ci: Use correct environment by @ko3n1g :: PR: #12917
- Freeze tags in in `r2.3.0` by @github-actions[bot] :: PR: #12919
- chore: Bump version to 2.3.0.rc2 by @chtruong814 :: PR: #12920
- Version bump to `2.3.0rc3.dev0` by @github-actions[bot] :: PR: #12921
- Cherry pick `[automodel] Add linear ce loss support (12825)` into `r2.3.0` by @ko3n1g :: PR: #12922
- Cherry pick `DeepSeek V3 Multi Token Prediction (12550)` into `r2.3.0` by @ko3n1g :: PR: #12928
- Cherry pick `Set L2_NeMo_2_EVAL test to be optional (12949)` into `r2.3.0` by @ko3n1g :: PR: #12951
- Cherry pick `GB200 LLM performance scripts tuning (12791)` into `r2.3.0` by @ko3n1g :: PR: #12923
- Cherry pick `Allow configuration of PP communication backend to UCC in nemo2 (11755)` into `r2.3.0` by @ko3n1g :: PR: #12946
- Cherry pick `guard bitsandbytes based on cuda availability (12937)` into `r2.3.0` by @ko3n1g :: PR: #12958
- Cherry pick `Hugging Face model deployment support (12628)` into `r2.3.0` by @ko3n1g :: PR: #12962
- Cherry pick `fix macro-acc for pair-audio eval (12908)` into `r2.3.0` by @ko3n1g :: PR: #12963
- Cherry pick `Add energon dataset support for Qwen2VL (12831)` into `r2.3.0` by @ko3n1g :: PR: #12966
- Cherry pick `Make TETransformerLayerAutocast Support Cuda Graph (12075)` into `r2.3.0` by @ko3n1g :: PR: #12967
- Cherry pick `Use nvidia-lm-eval for evaluation (12902)` into `r2.3.0` by @ko3n1g :: PR: #12971
- Cherry pick `[NeMo 2.0] Interface for using MXFP8 and FP8 current scaling recipes (12503)` into `r2.3.0` by @ko3n1g :: PR: #12974
- Cherry pick `Fix trtllm and lightning conflict (12943)` into `r2.3.0` by @ko3n1g :: PR: #12981
- Cherry pick `Update v3 finetuning recipe (12950)` and `Specify PP first/last in strategy (12992)` into `r2.3.0` by @ko3n1g :: PR: #12984
- Cherry pick `Resolve an issue in custom megatron FSDP config setting (12948)` into `r2.3.0` by @ko3n1g :: PR: #12987
- Cherry pick `Remove getattr_proxy to avoid problematic edge cases (12176)` into `r2.3.0` by @ko3n1g :: PR: #12990
- Cherry pick `Enable async requests for in-fw deployment with OAI compatible server (12980)` into `r2.3.0` by @ko3n1g :: PR: #12994
- Cherry pick `initialize model with metadata (12496)` into `r2.3.0` by @ko3n1g :: PR: #12997
- Cherry pick `Bugfix for logits support for hf deployment (12965)` into `r2.3.0` by @ko3n1g :: PR: #13001
- Cherry pick `Update nvidia-resiliency-ext to be >= 0.3.0 (12925)` into `r2.3.0` by @ko3n1g :: PR: #13000
- Cherry-pick Fix params_dtype for distillation and GPT HF Exporter head_dim for pruning to r2.3.0 by @kevalmorabia97 :: PR: #13002
- Install nvidia-pytriton on arm (#13011) by @thomasdhc :: PR: #13013
- Version bump to `2.3.0rc4.dev0` by @github-actions[bot] :: PR: #13041
- Cherry pick `Alit/nemotron h (12942)` into `r2.3.0` by @ko3n1g :: PR: #13007
- Cherry pick `[Automodel] Add TP/SP support with default llama-like sharding plan (12796)` into `r2.3.0` by @ko3n1g :: PR: #13017
- Cherry pick `Add initial docs broken link check (12977)` into `r2.3.0` by @ko3n1g :: PR: #13045
- Cherry pick `Fix MoE Init to not use Bias in test_strategy_lib.py (13009)` into `r2.3.0` by @ko3n1g :: PR: #13014
- Cherry pick `cleaner tflops log name (13005)` into `r2.3.0` by @ko3n1g :: PR: #13024
- Cherry pick `Improve t5 test coverage (12803)` into `r2.3.0` by @ko3n1g :: PR: #13025
- Cherry pick `put the warning on the right place (12909)` into `r2.3.0` by @ko3n1g :: PR: #13035
- Cherry pick `Temporary disable CUDA graphs in DDP mode for transducer decoding (12907)` into `r2.3.0` by @ko3n1g :: PR: #13036
- Cherry pick `[automodel] peft fix vlm (13010)` into `r2.3.0` by @ko3n1g :: PR: #13037
- Cherry pick `Only run the docs link check on the container (13068)` into `r2.3.0` by @ko3n1g :: PR: #13070
- Cherry pick `Add fp8 recipe option to perf script (13032)` into `r2.3.0` by @ko3n1g :: PR: #13055
- Cherry pick `Unified ptq export (12786)` into `r2.3.0` by @ko3n1g :: PR: #13062
- Cherry pick `Fix VP list index out of range from Custom FSDP (13021)` into `r2.3.0` by @ko3n1g :: PR: #13077
- Cherry pick `Add logging to cancel out PTL's warning about dataloader not being resumable (13072)` into `r2.3.0` by @ko3n1g :: PR: #13100
- Cherry pick `Fix long sequence generation after new arg introduced in mcore engine (13049)` into `r2.3.0` by @ko3n1g :: PR: #13104
- Cherry pick `Support Mamba models quantization (12631)` into `r2.3.0` by @ko3n1g :: PR: #13105
- Cherry pick `Add track_io to user buffer configs (13071)` into `r2.3.0` by @ko3n1g :: PR: #13111
- ci: Onboard 8-GPU runner (#13115) by @ko3n1g :: PR: #13121
- Cherry pick `Add fine-tuning dataset function for FineWeb-Edu and update automodel… (13027)` into `r2.3.0` by @ko3n1g :: PR: #13118
- Cherry pick `Re-add sox to asr requirements (13092)` into `r2.3.0` by @ko3n1g :: PR: #13120
- Cherry pick `Update Mllama cross attn signature to match update MCore (13048)` into `r2.3.0` by @ko3n1g :: PR: #13122
- Cherry pick `Fix Exporter for baichuan and chatglm (13095)` into `r2.3.0` by @ko3n1g :: PR: #13126
- ci: Faster builds (#13142) by @ko3n1g :: PR: #13144
- Version bump to `2.3.0rc5.dev0` by @github-actions[bot] :: PR: #13146
- ci: Fix mcore install in test container (#13152) by @ko3n1g :: PR: #13159
- ci: Fix race-condition of container setup (#13162) by @ko3n1g :: PR: #13163
- Cherry pick `Guard decord and triton import (12861)` into `r2.3.0` by @ko3n1g :: PR: #13132
- Cherry pick `Bump TE version and apply patch (13087)` into `r2.3.0` by @ko3n1g :: PR: #13139
- Cherry pick `Update Llama-Minitron pruning-distillation notebooks from NeMo1 to NeMo2 + NeMoRun (12968)` into `r2.3.0` by @ko3n1g :: PR: #13141
- Cherry pick `Export and Deploy Tests (13076)` into `r2.3.0` by @ko3n1g :: PR: #13150
- Cherry pick `ub fp8 h100 fixes (13131)` into `r2.3.0` by @ko3n1g :: PR: #13153
- Cherry pick `Fix Transducer Decoding with CUDA Graphs in DDP with Mixed Precision (12938)` into `r2.3.0` by @ko3n1g :: PR: #13154
- Cherry pick `build: Pin modelopt (13029)` into `r2.3.0` by @chtruong814 :: PR: #13170
- Cherry pick `add fixes for nemotron-h` (13073) into `r2.3.0` by @JRD971000 :: PR: #13165
- Add dsv3 pretrain script, support flops calculation (previous #12947) by @guyueh1 :: PR: #13186
- ci: Allow running CI on weekly bump branch by @ko3n1g :: PR: #13233
- Cherry pick `Add Llama Nemotron Super/Ultra models (13044)` into `r2.3.0` by @ko3n1g :: PR: #13212
- Cherry pick `Add Blockwise FP8 to PTQ & EP to modelopt resume (12670)` into `r2.3.0` by @ko3n1g :: PR: #13239
- Cherry pick `[OAI Serving] Validate greedy generation args (redo) (13216)` into `r2.3.0` by @ko3n1g :: PR: #13242
- Cherry pick `drop sample_alpha in speechlm (13208)` into `r2.3.0` by @ko3n1g :: PR: #13246
- Cherry pick `[Eval bugfix] Move global eval-related imports inside the evaluate function (13166)` into `r2.3.0` by @ko3n1g :: PR: #13249
- Cherry pick `[Eval bugfix] Change default val of parallel_requests in eval script (13247)` into `r2.3.0` by @ko3n1g :: PR: #13253
- Cherry pick `Add tutorial for evaluation with Evals Factory (13259)` into `r2.3.0` by @ko3n1g :: PR: #13271
- Cherry pick `Fix default token durations (13168)` into `r2.3.0` by @ko3n1g :: PR: #13261
- Cherry pick `[Evaluation] Add support for nvidia-lm-eval==25.04 (13230)` into `r2.3.0` by @ko3n1g :: PR: #13274
- Cherry pick `[bug fix] set inference max seq len in inference context (13245)` into `r2.3.0` by @ko3n1g :: PR: #13276
- Cherry pick `More export and deploy unit tests (13178)` into `r2.3.0` by @ko3n1g :: PR: #13283
- Cherry pick `Reopen 13040 (13199)` into `r2.3.0` by @ko3n1g :: PR: #13303
- Cherry pick `Fix nemo1's neva notebook (13218)` into `r2.3.0` by @ko3n1g :: PR: #13312
- Cherry pick `build: various bumps (13285)` into `r2.3.0` by @ko3n1g :: PR: #13313
- Cherry-pick `ci: Increase cache pool` into `r2.3.0` by @chtruong814 :: PR: #13317
- Cherry pick `update num nodes in deepseek v3 finetune recipe (13314)` into `r2.3.0` by @ko3n1g :: PR: #13316
- Cherry pick `Fix neva notebook (13334)` into `r2.3.0` by @ko3n1g :: PR: #13335
- Cherry-pick `Add Llama4 Scout and Maverick Support (#12898)` by @ko3n1g :: PR: #13331
- Cherry pick `Fix handling Llama Embedding dimensions param and prompt type in the ONNX export tutorial (13262)` into `r2.3.0` by @ko3n1g :: PR: #13326
- Cherry-pick `Fix transformer offline for CI/CD llama4 tests` (#13339) to `r2.3.0` by @chtruong814 :: PR: #13340
- Fix llama4 test names by @chtruong814 :: PR: #13358
- Cherry pick `vLLM==0.8.5 update (13350)` into `r2.3.0` by @ko3n1g :: PR: #13354
- Cherry-pick a test and doc fix to r2.3.0 by @chtruong814 :: PR: #13338
- Cherry pick `Add llama4 training recipe (12952)` into `r2.3.0` by @ko3n1g :: PR: #13386
## NVIDIA Neural Modules 2.2.1
### Highlights
- Training
- Fix MoE based models training instability.
- Fix bug in Llama exporter for Llama 3.2 1B and 3B.
- Fix bug in LoRA linear_fc1adapter when different TP is used during saving and loading the adapter checkpoint.
### Detailed Changelogs
#### Uncategorized
Changelog
- Re-add reverted commits after 2.2.0 and set next version to be 2.2.1 by @chtruong814 :: PR: #12587
- Cherry pick `Fix exporter for llama models with shared embed and output layers (12545)` into `r2.2.0` by @ko3n1g :: PR: #12608
- Cherry pick `Fix TP for LoRA adapter on`linear_fc1`(12519)` into `r2.2.0` by @ko3n1g :: PR: #12607
- Bump mcore to use 0.11.1 by @chtruong814 :: PR: #12634
## NVIDIA Neural Modules 2.2.0
### Highlights
- Training
- Blackwell and Grace Blackwell support
- Pipeline parallel support for distillation
- Improved NeMo Framework installation
- Export & Deploy
- vLLM export for NeMo 2.0
- Evaluations
- Integrate lm-eval-harness
- Collections
- LLM
- DAPT Example and best practices in nemo 2.0
- [NeMo 2.0] Enable Tool Learning and add a tutorial
- Support GPT Embedding Model (Llama 3.2 1B/3B)
- Qwen2.5, Phi4 (via AutoModel)
- SFT for Llama 3.3 model (via AutoModel)
- Support BERT Embedding Model with NeMo 2.0
- DeepSeek SFT & PEFT Support
- MultiModal
- Clip
- SP for NeVA
- CP for NeVA
- Intern-VIT
- Automodel
- Preview release.
- PEFT and SFT support for LLMs available via Hugging Face’s AutoModelForCausalLM.
- Support for Hugging Face-native checkpoints (full model and adapter only).
- Support for distributed training via DDP and FSDP2.
- ASR/TTS
- Lhotse: TPS-free 2D bucket estimation and filtering
- Update model outputs to make all asr outputs to be in consistent format
- Sortformer Release Model
### Detailed Changelogs
#### ASR
Changelog
- removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710
- TPS-free 2D bucket estimation and filtering by @pzelasko :: PR: #11738
- Update transcribe_utils.py by @stevehuang52 :: PR: #11984
- Sortformer Diarizer 4spk v1 model PR Part 4: Sortformer Documents and Notebook Tutorials by @tango4j :: PR: #11707
- fix the issue during batched inference of Sortformer diarizer by @tango4j :: PR: #12047
- changed asr models outputs to be consistent by @Ssofja :: PR: #11818
- chore: Update notebooks by @ko3n1g :: PR: #12161
- add ctc segmentation by @ko3n1g :: PR: #12312
- clean up VAD tutorial by @stevehuang52 :: PR: #12410
- copy from main by @nithinraok :: PR: #12423
- ci: Disable ASR tests for now (#12443) by @ko3n1g :: PR: #12466
- ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538
#### TTS
Changelog
- Add New Transformer Backbone for TTS Models by @blisc :: PR: #11911
- changed asr models outputs to be consistent by @Ssofja :: PR: #11818
- chore: Update notebooks by @ko3n1g :: PR: #12161
#### NLP / NMT
Changelog
- Use explicit imports from megatronllm_deployable.py by @janekl :: PR: #11705
- Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
- gpt moe perf scripts by @malay-nagda :: PR: #11760
- Bump mcore by @ko3n1g :: PR: #11740
- Enable packed seqs for validation by @jiemingz :: PR: #11748
- Revert Mcore update since it caused regression by @pablo-garay :: PR: #11791
- Fix Gemma2 Attention Init Args by @suiyoubi :: PR: #11792
- Add null tokenizer by @erhoo82 :: PR: #11789
- Fix DistCP inference issue by @suiyoubi :: PR: #11801
- Add BERT Embedding Models E5 Recipe by @suiyoubi :: PR: #11787
- Add rope scaling configs for NeMo 1 by @BoxiangW :: PR: #11807
- Fix calculating num_available_samples by @huvunvidia :: PR: #11830
- fix sentencepiece tokenizer special tokens by @akoumpa :: PR: #11811
- add chat sft dataset to support agent tool calling by @chenrui17 :: PR: #11759
- Revert "Revert Mcore update since it caused regression (#11791)" by @ko3n1g :: PR: #11799
- fix checkpoint load issue by @dimapihtar :: PR: #11859
- Fix nemo 1 packed sequence TE version error by @cuichenx :: PR: #11874
- enable loading older TE checkpoints by @dimapihtar :: PR: #11930
- ci: Use single runner machines for unit tests by @ko3n1g :: PR: #11937
- llm performance scripts by @malay-nagda :: PR: #11736
- [MoE] add expert tensor parallelism support for NeMo2.0 MoE by @gdengk :: PR: #11880
- add exception when loading ckpt saved by TE < 1.13 by @dimapihtar :: PR: #11988
- remove renormalize_blend_weights flag by @dimapihtar :: PR: #11975
- Llama3.2 1B Embedding Model Support by @suiyoubi :: PR: #11909
- Weekly bump by @ko3n1g :: PR: #11896
- Debug Apex distributed optimizer to handle Transformer Engine 2.0 by @timmoon10 :: PR: #12004
- throw MegatronOptimizerModule warning only with mcore models by @akoumpa :: PR: #12085
- fix nmt dataclass issue by @dimapihtar :: PR: #12081
- Propogate dp last changes from mcore by @ryantwolf :: PR: #12012
- Add error message when downloading failed. by @yuanzhedong :: PR: #12139
- interface for asymmetric pipeline schedule by @erhoo82 :: PR: #12039
- chore: Update notebooks by @ko3n1g :: PR: #12161
- Cherrypick #12382, #12415 and #12424 by @cuichenx :: PR: #12425
- ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538
#### Text Normalization / Inverse Text Normalization
Changelog
- surface attn_implementation option by @akoumpa :: PR: #11873
- attn_implementation eager fallback by @akoumpa :: PR: #12060
#### NeMo Tools
Changelog
- build: Add `sox` to SDE by @ko3n1g :: PR: #11882
- add ctc segmentation by @ko3n1g :: PR: #12312
#### Export
Changelog
- Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
- In-framework deployment NeMo 2.0 nemo_export.py test by @janekl :: PR: #11749
- Fix starcoder2 missing bias in nemo2 config for TRTLLM by @meatybobby :: PR: #11809
- Autodetect dtype on exporting to TensorRT-LLM by @janekl :: PR: #11907
- PTQ & TRT-LLM updates related to upcoming PyTorch 25.01 bump by @janekl :: PR: #11941
- Run Flake8 for nemo.export module by @janekl :: PR: #11728
- Skip initialization in hf export by @cuichenx :: PR: #12136
- update export io call by @akoumpa :: PR: #12144
- add default kwargs for trtllm model runner by @pablo-garay :: PR: #12248
- cherry-pick: fix[export]: reshard model correctly handles extra_state when it's a tensor (#12132) by @terrykong :: PR: #12335
#### Bugfixes
Changelog
- added required instalation for sox to process mp3 file by @Ssofja :: PR: #11709
- removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710
- Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
#### Uncategorized
Changelog
- Allow using vocab size from config by @shanmugamr1992 :: PR: #11718
- Fix baseline recipes by @erhoo82 :: PR: #11725
- Update changelog for `r2.1.0` by @github-actions[bot] :: PR: #11745
- ci: Fix changelog generator by @ko3n1g :: PR: #11744
- Fix 'http_port' parameter name in DeployPyTriton usages and update .qnemo compress=True path by @janekl :: PR: #11747
- Conversion NeMo and HF checkpoint script for T5 by @huvunvidia :: PR: #11739
- Add BERT Embedding Models by @suiyoubi :: PR: #11737
- Add server ready check before starting evaluation by @athitten :: PR: #11731
- only install bitsandbytes on x86 by @akoumpa :: PR: #11781
- [Bugfix] Skip processing if extra_state loads as None by @janekl :: PR: #11778
- chore(beep boop 🤖): Bump `MCORE_TAG=4dc8977...` (2025-01-07) by @ko3n1g :: PR: #11768
- make progress printer compatible with PTL v2.5.0 by @ashors1 :: PR: #11779
- Fix Mistral Conversion Issue by @suiyoubi :: PR: #11786
- build: Fix build-arg by @ko3n1g :: PR: #11815
- Lora ckpt in HF format for NeMo AutoModel by @oyilmaz-nvidia :: PR: #11712
- 8x22b seq len by @malay-nagda :: PR: #11788
- Bugfix for output_generation_logits in tensorrtllm by @athitten :: PR: #11820
- handle mistralai/Mistral-7B-Instruct-v0.3 tokenizer correctly by @akoumpa :: PR: #11839
- remove tensorstore pin in requirements*.txt by @pstjohn :: PR: #11777
- Do not load context for model transform in llm inference by @hemildesai :: PR: #11751
- update nemo2sftpeft tutorial container verison by @HuiyingLi :: PR: #11832
- Latest News updated for Cosmos by @lbliii :: PR: #11806
- Removes tensorstore 0.1.45 pin from requirements_deploy.txt by @pstjohn :: PR: #11858
- ci: Prune dangling images by @ko3n1g :: PR: #11885
- Disable tests that download datasets from web by @akoumpa :: PR: #11878
- Add context_logits for eval accuracy calculation in case of multi token prediction tasks by @athitten :: PR: #11753
- add dataset_root to SpecterDataModule by @suiyoubi :: PR: #11837
- Support both Path and str for APIs by @maanug-nv :: PR: #11865
- Run nsys callback on GBS not on MBS by @akoumpa :: PR: #11861
- ci: Set bump-branch to weekly by @ko3n1g :: PR: #11889
- chore: Update mcore-tag-bump-bot.yml by @ko3n1g :: PR: #11891
- ci: Bump Mcore in weekly PR by @ko3n1g :: PR: #11897
- check restore_config first by @akoumpa :: PR: #11890
- LinearAdapter: propagate args to _init_adapter by @akoumpa :: PR: #11902
- NeMo 2.0 fp8 conversion by @Laplasjan107 :: PR: #11845
- nemo ux expert tensor parallel by @akoumpa :: PR: #11903
- Add CP support to Neva in NeMo2 by @yaoyu-33 :: PR: #11850
- build: Move dependencies by @ko3n1g :: PR: #11790
- Add Flux and Flux Controlnet Support to Diffusion folder by @Victor49152 :: PR: #11794
- ci: Adjust bump mcore workflow by @ko3n1g :: PR: #11918
- ci: Small fix to bump workflow by @ko3n1g :: PR: #11919
- Revert #11890 and add a test that would have caught the error by @cuichenx :: PR: #11914
- ci: Adjust input argument by @ko3n1g :: PR: #11921
- Create test_phi3.py by @mayani-nv :: PR: #11843
- Enable NeMo importer and loading dist CKPT for training by @Victor49152 :: PR: #11927
- build: Pin `triton` by @ko3n1g :: PR: #11938
- Add sharding for speechlm and vlm by @BoxiangW :: PR: #11876
- Update torch load for load from disk by @thomasdhc :: PR: #11963
- Add options to add mp_policy and parallel_fn for NeMo automodel fsdp2 by @BoxiangW :: PR: #11956
- ci: Add coverage reports by @ko3n1g :: PR: #11912
- Add batching support for evaluation by @athitten :: PR: #11934
- add use_fast option by @akoumpa :: PR: #11976
- improve error and debug messages in model connector by @cuichenx :: PR: #11979
- [checkpoint][docs] Fix typos in dist checkpointing docs by @ananthsub :: PR: #11983
- callbacks and bf16 grad by @malay-nagda :: PR: #11985
- remove --disable-ckpt from tests by @akoumpa :: PR: #11996
- nemo automodel sft squad data prep fix by @akoumpa :: PR: #11994
- Introduce evaluation API by @Glorf :: PR: #11895
- Remove deprecated tests/infer_data_path.py by @janekl :: PR: #11997
- Checkpoint saving for automodels via ModelCheckpoint by @akoumpa :: PR: #11998
- Mask vocab padding token ids from CE loss by @maanug-nv :: PR: #11999
- Add the NeMo2 memory profiling plugin by @gdengk :: PR: #12009
- chore(ci): Disable VMs cron job on forks by @mikemckiernan :: PR: #12020
- Adding speechlm AutoModel test by @oyilmaz-nvidia :: PR: #11990
- minor fix and simplify by @akoumpa :: PR: #12007
- ci: Build wheel workflow by @ko3n1g :: PR: #12021
- ci: Release workflow by @ko3n1g :: PR: #12022
- Version bump to `2.2.0rc1` by @github-actions[bot] :: PR: #12023
- ci: Run unit tests on main by @ko3n1g :: PR: #11986
- [Audio] Fix extra step in Euler sampler for flow matching inference by @racoiaws :: PR: #11989
- Set zarr range to >=2.18.2 and <3.0.0 by @chtruong814 :: PR: #12005
- ci: Run linting per domain by @ko3n1g :: PR: #12027
- Replace reference of requirements_infer.txt with requirements_deploy.txt by @chtruong814 :: PR: #12029
- ci: Always run linting by @ko3n1g :: PR: #12035
- ci: Retry on timeout by @ko3n1g :: PR: #11974
- [MoE] fix run err in mixtral22B recipe and update its perf config by @gdengk :: PR: #12036
- Version bump to `2.2.0rc2.dev0` by @github-actions[bot] :: PR: #12040
- ci: Update weekly brain by @ko3n1g :: PR: #12043
- ci: Update workflow by @ko3n1g :: PR: #12044
- nemo-automodel: fsdp2 support for peft by @akoumpa :: PR: #12008
- fix llama-3.1 hf model_id by @AtsunoriFujita :: PR: #11774
- Clip Model in Nemo2 by @abhinavg4 :: PR: #11980
- Adding TFLOPs callback for Multimodal models and NeVA calculator by @parthmannan :: PR: #11969
- ci: Allow skipping docs by @ko3n1g :: PR: #12048
- avoid missmatch error when loading older TE checkpoints by @dimapihtar :: PR: #12028
- Add padding in mllama vision encoder to align with HF by @meatybobby :: PR: #11808
- chore: Add warning for rebase by @ko3n1g :: PR: #12061
- ci: Lint Python files only by @ko3n1g :: PR: #12064
- Recipe changes for performance by @guyueh1 :: PR: #11763
- Pipeline-parallel support for Knowledge Distillation (NeMo 2) by @AAnoosheh :: PR: #11766
- add cp_comm_type param to Mistral config by @dimapihtar :: PR: #12049
- Conformer-based spectrogram estimator by @anteju :: PR: #12002
- Adding nemo CI by @abhinavg4 :: PR: #12052
- Update optimization features readme from nemo1 to nemo2 by @yaoyu-33 :: PR: #12071
- Add Llama Embedding Tutorial by @suiyoubi :: PR: #12042
- Fix Linting by @suiyoubi :: PR: #12079
- Fix hf_dataset bug by @BoxiangW :: PR: #12072
- set TOKENIZERS_PARALLELISM=True by @akoumpa :: PR: #12083
- minor fix in model's summary identation during logging by @akoumpa :: PR: #12084
- Refactor VLM modules / Add InternVit submodule support by @yaoyu-33 :: PR: #11851
- Fix SBERT with sequence_len_offset by @suiyoubi :: PR: #12057
- ci: codecov by @ko3n1g :: PR: #12030
- build: Improve installer by @ko3n1g :: PR: #12016
- ci: Modular unit tests by @ko3n1g :: PR: #12104
- ci: Update bump workflow by @ko3n1g :: PR: #12106
- etp docs by @akoumpa :: PR: #12111
- build: Better caching by @ko3n1g :: PR: #12109
- ci: Fix flaky test by @ko3n1g :: PR: #12113
- Ensure nemo.collections.vlm does not strictly require transformer engine by @chtruong814 :: PR: #12108
- build: Optimize by @ko3n1g :: PR: #12112
- refactor peft module matching; introduce exclude_modules by @akoumpa :: PR: #12066
- Update mcore commit (02.06.25) by @pablo-garay :: PR: #12114
- ci: Bump Mcore inplace by @ko3n1g :: PR: #12115
- ci: Bump bot by @ko3n1g :: PR: #12117
- Add neva pretrain script by @yaoyu-33 :: PR: #12033
- DAPT playbooks - with NeMo 2.0 by @jvamaraju :: PR: #12067
- Malay/bw scripts by @malay-nagda :: PR: #11961
- [MoE] Add type annotation for mixtral configs by @gdengk :: PR: #12126
- ci: Disable checks by @ko3n1g :: PR: #12129
- Add performance-optimized example for llama2 70b LoRA by @vysarge :: PR: #12055
- Add Automodel support for Deepseek v3 model by @BoxiangW :: PR: #12099
- Bug fix with generation of expert_tensor_parallel_rank by @guyueh1 :: PR: #12125
- Rename neva datamodule by @yaoyu-33 :: PR: #12121
- Update vLLM to 0.7.2 by @Laplasjan107 :: PR: #12078
- Prevent downloading dataset every time in ci test by @cuichenx :: PR: #12095
- AudioToAudioModel: fix model->dataloader sample_rate parameter injection by @racoiaws :: PR: #12092
- Minor Bug Fixes - LLaMa Embedding by @soluwalana :: PR: #12146
- build: Force re-install VCS dependencies by @ko3n1g :: PR: #12155
- Cherry pick `build: Force re-install VCS dependencies (12155)` into `r2.2.0` by @ko3n1g :: PR: #12191
- Cherry pick `Add function calling SFT NeMo2.0 tutorial (11868)` into `r2.2.0` by @ko3n1g :: PR: #12180
- Cherry pick `Update TTS code to remove calls to deprecated functions (12153)` into `r2.2.0` by @ko3n1g :: PR: #12201
- Cherry pick `Fix multi-GPU in-framework deployment (12090)` into `r2.2.0` by @ko3n1g :: PR: #12172
- Cherry pick `disable moe logging to avoid deepseek hang (12168)` into `r2.2.0` by @ko3n1g :: PR: #12192
- Cherry pick `build: Pin down transformers (12229)` into `r2.2.0` by @ko3n1g :: PR: #12230
- Cherry pick `Fix loading extra states from torch tensor (12185)` into `r2.2.0` by @ko3n1g :: PR: #12226
- Cherry pick `nemo-automodel checkpoint-io refactor (12070)` into `r2.2.0` by @ko3n1g :: PR: #12234
- ci: Flaky tests release by @ko3n1g :: PR: #12293
- Cherry pick `Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional (12299)` into `r2.2.0` by @ko3n1g :: PR: #12300
- build: Editable nemo install (#12304) by @ko3n1g :: PR: #12308
- ci: Fix test workflow by @ko3n1g :: PR: #12311
- Cherry pick `build: Exclude tensorstore 0.1.72 (12317)` into `r2.2.0` by @ko3n1g :: PR: #12318
- Cherry pick `Fix the local path in Sortformer diarizer training tutorial (12135)` into `r2.2.0` by @ko3n1g :: PR: #12316
- Cherry pick `Add eval requirement to setup.py (12152)` into `r2.2.0` by @ko3n1g :: PR: #12277
- Cherry pick `Add modelopt to requirements_nlp.txt (12261)` into `r2.2.0` by @ko3n1g :: PR: #12278
- cherry pick 12209 by @akoumpa :: PR: #12240
- Cherry pick `Energon ckpt multimodal (12245)` into `r2.2.0` by @ko3n1g :: PR: #12307
- Cherry pick `[nemo1] Fix Mamba/Bert loading from checkpoint after TE extra states were introduced (12275)` into `r2.2.0` by @ko3n1g :: PR: #12314
- Cherry pick `fix masked loss calculation (12255)` into `r2.2.0` by @ko3n1g :: PR: #12286
- chore: Cherry pick deepseek by @ko3n1g :: PR: #12324
- build: Bump PyT to 25.01 (#11973) by @ko3n1g :: PR: #12323
- Cherry pick `build: Bump mcore (12320)` into `r2.2.0` by @ko3n1g :: PR: #12328
- Cherry pick `[automodel] re-enable FSDP2 tests (12325)` into `r2.2.0` by @ko3n1g :: PR: #12331
- Cherry pick `[automodel] fix loss reporting (12303)` into `r2.2.0` by @ko3n1g :: PR: #12334
- build: Bump Mcore by @ko3n1g :: PR: #12340
- Cherry-pick Asr fixes 2.2 (#12227) by @ko3n1g :: PR: #12345
- Cherry-pick Bug fixes (#12315) by @chtruong814 :: PR: #12346
- Cherry pick `[automodel] remove fix_progress_bar from fsdp2 strategy (12339)` into `r2.2.0` by @ko3n1g :: PR: #12347
- Cherry pick `Fix NeMo1 Bert Embedding Dataset args (12341)` into `r2.2.0` by @ko3n1g :: PR: #12349
- Cherry pick `Fix NeMo1 sequence_len_offset in Bert fwd (12350)` into `r2.2.0` by @ko3n1g :: PR: #12359
- Cherry pick `Add nemo-run recipe for evaluation (12301)` into `r2.2.0` by @ko3n1g :: PR: #12352
- Cherry pick `Add DeepSeek-R1 Distillation NeMo 2.0 tutorial (12187)` into `r2.2.0` by @ko3n1g :: PR: #12355
- chore: Update package_info.py by @ko3n1g :: PR: #12362
- Version bump to `2.2.0rc4.dev0` by @github-actions[bot] :: PR: #12363
- Bump mcore to latest commit on release branch by @chtruong814 :: PR: #12360
- Cherry pick `[automodel] add lr scheduler (12351)` into `r2.2.0` by @ko3n1g :: PR: #12361
- Cherry pick `[automodel] add distributed data sampler (12326)` into `r2.2.0` by @ko3n1g :: PR: #12373
- Cherry pick `[NeVA] Fix for CP+THD (12366)` into `r2.2.0` by @ko3n1g :: PR: #12375
- Cherry pick `Ignore attribute error when serializing mcore specs (12353)` into `r2.2.0` by @ko3n1g :: PR: #12383
- Cherry pick `Avoid init_ddp for inference (12011)` into `r2.2.0` by @ko3n1g :: PR: #12385
- Cherry pick `[docs] fix notebook render (12374)` into `r2.2.0` by @ko3n1g :: PR: #12394
- Cherry pick `Neva finetune scripts and PP fix (12387)` into `r2.2.0` by @ko3n1g :: PR: #12397
- Cherry pick `[automodel] update runner tags for notebooks (12428)` into `r2.2.0` by @ko3n1g :: PR: #12431
- Cherry pick `[automodel] update examples (12411)` into `r2.2.0` by @ko3n1g :: PR: #12432
- Cherry pick `Evaluation docs (12348)` into `r2.2.0` by @ko3n1g :: PR: #12460
- Cherry pick `Update prompt format (12452)` into `r2.2.0` by @ko3n1g :: PR: #12455
- Cherry pick `Fixing a wrong Sortformer Tutorial Notebook path. (12479)` into `r2.2.0` by @ko3n1g :: PR: #12480
- Cherry pick `added a needed checks and changes for bugfix (12400)` into `r2.2.0` by @Ssofja :: PR: #12447
- Cherry pick `[automodel] fix loss/tps reporting across ranks (12389)` into `r2.2.0` by @ko3n1g :: PR: #12413
- Cherry pick `enable fsdp flag for FSDP2Strategy (12392)` into `r2.2.0` by @ko3n1g :: PR: #12429
- Cherry pick `Fix lita notebook issue (12474)` into `r2.2.0` by @ko3n1g :: PR: #12476
- Cherrypick multinode tut changes by @BoxiangW :: PR: #12501
- Cherry pick `Changed the argument types passed to metrics calculation functions (12500)` into `r2.2.0` by @ko3n1g :: PR: #12502
- Cherry pick `added needed fixes (12495)` into `r2.2.0` by @ko3n1g :: PR: #12509
- Cherry pick `update transformers version requirements (12475)` into `r2.2.0` by @ko3n1g :: PR: #12507
- Cherry pick `[checkpoint] Log timings for checkpoint IO save and load (11972)` into `r2.2.0` by @ko3n1g :: PR: #12520
- Cherry pick `few checkings needed because of the change of asr models output (12499)` into `r2.2.0` by @ko3n1g :: PR: #12513
- Oyilmaz nvidia/chore/cherry pick 12242 by @oyilmaz-nvidia :: PR: #12523
- Cherry pick `Remove`_attn_implementation` in `LlamaBidirectionalModel`constructor (12364)` into `r2.2.0` by @ko3n1g :: PR: #12525
- Cherry pick `Configure FSDP to keep module params (12074)` into `r2.2.0` by @ko3n1g :: PR: #12524
- Cherry pick `[automodel] docs (11942)` into `r2.2.0` by @ko3n1g :: PR: #12530
- Cherry pick `[automodel] update examples' comments (12518)` and `[automodel] Move PEFT to configure_model (#12491)` into `r2.2.0` by @ko3n1g :: PR: #12529
- Cherry pick `update readme to include latest pytorch version (12539)` into `r2.2.0` by @ko3n1g :: PR: #12577
- Publish r2.2.0 by @chtruong814 :: PR: #12583
## NVIDIA Neural Modules 2.1.0
### Highlights
- Training
- Fault Tolerance
- Straggler Detection
- Auto Relaunch
- LLM & MM
- MM models
- Llava-next
- Llama 3.2
- Sequence Model Parallel for NeVa
- Enable Energon
- SigLIP (NeMo 1.0 only)
- LLM 2.0 migration
- Starcoder2
- Gemma 2
- T5
- Baichuan
- BERT
- Mamba
- ChatGLM
- DoRA support
- Export
- Nemo 2.0 base model export path for NIM
- PTQ in Nemo 2.0
- ASR
- Timestamps with TDT decoder
- Timestamps option with .transcribe()
### Detailed Changelogs
#### ASR
Changelog
- [Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980
- Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994
- TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875
- ci: Switch to CPU only runner by @ko3n1g :: PR: #11035
- Fix timestamps tests by @monica-sekoyan :: PR: #11053
- ci: Pin release freeze by @ko3n1g :: PR: #11143
- Fix RNN-T loss memory usage by @artbataev :: PR: #11144
- Added deprecation notice by @Ssofja :: PR: #11133
- Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184
- add ipython import guard by @nithinraok :: PR: #11191
- Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206
- update the return type by @nithinraok :: PR: #11210
- Timestamps to transcribe by @nithinraok :: PR: #10950
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- update hypothesis when passed through cfg by @nithinraok :: PR: #11366
- Revert "update hypothesis when passed through cfg" by @pablo-garay :: PR: #11373
- Fix transcribe speech by @nithinraok :: PR: #11379
- Lhotse support for transcribe_speech_parallel by @nune-tadevosyan :: PR: #11249
- Sortformer Diarizer 4spk v1 model PR Part 1: models, modules and dataloaders by @tango4j :: PR: #11282
- Removing unnecessary lines by @nune-tadevosyan :: PR: #11408
- Support for initializing lhotse shar dataloader via field: list[path] mapping by @pzelasko :: PR: #11460
- New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations by @pzelasko :: PR: #11058
- Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom by @weiqingw4ng :: PR: #11636
#### TTS
Changelog
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- Add T5TTS by @blisc :: PR: #11193
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- Add nvidia/low-frame-rate-speech-codec-22khz model on docs by @Edresson :: PR: #11457
#### NLP / NMT
Changelog
- Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877
- Use a context-manager when opening files by @akoumpa :: PR: #10895
- Packed sequence bug fixes by @cuichenx :: PR: #10898
- ckpt convert bug fixes by @dimapihtar :: PR: #10878
- remove deprecated ci tests by @dimapihtar :: PR: #10922
- Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972
- Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962
- gpt3 175b cli by @malay-nagda :: PR: #10985
- Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920
- Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` by @ashors1 :: PR: #11016
- add deprecation note by @dimapihtar :: PR: #11024
- Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902
- switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948
- NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002
- Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037
- Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100
- Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109
- TP+MoE peft fix by @akoumpa :: PR: #11114
- GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119
- Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128
- update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138
- Call `ckpt_to_weights_subdir` from `MegatronCheckpointIO` by @ashors1 :: PR: #10897
- [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
- Use MegatronDataSampler in HfDatasetDataModule by @akoumpa :: PR: #11274
- Add T5TTS by @blisc :: PR: #11193
- ci: Exclude CPU machines from scan by @ko3n1g :: PR: #11300
- Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
- remove redundant docs by @sharathts :: PR: #11302
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Add `attention_bias` argument in transformer block and transformer layer modules, addressing change in MCore by @yaoyu-33 :: PR: #11289
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- Update T5 attention-mask shapes to be compatible with all attention-backend in new TE versions by @huvunvidia :: PR: #11059
- Add support for restoring from 2.0 checkpoint in 1.0 by @hemildesai :: PR: #11347
- Fix Gemma2 Attention Args by @suiyoubi :: PR: #11365
- mlm conversion & tiktokenizer support by @dimapihtar :: PR: #11349
- [Nemo1] Generate sharded optimizer state dicts only if needed for saving by @ananthsub :: PR: #11451
- add hindi tn/itn coverage by @mgrafu :: PR: #11382
- chore(beep boop 🤖): Bump `MCORE_TAG=67a50f2...` (2024-11-28) by @ko3n1g :: PR: #11427
- Handle exception when importing RetroGPTChunkDatasets by @guyueh1 :: PR: #11415
- Update restore from config for gpt type continual training in NeMo1 by @yaoyu-33 :: PR: #11471
- ci: Re-enable `L2_Megatron_LM_To_NeMo_Conversion` by @ko3n1g :: PR: #11484
- Apply packed sequence params change for fused rope compatibility by @ananthsub :: PR: #11506
- Huvu/tiktoken tokenizer update by @huvunvidia :: PR: #11494
#### Text Normalization / Inverse Text Normalization
Changelog
- Adding support for LightningDataModule inside Fabric-API by @marcromeyn :: PR: #10879
- Add registry to register all needed classes with artifacts in nemo.lightning.io by @hemildesai :: PR: #10861
- Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252
- Remove pytorch-lightning by @maanug-nv :: PR: #11306
- add hindi tn/itn coverage by @mgrafu :: PR: #11382
#### Export
Changelog
- Update engine build step for TRT-LLM 0.13.0 by @janekl :: PR: #10880
- Nemo 2.0 ckpt support in TRT-LLM export by @oyilmaz-nvidia :: PR: #10891
- Fix TRTLLM parallel_embedding by @meatybobby :: PR: #10975
- Export & deploy updates (part I) by @janekl :: PR: #10941
- Add doc-strings to import & export + improve logging by @marcromeyn :: PR: #11078
- NeMo-UX: fix nemo-ux export path by @akoumpa :: PR: #11081
- Fix TRTLLM nemo2 activation parsing by @meatybobby :: PR: #11062
- Support exporting Nemotron-340B for TensorRT-LLM by @jinyangyuan-nvidia :: PR: #11015
- vLLM Hugging Face exporter by @oyilmaz-nvidia :: PR: #11124
- Fix export of configuration parameters to Weights and Biases by @soluwalana :: PR: #10995
- Change activation parsing in TRTLLM by @meatybobby :: PR: #11173
- Remove builder_opt param from trtllm-build for TensorRT-LLM >= 0.14.0 by @janekl :: PR: #11259
- fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255
- fix(export): update API for disabling device reassignment in TRTLLM for Aligner by @terrykong :: PR: #10863
- Add openai-gelu in gated activation for TRTLLM export by @meatybobby :: PR: #11293
- Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301
- Adding alinger export by @shanmugamr1992 :: PR: #11269
- Export & deploy updates (part II) by @janekl :: PR: #11344
- Introducing TensorRT lazy export and caching option with trt_compile() by @borisfom :: PR: #11266
- fix: export converts properly if no model_prefix by @terrykong :: PR: #11477
#### Bugfixes
Changelog
- Change default ckpt name by @maanug-nv :: PR: #11277
- Fix patching of NeMo tokenizers for correct Lambada evaluation by @janekl :: PR: #11326
#### Uncategorized
Changelog
- ci: Use Slack group by @ko3n1g :: PR: #10866
- Bump `Dockerfile.ci` (2024-10-14) by @ko3n1g :: PR: #10871
- Fix peft resume by @cuichenx :: PR: #10887
- call __post_init__ after altering config values by @akoumpa :: PR: #10885
- Late import prettytable by @maanug-nv :: PR: #10912
- Bump `Dockerfile.ci` (2024-10-17) by @ko3n1g :: PR: #10919
- Warning for missing FP8 checkpoint support for vLLM deployment by @janekl :: PR: #10906
- Fix artifact saving by @hemildesai :: PR: #10914
- Lora improvement by @cuichenx :: PR: #10918
- Huvu/t5 nemo2.0 peft by @huvunvidia :: PR: #10916
- perf recipes and Mcore DistOpt params by @malay-nagda :: PR: #10883
- ci: Fix cherry pick team by @ko3n1g :: PR: #10945
- Fix requirements for MacOS by @artbataev :: PR: #10930
- Fix nemo 2.0 recipes by @BoxiangW :: PR: #10915
- Akoumparouli/nemo ux fix dir or string artifact by @akoumpa :: PR: #10936
- Fix typo in docstring by @ashors1 :: PR: #10955
- [Nemo CICD] Remove deprecated tests by @pablo-garay :: PR: #10960
- Restore NeMo 2.0 T5 pretraining CICD test by @huvunvidia :: PR: #10952
- Convert perf plugin env vars to strings by @hemildesai :: PR: #10947
- disable dynamo for ddp checker by @akoumpa :: PR: #10961
- Bump `Dockerfile.ci` (2024-10-21) by @ko3n1g :: PR: #10965
- respect warnings' filters by @akoumpa :: PR: #10953
- Alit/mamba recipe by @JRD971000 :: PR: #10935
- Long context performance doc hot fix by @youngeunkwon0405 :: PR: #10946
- Performance mode by @malay-nagda :: PR: #10926
- Bump `Dockerfile.ci` (2024-10-22) by @ko3n1g :: PR: #10979
- Add more recipes by @cuichenx :: PR: #10957
- ci: Update tests by @ko3n1g :: PR: #10987
- Bump `Dockerfile.ci` (2024-10-23) by @ko3n1g :: PR: #11001
- llm.generate fixes by @HuiyingLi :: PR: #10983
- use __dict__ in check by @akoumpa :: PR: #11012
- LoRA support for HF::AutoModelForCausalLM by @akoumpa :: PR: #10982
- Change default for always_save_context to True by @athitten :: PR: #11014
- Fix pip install by @marcromeyn :: PR: #11026
- Change dist ckpt defaults by @ShriyaPalsamudram :: PR: #10913
- Fix _strategy_lib tests by @maanug-nv :: PR: #11033
- Basic online dynamic FP8 quantization with vLLM by @janekl :: PR: #10904
- Expose packed seq in finetuning recipes by @cuichenx :: PR: #11006
- PEFT Inference by @cuichenx :: PR: #11030
- added Lhotse online augmentation tutorial for SE by @nasretdinovr :: PR: #10944
- Bump `Dockerfile.ci` (2024-10-27) by @ko3n1g :: PR: #11051
- ci: Send team alerts on specific keywords by @ko3n1g :: PR: #10986
- Qwen2 Recipe by @suiyoubi :: PR: #10974
- Bump `Dockerfile.ci` (2024-10-28) by @ko3n1g :: PR: #11054
- Generalizing Inference pipeline in NeMo 2.0 to support encoder-decoder models by @huvunvidia :: PR: #10924
- [Bug fix] In energon MultiModalSampleConfig use default_factory in dataclass by @guyueh1 :: PR: #11041
- fix: Resolve mutable default issue in MultiModalSampleConfig dataclass by @michal2409 :: PR: #11061
- SC1/SC2 Recipe by @suiyoubi :: PR: #10971
- Wrap batch_sampler with_IndexBatchSamplerWrapper by @farhadrgh :: PR: #10934
- Performance fine-tuning recipes for llama3 8b + 70b by @vysarge :: PR: #11046
- Set TE spec name for NeMo to HF checkpoint converters by @kevalmorabia97 :: PR: #11036
- ci: Re-add secrets detector by @ko3n1g :: PR: #11038
- Adding nemo-run recipes for NeMo 2.0 T5 by @huvunvidia :: PR: #10964
- Minor fixes for NeMo 2.0 PTQ by @Laplasjan107 :: PR: #11079
- Add copyright check by @pablo-garay :: PR: #11048
- Fix finalize model grad for PEFT by @cuichenx :: PR: #11065
- ci: Less verbose infra alerts by @ko3n1g :: PR: #11080
- Add copyright notice by @pablo-garay :: PR: #11085
- ci: Fix cron schedule by @ko3n1g :: PR: #11076
- ci: Use code-freeze via Nemo-FW-Templates by @ko3n1g :: PR: #11073
- Akoumparouli/hf lit module peft ckpt bugfix by @akoumpa :: PR: #11022
- PEFT perf and TE spec fixes by @JimmyZhang12 :: PR: #11070
- Bump `Dockerfile.ci` (2024-10-30) by @ko3n1g :: PR: #11092
- NeMorun for NeMo 2.0 T5 finetuning by @huvunvidia :: PR: #11040
- fix model_checkpoint.py by @ethanhe42 :: PR: #11057
- Update PTQ tests and ModelOpt version by @janekl :: PR: #11095
- Fix datasets in CLI by @marcromeyn :: PR: #11097
- Fix yaml serialization in io mixin by @hemildesai :: PR: #11106
- disable overlap_param_gather_with_optimizer_step by @JimmyZhang12 :: PR: #11102
- nemo1 to nemo2 checkpoint convert by @HuiyingLi :: PR: #10937
- fix expert regex filter by @akoumpa :: PR: #11103
- Remove stale checkpoint deletion on checkpoint saving failure by @akoumpa :: PR: #11116
- NeMo-UX: Mistral/mixtral peft ci test by @akoumpa :: PR: #11094
- Make nemo.collections.llm PreTrainingDataModule num samples configurable by @hemildesai :: PR: #11088
- Fix packed seq path by @cuichenx :: PR: #11121
- Allow arguments passed to dataset class + Gemma recipe fix by @cuichenx :: PR: #11125
- Nemotron Recipe by @suiyoubi :: PR: #11118
- NeMo-UX: HF PeFT fix by @akoumpa :: PR: #11096
- Remove deprecated tests by @pablo-garay :: PR: #11134
- Recipe Fix for NeMo CI by @suiyoubi :: PR: #11127
- Fix freeze_model call in peft by @cuichenx :: PR: #11146
- Bump `Dockerfile.ci` (2024-11-05) by @ko3n1g :: PR: #11159
- NeMo-UX: Add sgd optim by @akoumpa :: PR: #11157
- Update copyright check by @pablo-garay :: PR: #11168
- add lora recipt for 405b by @JRD971000 :: PR: #10991
- dit training diagrams by @zpx01 :: PR: #10873
- ci: Switch to FW templates for build by @ko3n1g :: PR: #11077
- Bump `Dockerfile.ci` (2024-11-06) by @ko3n1g :: PR: #11174
- feat: Run PyLint by @ko3n1g :: PR: #11147
- Add Alpaca Finetune Datamodule by @suiyoubi :: PR: #11185
- Updated Diffusion Collection README by @zpx01 :: PR: #11179
- Add support for Cosmos Tokenizers by @jojennin :: PR: #11194
- Run formatting only if files changed. Echo message if pylint fails. by @artbataev :: PR: #11188
- Bump `Dockerfile.ci` (2024-11-07) by @ko3n1g :: PR: #11196
- Fix rotary_percentage parsing in nemo2 config by @meatybobby :: PR: #11197
- ci: Update cherry pick workflow by @ko3n1g :: PR: #11202
- ci: Build, test, publish a wheel by @ko3n1g :: PR: #11183
- Bump `Dockerfile.ci` (2024-11-08) by @ko3n1g :: PR: #11222
- update default pipeline_parallelism_type by @akoumpa :: PR: #11213
- check actual value of vocab_file by @akoumpa :: PR: #11228
- Fix VP Initialization Issue with Latest MCore by @suiyoubi :: PR: #11209
- ci: Run Pylint strictly on new files, softly on history by @ko3n1g :: PR: #11212
- ci: Add release workflow by @ko3n1g :: PR: #11180
- Fix llm.generate by @hemildesai :: PR: #11217
- Bump `Dockerfile.ci` (2024-11-11) by @ko3n1g :: PR: #11247
- Bump `Dockerfile.ci` (2024-11-12) by @ko3n1g :: PR: #11254
- Handling tokenizer in PTQ for Nemo 2.0 by @janekl :: PR: #11237
- Fix finetuning datamodule resume by @cuichenx :: PR: #11187
- ci: Move `bump mcore` to templates by @ko3n1g :: PR: #11229
- ci: Fix secrets detector by @ko3n1g :: PR: #11205
- chore(beep boop 🤖): Bump `MCORE_TAG=aded519...` (2024-11-12) by @ko3n1g :: PR: #11260
- ci: Run secrets detector on `pull_request_target` by @ko3n1g :: PR: #11263
- Advanced Diffusion Training Features by @zpx01 :: PR: #11246
- Update pruning and distillation tutorial notebooks by @gvenkatakris :: PR: #11091
- update nemo1->2 conversion according to changes in main by @HuiyingLi :: PR: #11253
- Add llama 3.1 recipes by @cuichenx :: PR: #11273
- Fix Finetune Recipe by @suiyoubi :: PR: #11267
- Configure no restart validation loop in nl.Trainer by @hemildesai :: PR: #11029
- Handle _io_unflatten_object when_thread_local.output_dir is not available by @hemildesai :: PR: #11199
- Remove opencc upperbound by @thomasdhc :: PR: #10909
- Fix head_size in NeMo to HF checkpoint converters for width pruned model support by @eagle705 :: PR: #11230
- Fixes per comments by @gvenkatakris :: PR: #11280
- Create phi3mini.py by @mayani-nv :: PR: #11281
- ci: Fix release workflow by @ko3n1g :: PR: #11286
- fix perf plugin CUDA_DEVICE_MAX_CONNECTIONS setting by @JimmyZhang12 :: PR: #11299
- PTQ via NeMo-Run CLI by @janekl :: PR: #10984
- PTQ memory optimization by @Laplasjan107 :: PR: #11257
- Update README.md for collection page by @yaoyu-33 :: PR: #11223
- Adding multimodal examples by @shanmugamr1992 :: PR: #11279
- Add HF untrusted code toggle by @akoumpa :: PR: #11313
- P2p chunk size setting in nemo 2.0 by @erhoo82 :: PR: #11312
- Nemo2 batcheval by @HuiyingLi :: PR: #11158
- DoRA by @cuichenx :: PR: #11104
- Profiling - support Chakra & Kineto trace dumping by @lilyw97 :: PR: #11115
- NeMo 2.0 SFT PEFT notebooks by @HuiyingLi :: PR: #10874
- Update symlink option for save_last in ModelCheckpoint by @paul-gibbons :: PR: #11319
- ci: Pass-through of `workflow_event` by @ko3n1g :: PR: #11322
- Add StragglerDetection and auto-relaunch to NeMo2.0 by @ShriyaPalsamudram :: PR: #11328
- Huvu/t5 nemo2.0 nemoci by @huvunvidia :: PR: #11291
- TE acceleration using callbacks by @oyilmaz-nvidia :: PR: #11261
- Leave target_module as default in PEFT Recipes by @cuichenx :: PR: #11334
- More robust tar file loading from AIStore by @pzelasko :: PR: #11323
- Fix CLIP transformer layer api by @yaoyu-33 :: PR: #11337
- pass trust_remote_code to AutoTokenizer by @akoumpa :: PR: #11343
- Fix linear layer replacement by @oyilmaz-nvidia :: PR: #11356
- fix typo by @JRD971000 :: PR: #11351
- Add torchrun local executor to recipes by @marcromeyn :: PR: #11342
- Add PP support in NeVA along with few bug fixes by @yaoyu-33 :: PR: #11170
- nemo2 peft merge by @HuiyingLi :: PR: #11017
- Add dora recipes by @cuichenx :: PR: #11330
- add fix to recipe by @JRD971000 :: PR: #11368
- Add missing test to CICD needed list by @pablo-garay :: PR: #11376
- update SquadDataModule to use run.config by @huvunvidia :: PR: #11358
- Add llama 3.2 1b and 3b by @cuichenx :: PR: #11335
- calculate metrics for nemo2 sftpeft notebook by @HuiyingLi :: PR: #11381
- Enable packed dataset for validation; add a2a_experimental argument by @michal2409 :: PR: #11378
- Fix DDP unused param error when TE is enabled in NeMo Lite by @oyilmaz-nvidia :: PR: #11364
- Update llama32 vision (mllama) use attention bias by @yaoyu-33 :: PR: #11316
- Fix environment variables in torchrun executor by @hemildesai :: PR: #11363
- Add sample generate to PTQ for NeMo 2.0 by @Laplasjan107 :: PR: #11339
- Fix selective restore by explicitly verifying keys by @hemildesai :: PR: #11377
- Minor fix by @gvenkatakris :: PR: #11353
- Add a fix for single-GPU nsys. by @tfogal :: PR: #11354
- capitalize HF as HF instead of Hf by @akoumpa :: PR: #11384
- ci: Add HF cache by @ko3n1g :: PR: #11398
- Remove logic to skip checkpoint save if checkpoint exists by @ashors1 :: PR: #11362
- Rewire tokenizer exception handling in model resume by @cuichenx :: PR: #11375
- Adding LLava-Next model class by @yashaswikarnati :: PR: #11399
- Fix vllm test issue when run_accuracy is enabled by @oyilmaz-nvidia :: PR: #11413
- data modules for llava_next by @yashaswikarnati :: PR: #11400
- Fix strategies saving unsharded optimizer states by @ananthsub :: PR: #11392
- Adjust CLI support for PTQ by @janekl :: PR: #11421
- Nemo run recipe's and example scripts for Llava Next by @yashaswikarnati :: PR: #11405
- Huvu/t5 nemo2.0 nemoci 3b11b by @huvunvidia :: PR: #11388
- ci: Allow dry-run of release by @ko3n1g :: PR: #11418
- fix dtype when init HF model from config by @akoumpa :: PR: #11420
- Handle import errors in virtual environment when running vLLM tests by @janekl :: PR: #11435
- Fix loss mask when answer_only_loss=True by @ashors1 :: PR: #11444
- [audio] Keep input directory structure when saving processed files by @anteju :: PR: #11403
- Add different recipe examples to NeMo 2.0 by @BoxiangW :: PR: #11317
- [Scripts] Remove fixed seed for adding noise by @anteju :: PR: #11401
- Add option to provide prior NeMo 2 ckpt path to convert_nemo1_to_nemo… by @hemildesai :: PR: #11452
- PTQ CLI and param updates by @janekl :: PR: #11459
- Add tests for resiliency feature integration by @maanug-nv :: PR: #11406
- ci: Disable HexHighEntropyString plugin by @ko3n1g :: PR: #11470
- Fix broken links by @shashank3959 :: PR: #11294
- Nemo 2.0 canonical lora by @cuichenx :: PR: #11416
- ci: Run secrets detector on merge-commit by @ko3n1g :: PR: #11479
- Formatting (minor) by @pablo-garay :: PR: #11485
- Fix bug related to naming by @pablo-garay :: PR: #11487
- Add BERT Model To NeMo2.0 by @suiyoubi :: PR: #11333
- Update Nemo Distributed Checkpoint User Guide by @FortunaZhang :: PR: #11489
- fix: regular torch optims (e.g., sgd) no longer error with closure spec by @terrykong :: PR: #11189
- Add recipe configs validating by @BoxiangW :: PR: #10954
- Fix finetuning PP by @cuichenx :: PR: #11474
- [docs] Documentation for audio collection by @anteju :: PR: #11426
- config hierarchy by @malay-nagda :: PR: #11145
- Force param sync when using distributed optimizer and overlap_param_gather by @hemildesai :: PR: #11486
- chore(beep boop 🤖): Bump `MCORE_TAG=bd677bf...` (2024-12-06) by @ko3n1g :: PR: #11492
- Remove default mutable arguments from AbstractEmbModel constructor by @ananthsub :: PR: #11348
- minor fix for nemo2 sftpeft readme by @HuiyingLi :: PR: #11502
- Update Llama3 Fine-Tuning Notebook by @roclark :: PR: #11522
- Fix CI issue on validation config by @BoxiangW :: PR: #11521
- Freeze tags in in `r2.1.0` by @github-actions[bot] :: PR: #11556
- Cherrypick all + R2.1.0 fix cicd by @pablo-garay :: PR: #11622
- Cherry pick `Add fix docstring for speech commands (11638)` into `r2.1.0` by @ko3n1g :: PR: #11639
- Cherrypick #11628 to r2.1.0 by @nasretdinovr :: PR: #11630
- Update package_info.py by @ko3n1g :: PR: #11646
- Cherry pick `Add fix docstring for VAD (11659)` into `r2.1.0` by @ko3n1g :: PR: #11660
- Fix tokenizer trust_remote_code by @cuichenx :: PR: #11657
- Cherrypick 11568 by @cuichenx :: PR: #11656
- Cherry pick `Downgrading the 'datasets' package from 3.0.0 to 2.21.0 for Multilang_ASR.ipynb and ASR_CTC_Language_Finetuning.ipynb (11675)` into `r2.1.0` by @ko3n1g :: PR: #11677
- r2.1.0 cherrypick by @pablo-garay :: PR: #11680
- Cherry pick `Rename multimodal data module - EnergonMultiModalDataModule (11654)` into `r2.1.0` by @ko3n1g :: PR: #11685
- chore: Bump to `r2.1.0rc2` by @ko3n1g :: PR: #11693
- r2.1.0 ptl fix by @pablo-garay :: PR: #11694
## NVIDIA Neural Modules 2.1.0rc2
Prerelease: NVIDIA Neural Modules 2.1.0rc2 (2024-12-21)
## NVIDIA Neural Modules 2.1.0rc1
Prerelease: NVIDIA Neural Modules 2.1.0rc1 (2024-12-20)
## NVIDIA Neural Modules 2.1.0rc0
Prerelease: NVIDIA Neural Modules 2.1.0rc0 (2024-12-12)
## NVIDIA Neural Modules 2.0.0rc1
### Highlights
#### Large language models
- PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
- State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
- Support Nemotron, Minitron, Gemma2, Qwen, RAG
- Custom Tokenizer training in NeMo
- Update the Auto-Configurator for EP, CP and FSDP
#### Multimodal
- NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
- Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA
#### ASR
- SpeechLM and SALM
- Adapters for Canary Customization
- Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
- Cuda Graphs for Transducer Inference
- Replaced webdataset with Lhotse - gives up to 2x speedup
- Transcription Improvements - Speedup and QoL Changes
- ASR Prompt Formatter for multimodal Canary
#### Export & Deploy
- In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10
- TRT-LLM C++ runtime
### Detailed Changelogs
#### ASR
Changelog
- Support dataloader as input to `audio` for transcription by @titu1994 :: PR: #9201
- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- Fix Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9251
- Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. by @galv :: PR: #9347
- Revert "Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer." by @titu1994 :: PR: #9351
- Prompt formatter API and canary transcribe tensor input support by @pzelasko :: PR: #9206
- Fix prompt formatter's defaults=None case in multi-task model by @pzelasko :: PR: #9366
- move AED chunked infer script by @stevehuang52 :: PR: #9367
- Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. by @galv :: PR: #9198
- ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_C… by @ko3n1g :: PR: #9399
- Fix logging message for ASR by @titu1994 :: PR: #9469
- Add support to change Multi task model prompt by @titu1994 :: PR: #9542
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- Audio model collection by @anteju :: PR: #9263
- TitaNet Batch Verify Speaker by @monica-sekoyan :: PR: #9337
- Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- Canary Adapters tutorial (#9670) by @nithinraok :: PR: #9777
- typos and branch name update to r2.0.0rc1 by @nithinraok :: PR: #9846
- Fix RNNT alignments test by @artbataev :: PR: #9770
- By default trust remote code from HF Datasets by @nithinraok :: PR: #9886
- Temporarily disable cuda graph based RNN-T greedy inference for r2.0.0rc1 by @galv :: PR: #9904
- Enable CUDA graphs by default, but require CUDA 12.6 for full graphs by @artbataev :: PR: #9919
- update branch name for script by @nithinraok :: PR: #9936
- updte branch by @nithinraok :: PR: #9942
#### TTS
Changelog
- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- Add mel codec checkpoints by @anteju :: PR: #9228
- GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- refactor: notebook branch release by @ko3n1g :: PR: #9711
#### LLM/Multimodal
Changelog
- Update nemo.export module for quantized models by @janekl :: PR: #9218
- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
- Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199
- Clean up dev docs collection section by @yaoyu-33 :: PR: #9205
- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
- Revert rope fusion defaults by @cuichenx :: PR: #9237
- fix import by @akoumpa :: PR: #9240
- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
- sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262
- Alit/bert convert fix by @JRD971000 :: PR: #9285
- conv1d stable version by @JRD971000 :: PR: #9330
- Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293
- Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341
- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
- Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335
- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
- support null/None truncation field by @arendu :: PR: #9355
- NeVa token fusion by @paul-gibbons :: PR: #9245
- bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356
- Re-org export code by @oyilmaz-nvidia :: PR: #9353
- QLoRA by @cuichenx :: PR: #9340
- PeFT fix for distOpt by @akoumpa :: PR: #9392
- [NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387
- cherry pick of #9266 by @dimapihtar :: PR: #9411
- Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423
- add support for new mcore ds features by @dimapihtar :: PR: #9388
- LoRA for MoE Layer by @cuichenx :: PR: #9396
- Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222
- Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414
- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
- In-framework deployment by @oyilmaz-nvidia :: PR: #9438
- Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458
- Hyena Operator by @guyjacob :: PR: #9264
- Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276
- move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382
- Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478
- Fix unwrap model by @cuichenx :: PR: #9480
- fix operator precedence by @akoumpa :: PR: #9403
- [NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525
- update mcoreddp call by @akoumpa :: PR: #9345
- mcore distOpt restore fix by @akoumpa :: PR: #9421
- vLLM Export Support by @apanteleev :: PR: #9381
- PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535
- extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532
- fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530
- add reset learning rate functionality by @dimapihtar :: PR: #9372
- Use closed-formula to round by multiple by @akoumpa :: PR: #9307
- GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559
- Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413
- Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409
- PTQ refinements by @janekl :: PR: #9574
- Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326
- Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445
- Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455
- Make QLoRA more model-agnostic by @cuichenx :: PR: #9488
- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
- [NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599
- Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354
- [NeMo-UX] Add PEFT by @cuichenx :: PR: #9490
- Alit/mamba tmp by @JRD971000 :: PR: #9612
- Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505
- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
- fix ckpt load bug by @dimapihtar :: PR: #9621
- Alit/mamba by @JRD971000 :: PR: #9575
- Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622
- MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432
- [Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592
- Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594
- Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459
- Adding support for mcore generate by @shanmugamr1992 :: PR: #9566
- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
- [Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644
- Parametrize FPS group by @mikolajblaz :: PR: #9648
- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
- add documentation for reset_lr feature by @dimapihta
- chore: Pin branch in notebooks by @ko3n1g :: PR: #9697
- Cherry pick: LITA Integration by @Slyne :: PR: #9684
- SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654
- Gemma 2 by @cuichenx :: PR: #9672
- Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613
- refactor: notebook branch release by @ko3n1g :: PR: #9711
- [NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550
- Alit/r2.0.0 by @JRD971000 :: PR: #9718
- Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737
- In framework export by @oyilmaz-nvidia :: PR: #9658
- T5 changes based on mcore changes by @pablo-garay :: PR: #9829
- [NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801
- deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864
- Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837
- make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852
- add rampup bs documentation by @dimapihtar :: PR: #9884
- copy of #9576 by @dimapihtar :: PR: #9986
- Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897
- Bug fix for pooler causing dist checkpointing exception by @shanmugamr1992 :: PR: #10008
#### Export
Changelog
- Update nemo.export module for quantized models by @janekl :: PR: #9218
- Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221
- Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210
- TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270
- Re-org export code by @oyilmaz-nvidia :: PR: #9353
- Use TensorRT-LLM native parameter names in nemo.export module by @janekl :: PR: #9424
- TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402
- vLLM Export Support by @apanteleev :: PR: #9381
- Add page context fmha option in TensorRTLLM export by @meatybobby :: PR: #9526
- Test C++ runtime on demand in nemo_export.py to avoid possible OOMs by @janekl :: PR: #9544
- Fix nemo export test by @oyilmaz-nvidia :: PR: #9547
- Add tps and pps params to the export script by @oyilmaz-nvidia :: PR: #9558
- Add Multimodal Exporter by @meatybobby :: PR: #9256
- Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593
- Inflight nemo model export support by @JimmyZhang12 :: PR: #9527
- vLLM Export Improvements by @apanteleev :: PR: #9596
- Akoumparouli/nemo ux mixtral export by @akoumpa :: PR: #9603
- Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620
- Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624
- Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638
- Cherry-pick megatron export fix from main by @borisfom :: PR: #9643
- In framework export by @oyilmaz-nvidia :: PR: #9658
- Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826~
#### Bugfixes
Changelog
- use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223
- fix import by @akoumpa :: PR: #9240
- Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281
- call set_expert_model_parallel_world_size instead of set_cpu_expert_m… by @akoumpa :: PR: #9275
- Fix typos in Mixtral NeMo->HF and Starcoder2 NeMo->HF conversion scripts by @evellasques :: PR: #9325
- Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344
- Add OpenAI format response to r2.0.0rc1 by @athitten :: PR: #9796
- [NeMo UX] Support generating datasets using different train/valid/test distributions by @ashors1 :: PR: #9771
- Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826
#### General Improvements
Changelog
- [Nemo CICD] run_cicd_for_release_branches_also by @pablo-garay :: PR: #9213
- rename paths2audiofiles to audio by @github-actions[bot] :: PR: #9220
- Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @github-actions[bot] :: PR: #9234
- ci: Remove duplicated job by @ko3n1g :: PR: #9258
- Fix document links by @yaoyu-33 :: PR: #9260
- Pin transformers by @github-actions[bot] :: PR: #9273
- Fix loading github raw images on notebook by @github-actions[bot] :: PR: #9283
- Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @github-actions[bot] :: PR: #9278
- Refactor Sequence Packing Script by @cuichenx :: PR: #9271
- [Nemo-UX] Move code to collections + fix some small bugs by @marcromeyn :: PR: #9277
- Fix typo in HF tutorial by @github-actions[bot] :: PR: #9304
- Expand documentation for data parallelism and distributed optimizer by @timmoon10 :: PR: #9227
- Install alerting by @ko3n1g :: PR: #9311
- typos by @github-actions[bot] :: PR: #9315
- FP8 feature documentation by @ksivaman :: PR: #9265
- [Nemo CICD] Comment out flaky tests by @pablo-garay :: PR: #9333
- Fixed typos in README.rst by @gdevakumar :: PR: #9322
- Update README.rst to clarify installation via Conda by @SimonCW :: PR: #9323
- [Nemo CICD] update flaky test by @pablo-garay :: PR: #9339
- fix lora and ptuning and isort/black by @github-actions[bot] :: PR: #9295
- Fix P-tuning for Llama based models by @github-actions[bot] :: PR: #9300
- add large model stable training fix and contrastive loss update for variable seq by @github-actions[bot] :: PR: #9348
- Guard cuda memory allocator update by @github-actions[bot] :: PR: #9313
- [Nemo CICD] Remove unnecessary commented out code by @pablo-garay :: PR: #9364
- Update Gemma conversion script by @yaoyu-33 :: PR: #9365
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @github-actions[bot] :: PR: #9371
- Re-enable cuda graphs in training modes. by @github-actions[bot] :: PR: #9343
- fix typo infer_seq_lenght -> infer_seq_length by @akoumpa :: PR: #9370
- Make a backward compatibility for old MSDD configs in label models by @github-actions[bot] :: PR: #9378
- Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @github-actions[bot] :: PR: #9253
- Update README.rst by @jgerh :: PR: #9393
- Force diarizer to use CUDA if cuda is available and if device=None. by @github-actions[bot] :: PR: #9390
- ci: Properly catch failed tests by introduction of workflow templates by @ko3n1g :: PR: #9324
- Fix T5 G2P Input and Output Types by @github-actions[bot] :: PR: #9269
- Huvu/rag pipeline citest by @huvunvidia :: PR: #9384
- Fix circular import for MM dataprep notebook by @github-actions[bot] :: PR: #9292
- add check if num layers is divisible by pp size by @github-actions[bot] :: PR: #9298
- [Nemo CICD] timeouts fix by @pablo-garay :: PR: #9407
- [NeMo-UX] Removing un-used ModelConfig class by @marcromeyn :: PR: #9389
- Add tutorial for Llama-3-8B lora training and deployment by @shashank3959 :: PR: #9359
- [NeMo-UX] Removing default_path from ModelConnector by @marcromeyn :: PR: #9401
- Fix README by @ericharper :: PR: #9415
- [SD] Fix SD CUDA Graph Failure by @alpha0422 :: PR: #9319
- [NeMo-UX] Adding file-lock to Connector by @marcromeyn :: PR: #9400
- Add Dev Container Bug Report by @pablo-garay :: PR: #9430
- Akoumparouli/profiling docs by @akoumpa :: PR: #9420
- ci: Enrich notifications by @ko3n1g :: PR: #9412
- Fix failing RIR unit test with lhotse 1.24+ by @pzelasko :: PR: #9444
- [NeMo-UX] Adding support for mcore distributed optimizer by @marcromeyn :: PR: #9435
- Use ModelOpt build_tensorrt_llm for building engines for qnemo checkpoints by @janekl :: PR: #9452
- ci(notifications): Fix extraction of last 2K chars by @ko3n1g :: PR: #9450
- Update readme with mlperf news by @ericharper :: PR: #9457
- [NeMo-UX] Add nsys callback by @ashors1 :: PR: #9461
- [NeMo UX] Introducing optimizer module by @marcromeyn :: PR: #9454
- Fix minor import bug in deploy module by @oyilmaz-nvidia :: PR: #9463
- ci(notifications): Fetch all jobs by @ko3n1g :: PR: #9465
- Update build_dataset.py by @stevehuang52 :: PR: #9467
- bionemo: bn2/add pipelineparallel dtype by @skothenhill-nv :: PR: #9475
- [NeMo-UX] Integrate experiment manager features with NeMo-UX APIs by @ashors1 :: PR: #9460
- Add python_requires by @galv :: PR: #9431
- [NeMo-UX] Fixing imports of NeMoLogging, AutoResume & ModelCheckpoint by @marcromeyn :: PR: #9476
- Modelopt Refactor for SDXL Quantization by @suiyoubi :: PR: #9279
- [NeMo-UX] Fixing defaults in llm.train & Mistral7BModel by @marcromeyn :: PR: #9486
- In framework deploy using deploy script by @oyilmaz-nvidia :: PR: #9468
- [NeMo-UX] Integrate tokenizer import into model.import_ckpt by @marcromeyn :: PR: #9485
- append to file by @malay-nagda :: PR: #9483
- [NeMo-UX] Fix bug in import_ckpt by @marcromeyn :: PR: #9492
- Add nemotron news by @ericharper :: PR: #9510
- Add CICD test for Stable Diffusion by @michal2409 :: PR: #9464
- Akoumparouli/nemo ux mixtral by @akoumpa :: PR: #9446
- [NeMo-UX] Llama and Gemma by @cuichenx :: PR: #9528
- [NeMo-UX] minor logging bug fixes by @ashors1 :: PR: #9529
- Update neva conversion script from and to HF by @yaoyu-33 :: PR: #9296
- [Nemo-UX] IO fixes by @marcromeyn :: PR: #9512
- Fix lhotse tests for v1.24.2 by @pzelasko :: PR: #9546
- [Nemo CICD] Make GPU Unit Tests non-optional by @pablo-garay :: PR: #9551
- Add Python AIStore SDK to container and bump min Lhotse version by @pzelasko :: PR: #9537
- [NeMo-UX] Fix tokenizer IO by @marcromeyn :: PR: #9555
- [NeMo UX] Move mistral_7b.py to mistral.py by @akoumpa :: PR: #9545
- ci: Do not attempt to send slack on fork by @ko3n1g :: PR: #9556
- Fix SDXL incorrect name in Docs by @suiyoubi :: PR: #9534
- Bump PTL version by @athitten :: PR: #9557
- [Resiliency] Straggler detection by @jbieniusiewi :: PR: #9473
- [NeMo-UX] Switch to torch_dist as default distributed checkpointing backend by @ashors1 :: PR: #9541
- [NeMo-UX] Checkpointing bug fixes by @ashors1 :: PR: #9562
- Expose MCore path_to_cache option by @maanug-nv :: PR: #9570
- [NeMo-UX] Fix Trainer serialization by @marcromeyn :: PR: #9571
- Update click version requirement by @thomasdhc :: PR: #9580
- [Fault tolerance] Heartbeat detection by @maanug-nv :: PR: #9352
- [Nemo-UX] Add fabric-API for manual forward-pass by @marcromeyn :: PR: #9577
- [Nemo-UX] Add SDK-factories to llm-collection by @marcromeyn :: PR: #9589
- [NeMo-UX] Some improvements to NeMoLogger by @marcromeyn :: PR: #9591
- Set no_sync_func & grad_sync_fucn by @akoumpa :: PR: #9601
- [NeMo-UX] Fix nemo logger when trainer has no loggers by @ashors1 :: PR: #9607
- Fix the dictionary format returned by the `scheduler` method by @sararb :: PR: #9609
- [NeMo-UX] Dataloading enhancements and bug fixes by @ashors1 :: PR: #9595
- Fix serialization of AutoResume by @sararb :: PR: #9616
- Jsonl support by @adityavavre :: PR: #9611
- Akoumparouli/mistral import instruct chat template fix by @akoumpa :: PR: #9567
- Remove .cuda calls, use device isntead by @akoumpa :: PR: #9602
- fix converter defautl args by @akoumpa :: PR: #9565
- fix: remove non_blocking from PTL's .cuda call by @akoumpa :: PR: #9618
- NeVA Minor Fixes by @yaoyu-33 :: PR: #9608
- [NeMo-UX] fix pretrianing data sizes and weights by @cuichenx :: PR: #9627
- [NeMo-UX] async checkpointing support by @ashors1 :: PR: #9466
- Change default parallel_save to False by @mikolajblaz :: PR: #9632
- Add REST API to deploy module by @athitten :: PR: #9539
- ci: Timeout per step, not job by @ko3n1g :: PR: #9635
- [NeMo-UX] Fix when optimizers are setup for PEFT by @marcromeyn :: PR: #9619
- [NeMo-UX] Fix pipeline parallel bug by @ashors1 :: PR: #9637
- Fixing import error fior llama-index (RAG pipeline) by @pablo-garay :: PR: #9662
- llama CI fix by @rohitrango :: PR: #9663
- [NeMo-UX] Make 'load_directly_on_device' configurable by @ashors1 :: PR: #9657
- [Nemo-UX] Including all trainable-params in a PEFT-checkpoint by @marcromeyn :: PR: #9650
- [NeMo-UX] Fix imports so local configuration of runs works again by @marcromeyn :: PR: #9690
- Set TE flag in legacy -> mcore conversion script by @terrykong :: PR: #9722
- Update starthere docs text by @erastorgueva-nv :: PR: #9724
- TorchAudio installation workaround for incorrect `PYTORCH_VERSION` variable by @artbataev :: PR: #9736
- [NeMo-UX] Match nemo 1's default behavior for drop_last and pad_samples_to_global_batch_size by @ashors1 :: PR: #9707
- add a bit more for timeout (#9702) by @pablo-garay :: PR: #9754
- Fix missing parallelisms by @maanug-nv :: PR: #9725
- update branch by @nithinraok :: PR: #9764
- Fix data preprocessing script by @cuichenx :: PR: #9759
- vLLM 0.5.1 update by @apanteleev :: PR: #9779
- upper bound hf-hub by @akoumpa :: PR: #9805
- Fix few issues and docs for neva and clip in r2.0.0rc1 by @yaoyu-33 :: PR: #9681
- add dummy vision and text transformer config (assumed mcore to be false) by @rohitrango :: PR: #9699
- fix lita bugs by @Slyne :: PR: #9810
- [NeMo-UX] Log `val_loss` by @ashors1 :: PR: #9814
- [NeMo-UX] Fix some dataloading bugs by @ashors1 :: PR: #9807
- [NeMo-UX] Adding recipes by @marcromeyn :: PR: #9720
- [NeMo-UX] Set async_save from strategy rather than ModelCheckpoint by @ashors1 :: PR: #9800
- Fix hf hub for 0.24+ by @titu1994 :: PR: #9806
- [NeMo-UX] Fix a minor bug with async checkpointing by @ashors1 :: PR: #9856
- [NeMo-UX] make progress bar easier to parse by @ashors1 :: PR: #9877
- Docs: add "Nemo Fundamentals" page by @erastorgueva-nv :: PR: #9835
- Create __init__.py by @stevehuang52 :: PR: #9892
- [NeMo-UX] Fixes to make PreemptionCallback work by @hemildesai :: PR: #9830
- Fix Docker build. Make Dockerfile consistent with CI by @artbataev :: PR: #9784
- Multimodal data prep notebook fix by @cuichenx :: PR: #9910
- [NeMo-UX] Add distributed checkpointing unit tests by @ashors1 :: PR: #9794
- r2.0.0rc1 fix for dist checkpoint loading by @yaoyu-33 :: PR: #9854
- [NeMo-UX] Rename sdk references to NeMo Run by @hemildesai :: PR: #9872
- [NeMo-UX] Fix some serialization bugs by @ashors1 :: PR: #9868
- add mixtral neva tutorial (moe + token fusion + siglip) by @paul-gibbons :: PR: #9926
- [NeMo-UX] Add more NeMo Logger tests by @ashors1 :: PR: #9795
- Akoumparouli/mixtral fixes for r2.0.0rc1 by @akoumpa :: PR: #9911
- R2.0.0rc1 clip fix by @Slyne :: PR: #9871
- [NeMo-UX] Add missing docstrings and update some defaults by @ashors1 :: PR: #9895
- Add REST service requirements.txt by @oyilmaz-nvidia :: PR: #9923
- add bert latest fix by @JRD971000 :: PR: #9921
- remove empy reconfigure_limit_batches by @akoumpa :: PR: #9934
- fix mem by @terrykong :: PR: #9964
- Run a sample query for a quantized model conditionally by @janekl :: PR: #9965
- Add pydantic-settings by @oyilmaz-nvidia :: PR: #9961
- Resiliency features update by @jbieniusiewi :: PR: #9714
- [NeMo-UX] Wrap task config save in a try/except by @ashors1 :: PR: #9956
- [NeMo-UX] Update default PTL logging `save_dir` by @ashors1 :: PR: #9954
- Fix lita tutorial by @Slyne :: PR: #9980
- Add deploy and REST API support to NeMo 2.0 by @athitten :: PR: #9834
- ci: Allow changelog manual (#10156) by @ko3n1g :: PR: #10157
- docs: Add changelog by @ko3n1g :: PR: #10155
- add manifest file by @ko3n1g :: PR: #10161
## NVIDIA Neural Modules 2.0.0rc0
### Highlights
#### LLM and MM
##### Models
- Megatron Core RETRO
- Pre-training
- Zero-shot Evaluation
- Pretraining, conversion, evaluation, SFT, and PEFT for:
- Mixtral 8X22B
- Llama 3
- SpaceGemma
- Embedding Models Fine Tuning
- Mistral
- BERT
- BERT models
- Context Parallel
- Distributed checkpoint
- Video capabilities with NeVa
##### Performance
- Distributed Checkpointing
- Torch native backend
- Parallel read/write
- Async write
- Multimodal LLM (LLAVA/NeVA)
- Pipeline Parallelism support
- Sequence packing support
##### Export
- Integration of Export & Deploy Modules into NeMo Framework container
- Upgrade to TRT-LLM 0.9
#### Speech (ASR & TTS)
##### Models
- AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
- Multimodal Domain - Speech LLM supporting SALM Model
- Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
- Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
- mel_codec_22khz_medium
- mel_codec_44khz_medium
##### Perf Improvements
- Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
- Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
- Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
- Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.
##### Customization
- Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
- Longform Inference
- Longform inference support for AED models
- Transcription of multi-channel audio for AED models
##### Misc
- Upgraded webdataset - Speech and LLM / Multimodal unified container
### Detailed Changelogs
#### ASR
Changelog
- Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828
- TDT confidence fix by @GNroy :: PR: #8982
- Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956
- NeMo dev doc restructure by @yaoyu-33 :: PR: #8896
- Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001
- Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964
- [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007
- Add ASR latest news by @titu1994 :: PR: #9073
- Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006
- PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061
- RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972
- Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100
- Update branch for notebooks and ci in release by @ericharper :: PR: #9189
- Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196
- rename paths2audiofiles to audio by @nithinraok :: PR: #9209
- Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233
- Cherrypick: Support dataloader as input to `audio` for transcription (#9201) by @titu1994 :: PR: #9235
- Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252
- Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243
- Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246
- Fix loading github raw images on notebook by @nithinraok :: PR: #9282
- typos by @nithinraok :: PR: #9314
- Re-enable cuda graphs in training modes. by @galv :: PR: #9338
- add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259
- Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369
- Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350
- Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377
- Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380
#### TTS
Changelog
- [TTS] Add tutorial for training audio codecs by @rlangman :: PR: #8723
- Update radtts.py by @blisc :: PR: #9097
- [Nemo CICD] RADTTS test optional by @pablo-garay :: PR: #9112
- Remove Radtts CI test by @blisc :: PR: #9144
- Fix T5 G2P Input and Output Types by @blisc :: PR: #9224
#### LLM and MM
Changelog
- Rachitg/dpa by @rachitgarg91 :: PR: #8911
- Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908
- Huvu/mcore retro by @huvunvidia :: PR: #8861
- fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947
- Fix memory leak at loss func by @minitu :: PR: #8868
- change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965
- Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991
- [NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987
- Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905
- Gemma bug by @cuichenx :: PR: #8962
- [NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995
- Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859
- add geglu to mlp swap by @JRD971000 :: PR: #8999
- add timeout for new_group by @acphile :: PR: #8998
- Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941
- Added fusion for squared relu by @sanandaraj5597 :: PR: #8963
- Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026
- [NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011
- Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022
- docs and simplification of cmd args by @arendu :: PR: #8979
- [NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057
- Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957
- Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070
- unfused lora by @arendu :: PR: #9004
- Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075
- Alit/griffin by @JRD971000 :: PR: #9021
- Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016
- Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095
- HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060
- mcore ds updates by @dimapihtar :: PR: #8951
- Alit/griffin perf by @JRD971000 :: PR: #9107
- Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110
- Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869
- Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118
- Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034
- Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128
- scripts to convert HF lora to nemo by @arendu :: PR: #9102
- Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015
- add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142
- Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088
- Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145
- CUDA memory profile by @erhoo82 :: PR: #9096
- Fix missing func for T5 model by @gdengk :: PR: #9141
- Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125
- Revert rope fusion defaults by @cuichenx :: PR: #9238
- Update nemo.export module for quantized models by @janekl :: PR: #9250
- Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287
- neva media_type + text generation default fix by @paul-gibbons :: PR: #9257
- fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290
- add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208
- Fix P-tuning for Llama based models by @apanteleev :: PR: #9297
- add deprecation warnings by @pablo-garay :: PR: #9266
- move pooler under post_process by @dimapihtar :: PR: #9328
- add deprecation note for nmt by @dimapihtar :: PR: #9342
- Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204
- fix fp16 precision issue by @dimapihtar :: PR: #9376
- Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877
#### Export
Changelog
- Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873
- Mingyuanm/sdxl export by @Victor49152 :: PR: #8926
- Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866
- Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974
- TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863
#### General Improvements
Changelog
- Update package info by @ericharper :: PR: #8793
- [Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917
- Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895
- Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942
- Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883
- Update te install for jenkins by @ericharper :: PR: #8954
- [Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959
- Minor quantization pipeline updates by @janekl :: PR: #8924
- Fix External CLIP Converter by @yaoyu-33 :: PR: #8960
- PP support in LoRA merge script by @cuichenx :: PR: #8934
- Update PR template by @ericharper :: PR: #8978
- Update Latest News by @shashank3959 :: PR: #8837
- Fix incorrect link to latest news in README by @shashank3959 :: PR: #8985
- Update dependency install for LLM and MM by @ericharper :: PR: #8990
- Temporarily remove mcore dep by @ericharper :: PR: #9010
- [Nemo CICD] further specialize runners for more parallelism by @pablo-garay :: PR: #9036
- Update mm dataprep notebook based on feedback by @cuichenx :: PR: #9029
- Fix import in lora merge script by @cuichenx :: PR: #9032
- [Nemo CICD] Run when labeled:Run CICD by @pablo-garay :: PR: #9044
- [Nemo CICD] Add tag/label for 1-gpu runner by @pablo-garay :: PR: #9046
- [Nemo CICD] checkout v4 by @pablo-garay :: PR: #9048
- [Nemo CICD] Remove temp test change by @pablo-garay :: PR: #9049
- remove in-place addition for dreambooth train with text encoder by @Victor49152 :: PR: #8825
- Mingyuanm/sdxl quantization notebook by @Victor49152 :: PR: #9042
- [Nemo CICD] Trigger on comment issued by @pablo-garay :: PR: #9062
- zarr ckpt to torch_dist ckpt converter by @dimapihtar :: PR: #8842
- Restore PTQ tests for Llama2 (reopened) by @janekl :: PR: #9064
- add clip H config by @JRD971000 :: PR: #9082
- [NeMo-UX] Add mixed-precision plugin by @marcromeyn :: PR: #9065
- Comment baichuan test and update pr template by @ericharper :: PR: #9085
- Add safe extraction of nemo tar files by @athitten :: PR: #8976
- Improved `shard_id` parsing in `LazyNemoTarredIterator`, enables AIS dataloading by @pzelasko :: PR: #9077
- [NeMo-UX] Add mistral-7b model by @marcromeyn :: PR: #9066
- Llama3 Conversion Script Update by @suiyoubi :: PR: #9089
- dehardcode test string by @JimmyZhang12 :: PR: #8865
- [Nemo CICD] Try trigger cicd run on comment by @pablo-garay :: PR: #9111
- Lhotse dataloading: RIR augmentation and nemo/tarred input support for RIR and noise aug by @pzelasko :: PR: #9109
- mixtral evaluation PR by @Slyne :: PR: #8989
- [Nemo CICD] Revert: run GHA cicd on comment by @pablo-garay :: PR: #9119
- [Nemo CICD] Comment out flaky test: running too long by @pablo-garay :: PR: #9123
- [Nemo CICD] Add timeout to unit tests by @pablo-garay :: PR: #9132
- [Nemo CICD] Indicate optional test in name (prefix) by @pablo-garay :: PR: #9139
- video neva null image+video folder path fix by @paul-gibbons :: PR: #9116
- [NeMo-UX] Add data module by @cuichenx :: PR: #9133
- NeMo Inference Requirements by @oyilmaz-nvidia :: PR: #9093
- Remove debug print by @maanug-nv :: PR: #9074
- Remove legacy CI by @pablo-garay :: PR: #9149
- Update support for push_to_hf_hub() by @titu1994 :: PR: #9159
- [Nemo CICD] comment out flaky PTQ tests by @pablo-garay :: PR: #9160
- Update branch by @ericharper :: PR: #9211
- dist adam transpose fix by @dimapihtar :: PR: #9239
- [Nemo CICD] Increase time limit for Speech_Checkpoints_tests (#9186) by @pablo-garay :: PR: #9247
- Pin transformers by @ericharper :: PR: #9261
- Fix typo in HF tutorial by @titu1994 :: PR: #9302
## NVIDIA Neural Modules 1.23.0
### Highlights
#### Models
##### Nvidia Starcoder 2 - 15B
- Announcement -
- AI Foundation Model Inference -
-
##### NeMo Canary
Announcement -
-
#### NeMo LLM
- Falcon
- Code Llama
- StarCoder
- GPT perf improvements
- Context parallelism
- Mistral
- Mixtral (without expert parallelism)
- Mcore GPT Dataset integration
#### NeMo MM
- CLIP
- Stable Diffusion (supporting LoRA)
- Imagen
- ControlNet (for SD)
- Instruct pix2pix (for SD)
- LLAVA
- NeVA
- DreamFusion++
- NSFW filtering
#### NeMo ASR
- Lhotse Dataloading support #7880
- Canary: Multi task multi lingual ASR #8242
- LongForm Audio for Diarization #7737
- Faster algorithm for RNN-T Greedy #7926
- Cache-Aware streaming notebook #8296
#### NeMo TTS
#### NeMo Vision
#### Known Issues
##### ASR
###### RNNT WER calculation when fused batch size > 1 during validation / test step()
Previously, the RNNT metric was stateful while the CTC one was not ([r1.22.0](https://github.com/NVIDIA/NeMo/blob/r1.22.0/nemo/collections/asr/metrics/rnnt_wer_bpe.py#L419-L420), [r1.23.0](https://github.com/NVIDIA/NeMo/blob/r1.23.0/nemo/collections/asr/metrics/wer.py#L333))
Therefore this calculation in the RNNT joint for fused operation worked properly. However with the unification of metrics in r1.23.0, a bug was introduced where only the last sub-batch of metrics calculates the scores and does not accumulate. This is patched via and will be fixed in the next release.
__Workaround__: Explicitly disable fused batch size during inference using the following command
```python
from omegaconf import open_dict
model = ...
decoding_cfg = model.cfg.decoding
with open_dict(decoding_cfg):
decoding_cfg.fused_batch_size = -1
model.change_decoding_strategy(decoding_cfg)
```
Note: This bug does not affect scores calculated via model.transcribe() (since it does not calculate metrics during inference, just text), or using the `transcribe_speech.py` or `speech_to_text_eval.py` in `examples/asr`.
###### Two failing unit tests due to a change in expected results, caused by lhotse version update
#### Container
For additional information regarding NeMo containers, please visit:
`docker pull nvcr.io/nvidia/nemo:24.01.speech`
#### ASR
Changelog
- Update link to yaml file in ASR_with_Transducers.ipynb by @Faith-Nchifor :: PR: #8014
- Use convert_hf_dataset_to_nemo by @karpnv :: PR: #8017
- Update asr_language_modeling.rst: Add a missing word by @martin0258 :: PR: #8007
- spelling mistake by @orena1 :: PR: #7903
- update asr eval by @stevehuang52 :: PR: #8045
- fix noise aug by @stevehuang52 :: PR: #8057
- Various fixes for typos and urls by @titu1994 :: PR: #8066
- [Fix] Increase length check tolerance to prevent test failing by @anteju :: PR: #8067
- Add text metrics to asr eval by @stevehuang52 :: PR: #8087
- fix device setting to allow using accelerator cpu by @orena1 :: PR: #8084
- .ctm in data simulator annotator compliant with RT-09 specification by @popcornell :: PR: #8004
- Fix AST eval by @stevehuang52 :: PR: #8112
- fix: numba.*_num_threads resets torch num_threads #8141 by @itzsimpl :: PR: #8145
- Update dependencies by @titu1994 :: PR: #8156
- NeMo + Lhotse integration by @pzelasko :: PR: #7880
- Speedup RNN-T greedy decoding by @artbataev :: PR: #7926
- [docker] Install k2 before NeMo for faster image rebuilding by @pzelasko :: PR: #8204
- [docs] Add --force_codec to tarred dataset creation examples by @pzelasko :: PR: #8227
- Temporarily use the previous RNN-T decoding algorithm as default by @artbataev :: PR: #8226
- Make TDT inference not require duration params by @hainan-xv :: PR: #8207
- Cache Aware Streaming tutorial notebook by @erastorgueva-nv :: PR: #8296
- fix path location and branch by @nithinraok :: PR: #8304
- Attention encoder-decoder models for multiple speech-to-text tasks … by @titu1994 :: PR: #8324
- Remove asr webapp by @titu1994 :: PR: #8347
- remove _target_ at model level in aed model config [ASR] by @krishnacpuvvada :: PR: #8351
- Add change_vocabulary and save_tokenizers() support to Multitask ASR models by @titu1994 :: PR: #8357
- Change default beam size by @titu1994 :: PR: #8371
- adding jenkins test for speech_to_text_aed model by @krishnacpuvvada :: PR: #8368
- Add Finetuning tutorial with HF Datasets by @nithinraok :: PR: #8356
- wer fix by @tbartley94 :: PR: #8404
- add ensemble decoding fix by @nithinraok :: PR: #8427
- Update k2 by @artbataev :: PR: #8492
#### TTS
Changelog
- [TTS] Scale sampler steps by number of devices by @rlangman :: PR: #7947
- Add All Multimodal Source Code Part 2: Text to image, x to nerf by @yaoyu-33 :: PR: #7970
- [TTS] Add period discriminator and feature matching loss to codec recipe by @rlangman :: PR: #7884
- Added VectorQuantizer base class by @anteju :: PR: #8011
#### LLMS
Changelog
- Add interface to set NCCL options of each process group by @erhoo82 :: PR: #7923
- Support O2 training of PEFT and SFT by @cuichenx :: PR: #7971
- [NLP] Access scaler only in FP16 case by @janekl :: PR: #7916
- [NLP] Minor improvements in Llama conversion script by @janekl :: PR: #7978
- [NLP] Use helpers from utils_funcs.py in Llama conversion by @janekl :: PR: #7979
- [NLP] Remove replace_sampler_ddp (deprecated in Trainer) by @janekl :: PR: #7981
- Reworked MegatronPretrainingRandomBatchSampler to correctly handle epochs > 1 by @trias702 :: PR: #7920
- Remove deprecated arguments from TE's TransformerLayer by @jbaczek :: PR: #7917
- Add All Multimodal Source Code by @yaoyu-33 :: PR: #7791
- First draft of mcore bert model in NeMo by @shanmugamr1992 :: PR: #7814
- Support Falcon Variants (7B/40B/180B) in Mcore NeMo by @xuanzic :: PR: #7666
- FSDP + Tensor Parallelism by @erhoo82 :: PR: #7897
- Packed Sequence by @cuichenx :: PR: #7945
- Adding method back that was removed accidentally by @ericharper :: PR: #8038
- [NLP] ArtifactItem with init=True to make it debuggable by @janekl :: PR: #7980
- SFT patch: (1) enable sequence parallelism and (2) enable profile by @erhoo82 :: PR: #7963
- migration to PTL 2.0 for spellmapper model by @bene-ges :: PR: #7924
- Change the megatron config lr scheduler default and fix to change partitions script by @shan18 :: PR: #8094
- (1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast by @erhoo82 :: PR: #7793
- Reconfigure limit_val_batches only for int by @athitten :: PR: #8099
- Fixing wrapper and moving it to base class by @shanmugamr1992 :: PR: #8055
- fix gated_linear_unit bug by @Agoniii :: PR: #8042
- Fix Adapter for MCore models by @cuichenx :: PR: #8124
- add war fix for sync issues by @gshennvm :: PR: #8130
- Improve PEFT UX by @cuichenx :: PR: #8131
- Enhance flexibility by passing callbacks as method argument by @michal2409 :: PR: #8015
- context parallelism by @xrennvidia :: PR: #7739
- Make pipelined TP comm overlap available with mcore by @erhoo82 :: PR: #8005
- remove deprecated scripts by @arendu :: PR: #8138
- adding OnlineSampleMapping by @arendu :: PR: #8137
- Add distopt support for FP8 params and BF16 optimizer state by @timmoon10 :: PR: #7909
- Revert adding OnlineSampleMapping by @pablo-garay :: PR: #8164
- Token count and sequence length logging for MegatronGPTSFTModel by @vysarge :: PR: #8136
- Use latest apex internal API by @jbaczek :: PR: #8129
- tune specific params in the base model by @arendu :: PR: #7745
- Virtual pipeline parallel support for MegatronGPTSFTModel by @vysarge :: PR: #7964
- removed deprecated peft model by @arendu :: PR: #8183
- remove more deprecated files by @arendu :: PR: #8169
- Pre-generate cu_seqlens argmin and max_seqlen to remove host-to-device sync by @erhoo82 :: PR: #8108
- Add the interface to use SHARP to FSDP strategy by @erhoo82 :: PR: #8202
- Multimodal required NLP base model changes by @yaoyu-33 :: PR: #8188
- [NLP] Improve and unify loading state_dict for community models by @janekl :: PR: #7977
- Rename Finetuning Scripts by @cuichenx :: PR: #8201
- Final multimodal PR with our recent developments on MM side by @yaoyu-33 :: PR: #8127
- Add include_text parameter to SFT dataloaders by @Kipok :: PR: #8198
- Add random_seed argument to generate by @Kipok :: PR: #8162
- Added support for neptune logger by @harishankar-gopalan :: PR: #8210
- Pre-compute max_seqlen and cu_seqlens_argmin in all model-parallel cases by @erhoo82 :: PR: #8222
- Use PackedSeqParams in accordance with changes in Megatron-LM by @cuichenx :: PR: #8205
- Fix to peft & virtual pipeline parallel unsupported check by @vysarge :: PR: #8216
- Fixed the tp overlap switch by @sanandaraj5597 :: PR: #8195
- add knobs for rope/swiglu fusion by @lhb8125 :: PR: #8184
- Added sample cpu_offloading switch to YAML by @sanandaraj5597 :: PR: #8148
- Syncing random seed between ranks in generate by @Kipok :: PR: #8230
- add first_val_step to mcore scheduler by @JimmyZhang12 :: PR: #8150
- Correct padding for SFT input data to account for sequence parallel + TE's fp8 op dimension requirements by @vysarge :: PR: #8240
- Mistral 7b conversion script by @akoumpa :: PR: #8052
- switch to mcore dataset [with FIM support] by @dimapihtar :: PR: #8149
- Mixtral to NeMo conversion script. by @akoumpa :: PR: #8155
- fixes to accomendate mcore changes by @HuiyingLi :: PR: #8261
- Allow MegatronPretrainingRandomSampler to do multi-epoch training by @trias702 :: PR: #8239
- Add dist ckpt support for regular optimizers by @mikolajblaz :: PR: #7749
- add deallocate pipeline output optimization by @JimmyZhang12 :: PR: #8279
- Fix memory leak caused by context parallelism hanging references by omegaconf by @JimmyZhang12 :: PR: #8299
- distributed fused adam + rampup bs support by @dimapihtar :: PR: #8302
- Update PEFT Doc by @cuichenx :: PR: #8262
- Converter script fixes for mixtral/mistral by @akoumpa :: PR: #8272
- Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 by @erhoo82 :: PR: #8334
- Enable megatron core loggers for GPT pretraining by @ashbhandare :: PR: #8354
- mcore ds fix by @dimapihtar :: PR: #8283
- release updates by @dimapihtar :: PR: #8378
- Mcore customization doc by @HuiyingLi :: PR: #8298
- updated link to pubmed by @nithinraok :: PR: #8402
- mcore customization doc minor fix by @HuiyingLi :: PR: #8421
- Fixing mcore bert for TP, PP and SP by @shanmugamr1992 :: PR: #8336
- Add settings to suppress bf16 compile errors in CI on V100 by @athitten :: PR: #8481
- MoE parameter passing by @akoumpa :: PR: #8255
- Add fp8 support for SD/Update notebook paths by @Victor49152 :: PR: #8489
#### NeMo Tools
Changelog
- SDE bugfix log by @Jorjeous :: PR: #8430
#### General Improvements
Changelog
- Add news section to README by @ericharper :: PR: #7984
- Fixing conversion script to work for code llama by @shanmugamr1992 :: PR: #7997
- Fix crash when converting to mcore a model using rotary embeddings by @odelalleau :: PR: #7998
- Added a procedure for Windows users, README by @Jorjeous :: PR: #7942
- Update manifest.py to speedup loading tarred datasets by @stevehuang52 :: PR: #7900
- [Fix] Fixed name of a test by @anteju :: PR: #7986
- Fix lora merge script by @cuichenx :: PR: #8113
- Support transcoding audio formats when saving tarred datasets (FLAC, OPUS) by @pzelasko :: PR: #8102
- README edit to change Apple Silicon install instructions (to fix a break introduced by pytorch 2) by @stephenmcconnachie :: PR: #8122
- Fixes NVIDIA/apex installation to not erroneously install the pkg by @terrykong :: PR: #8126
- Graphviz fix by @GNroy :: PR: #7843
- Update README.rst by @fayejf :: PR: #8154
- Fix TP>1 issue for conversion script by @cuichenx :: PR: #8144
- Support torch jit script by @artbataev :: PR: #8027
- NeMo Multimodal Docs and Tests Initial PR by @yaoyu-33 :: PR: #8028
- Remove left-over prints in NeMo+Lhotse code by @pzelasko :: PR: #8180
- Upgrade to DLFW PyTorch 23.12 by @ericharper :: PR: #8163
- Add Lhotse support for key in NeMo manifests by @pzelasko :: PR: #8197
- Fix CPU Initialization and TP>1 for LoRA Merge Script by @cuichenx :: PR: #8199
- Add support in Neural Typecheck to disable semantic checks by @titu1994 :: PR: #8212
- Pin lhotse=1.19.2 in r1.23.0 by @pzelasko :: PR: #8303
- Multimodal r1.23.0 bug fix by @yaoyu-33 :: PR: #8315
- MCore dataset compatibility for tokenizers by @vysarge :: PR: #8390
- Update NFA video download link by @erastorgueva-nv :: PR: #8406
- Update MM Dataprep Tutorial by @cuichenx :: PR: #8410
- Fix dreambooth data sampler issue by @yaoyu-33 :: PR: #8400
- Fix a bug in CTM line processing function for multi-speaker data simulations by @tango4j :: PR: #8416
- Akoumparouli/mistral bugfix by @akoumpa :: PR: #8353
- pin to 0.5.0 by @ericharper :: PR: #8465
- Update NeMo Multimodal Requirements by @yaoyu-33 :: PR: #8515
- Fix link in multimodal dataprep tutorial by @cuichenx :: PR: #8517