# Changelog ## NVIDIA Neural Modules 2.6.0 ### Highlights - Speech - Add Timestamps to streaming ASR [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14766) - Add Streaming decoding policies (Wait-K and AlignAtt) for Canary model [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14765) - Add NeMo Voice Agent [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14325) - Hybrid RNNT-CTC Prompted Parakeet Model support [PR](https://github.com/NVIDIA-NeMo/NeMo/pull/14561) - [New] MT-Parakeet Streaming Models [release](https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1) - Removed the Automodel module. Automodel is available in the repo https://github.com/NVIDIA-NeMo/Automodel. - Removed the Deploy module. Export & Deploy is available in the repo https://github.com/NVIDIA-NeMo/Export-Deploy. - Non-Speech NeMo 2.0 collections are deprecated and will be removed in a later release. Their functionality is available in the Megatron Bridge repo at https://github.com/NVIDIA-NeMo/Megatron-Bridge. ### Known Issues - NeMo voice agent pipecat connecting issues ### Detailed Changelogs: #### ASR
Changelog - fixing kernel restarting when transcribing by @weiqingw4ng :: PR: #14665 - Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14679 - Fixing Sortformer training tutorial notebook by @tango4j :: PR: #14680 - Fix for "EncDecRNNTBPEModel transcribe() failed with TypeError" by @andrusenkoau :: PR: #14698 - Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite (merge to main) by @erastorgueva-nv :: PR: #14743 - Use lhotse dataloader for ASR models to support in-manifest channel selection for multichannel recordings by @racoiaws :: PR: #14586 - add transducer timestamps without alignments, timestamps to streaming by @lilithgrigoryan :: PR: #14766 - Adding bf16 Sortformer train and inference by @tango4j :: PR: #14627 - Replace texterrors with kaldialign library by @andrusenkoau :: PR: #14775 - fix: Use shutil.copy fallback to handle file metadata permission errors by @vipnydav :: PR: #14639 - Add Customization Capabilities to Cache-Aware Models by @artbataev :: PR: #14757 - Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800 - Streaming decoding policies (Wait-K and AlignAtt) for Canary model by @andrusenkoau :: PR: #14765 - Add tests for streaming buffered and cache-aware transducer models by @artbataev :: PR: #14823 - Merge updates of Multi-Talker Parakeet Model, Modules, Dataloader and Utils PR 01 by @weiqingw4ng :: PR: #14905 - Merge updates of Multi-Talker Parakeet - Unit tests and CI tests PR 02 by @weiqingw4ng :: PR: #14932 - Add Parakeet Hybrid RNNT CTC BPE Model with Prompt support by @ealbasiri :: PR: #14561 - fix notebooks by @nithinraok :: PR: #15079 - cherry pick #15070 by @nithinraok :: PR: #15082
#### TTS
Changelog - Remove outdated TTS Tutorials by @blisc :: PR: #14660 - Add KokoroTTS support for voice agent framework by @tango4j :: PR: #14910 - remove language_modeling by @dimapihtar :: PR: #14192
#### NLP / NMT
Changelog - Add gpt-oss by @cuichenx :: PR: #14457 - Fix sequence packing loss calculation by @rayandasoriya :: PR: #14437 - [Perf script] Llama and GPT3 perf script use mlp cast fusion by @guyueh1 :: PR: #14575 - Delete tutorials/llm/llama/biomedical-qa directory by @cuichenx :: PR: #14653 - Add gpt-oss lora exporter by @cuichenx :: PR: #14589 - Replace MegatronTokenizer with MegatronLegacyTokenizer by @chtruong814 :: PR: #14721 - Update ModelCommPGs API from megatron-core by @yaoyu-33 :: PR: #14578 - feat: Compatibility modification of megatron-fsdp by @shjwudp :: PR: #14593 - imported get_moe_layer_wise_logging_tracker from megatron core moe_utils by @prathamk-tw :: PR: #14694 - Fix gpt-oss yarn_original_max_position_embeddings value by @cuichenx :: PR: #14706 - Update docs per guidance by @pablo-garay :: PR: #14841 - Fixing three mcore links by @aschilling-nv :: PR: #14839 - Documentation for gpu-based phrase boosting by @andrusenkoau :: PR: #14800 - Update gpt-oss configs by @cuichenx :: PR: #14674 - remove language_modeling by @dimapihtar :: PR: #14192 - cp: `remove ExportDeploy` into `r2.6.0` by @pablo-garay :: PR: #15053 - cherry pick #15070 by @nithinraok :: PR: #15082
#### Export
Changelog - fix: fix missing rope scaling in exporting llama embedding model by @ZhiyuLi-Nvidia :: PR: #14523 - Add gpt-oss lora exporter by @cuichenx :: PR: #14589 - Skip trt-llm and vllm install in install test by @chtruong814 :: PR: #14663 - Fix deepseek export dtype by @cuichenx :: PR: #14307 - Remove export-deploy, automodel, and eval tutorials by @chtruong814 :: PR: #14790 - cp: `remove ExportDeploy` into `r2.6.0` by @pablo-garay :: PR: #15053
#### Uncategorized:
Changelog - Version bump to `2.6.0rc0.dev0` by @github-actions[bot] :: PR: #14512 - [Audio]: added conformer U-Net model for SE by @nasretdinovr :: PR: #14442 - hyena/evo2: Make sure to convert to real after fp32 conversion by @antonvnv :: PR: #14515 - Force-set restore path for student in KD mode by @AAnoosheh :: PR: #14532 - Skip PTQ if PTQ model path exists by @jenchen13 :: PR: #14536 - Support QwenVL for inference API by @meatybobby :: PR: #14534 - Hyena: Allow to use unfused RMSNorm + TELinear to restore accuracy and some speed by @antonvnv :: PR: #14542 - [Audio]: added streaming mode to SpectrogramToAudio by @nasretdinovr :: PR: #14524 - Update evo2 defaults so converted checkpoints have the right parameters by @jstjohn :: PR: #14514 - deprecate t0 scripts by @dimapihtar :: PR: #14585 - cfg typo correction by @malay-nagda :: PR: #14588 - [Perf script] Add use_te_activation_func and activation_func_fp8_input_store flags by @guyueh1 :: PR: #14522 - Modify logging message to signal that RestoreConfig will be used by @balvisio :: PR: #14469 - Bump TE and Mcore by @chtruong814 :: PR: #14568 - Avoid host-device sync in PTL logging by @WanZzzzzz :: PR: #14489 - Integrate implicit filter kernel with Hyena layer by @farhadrgh :: PR: #14621 - Fix kv_channels configuration for Gemma2 27b by @ananthsub :: PR: #14590 - [Flux] small fixes by @CarlosGomes98 :: PR: #14333 - [Flux] Add MXFP8 Support by @alpha0422 :: PR: #14473 - Use hugginface_hub for downloading the FLUX checkpoint by @suiyoubi :: PR: #14638 - Fine-tune embedding models (E5-Large-V2 and LLaMA-3.2-1B) on the allnli triplet dataset with NeMo Framework by @girihemant19 :: PR: #14584 - remove service launch scripts by @dimapihtar :: PR: #14647 - Warn instead of error when chat template doesn't contain generation keyword by @jenchen13 :: PR: #14641 - Fix function calling notebook by @cuichenx :: PR: #14643 - [Audio]: fixed bug in conformer unet by @nasretdinovr :: PR: #14626 - Fix code checkout during test by @chtruong814 :: PR: #14658 - Fix Flux seed as optional Arg by @suiyoubi :: PR: #14652 - Remove PEFT scheme condition from recipe by @JRD971000 :: PR: #14661 - Add NeMo Voice Agent by @stevehuang52 :: PR: #14325 - Update get_tensor_shapes function whose signature was refactored by @AAnoosheh :: PR: #14594 - Delete nemo1 notebooks by @cuichenx :: PR: #14677 - Bump latest Mcore 020abf01 by @chtruong814 :: PR: #14676 - [Flux] correct vae_downscale_factor by @CarlosGomes98 :: PR: #14425 - Bump modelopt to 0.35.0 and remove `safe_import("modelopt")` in llm collection by @kevalmorabia97 :: PR: #14656 - Canary tutorial fix by @nune-tadevosyan :: PR: #14699 - Add option for LoRA with Transformer Engine op fuser by @timmoon10 :: PR: #14411 - add load-in-4bit param by @dimapihtar :: PR: #14636 - Support NVFP4 recipe by @WanZzzzzz :: PR: #14625 - Fix broken link in Reasoning-SFT.ipynb by @cuichenx :: PR: #14716 - Remove artificial block to vortex fp8 TP by @jstjohn :: PR: #14684 - Drop speech_llm example suite by @yaoyu-33 :: PR: #14683 - remove env var by @malay-nagda :: PR: #14739 - detach arg option for run scripts by @malay-nagda :: PR: #14722 - Randomized shard slicing for tarred data by @pzelasko :: PR: #14558 - Data prediction objective for flow matching speech enhancement models by @racoiaws :: PR: #14749 - Fix Some Failures by @alpha0422 :: PR: #14763 - Support additional Slurm parameters (#14701) by @bdubauski :: PR: #14742 - [Flux] Remove Redundant Host & Device Sync by @alpha0422 :: PR: #14711 - [Flux] Full Iteration CUDA Graph by @alpha0422 :: PR: #14744 - Update prune-distill notebooks to Qwen3 + simplify + mmlu eval by @kevalmorabia97 :: PR: #14785 - ci: Automodel deprecation warning by @thomasdhc :: PR: #14787 - Bug in MXFP8 recipe by @adityavavreNVDA :: PR: #14793 - feat: Disable blank Issues by @pablo-garay :: PR: #14788 - ci: Add community label bot by @chtruong814 :: PR: #14796 - Add mistral small3 24B config and recipe by @eagle705 :: PR: #14784 - Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #14812 - QWEN2.5-VL 7B FP8 Recipe by @tomlifu :: PR: #14801 - Feat: Disk space management: for nemo install test by @pablo-garay :: PR: #14822 - Evo2 address rare over-masking in 1m context dataset by @jstjohn :: PR: #14821 - Update cherry-pick workflow to use version 0.63.0 by @pablo-garay :: PR: #14832 - Removing automodel items by @aschilling-nv :: PR: #14840 - Update changelog for `v2.4.1` by @github-actions[bot] :: PR: #14828 - Fix lm_eval installation in pruning tutorial for 25.09 container by @kevalmorabia97 :: PR: #14865 - Add nemotron-nano-v2 support to voice agent by @stevehuang52 :: PR: #14704 - Update changelog for 2.5.0 by @chtruong814 :: PR: #14890 - [Qwen3] Fix the flop cal for Qwen3 by @gdengk :: PR: #14897 - [lhotse][aistore] added support input_cfg.yaml directly from aistore bucket by @XuesongYang :: PR: #14891 - Harden _is_target_allowed by adding runtime class validation on top of prefix checks to prevent unsafe target resolution by @KunalDhawan :: PR: #14540 - Enable simplified DistOpt checkpoint formats by @mikolajblaz :: PR: #14428 - Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. by @liquor233 :: PR: #14945 - Revert "new changelog-build" by @pablo-garay :: PR: #14949 - feat: new changelog-build by @pablo-garay :: PR: #14950 - Update llama4 utils kwargs by @yaoyu-33 :: PR: #14924 - Update README.md by @snowmanwwg :: PR: #14917 - Update all outdated NeMo Curator links by @sarahyurick :: PR: #14760 - Freeze tags in in `r2.6.0` by @github-actions[bot] :: PR: #14957 - cp: `Bump MCore, TE, Pytorch, and modelopt for 25.11 (14946)` into `r2.6.0` by @chtruong814 :: PR: #14976 - cp: `Update ctc-segmentation (14991)` into `r2.6.0` by @chtruong814 :: PR: #14998 - cherry-pick of #14962 by @dimapihtar :: PR: #15000 - cp: `Pass timeout when running speech functional tests (15012)` into `r2.6.0` by @chtruong814 :: PR: #15013 - cp: `check asr models (14989)` into `r2.6.0` by @chtruong814 :: PR: #15002 - cp: `Enable EP in PTQ (15015)` into `r2.6.0` by @chtruong814 :: PR: #15026 - cp: `Update numba to numba-cuda and update cuda python bindings usage (15018)` into `r2.6.0` by @chtruong814 :: PR: #15024 - cp: `Add import guards for mcore lightning module (14970)` into `r2.6.0` by @chtruong814 :: PR: #14981 - cp: `fix loading of hyb ctc rnnt bpe models when using from pretrained (15042)` into `r2.6.0` by @chtruong814 :: PR: #15045 - cp: `fix: fix update-buildcache workflow after ED remove (15051)` into `r2.6.0` by @chtruong814 :: PR: #15052 - cp: `chore: update Lightning requirements version (15004)` into `r2.6.0` by @chtruong814 :: PR: #15049 - cp: `update notebook (15093)` into `r2.6.0` by @chtruong814 :: PR: #15094 - cp: `Fix: Obsolete Attribute [SDE] (15105)` into `r2.6.0` by @chtruong814 :: PR: #15106 - cp: `Upgrade NeMo ASR tutorials from Mozilla/CommonVoice to Google/FLEURS (15103)` into `r2.6.0` by @chtruong814 :: PR: #15107 - cp: `chore: Remove Automodel module (15044)` into `r2.6.0` by @chtruong814 :: PR: #15084 - cp: `Add deprecation notice to modules (15050)` into `r2.6.0` by @chtruong814 :: PR: #15110
## NVIDIA Neural Modules 2.5.3 ### Highlights - This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at - Update nv-one-logger - Update ctc-segmentation ### Detailed Changelogs: #### Text Normalization / Inverse Text Normalization
Changelog - chore: update Lightning requirement by @liquor233 :: PR: #15005
#### Uncategorized:
Changelog - cp: `Update ctc-segmentation (14991)` into `r2.5.0` by @chtruong814 :: PR: #15020 - Bump to 2.5.3 by @chtruong814 :: PR: #15022
## NVIDIA Neural Modules 2.5.2 ### Detailed Changelogs: #### Text Normalization / Inverse Text Normalization
Changelog - cp: `Add import guards for mcore lightning module` (#14970) into `r2.5.0` by @chtruong814 :: PR: #14982
#### Uncategorized:
Changelog - Bump to 2.5.2 by @chtruong814 :: PR: #14983
## NVIDIA Neural Modules 2.5.1 ### Highlights - This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at - Adds nv-one-logger - Adds fixes related to Megatron FSDP ### Detailed Changelogs: #### ASR
Changelog - Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
#### TTS
Changelog - Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811
#### NLP / NMT
Changelog - Patch: r2.5.0 with onelogger changes. by @PeiyuanQi :: PR: #14811 - Megatron FSDP r2.5.0 cherry-pick by @BoxiangW :: PR: #14922
#### Uncategorized:
Changelog - Bump to 2.5.1 by @chtruong814 :: PR: #14898 - Cherry pick `Feat: Disk space management: for nemo install test (14822)` into `r2.5.0` by @chtruong814 :: PR: #14937 - cp: `Fix the load checkpointing issue -- onelogger callback gets called multiple time in some case. (14945)` into `r2.5.0` by @chtruong814 :: PR: #14948
## NVIDIA Neural Modules 2.5.0 ### Highlights - Collections: - LLM - Nano v2 12B and 9B - Speech - New SpeechLM2 collection - Streaming Softformer model - Deprecate Confidence Ensemble models - parakeet-tdt-0.6b-v3 and canary-1b-v2 models - Added chunk inference support with .transcribe() for canary based models - Enable prediction of timestamps with streaming ASR - Improve ASR models’ invariance to padding/batch size - Qwen prompt format support, SALM generation fixes - High-level SALM model.generate API closely resembling HF models - SALM model initialization with time/memory optimization - SpeechLM2: fixed excessive padding, support on-the-fly resampling for SALM - Automodel and Export-Deploy functionality are available in their individual repositories respectively and deprecated in NeMo2 ### Detailed Changelogs: #### ASR
Changelog - Modernize logger interface by @emmanuel-ferdman :: PR: #13783 - Higher-level API for SALM.generate by @pzelasko :: PR: #14034 - add/refactor docs for asr lm customization by @lilithgrigoryan :: PR: #14088 - Improve NEST GPU Utilization 1/N by @MahmoudAshraf97 :: PR: #14086 - Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827 - Clean up transducer decoding initialization by @artbataev :: PR: #14112 - Improve NEST GPU Utilization 2/N by @MahmoudAshraf97 :: PR: #14089 - GPU-accelerated Phrase-Boosting (GPU-PB) for AED decoding by @andrusenkoau :: PR: #14108 - Fix decoding with ngpu-lm when training (#13994) by @hoangtran9122 :: PR: #13995 - fix eval_beamsearch_ngram_ctc script by @lilithgrigoryan :: PR: #14238 - fix wrong typing for ctc-ws context graph by @andrusenkoau :: PR: #14262 - fix frame vad by @stevehuang52 :: PR: #14337 - Improve NEST GPU Utilization 3/N by @MahmoudAshraf97 :: PR: #14234 - remove confidence ensemble models by @lilithgrigoryan :: PR: #14343 - Fix ASR decoding issues with CUDA graphs in training by @artbataev :: PR: #14184 - Streaming Sortformer release PR01: uploading bugfixes, refactored variables and yaml file name changes by @tango4j :: PR: #14416 - Streaming Sortformer release PR02: unit tests for streaming models and modules by @tango4j :: PR: #14417 - GPU-accelerated Phrase-Boosting (GPU-PB) for CTC, RNN-T, and TDT decoding by @andrusenkoau :: PR: #14277 - Fix subsampling chunking test by @monica-sekoyan :: PR: #14452 - Canary2 with NFA by @monica-sekoyan :: PR: #14121 - Initial Chunking by @nune-tadevosyan :: PR: #14321 - Chunking fix by @nune-tadevosyan :: PR: #14482 - Tutorial and doc update by @nune-tadevosyan :: PR: #14484 - Streaming Sortformer release PR03: NeMo documentations and tutorial notebook by @tango4j :: PR: #14388 - Add wget_from_nemo by @nune-tadevosyan :: PR: #14623 - Downgrade "datasets" library version in ASR tutorial to ensure compatibility with HF Datasets used by @KunalDhawan :: PR: #14685 - Canary tutorial fix by @nune-tadevosyan :: PR: #14708 - Force activations and weights cast to FP32 Jasper Encoder Squeeze-Excite by @erastorgueva-nv :: PR: #14715
#### TTS
Changelog - Improve ASR models' invariance to padding/batch size by @pzelasko :: PR: #13827 - remove nlp modules by @dimapihtar :: PR: #14127 - Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167 - Remove T5-TTS by @blisc :: PR: #14252
#### NLP / NMT
Changelog - add extra params for MegatronDataSampler by @dimapihtar :: PR: #13956 - Modernize logger interface by @emmanuel-ferdman :: PR: #13783 - remove dialogue collection by @dimapihtar :: PR: #14087 - remove QA collection by @dimapihtar :: PR: #14092 - remove text nlp collection by @dimapihtar :: PR: #14110 - remove nlp modules by @dimapihtar :: PR: #14127 - remove rag collection by @dimapihtar :: PR: #14157 - remove nmt collection by @dimapihtar :: PR: #14191 - Fix importerror in transformer_lm_model after nlp module removals by @chtruong814 :: PR: #14199 - fix QA comments NVBug by @huvunvidia :: PR: #14196 - Temporarily Remove Encoder PP Support by @yaoyu-33 :: PR: #14167 - remove mixins collections by @dimapihtar :: PR: #14281 - feat: print expert groups on megatron init by @clumsy :: PR: #13874 - [speechlm2] [lhotse] sharegpt data and testloader by @huckiyang :: PR: #14294 - Add notebook for LoRA on GPT-OSS-20B by @shashank3959 :: PR: #14439 - Sketch dist-ckpt content versioning by @mikolajblaz :: PR: #13839 - Change to enable full iteration CUDA graph for LLMs by @vasunvidia :: PR: #14077
#### Text Normalization / Inverse Text Normalization
Changelog - Check lightning and core imports in install test by @chtruong814 :: PR: #14403
#### Export
Changelog - ci: Set L2_NeMo_2_Export_Deploy_Query_In_Framework to be optional by @chtruong814 :: PR: #13946 - Remove old export doc by @oyilmaz-nvidia :: PR: #14292 - Llama4 Export: Remove outdated MLP weight transform by @suiyoubi :: PR: #14297 - Update mllama hf import/export for transformers 4.53 by @meatybobby :: PR: #14327
#### Bugfixes
Changelog - Bugfix for Hyena to the get_t function which comes up when doing longer context inference by @jstjohn :: PR: #14256 - fix skipped cuHyena kernel while training by @farhadrgh :: PR: #14365 - Remove flaky Evo2 dataset performance test by @jstjohn :: PR: #14371 - Use module prefix in restore_modelopt_state by @jenchen13 :: PR: #14384
#### Uncategorized:
Changelog - Version bump to `2.5.0rc0.dev0` by @github-actions[bot] :: PR: #13944 - [Llama4] Enable tp comm overlap for llama4 by @gdengk :: PR: #13940 - Fix for Squad Dataset Download by @rhmukundan :: PR: #13893 - add nmh HF conversion by @JRD971000 :: PR: #13941 - Speechlm2 SALM improvements by @pzelasko :: PR: #13829 - fix dataset issue by @dimapihtar :: PR: #13953 - Editing MMLU to pull from the correct repo by @ruchaa-apte :: PR: #13991 - move classes to module to use __target__ feature (#14023) by @nithinraok :: PR: #14031 - Add Nemotron-H prompt format, fix cut-to-conversation custom attr propagation by @pzelasko :: PR: #13963 - Bump release_library template to v0.40.0 by @chtruong814 :: PR: #14046 - [automodel] add support for layer-freezing by @akoumpa :: PR: #14000 - [Qwen3] Recipe config bug fix by @gdengk :: PR: #14084 - Add TE import guard in qwen2vl vision module by @chtruong814 :: PR: #14091 - Update bitsandbytes dependency to v0.46.0 by @pramodk :: PR: #14050 - Update FSDP2 docstring by @BoxiangW :: PR: #14105 - Interface to enable fsdp-double-buffer without enabling NCCL-UB by @youngeunkwon0405 :: PR: #14076 - SpeechLM2 SALM: load ckpt faster, with less GPU memory by @pzelasko :: PR: #14113 - Add object_storage_cache_path to PreTrainingDataModule by @shunjiad :: PR: #14103 - Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #14160 - Fix FLUX test with correct env var by @suiyoubi :: PR: #14149 - add mmap_bin_files param by @dimapihtar :: PR: #14122 - Add option to suppress import checks in `Dockerfile.speech` by @artbataev :: PR: #14185 - Safely import optional python packages by @roclark :: PR: #13936 - Set flux test as optional by @chtruong814 :: PR: #14190 - Revert "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14197 - Fix "Safely import optional python packages (#13936)" by @chtruong814 :: PR: #14198 - Add fix for evo2 generate/inference by @jwilber :: PR: #14027 - Fixing file path suffix by @gautham-kollu :: PR: #14179 - Update AVLM finetune example for vanilla fine-tuning by @huvunvidia :: PR: #14232 - [finetune] Add dataset_kwargs to prepare packed sequence data by @jiajunly :: PR: #14169 - Allow exception in hf ckpt load attempt before fallback to standard l… by @trvachov :: PR: #14214 - Load master weights from checkpoint by @kunlunl :: PR: #14072 - Add deploy lora adapter portion by @ruchaa-apte :: PR: #14255 - fix speechlm lhotse loading nemo_tarred by @stevehuang52 :: PR: #14314 - Update changelog for `r2.4.0` by @github-actions[bot] :: PR: #14334 - Flaky test timing out: @pytest.mark.pleasefixme by @pablo-garay :: PR: #14351 - Support dump perf recipe diff from base recipe by @guyueh1 :: PR: #14206 - Bugfix degenerate bases evo2 dataset by @jstjohn :: PR: #14359 - Hyena support for flash decode API by @jstjohn :: PR: #14315 - Fix Gemma2/3 & Llava (Next) & Llama4 conversion issue with latest transformers by @suiyoubi :: PR: #14367 - fix: reduce the excessive test time of test_msdd_diar_inference by @tango4j :: PR: #14366 - SpeechLM2: S2S->S2T data reader, excessive padding fixes by @pzelasko :: PR: #14124 - chore: Release 2.5.0rc0 by @ko3n1g :: PR: #14389 - Add pyxis flag for container writable. by @sudostock :: PR: #14395 - [MoE] Partial Cudagraph support for MoE by @gdengk :: PR: #14362 - Revert "[MoE] Partial Cudagraph support for MoE (#14362)" by @chtruong814 :: PR: #14402 - Update AVLM recipes for NeMo-CI runs by @huvunvidia :: PR: #14397 - Remove nemo1 multimodal and vision by @yaoyu-33 :: PR: #14095 - Fix LazyNeMoIterator supervision for multi-channel cuts by @anteju :: PR: #14409 - Bump Mcore to 7f7439f by @chtruong814 :: PR: #14373 - Use cuhyena rearrange when available. by @moradza :: PR: #14383 - Fix model training/eval state after PTL validation loop by @paul-gibbons :: PR: #14152 - Add deprecation notice to eval code by @athitten :: PR: #14316 - Streaming Sortformer release PR04: Adding functional tests for streaming sortformer by @tango4j :: PR: #14435 - QWEN2.5-VL 7B Performance Recipe by @tomlifu :: PR: #14401 - Discount FLOPs in dot-product att by @erhoo82 :: PR: #14424 - Bump to pytorch 25.06 and newer TE commit by @chtruong814 :: PR: #14423 - Enable precision aware optimizer for dsv3 by @guyueh1 :: PR: #14444 - Make VBoost activation conditional by @bdubauski :: PR: #14458 - cuHyena FFTConv support for Hyena Long Implicit (LI) Layer by @farhadrgh :: PR: #14396 - Alit/nano v2 by @JRD971000 :: PR: #14464 - Fix reuse_grad_buf_for_mxfp8_param_ag for mxfp8 by @guyueh1 :: PR: #14445 - Fix loss mask for chat datasets by @cuichenx :: PR: #14369 - Rename to subquadratic_ops by @farhadrgh :: PR: #14486 - Allows using other signals (than SIGTERM) with PreemptionPlugin by @zachmoshe :: PR: #14248 - Qwen2.5-VL 32B Performance Recipe by @tomlifu :: PR: #14485 - Alit/nanov2 12b by @JRD971000 :: PR: #14483 - Freeze tags in in `r2.5.0` by @github-actions[bot] :: PR: #14513 - deprecate t0 by @dimapihtar :: PR: #14599 - Cherry pick `Use hugginface_hub for downloading the FLUX checkpoint (14638)` into `r2.5.0` by @chtruong814 :: PR: #14640 - Cherry pick `Fix function calling notebook (14643)` into `r2.5.0` by @chtruong814 :: PR: #14650 - Cherry pick `remove service launch scripts (14647)` into `r2.5.0` by @chtruong814 :: PR: #14648 - Cherry pick `Delete tutorials/llm/llama/biomedical-qa directory (14653)` into `r2.5.0` by @chtruong814 :: PR: #14654 - Cherry pick `Remove PEFT scheme condition from recipe (14661)` into `r2.5.0` by @chtruong814 :: PR: #14662 - Cherry pick `fixing kernel restarting when transcribing (14665)` into `r2.5.0` by @chtruong814 :: PR: #14672 - Delete nemo 1 notebooks by @cuichenx :: PR: #14675 - Cherry pick `Fixing Sortformer training tutorial notebook (14680)` into `r2.5.0` by @chtruong814 :: PR: #14681 - Cherry-pick `Update get_tensor_shapes function whose signature was refactored` (14594) into `r2.5.0` by @chtruong814 :: PR: #14678 - Cherry pick `Skip trt-llm and vllm install in install test (14663)` into `r2.5.0` by @chtruong814 :: PR: #14697 - Cherry pick `Fix for \EncDecRNNTBPEModel transcribe() failed with TypeError\ (14698)` into `r2.5.0` by @chtruong814 :: PR: #14709 - Cherry pick `Fix broken link in Reasoning-SFT.ipynb (14716)` into `r2.5.0` by @chtruong814 :: PR: #14717 - cherry-pick add load-in-4bit param (14636) into r2.5.0 by @dimapihtar :: PR: #14719 - Cherry pick `Fix deepseek export dtype (14307)` into `r2.5.0` by @chtruong814 :: PR: #14682 - Cherry pick `remove env var (14739)` into `r2.5.0` by @chtruong814 :: PR: #14746 - Cherry-pick 'Bump modelopt to 0.35.0 and remove `safe_import("modelopt")` in llm collection (#14656)' into 'r2.5.0' by @chtruong814 :: PR: #14771 - Cherry pick `Update prune-distill notebooks to Qwen3 + simplify + mmlu eval (14785)` into `r2.5.0` by @chtruong814 :: PR: #14789 - Cherry pick `Remove export-deploy, automodel, and eval tutorials (14790)` into `r2.5.0` by @chtruong814 :: PR: #14792 - Cherry pick `ci: Automodel deprecation warning (14787)` into `r2.5.0` by @chtruong814 :: PR: #14791
## NVIDIA Neural Modules 2.4.1 ### Detailed Changelogs: #### Uncategorized:
Changelog - Update package_info.py by @ko3n1g :: PR: #14400 - Patch to address issue 14392 by @youngeunkwon0405 :: PR: #14398 - Cherry pick `Fix callbacks in DSV3 script (14350)` into `r2.4.0` by @chtruong814 :: PR: #14370 - Cherry pick `Change Llama Embedding Tutorial to use SFT by default (14231)` into `r2.4.0` by @chtruong814 :: PR: #14303 - Cherrypick `calculate_per_token_loss requirement for context parallel` (#14065) (#14282) into `r2.4.0` by @chtruong814 :: PR: #14448 - Pin nvidia-lm-eval to 25.6.1 by @chtruong814 :: PR: #14470
## NVIDIA Neural Modules 2.3.3 - This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at - Pin nvidia-lm-eval to 25.5 ## NVIDIA Neural Modules 2.4.0 ### Highlights - Collections: - Speech - Batched beam search for transducers (RNN-T and TDT) - RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware - add support for CTC batched beam search with GPU-LM - Key fixes - Punctuation Marks in Timestamps - Fix timestamps when cuda graphs enabled - Fix masking of \ tokens in AED inference - TDT streaming inference fix - LLM - Qwen 3 235B-A22B Perf Optimized - DeepSeek V3 Perf Optimized - Gemma3 support from Google - Embedding and Reranker models - MM - Llama 4 - AVLM - Training performance (speed) - NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200 - MXFP8 with TP communication overlap - MXFP8 with reduced memory allocation - FP8 sub-channel recipe (128x128 for weight and 1x128 for activation) - cudnn fused attention for MLA (both Hopper and Blackwell) - Advanced custom asymmetric pipelining (for MTP, loss func, and embd) - BF16 optimizer for model memory saving - CUDA graph fix for fine-tuning benchmarks - CUDA graph support for LLAMA4 ### Detailed Changelogs #### ASR
Changelog - ci: Fix ASR container by @ko3n1g :: PR: #13288 - Set L2_Segmentation_Tool_Parallel_ctc_segmentation test to be optional by @chtruong814 :: PR: #13296 - Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421 - Update flagged docs links by @erastorgueva-nv :: PR: #13391 - [Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445 - Update CP by @pablo-garay :: PR: #13532 - Tdt buffered inference fix by @hainan-xv :: PR: #13500 - Fix transcribe when nbest hypotheses are returned by @lilithgrigoryan :: PR: #13540 - Set ASR test to be optional by @chtruong814 :: PR: #13633 - Enabling chunked inference for AED models in asr_evaluator by @melllinia :: PR: #13674 - Ko3n1g/chore/asr only by @ko3n1g :: PR: #13704 - decompressing joblib file before checking it by @Ssofja :: PR: #13732 - Revert "decompressing joblib file before checking it (#13732)" by @chtruong814 :: PR: #13791 - Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353 - AIStore with Webdataset by @monica-sekoyan :: PR: #13604 - Update to add default for dataclass variables by @nithinraok :: PR: #13814 - This PR addresses to known security issues by @Ssofja :: PR: #13804 - remove model_stride var by @nithinraok :: PR: #13867 - add CTC batched beam search by @lilithgrigoryan :: PR: #13337 - Clean up streaming ASR script and tests by @artbataev :: PR: #13894 - add NGPU-LM fusion during CTC greedy by @lilithgrigoryan :: PR: #13917
#### TTS
Changelog - Revert "WebDataset URL refactoring" by @ko3n1g :: PR: #13421 - Update flagged docs links by @erastorgueva-nv :: PR: #13391 - [Docs] Fix incorrectly formatted reference tags by @erastorgueva-nv :: PR: #13445 - Update CP by @pablo-garay :: PR: #13532 - fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673 - AIStore with Webdataset by @monica-sekoyan :: PR: #13604
#### NLP / NMT
Changelog - Migrate Hyena to Megatron inference_context. by @cspades :: PR: #13436 - Update CP by @pablo-garay :: PR: #13532 - fix broken links by @dimapihtar :: PR: #13544 - Add nlp import checks by @thomasdhc :: PR: #13563 - PTQ model support, quant_cfg, and documentation updates by @janekl :: PR: #13519 - feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences by @soluwalana :: PR: #13367 - fix: vpp stage refactoring to match mcore by @ZhiyuLi-Nvidia :: PR: #13673 - Fix resume with MegatronPretrainingBatchSampler by @ashors1 :: PR: #13565 - Punctuation Marks in Timestamps by @monica-sekoyan :: PR: #13353 - Revert `Adding more doc-strings to megatron_parallel.py #12767` by @ko3n1g :: PR: #13824 - reasoning model evaluation mmlu gpqa by @ruchaa-apte :: PR: #13880 - Remove unused DynamicRetrievalServer and Bert dataset loader classes by @dimapihtar :: PR: #14209 - Huvu/avlm qafix cherrypick from by @huvunvidia :: PR: #14253
#### Export
Changelog - Improve Nemo2Exporter for Models Using Custom Modelling Files on HF by @suiyoubi :: PR: #13400 - Adding more export tests by @oyilmaz-nvidia :: PR: #13410 - Add Warning to Export when output_path exists by @suiyoubi :: PR: #13465 - Move libsox-fmt-all from Dockerfile.ci.export_deploy to Dockerfile.ci by @chtruong814 :: PR: #13452 - ci: Remove trt-llm breakpoint by @ko3n1g :: PR: #13499 - Add Qwen2VL export_ckpt by @AtsunoriFujita :: PR: #13398 - Add MLlama export_ckpt by @AtsunoriFujita :: PR: #13346 - Update vLLMExporter to use vLLM V1 by @janekl :: PR: #13498 - Add vLLM Mixtral and TRT-LLM qnemo export tests (plus a couple of bugfixes) by @janekl :: PR: #13697 - Fix Qwen3 export + misc by @cuichenx :: PR: #13679 - Extra int cast for successful tracing during ONNX export by @janekl :: PR: #13782 - FP8 lora export by @cuichenx :: PR: #13748 - Add PEFT export check by @cuichenx :: PR: #13835 - Update llm api import_ckpt/export_ckpt docstring by @meatybobby :: PR: #13714 - Use modelopt export and disable dataset calibration for weight only PTQ by @jenchen13 :: PR: #13756
#### Bugfixes
Changelog - [automodel] move liger kernel patching by @akoumpa :: PR: #13579
#### Uncategorized
Changelog - build: various bumps by @ko3n1g :: PR: #13285 - ci: Fixes to selective triggering by @ko3n1g :: PR: #13287 - ci: Set timeout by @ko3n1g :: PR: #13294 - Set L2_NeMo_2_T5_Pretraining test as optional by @chtruong814 :: PR: #13282 - Add test environment approval step for CI by @chtruong814 :: PR: #13297 - update num nodes in deepseek v3 finetune recipe by @cuichenx :: PR: #13314 - ci: Increase cache pool by @ko3n1g :: PR: #13306 - Rename adam_with_cosine_annealing as adam since cosin LR is not setup by @ShriyaRishab :: PR: #13315 - ci: Update test queue bot to not assume a workflow is launched from a PR by @chtruong814 :: PR: #13318 - Fix TE pytorch attention doc link by @thomasdhc :: PR: #13327 - ci: Add all recent buildcaches to update-buildcache job by @ko3n1g :: PR: #13289 - Fix neva notebook by @yaoyu-33 :: PR: #13334 - Fix transformer offline for CI/CD llama4 tests by @yaoyu-33 :: PR: #13339 - [automodel] convert lm head to full tensor before passing to lce by @yuanzhedong :: PR: #13319 - ci: No dups in queue by @ko3n1g :: PR: #13352 - ci(hotfix): VLM CPU unit tests by @ko3n1g :: PR: #13348 - vLLM==0.8.5 update by @janekl :: PR: #13350 - ci: Allow bypassing approval by @ko3n1g :: PR: #13365 - Avoid the need to specify optional attributes for lhotse/nemo reader functions by @pzelasko :: PR: #13307 - ci: Fix selective-triggering for non-PR events by @ko3n1g :: PR: #13374 - ci: Revert `no-concurrency-group-on-main` by @ko3n1g :: PR: #13375 - ci: Improve no-fail-fast mechanism by @ko3n1g :: PR: #13370 - 2d buckets estimation fix by @monica-sekoyan :: PR: #13377 - ci: Fix scheduled runs by @ko3n1g :: PR: #13378 - Ko3n1g/ci/fix nightly runs by @ko3n1g :: PR: #13382 - [automodel] fix none issue in dataset for qwen model by @yuanzhedong :: PR: #13311 - update table by @akoumpa :: PR: #13397 - Improve test coverage for audio modules by @anteju :: PR: #13333 - Disable failing maxine loss test by @anteju :: PR: #13361 - Ko3n1g/ci/no notification on cancel by @ko3n1g :: PR: #13403 - document fp8_recipe by @akoumpa :: PR: #13405 - Weekly bump main by @ko3n1g :: PR: #13408 - Handle boolean args for performance scripts and log received config by @guyueh1 :: PR: #13291 - [automodel] add FirstRankPerNode by @akoumpa :: PR: #13373 - tests: Disable flaky audio test by @ko3n1g :: PR: #13429 - ci: Disable flaky audio test by @ko3n1g :: PR: #13435 - Fix loss compute and reduction by @xrennvidia :: PR: #13295 - ci: Skip link check on github links by @chtruong814 :: PR: #13425 - Add NCCL cfg interface to perf scripts by @erhoo82 :: PR: #13407 - ci: Success only if `Run CICD` label attached by @ko3n1g :: PR: #13430 - ci: Add tests to selective triggering by @ko3n1g :: PR: #13404 - ci: Remove jq by @ko3n1g :: PR: #13440 - ci: Fix deps tree for tests by @ko3n1g :: PR: #13443 - Ko3n1g/ci/fix dependency tree by @ko3n1g :: PR: #13448 - Adding additional unit tests for the deploy module by @pthombre :: PR: #13411 - [Audio] fix a flaky test (and also make some tests run faster) by @racoiaws :: PR: #13439 - [automodel] ignore tail padding in TPS calculation by @akoumpa :: PR: #13329 - Ko3n1g/ci/selective triggering 3 by @ko3n1g :: PR: #13460 - ci: Disable broken neva tests by @ko3n1g :: PR: #13461 - fix speechlm data module by @stevehuang52 :: PR: #13362 - ci: Enter queue only with passing linting by @ko3n1g :: PR: #13462 - Adding tests for Schroedinger Bridge model by @nasretdinovr :: PR: #13401 - add more detailed description by @dimapihtar :: PR: #13464 - [Audio] tests for score-based and flow matching enhancement models by @racoiaws :: PR: #13406 - Use expandable cuda memory segmentation by @erhoo82 :: PR: #13418 - Fix llava tokenizer caused nan issue by @yaoyu-33 :: PR: #13466 - Remove cuda method from ModelPT by @erastorgueva-nv :: PR: #13394 - Fix BNR 2 unit test + input, case where input length was not specified by @nitin9252 :: PR: #13467 - ci: Do not run any tests if no match is found by @ko3n1g :: PR: #13479 - Ko3n1g/ci/selective triggering 4 by @ko3n1g :: PR: #13489 - Fix typo in the performance script by @youngeunkwon0405 :: PR: #13487 - ci: No runs on main by @ko3n1g :: PR: #13490 - ci: Upload on schedule by @ko3n1g :: PR: #13491 - ci: Run selective triggering on dockerfiles and dependencies by @ko3n1g :: PR: #13493 - [automodel] fallback FP8 + LCE -> FP8 + CE by @akoumpa :: PR: #13349 - Update changelog for `r2.3.0` by @github-actions[bot] :: PR: #13501 - Update 2.3.0 changelog by @chtruong814 :: PR: #13504 - Enabling flash decode for float16 precision only by @pthombre :: PR: #13471 - Fix changelog formatting by @chtruong814 :: PR: #13505 - Updating the long context performance number for B200 by @youngeunkwon0405 :: PR: #13468 - ci: Add more files to filter by @ko3n1g :: PR: #13517 - Improve error message when HF checkpoint cannot be loaded by @ashors1 :: PR: #13513 - Add Resume_path to llama_nemotron models by @suiyoubi :: PR: #13515 - Add Llama4 GHA by @suiyoubi :: PR: #13442 - add memory profile interface to perf scripts by @erhoo82 :: PR: #13413 - Add fp8_param argument back to mixed precision plugin for backward compatibility by @guyueh1 :: PR: #13522 - [automodel] add find_unused_parameters=True for DDP by @akoumpa :: PR: #13366 - ci: Update success message by @ko3n1g :: PR: #13541 - [Audio] TransformerUNet: predictive model support added by @nasretdinovr :: PR: #13470 - Test Hyena mixer CP equivalency by @farhadrgh :: PR: #13330 - use null tokenizer by @malay-nagda :: PR: #13480 - ci: Remove optional marker by @ko3n1g :: PR: #13469 - Update extra_requires and requirements by @thomasdhc :: PR: #13359 - Fix default config for LlamaNemotron Ultra by @suiyoubi :: PR: #13542 - [audio] Improve test coverage for audio losses by @anteju :: PR: #13309 - deepseek finetuning callback error change by @SDcodehub :: PR: #13483 - ci(fix): Add `__init__` to selective-triggering by @ko3n1g :: PR: #13577 - nsys profile filename ranks info by @malay-nagda :: PR: #13576 - chore: Update setup.py by @ko3n1g :: PR: #13566 - Fix Llama importer by @suiyoubi :: PR: #13583 - [automodel] fix --mbs/gbs dtype and chat-template by @akoumpa :: PR: #13602 - Reconfigure 'limit__batches' by @maanug-nv :: PR: #13523 - ci: Optional speech tests by @ko3n1g :: PR: #13606 - [Automodel] Fix CP device_mesh issue, use PTL distsampler by @BoxiangW :: PR: #13473 - [automodel] fix log message by @akoumpa :: PR: #13612 - Tests for evaluation with NVIDIA Evals Factory by @chtruong814 :: PR: #13627 - Fix ptl import in notebooks by @maanug-nv :: PR: #13608 - [automodel] dist.abort -> dist.destroy_process_group by @akoumpa :: PR: #13578 - Skip eval unit test by @chtruong814 :: PR: #13635 - Fix image_processor config in Energon path by @AtsunoriFujita :: PR: #13618 - Add Gemma3 VL model by @xiangxu-google :: PR: #13536 - Set L2_NeMo_2_EVAL as optional by @chtruong814 :: PR: #13644 - Update install to use pip install by @thomasdhc :: PR: #13605 - Multi node settings for evaluation nemo-run script by @athitten :: PR: #13568 - [Llama4] Fix the missing args in the recipe by @gdengk :: PR: #13649 - Bump nvidia-modelopt to 0.29.0 by @AAnoosheh :: PR: #13599 - Update README.md for 25.04 release by @snowmanwwg :: PR: #13654 - [automodel] consolidate sft peft scripts by @akoumpa :: PR: #13634 - Qwen3 by @cuichenx :: PR: #13554 - Set env variables for eval tests by @marta-sd :: PR: #13658 - build: multimodal-only by @ko3n1g :: PR: #13665 - [Audio] TransformerUNet: predictive model tests added by @nasretdinovr :: PR: #13648 - [automodel] consolidate vllm scripts by @akoumpa :: PR: #13670 - build: Pin transformers by @ko3n1g :: PR: #13675 - ci: Enable codecov checks by @ko3n1g :: PR: #13497 - ci: Add `init-file-checker` by @ko3n1g :: PR: #13684 - Add use_sharp and use user buffer registration args in perf scripts by @youngeunkwon0405 :: PR: #13521 - Remove is-optional marker for L2_NeMo_2_EVAL by @marta-sd :: PR: #13669 - gpu type and #devices CLI args by @malay-nagda :: PR: #13620 - perf scripts updates by @malay-nagda :: PR: #13456 - Use audio codec without discriminators in SpeechLM2 tests by @pzelasko :: PR: #13711 - Update changelog for `r2.3.1` by @github-actions[bot] :: PR: #13719 - Recipe default value fix for Llama4 by @suiyoubi :: PR: #13696 - build: Lift numba by @ko3n1g :: PR: #13735 - New key override for timestamps by @melllinia :: PR: #13743 - Fixed Mllama Energon config by @AtsunoriFujita :: PR: #13574 - Update convert_to_tarred_audio_dataset.py by @ssh-meister :: PR: #13755 - Enable dropout recompute in LoRA by @michal2409 :: PR: #13745 - Address VDR feedback for NeMo FW evaluations by @athitten :: PR: #13701 - remove blocks unused to increase coverage by @romanbrickie :: PR: #13511 - Fix Flux Recipe for FSDP/DDP by @suiyoubi :: PR: #13715 - Try soften protobuf version requirement by @pablo-garay :: PR: #13747 - Flux FP8 recipe by @Victor49152 :: PR: #13584 - Gemma3 Fix and Tests by @suiyoubi :: PR: #13661 - Disable local gradient checker in performance scripts by @erhoo82 :: PR: #13768 - [Audio] Tests: training for mask, pred and SB models by @nasretdinovr :: PR: #13736 - Refactor MSC integration in exp manager by @shunjiad :: PR: #13626 - [fix] vpp error in Gemma3 by @ZhiyuLi-Nvidia :: PR: #13784 - ci: Ensure approval queue fetches all CICD workflows using pagnation by @chtruong814 :: PR: #13798 - ci: make_request in approval test queue appends next url for status checks only by @chtruong814 :: PR: #13802 - Remove guard for masking tests and improve coverage by @anteju :: PR: #13787 - fix: After mcore bump by @ko3n1g :: PR: #13781 - Fix Gemma3VL training bugs by @sharanmayank :: PR: #13766 - [NeMo 2.0] Remove the restriction of load_model_state_dict for cfsdp by @shjwudp :: PR: #13512 - Add option to construct Llama model with Transformer Engine op fuser by @timmoon10 :: PR: #13776 - [Evaluation] Add support for simple-evals and tasks that require logprobs by @marta-sd :: PR: #13647 - remove stale section by @akoumpa :: PR: #13759 - fix moe_router_pre_softmax for Mixtral by @akoumpa :: PR: #13678 - fix: improve sequence length handling to fix nan in loss when turning on cudagraph by @katec846 :: PR: #13779 - Gemma3 Energon Dataset by @suiyoubi :: PR: #13813 - Rectify BLEU evaluation by @ankitapasad :: PR: #13762 - ci: Moved workflows by @ko3n1g :: PR: #13828 - ci: Moved templates by @ko3n1g :: PR: #13830 - [Build] Bump bitsandbytes dependency to 0.45.5 (ubuntu 22.04 compatibility) by @pramodk :: PR: #13789 - update for `PYTORCH_CUDA_ALLOC_CONF` env var by @malay-nagda :: PR: #13837 - [Llama4] Enable VLM Dec cudagraph by @gdengk :: PR: #13767 - Support MSC URL in LLM checkpointing by @shunjiad :: PR: #13805 - additional metrics by @dimapihtar :: PR: #13754 - Expand modelopt version range by @chtruong814 :: PR: #13850 - Alit/nmh4b by @JRD971000 :: PR: #13481 - [Tutorial] Train your own reasoning model in 48 hours on a single GPU by @Maghoumi :: PR: #13853 - Enabled C2C-PCie bridge through NCCL by @sanandaraj5597 :: PR: #13621 - Added safe loading of models by @nithinraok :: PR: #13607 - Add NemotronH Performance Script by @guyueh1 :: PR: #13528 - Hyena SE/MR B2B Kernel integration by @farhadrgh :: PR: #13518 - chore: Destroy buildcache by @ko3n1g :: PR: #13869 - tests: Fix Qwen test by @ko3n1g :: PR: #13888 - fix: improve error handling in `is_multistorageclient_url` by @shunjiad :: PR: #13885 - feat(eval): adds benchmark adapters that allow specisal reasoning models by @agronskiy :: PR: #13709 - perf scripts 25.07 refactor by @malay-nagda :: PR: #13875 - Fix E5 and LlamaEmbedding Conversion by @suiyoubi :: PR: #13890 - Bug fix for NCCL vars by @sanandaraj5597 :: PR: #13908 - Reranker Model Support by @suiyoubi :: PR: #13876 - numa cmd in bash by @malay-nagda :: PR: #13914 - Fix BERT issue with PP by @suiyoubi :: PR: #13916 - [Llama4] Fix Vp_stage to enable VP for VLM llama4 by @gdengk :: PR: #13873 - Enable NVTX profiling in MCore by @minitu :: PR: #13820 - [Qwen3-MoE] Add Qwen3 MoE perf recipe for 30b and 235b by @gdengk :: PR: #13895 - lazy import bnbconfig by @akoumpa :: PR: #13919 - Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13932 - [peft] align adapter output shape with wrapped module output shape by @guyueh1 :: PR: #13922 - [automodel] move only lora adapters to cpu by @akoumpa :: PR: #13931 - Fix vp_stage not found when fsdp by @gautham-kollu :: PR: #13817 - Fix single optional import if ModelOpt not installed by @AAnoosheh :: PR: #13923 - Revert "Set TRANSFORMERS_OFFLINE=1 and HF_HUB_OFFLINE=1 in CI tests by @chtruong814 :: PR: #13938 - Enable LoRA for TELinear layers by @cuichenx :: PR: #13929 - Freeze tags in in `r2.4.0` by @github-actions[bot] :: PR: #13945 - Cherry pick `Use jiwer less than 4.0.0 (13997)` into `r2.4.0` by @ko3n1g :: PR: #13998 - Cherry pick `Remove container license reference (14010)` into `r2.4.0` by @ko3n1g :: PR: #14017 - move classes to module to use __target__ feature by @nithinraok :: PR: #14023 - Cherry pick `bf16 grads for bf16 jobs (14016)` into `r2.4.0` by @ko3n1g :: PR: #14020 - Cherry pick `Remove nemo1 stable diffusion test (14018)` into `r2.4.0` by @ko3n1g :: PR: #14019 - Version bump to `2.4.0rc1.dev0` by @github-actions[bot] :: PR: #14047 - Cherry pick `Fix Loading Custom Quantization Config (13934)` into `r2.4.0` by @ko3n1g :: PR: #13950 - Cherry pick `[automodel] fix sft notebook (14002)` into `r2.4.0` by @ko3n1g :: PR: #14003 - Cherry pick `Use average reduction in FSDP grad reduce-scatter when grad dtype is … (13981)` into `r2.4.0` by @ko3n1g :: PR: #14004 - Cherry pick `GPU memory logging update (13982)` into `r2.4.0` by @ko3n1g :: PR: #14021 - Cherry pick `Remove kaldiio (14006)` into `r2.4.0` by @ko3n1g :: PR: #14032 - Cherry pick `Set L2_NeMo_2_Flux_Import_Test to be optional (14056)` into `r2.4.0` by @ko3n1g :: PR: #14058 - Cherry pick `Bump protobuf to 5.29.5 (14045)` into `r2.4.0` by @ko3n1g :: PR: #14060 - Cherry pick `Detect hardware before enabling DeepEP (14022)` into `r2.4.0` by @ko3n1g :: PR: #14068 - Version bump to `2.4.0rc2.dev0` by @github-actions[bot] :: PR: #14115 - Cherry pick `Fix SFT Dataset Bug (13918)` into `r2.4.0` by @ko3n1g :: PR: #14074 - Cherry pick `Align adapter shape with base linear output shape (14009)` into `r2.4.0` by @ko3n1g :: PR: #14083 - Cherry pick `[MoE] Update the fp8 precision interface for llama4 and qwen3 (14094)` into `r2.4.0` by @ko3n1g :: PR: #14104 - Cherry pick `[Llama4] Tokenizer naming update (14114)` into `r2.4.0` by @ko3n1g :: PR: #14123 - Cherry pick `Bump to pytorch 25.05 container along with TE update (13899)` into `r2.4.0` by @ko3n1g :: PR: #14145 - Cherry pick `Perf scripts updates (14005)` into `r2.4.0` by @ko3n1g :: PR: #14129 - Cherry pick `Remove unstructured (14070)` into `r2.4.0` by @ko3n1g :: PR: #14147 - Version bump to `2.4.0rc3.dev0` by @github-actions[bot] :: PR: #14165 - Cherry pick `Add checkpoint info for NIM Embedding Expor Tutorial (14177)` into `r2.4.0` by @ko3n1g :: PR: #14178 - Cherry pick `Fix dsv3 script (14007)` into `r2.4.0` by @ko3n1g :: PR: #14182 - Cherry pick `405b perf script updates (14176)` into `r2.4.0` by @chtruong814 :: PR: #14195 - Cherry pick `Fix nemotronh flops calculator (14161)` into `r2.4.0` by @chtruong814 :: PR: #14202 - Cherry pick `Add option to disable gloo process groups` (#14156) into `r2.4.0` by @chtruong814 :: PR: #14220 - Cherry pick `Remove g2p_en (14204)` into `r2.4.0` by @chtruong814 :: PR: #14212 - Cherry pick `diffusion mock data null args (14173)` into `r2.4.0` by @chtruong814 :: PR: #14217 - Cherry pick `perf-scripts: Change b200 config to EP8 (14207)` into `r2.4.0` by @chtruong814 :: PR: #14223 - Cherry pick `Change RerankerSpecter Dataset question key (14200)` into `r2.4.0` by @chtruong814 :: PR: #14224 - Cherry pick `Fix the forward when final_loss_mask is not present (14201)` into `r2.4.0` by @chtruong814 :: PR: #14225 - Cherry pick `Fix Llama Nemotron Nano Importer (14222)` into `r2.4.0` by @chtruong814 :: PR: #14226 - Cherry pick `[automodel] fix loss_mask pad token (14150)` into `r2.4.0` by @chtruong814 :: PR: #14227 - [Performance script] FSDP-UBR related recipe update (#14208) by @youngeunkwon0405 :: PR: #14233 - Fix for MCore dist ckpt loading #14229 by @stevehuang52 :: PR: #14239 - cherry-pick fix eval beam search ctc script by @lilithgrigoryan :: PR: #14242 - Cherry pick `Moving export security fixes over here (14254)` into `r2.4.0` by @chtruong814 :: PR: #14261 - Cherry pick `Confidence fix for tutorial (14250)` into `r2.4.0` by @chtruong814 :: PR: #14266 - Cherry pick `added new models to documentation (14264)` into `r2.4.0` by @chtruong814 :: PR: #14278 - Cherry-pick `FIx Flux & Flux_Controlnet initialization issue` (#14263) into `r2.4.0` by @chtruong814 :: PR: #14273 - Cherry pick `update ffmpeg install (14237)` into `r2.4.0` by @chtruong814 :: PR: #14279
## NVIDIA Neural Modules 2.3.2 This release addresses known security issues. For the latest NVIDIA Vulnerability Disclosure Information visit , for acknowledgement please reach out to the NVIDIA PSIRT team at ## NVIDIA Neural Modules 2.3.1 ### Highlights - Collections - LLM - Llama 4: Fixed an accuracy issue caused by MoE probability normalization. Improved pre-train and fine-tune performance. - Export & Deploy - Updated vLLMExporter to use vLLM V1 to address a security vulnerability. - AutoModel - Improved chat-template handling. - Fault Tolerance - Local checkpointing: Fixed support for auto-inserted metric names for resuming from local checkpoints. ### Detailed Changelogs #### Export
Changelog - Cherry-pick `Update vLLMExporter to use vLLM V1` (#13498) into `r2.3.0` by @chtruong814 :: PR: #13631
#### Uncategorized
Changelog - Bump to 2.3.1 by @chtruong814 :: PR: #13507 - Cherry pick `Use explicitly cached canary-1b-flash in CI tests (13237)` into `r2.3.0` by @ko3n1g :: PR: #13508 - Cherry pick `[automodel] bump liger-kernel to 0.5.8 + fallback (13260)` into `r2.3.0` by @ko3n1g :: PR: #13308 - Cherry-pick `Add recipe and ci scripts for qwen2vl` to `r2.3.0` by @romanbrickie :: PR: #13336 - Cherry pick `Fix skipme handling (13244)` into `r2.3.0` by @ko3n1g :: PR: #13376 - Cherry pick `Allow fp8 param gather when using FSDP (13267)` into `r2.3.0` by @ko3n1g :: PR: #13383 - Cherry pick `Handle boolean args for performance scripts and log received config (13291)` into `r2.3.0` by @ko3n1g :: PR: #13416 - Cherry pick `new perf configs (13110)` into `r2.3.0` by @ko3n1g :: PR: #13431 - Cherry pick `Adding additional unit tests for the deploy module (13411)` into `r2.3.0` by @ko3n1g :: PR: #13449 - Cherry pick `Adding more export tests (13410)` into `r2.3.0` by @ko3n1g :: PR: #13450 - Cherry pick `[automodel] add FirstRankPerNode (13373)` into `r2.3.0` by @ko3n1g :: PR: #13559 - Cherry pick `[automodel] deprecate global_batch_size dataset argument (13137)` into `r2.3.0` by @ko3n1g :: PR: #13560 - Cherry-pick `[automodel] fallback FP8 + LCE -> FP8 + CE` (#13349) into `r2.3.0` by @chtruong814 :: PR: #13561 - Cherry pick `[automodel] add find_unused_parameters=True for DDP (13366)` into `r2.3.0` by @ko3n1g :: PR: #13601 - Cherry pick `Add CI test for local checkpointing (#13012)` into `r2.3.0` by @ananthsub :: PR: #13472 - Cherry pick `[automodel] fix --mbs/gbs dtype and chat-template (13598)` into `r2.3.0` by @akoumpa :: PR: #13613 - Cherry-pick `Update t5.py` (#13082) to `r2.3.0` and `bump mcore to f98b1a0` by @chtruong814 :: PR: #13642 - [Automodel] Fix CP device_mesh issue, use PTL distsampler (#13473) by @akoumpa :: PR: #13636 - [Llama4] Fix the recipe bug - cherrypick #13649 by @gdengk :: PR: #13650 - build: Pin transformers (#13675) by @ko3n1g :: PR: #13692
## NVIDIA Neural Modules 2.3.0 ### Highlights - Export & Deploy - NeMo 2.0 export path for NIM - ONNX and TensorRT Export for NIM Embedding Container - In-framework deployment for HF Models - TRT-LLM deployment for HF Models in NeMo Framework - Evaluation - Integrate nvidia-lm-eval to NeMo FW for evaluations with OpenAI API compatible in-framework deployment - AutoModel - VLM AutoModelForImageForTextToText - FP8 for AutoModel - Support CP with FSDP2 - Support TP with FSDP2 - Performance Optimization - add support for cut cross entropy & liger kernel - Gradient Checkpointing - Fault Tolerance - Integrate NVRx v0.3 Local checkpointing - Collections - LLM - Llama4 - Llama Nemotron Ultra - Llama Nemotron Super - Llama Nemotron Nano - Nemotron-h/5 - DeepSeek V3 Pretraining - Evo2 - Qwen 2.5 - LoRA for Qwen3-32B and Qwen3-30B-A3B - MultiModal - FLUX - Gemma 3 - Qwen2-VL - ASR - NeMo Run support for ASR training - N-Gram LM on GPU for AED - N-Gram LM on GPU + Transducer greedy decoding (RNN-T, TDT) - Timestamps support for AED timestamp supported models - Migrate SpeechLM to NeMo 2.0 - Canary-1.1 - Replace ClassificationModels class with LabelModels - Performance - Functional MXFP8 support for (G)B200 - Current scaling recipe with TP communication overlap and FP8 param gathers - Custom FSDP support that fully utilizes GB200 NVL72 ### Detailed Changelogs #### ASR
Changelog - Added model config params for Canary-1B-Flash, Canary-180M-Flash models by @KunalDhawan :: PR: #12588 - Canary tutorial by @ankitapasad :: PR: #12613 - Canary tutorial fix timestamp by @ankitapasad :: PR: #12677 - revert config by @nithinraok :: PR: #12689 - canary longform inference script with timestamps option by @krishnacpuvvada :: PR: #12653 - Fix default timestamps value for Hybrid ASR models by @artbataev :: PR: #12681 - Fix k2 installation with PyTorch 2.6.0 by @artbataev :: PR: #12686 - Improve time and RTFx report for ASR by @artbataev :: PR: #12680 - Modify train args by @ankitapasad :: PR: #12700 - Fix asr doc warnings by @nithinraok :: PR: #12720 - Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755 - transcribe fix for new hypotheses by @nune-tadevosyan :: PR: #12801 - Fix timestamps when cuda graphs enabled by @monica-sekoyan :: PR: #12808 - update streaming conformer by @stevehuang52 :: PR: #12846 - AED Decoding with N-Gram LM by @artbataev :: PR: #12730 - update notebook by @nithinraok :: PR: #13088 - bugfix ASR_Context_Biasing.ipynb by @lilithgrigoryan :: PR: #13109 - Change branch for installation from main to r2.3.0 by @ankitapasad :: PR: #13266
#### TTS
Changelog - Add Magpie-TTS and Updates NeMo Audio Codecs by @blisc :: PR: #12606 - fix bug from prior commit (#13264) by @blisc :: PR: #13328
#### NLP / NMT
Changelog - Remove old peft docs by @cuichenx :: PR: #12675 - Add code coverage for llm gpt models conversion tests by @suiyoubi :: PR: #12665 - Make BERT TransformerBlockWithPostLNSupport accept more inputs from Mcore by @suiyoubi :: PR: #12685 - remove gifs from documentation by @dimapihtar :: PR: #12732 - Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755 - fix NeMo documentation by @dimapihtar :: PR: #12754 - GPT Model/Data/Recipe Unit Test by @suiyoubi :: PR: #12757 - ci: Exclude nlp, mm, vision collections by @ko3n1g :: PR: #12816 - Add vocab size as attr to GPT and T5 Configs, use file name based logger in llm.gpt.data by @hemildesai :: PR: #12862 - Fix transformer layer api with megatron cbc89b3 by @yaoyu-33 :: PR: #12885
#### Text Normalization / Inverse Text Normalization
Changelog - Rename `FastNGramLM` -> `NGramGPULanguageModel` by @artbataev :: PR: #12755
#### Export
Changelog - GHA Conversion Test and Importer/Exporter Refactor by @suiyoubi :: PR: #12597 - Fix Llama Embedding Model Exporting keys by @suiyoubi :: PR: #12691 - build: Add trtllm by @ko3n1g :: PR: #12672 - Fix trt-llm install by @chtruong814 :: PR: #12827 - Update LLaVA's next HF exporter to load ViT checkpoint from YAML by @eagle705 :: PR: #12841 - Support huggingface export to tensorrtllm by @pthombre :: PR: #12889 - Adds a built stage for the trt-llm wheel to reduce the overall test image size by @chtruong814 :: PR: #12883
#### Uncategorized
Changelog - Update changelog-build.yml by @ko3n1g :: PR: #12584 - Update changelog for `r2.2.0` by @github-actions[bot] :: PR: #12585 - Add comments for requirements by @thomasdhc :: PR: #12603 - [automodel] FSDP2Strategy: move to device if using a single-device by @akoumpa :: PR: #12593 - build: Remove numba pin by @ko3n1g :: PR: #12604 - docs: Update installation guides by @ko3n1g :: PR: #12596 - Change Llama Scaling Factor type to Float by @suiyoubi :: PR: #12616 - ci: Test multiple python versions by @ko3n1g :: PR: #12619 - ci: Disable reformat by @ko3n1g :: PR: #12620 - Updating ModelOpt to 0.25.0 by @janekl :: PR: #12633 - [automodel] add additional hf_dataset tests by @akoumpa :: PR: #12646 - [automodel] add jit_transform tests by @akoumpa :: PR: #12645 - [automodel] init eos_token_id inside data module by @yuanzhedong :: PR: #12610 - [automodel] grad ckpt by @akoumpa :: PR: #12644 - bugfix(llm/LLaMa) - dropout_position can never be equal to extended string by @soluwalana :: PR: #12649 - Fix inference pipeline quality issue by @Victor49152 :: PR: #12639 - [automodel] switch to direct=True to propage return codes in nemorun by @akoumpa :: PR: #12651 - add Auto Conf support for bert, t5, qwen, starcoder models by @dimapihtar :: PR: #12601 - ci: Upload coverage by @ko3n1g :: PR: #12668 - ci: Re-enable changed-files action by @ko3n1g :: PR: #12683 - build: Pin sox by @ko3n1g :: PR: #12701 - add neva quantization by @linnanwang :: PR: #12698 - Clip coverage by @abhinavg4 :: PR: #12696 - GHA CI test: Remove unnecessary directive by @pablo-garay :: PR: #12714 - minor perf fixes by @malay-nagda :: PR: #12656 - Add DeepSeek V2 Lite into llm __init__.py by @suiyoubi :: PR: #12664 - Add Llama-Nemotron Nano and 70B models by @suiyoubi :: PR: #12712 - Save batch norm running stats in PEFT checkpoints by @cuichenx :: PR: #12666 - Fix document Readme under nemo to add more information by @yaoyu-33 :: PR: #12699 - Fix ub_overlap_ag by @cuichenx :: PR: #12721 - Toggle fast tokenizer if error occurs by @cuichenx :: PR: #12722 - Update README.md for blackwell and AutoModel by @snowmanwwg :: PR: #12612 - Raise error on import_ckpt with overwrite=False plus README for checkpoint_converters by @janekl :: PR: #12693 - [automodel] fix validation_step by @soluwalana :: PR: #12659 - [automodel] vlm tests by @akoumpa :: PR: #12716 - Auto Configurator code coverage by @dimapihtar :: PR: #12694 - [automodel] fix automodle benchmark script by @yuanzhedong :: PR: #12605 - Remove unnecessary directives by @pablo-garay :: PR: #12743 - Add recipe tests for coverage by @cuichenx :: PR: #12737 - Add Qwen2.5 in NeMo2 by @suiyoubi :: PR: #12731 - add fallback_module to safe_import_from by @akoumpa :: PR: #12726 - Update quantization scripts & relax modelopt requirement specifier by @janekl :: PR: #12709 - Import guard fasttext by @thomasdhc :: PR: #12758 - [automodel] chunked cross entropy by @akoumpa :: PR: #12752 - Add fsdp automodel test by @BoxiangW :: PR: #12718 - [automodel] if peft move only adapters to cpu by @akoumpa :: PR: #12735 - [automodel] update hf mockdataset by @akoumpa :: PR: #12643 - [automodel] remove unused cell in multinode notebook by @yuanzhedong :: PR: #12624 - Yash/llava next coverage by @yashaswikarnati :: PR: #12745 - Tidy code: remove unneeded statements/lines by @pablo-garay :: PR: #12771 - Pass tensor instead of raw number in _mock_loss_function in PTQ by @janekl :: PR: #12769 - ci: Run on nightly schedule by @ko3n1g :: PR: #12775 - Add logs for checkpoint saving start and finalization by @lepan-google :: PR: #12697 - Alit/test coverage by @JRD971000 :: PR: #12762 - Fix loss mask with packed sequence by @ashors1 :: PR: #12642 - Add pruning recipe by @kevalmorabia97 :: PR: #12602 - Update qwen2-v1 to use NeMo quick_gelu by @thomasdhc :: PR: #12787 - [doc] Fixes for audio doc warnings by @anteju :: PR: #12736 - ci: Measure multiprocessing by @ko3n1g :: PR: #12778 - ci: Fix flaky LLM tests by @ko3n1g :: PR: #12807 - Add BERT/Qwen2.5 Unit test and Refactor all GHA Conversion Tests by @suiyoubi :: PR: #12785 - Fix TransformerBlock cuda_graphs compatibility with MCore by @buptzyb :: PR: #12779 - ci: Remove `--branch` by @ko3n1g :: PR: #12809 - ci: Move scripts fully down to files by @ko3n1g :: PR: #12802 - add __init__.py to make this a package by @akoumpa :: PR: #12814 - Update changelog for `r2.2.1` by @github-actions[bot] :: PR: #12818 - add finetune support for Auto Configurator by @dimapihtar :: PR: #12770 - [automodel] add cpu:gloo to backend by @akoumpa :: PR: #12832 - add missing call to _apply_liger_kernel_to_instance by @akoumpa :: PR: #12806 - Prune docker images in GHA older than 8hrs by @chtruong814 :: PR: #12838 - [audio] Adding tests for predictive models by @anteju :: PR: #12823 - Update resiliency example notebook readme and add links to the brev launchable by @ShriyaRishab :: PR: #12843 - [automodel] qlora peft by @yzhang123 :: PR: #12817 - ci: Increase prune time by @ko3n1g :: PR: #12860 - Update base container in `Dockerfile.speech` by @artbataev :: PR: #12859 - Fix qwen2.5 1.5b configuration inheritance bug by @Aprilistic :: PR: #12852 - Update modelopt upperbound to 0.27 by @thomasdhc :: PR: #12788 - Non-blocking checkpoint cleanup failure by @jstjohn :: PR: #12804 - Improve evo2 dataset test and testability by @jstjohn :: PR: #12857 - Expand test converage neva / mllama by @yaoyu-33 :: PR: #12715 - Weekly bump by @ko3n1g :: PR: #12891 - ci: Optional_L2_NeMo_2_SSM_Finetuning by @ko3n1g :: PR: #12893 - docs: Update guide to PEP508 by @ko3n1g :: PR: #12890 - Replace lm-eval with nvidia-lm-eval by @chtruong814 :: PR: #12888 - Handle CUDA_DEVICE_MAX_CONNECTIONS before job launch by @guyueh1 :: PR: #12833 - add nemotron5 by @JRD971000 :: PR: #12660 - Bump vllm 0.8.2 by @Laplasjan107 :: PR: #12753 - DeepseekV3 SFT finetuning perf config by @gdengk :: PR: #12829 - add apply_chat_template method to TokenizerSpec + AutoTokenizer by @akoumpa :: PR: #12878 - add accelerate to dependencies by @akoumpa :: PR: #12871 - [automodel] Add FSDPv2-compatible context parallelism support. by @cspades :: PR: #12821 - [fault tolerance] Add local checkpointing support by @ananthsub :: PR: #12839 - ci: Bump release-freeze by @ko3n1g :: PR: #12914 - ci: Use PAT for code-freeze by @ko3n1g :: PR: #12915 - ci: Use correct environment by @ko3n1g :: PR: #12917 - Freeze tags in in `r2.3.0` by @github-actions[bot] :: PR: #12919 - chore: Bump version to 2.3.0.rc2 by @chtruong814 :: PR: #12920 - Version bump to `2.3.0rc3.dev0` by @github-actions[bot] :: PR: #12921 - Cherry pick `[automodel] Add linear ce loss support (12825)` into `r2.3.0` by @ko3n1g :: PR: #12922 - Cherry pick `DeepSeek V3 Multi Token Prediction (12550)` into `r2.3.0` by @ko3n1g :: PR: #12928 - Cherry pick `Set L2_NeMo_2_EVAL test to be optional (12949)` into `r2.3.0` by @ko3n1g :: PR: #12951 - Cherry pick `GB200 LLM performance scripts tuning (12791)` into `r2.3.0` by @ko3n1g :: PR: #12923 - Cherry pick `Allow configuration of PP communication backend to UCC in nemo2 (11755)` into `r2.3.0` by @ko3n1g :: PR: #12946 - Cherry pick `guard bitsandbytes based on cuda availability (12937)` into `r2.3.0` by @ko3n1g :: PR: #12958 - Cherry pick `Hugging Face model deployment support (12628)` into `r2.3.0` by @ko3n1g :: PR: #12962 - Cherry pick `fix macro-acc for pair-audio eval (12908)` into `r2.3.0` by @ko3n1g :: PR: #12963 - Cherry pick `Add energon dataset support for Qwen2VL (12831)` into `r2.3.0` by @ko3n1g :: PR: #12966 - Cherry pick `Make TETransformerLayerAutocast Support Cuda Graph (12075)` into `r2.3.0` by @ko3n1g :: PR: #12967 - Cherry pick `Use nvidia-lm-eval for evaluation (12902)` into `r2.3.0` by @ko3n1g :: PR: #12971 - Cherry pick `[NeMo 2.0] Interface for using MXFP8 and FP8 current scaling recipes (12503)` into `r2.3.0` by @ko3n1g :: PR: #12974 - Cherry pick `Fix trtllm and lightning conflict (12943)` into `r2.3.0` by @ko3n1g :: PR: #12981 - Cherry pick `Update v3 finetuning recipe (12950)` and `Specify PP first/last in strategy (12992)` into `r2.3.0` by @ko3n1g :: PR: #12984 - Cherry pick `Resolve an issue in custom megatron FSDP config setting (12948)` into `r2.3.0` by @ko3n1g :: PR: #12987 - Cherry pick `Remove getattr_proxy to avoid problematic edge cases (12176)` into `r2.3.0` by @ko3n1g :: PR: #12990 - Cherry pick `Enable async requests for in-fw deployment with OAI compatible server (12980)` into `r2.3.0` by @ko3n1g :: PR: #12994 - Cherry pick `initialize model with metadata (12496)` into `r2.3.0` by @ko3n1g :: PR: #12997 - Cherry pick `Bugfix for logits support for hf deployment (12965)` into `r2.3.0` by @ko3n1g :: PR: #13001 - Cherry pick `Update nvidia-resiliency-ext to be >= 0.3.0 (12925)` into `r2.3.0` by @ko3n1g :: PR: #13000 - Cherry-pick Fix params_dtype for distillation and GPT HF Exporter head_dim for pruning to r2.3.0 by @kevalmorabia97 :: PR: #13002 - Install nvidia-pytriton on arm (#13011) by @thomasdhc :: PR: #13013 - Version bump to `2.3.0rc4.dev0` by @github-actions[bot] :: PR: #13041 - Cherry pick `Alit/nemotron h (12942)` into `r2.3.0` by @ko3n1g :: PR: #13007 - Cherry pick `[Automodel] Add TP/SP support with default llama-like sharding plan (12796)` into `r2.3.0` by @ko3n1g :: PR: #13017 - Cherry pick `Add initial docs broken link check (12977)` into `r2.3.0` by @ko3n1g :: PR: #13045 - Cherry pick `Fix MoE Init to not use Bias in test_strategy_lib.py (13009)` into `r2.3.0` by @ko3n1g :: PR: #13014 - Cherry pick `cleaner tflops log name (13005)` into `r2.3.0` by @ko3n1g :: PR: #13024 - Cherry pick `Improve t5 test coverage (12803)` into `r2.3.0` by @ko3n1g :: PR: #13025 - Cherry pick `put the warning on the right place (12909)` into `r2.3.0` by @ko3n1g :: PR: #13035 - Cherry pick `Temporary disable CUDA graphs in DDP mode for transducer decoding (12907)` into `r2.3.0` by @ko3n1g :: PR: #13036 - Cherry pick `[automodel] peft fix vlm (13010)` into `r2.3.0` by @ko3n1g :: PR: #13037 - Cherry pick `Only run the docs link check on the container (13068)` into `r2.3.0` by @ko3n1g :: PR: #13070 - Cherry pick `Add fp8 recipe option to perf script (13032)` into `r2.3.0` by @ko3n1g :: PR: #13055 - Cherry pick `Unified ptq export (12786)` into `r2.3.0` by @ko3n1g :: PR: #13062 - Cherry pick `Fix VP list index out of range from Custom FSDP (13021)` into `r2.3.0` by @ko3n1g :: PR: #13077 - Cherry pick `Add logging to cancel out PTL's warning about dataloader not being resumable (13072)` into `r2.3.0` by @ko3n1g :: PR: #13100 - Cherry pick `Fix long sequence generation after new arg introduced in mcore engine (13049)` into `r2.3.0` by @ko3n1g :: PR: #13104 - Cherry pick `Support Mamba models quantization (12631)` into `r2.3.0` by @ko3n1g :: PR: #13105 - Cherry pick `Add track_io to user buffer configs (13071)` into `r2.3.0` by @ko3n1g :: PR: #13111 - ci: Onboard 8-GPU runner (#13115) by @ko3n1g :: PR: #13121 - Cherry pick `Add fine-tuning dataset function for FineWeb-Edu and update automodel… (13027)` into `r2.3.0` by @ko3n1g :: PR: #13118 - Cherry pick `Re-add sox to asr requirements (13092)` into `r2.3.0` by @ko3n1g :: PR: #13120 - Cherry pick `Update Mllama cross attn signature to match update MCore (13048)` into `r2.3.0` by @ko3n1g :: PR: #13122 - Cherry pick `Fix Exporter for baichuan and chatglm (13095)` into `r2.3.0` by @ko3n1g :: PR: #13126 - ci: Faster builds (#13142) by @ko3n1g :: PR: #13144 - Version bump to `2.3.0rc5.dev0` by @github-actions[bot] :: PR: #13146 - ci: Fix mcore install in test container (#13152) by @ko3n1g :: PR: #13159 - ci: Fix race-condition of container setup (#13162) by @ko3n1g :: PR: #13163 - Cherry pick `Guard decord and triton import (12861)` into `r2.3.0` by @ko3n1g :: PR: #13132 - Cherry pick `Bump TE version and apply patch (13087)` into `r2.3.0` by @ko3n1g :: PR: #13139 - Cherry pick `Update Llama-Minitron pruning-distillation notebooks from NeMo1 to NeMo2 + NeMoRun (12968)` into `r2.3.0` by @ko3n1g :: PR: #13141 - Cherry pick `Export and Deploy Tests (13076)` into `r2.3.0` by @ko3n1g :: PR: #13150 - Cherry pick `ub fp8 h100 fixes (13131)` into `r2.3.0` by @ko3n1g :: PR: #13153 - Cherry pick `Fix Transducer Decoding with CUDA Graphs in DDP with Mixed Precision (12938)` into `r2.3.0` by @ko3n1g :: PR: #13154 - Cherry pick `build: Pin modelopt (13029)` into `r2.3.0` by @chtruong814 :: PR: #13170 - Cherry pick `add fixes for nemotron-h` (13073) into `r2.3.0` by @JRD971000 :: PR: #13165 - Add dsv3 pretrain script, support flops calculation (previous #12947) by @guyueh1 :: PR: #13186 - ci: Allow running CI on weekly bump branch by @ko3n1g :: PR: #13233 - Cherry pick `Add Llama Nemotron Super/Ultra models (13044)` into `r2.3.0` by @ko3n1g :: PR: #13212 - Cherry pick `Add Blockwise FP8 to PTQ & EP to modelopt resume (12670)` into `r2.3.0` by @ko3n1g :: PR: #13239 - Cherry pick `[OAI Serving] Validate greedy generation args (redo) (13216)` into `r2.3.0` by @ko3n1g :: PR: #13242 - Cherry pick `drop sample_alpha in speechlm (13208)` into `r2.3.0` by @ko3n1g :: PR: #13246 - Cherry pick `[Eval bugfix] Move global eval-related imports inside the evaluate function (13166)` into `r2.3.0` by @ko3n1g :: PR: #13249 - Cherry pick `[Eval bugfix] Change default val of parallel_requests in eval script (13247)` into `r2.3.0` by @ko3n1g :: PR: #13253 - Cherry pick `Add tutorial for evaluation with Evals Factory (13259)` into `r2.3.0` by @ko3n1g :: PR: #13271 - Cherry pick `Fix default token durations (13168)` into `r2.3.0` by @ko3n1g :: PR: #13261 - Cherry pick `[Evaluation] Add support for nvidia-lm-eval==25.04 (13230)` into `r2.3.0` by @ko3n1g :: PR: #13274 - Cherry pick `[bug fix] set inference max seq len in inference context (13245)` into `r2.3.0` by @ko3n1g :: PR: #13276 - Cherry pick `More export and deploy unit tests (13178)` into `r2.3.0` by @ko3n1g :: PR: #13283 - Cherry pick `Reopen 13040 (13199)` into `r2.3.0` by @ko3n1g :: PR: #13303 - Cherry pick `Fix nemo1's neva notebook (13218)` into `r2.3.0` by @ko3n1g :: PR: #13312 - Cherry pick `build: various bumps (13285)` into `r2.3.0` by @ko3n1g :: PR: #13313 - Cherry-pick `ci: Increase cache pool` into `r2.3.0` by @chtruong814 :: PR: #13317 - Cherry pick `update num nodes in deepseek v3 finetune recipe (13314)` into `r2.3.0` by @ko3n1g :: PR: #13316 - Cherry pick `Fix neva notebook (13334)` into `r2.3.0` by @ko3n1g :: PR: #13335 - Cherry-pick `Add Llama4 Scout and Maverick Support (#12898)` by @ko3n1g :: PR: #13331 - Cherry pick `Fix handling Llama Embedding dimensions param and prompt type in the ONNX export tutorial (13262)` into `r2.3.0` by @ko3n1g :: PR: #13326 - Cherry-pick `Fix transformer offline for CI/CD llama4 tests` (#13339) to `r2.3.0` by @chtruong814 :: PR: #13340 - Fix llama4 test names by @chtruong814 :: PR: #13358 - Cherry pick `vLLM==0.8.5 update (13350)` into `r2.3.0` by @ko3n1g :: PR: #13354 - Cherry-pick a test and doc fix to r2.3.0 by @chtruong814 :: PR: #13338 - Cherry pick `Add llama4 training recipe (12952)` into `r2.3.0` by @ko3n1g :: PR: #13386
## NVIDIA Neural Modules 2.2.1 ### Highlights - Training - Fix MoE based models training instability. - Fix bug in Llama exporter for Llama 3.2 1B and 3B. - Fix bug in LoRA linear_fc1adapter when different TP is used during saving and loading the adapter checkpoint. ### Detailed Changelogs #### Uncategorized
Changelog - Re-add reverted commits after 2.2.0 and set next version to be 2.2.1 by @chtruong814 :: PR: #12587 - Cherry pick `Fix exporter for llama models with shared embed and output layers (12545)` into `r2.2.0` by @ko3n1g :: PR: #12608 - Cherry pick `Fix TP for LoRA adapter on`linear_fc1`(12519)` into `r2.2.0` by @ko3n1g :: PR: #12607 - Bump mcore to use 0.11.1 by @chtruong814 :: PR: #12634
## NVIDIA Neural Modules 2.2.0 ### Highlights - Training - Blackwell and Grace Blackwell support - Pipeline parallel support for distillation - Improved NeMo Framework installation - Export & Deploy - vLLM export for NeMo 2.0 - Evaluations - Integrate lm-eval-harness - Collections - LLM - DAPT Example and best practices in nemo 2.0 - [NeMo 2.0] Enable Tool Learning and add a tutorial - Support GPT Embedding Model (Llama 3.2 1B/3B) - Qwen2.5, Phi4 (via AutoModel) - SFT for Llama 3.3 model (via AutoModel) - Support BERT Embedding Model with NeMo 2.0 - DeepSeek SFT & PEFT Support - MultiModal - Clip - SP for NeVA - CP for NeVA - Intern-VIT - Automodel - Preview release. - PEFT and SFT support for LLMs available via Hugging Face’s AutoModelForCausalLM. - Support for Hugging Face-native checkpoints (full model and adapter only). - Support for distributed training via DDP and FSDP2. - ASR/TTS - Lhotse: TPS-free 2D bucket estimation and filtering - Update model outputs to make all asr outputs to be in consistent format - Sortformer Release Model ### Detailed Changelogs #### ASR
Changelog - removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710 - TPS-free 2D bucket estimation and filtering by @pzelasko :: PR: #11738 - Update transcribe_utils.py by @stevehuang52 :: PR: #11984 - Sortformer Diarizer 4spk v1 model PR Part 4: Sortformer Documents and Notebook Tutorials by @tango4j :: PR: #11707 - fix the issue during batched inference of Sortformer diarizer by @tango4j :: PR: #12047 - changed asr models outputs to be consistent by @Ssofja :: PR: #11818 - chore: Update notebooks by @ko3n1g :: PR: #12161 - add ctc segmentation by @ko3n1g :: PR: #12312 - clean up VAD tutorial by @stevehuang52 :: PR: #12410 - copy from main by @nithinraok :: PR: #12423 - ci: Disable ASR tests for now (#12443) by @ko3n1g :: PR: #12466 - ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538
#### TTS
Changelog - Add New Transformer Backbone for TTS Models by @blisc :: PR: #11911 - changed asr models outputs to be consistent by @Ssofja :: PR: #11818 - chore: Update notebooks by @ko3n1g :: PR: #12161
#### NLP / NMT
Changelog - Use explicit imports from megatronllm_deployable.py by @janekl :: PR: #11705 - Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714 - gpt moe perf scripts by @malay-nagda :: PR: #11760 - Bump mcore by @ko3n1g :: PR: #11740 - Enable packed seqs for validation by @jiemingz :: PR: #11748 - Revert Mcore update since it caused regression by @pablo-garay :: PR: #11791 - Fix Gemma2 Attention Init Args by @suiyoubi :: PR: #11792 - Add null tokenizer by @erhoo82 :: PR: #11789 - Fix DistCP inference issue by @suiyoubi :: PR: #11801 - Add BERT Embedding Models E5 Recipe by @suiyoubi :: PR: #11787 - Add rope scaling configs for NeMo 1 by @BoxiangW :: PR: #11807 - Fix calculating num_available_samples by @huvunvidia :: PR: #11830 - fix sentencepiece tokenizer special tokens by @akoumpa :: PR: #11811 - add chat sft dataset to support agent tool calling by @chenrui17 :: PR: #11759 - Revert "Revert Mcore update since it caused regression (#11791)" by @ko3n1g :: PR: #11799 - fix checkpoint load issue by @dimapihtar :: PR: #11859 - Fix nemo 1 packed sequence TE version error by @cuichenx :: PR: #11874 - enable loading older TE checkpoints by @dimapihtar :: PR: #11930 - ci: Use single runner machines for unit tests by @ko3n1g :: PR: #11937 - llm performance scripts by @malay-nagda :: PR: #11736 - [MoE] add expert tensor parallelism support for NeMo2.0 MoE by @gdengk :: PR: #11880 - add exception when loading ckpt saved by TE < 1.13 by @dimapihtar :: PR: #11988 - remove renormalize_blend_weights flag by @dimapihtar :: PR: #11975 - Llama3.2 1B Embedding Model Support by @suiyoubi :: PR: #11909 - Weekly bump by @ko3n1g :: PR: #11896 - Debug Apex distributed optimizer to handle Transformer Engine 2.0 by @timmoon10 :: PR: #12004 - throw MegatronOptimizerModule warning only with mcore models by @akoumpa :: PR: #12085 - fix nmt dataclass issue by @dimapihtar :: PR: #12081 - Propogate dp last changes from mcore by @ryantwolf :: PR: #12012 - Add error message when downloading failed. by @yuanzhedong :: PR: #12139 - interface for asymmetric pipeline schedule by @erhoo82 :: PR: #12039 - chore: Update notebooks by @ko3n1g :: PR: #12161 - Cherrypick #12382, #12415 and #12424 by @cuichenx :: PR: #12425 - ASR_CTC_Language_Finetuning.ipynb bugfix by @lilithgrigoryan :: PR: #12538
#### Text Normalization / Inverse Text Normalization
Changelog - surface attn_implementation option by @akoumpa :: PR: #11873 - attn_implementation eager fallback by @akoumpa :: PR: #12060
#### NeMo Tools
Changelog - build: Add `sox` to SDE by @ko3n1g :: PR: #11882 - add ctc segmentation by @ko3n1g :: PR: #12312
#### Export
Changelog - Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714 - In-framework deployment NeMo 2.0 nemo_export.py test by @janekl :: PR: #11749 - Fix starcoder2 missing bias in nemo2 config for TRTLLM by @meatybobby :: PR: #11809 - Autodetect dtype on exporting to TensorRT-LLM by @janekl :: PR: #11907 - PTQ & TRT-LLM updates related to upcoming PyTorch 25.01 bump by @janekl :: PR: #11941 - Run Flake8 for nemo.export module by @janekl :: PR: #11728 - Skip initialization in hf export by @cuichenx :: PR: #12136 - update export io call by @akoumpa :: PR: #12144 - add default kwargs for trtllm model runner by @pablo-garay :: PR: #12248 - cherry-pick: fix[export]: reshard model correctly handles extra_state when it's a tensor (#12132) by @terrykong :: PR: #12335
#### Bugfixes
Changelog - added required instalation for sox to process mp3 file by @Ssofja :: PR: #11709 - removed the line which caused a problem in nfa_tutorial by @Ssofja :: PR: #11710 - Bug fix minor bug in TRT-LLM deployment by @oyilmaz-nvidia :: PR: #11714
#### Uncategorized
Changelog - Allow using vocab size from config by @shanmugamr1992 :: PR: #11718 - Fix baseline recipes by @erhoo82 :: PR: #11725 - Update changelog for `r2.1.0` by @github-actions[bot] :: PR: #11745 - ci: Fix changelog generator by @ko3n1g :: PR: #11744 - Fix 'http_port' parameter name in DeployPyTriton usages and update .qnemo compress=True path by @janekl :: PR: #11747 - Conversion NeMo and HF checkpoint script for T5 by @huvunvidia :: PR: #11739 - Add BERT Embedding Models by @suiyoubi :: PR: #11737 - Add server ready check before starting evaluation by @athitten :: PR: #11731 - only install bitsandbytes on x86 by @akoumpa :: PR: #11781 - [Bugfix] Skip processing if extra_state loads as None by @janekl :: PR: #11778 - chore(beep boop 🤖): Bump `MCORE_TAG=4dc8977...` (2025-01-07) by @ko3n1g :: PR: #11768 - make progress printer compatible with PTL v2.5.0 by @ashors1 :: PR: #11779 - Fix Mistral Conversion Issue by @suiyoubi :: PR: #11786 - build: Fix build-arg by @ko3n1g :: PR: #11815 - Lora ckpt in HF format for NeMo AutoModel by @oyilmaz-nvidia :: PR: #11712 - 8x22b seq len by @malay-nagda :: PR: #11788 - Bugfix for output_generation_logits in tensorrtllm by @athitten :: PR: #11820 - handle mistralai/Mistral-7B-Instruct-v0.3 tokenizer correctly by @akoumpa :: PR: #11839 - remove tensorstore pin in requirements*.txt by @pstjohn :: PR: #11777 - Do not load context for model transform in llm inference by @hemildesai :: PR: #11751 - update nemo2sftpeft tutorial container verison by @HuiyingLi :: PR: #11832 - Latest News updated for Cosmos by @lbliii :: PR: #11806 - Removes tensorstore 0.1.45 pin from requirements_deploy.txt by @pstjohn :: PR: #11858 - ci: Prune dangling images by @ko3n1g :: PR: #11885 - Disable tests that download datasets from web by @akoumpa :: PR: #11878 - Add context_logits for eval accuracy calculation in case of multi token prediction tasks by @athitten :: PR: #11753 - add dataset_root to SpecterDataModule by @suiyoubi :: PR: #11837 - Support both Path and str for APIs by @maanug-nv :: PR: #11865 - Run nsys callback on GBS not on MBS by @akoumpa :: PR: #11861 - ci: Set bump-branch to weekly by @ko3n1g :: PR: #11889 - chore: Update mcore-tag-bump-bot.yml by @ko3n1g :: PR: #11891 - ci: Bump Mcore in weekly PR by @ko3n1g :: PR: #11897 - check restore_config first by @akoumpa :: PR: #11890 - LinearAdapter: propagate args to _init_adapter by @akoumpa :: PR: #11902 - NeMo 2.0 fp8 conversion by @Laplasjan107 :: PR: #11845 - nemo ux expert tensor parallel by @akoumpa :: PR: #11903 - Add CP support to Neva in NeMo2 by @yaoyu-33 :: PR: #11850 - build: Move dependencies by @ko3n1g :: PR: #11790 - Add Flux and Flux Controlnet Support to Diffusion folder by @Victor49152 :: PR: #11794 - ci: Adjust bump mcore workflow by @ko3n1g :: PR: #11918 - ci: Small fix to bump workflow by @ko3n1g :: PR: #11919 - Revert #11890 and add a test that would have caught the error by @cuichenx :: PR: #11914 - ci: Adjust input argument by @ko3n1g :: PR: #11921 - Create test_phi3.py by @mayani-nv :: PR: #11843 - Enable NeMo importer and loading dist CKPT for training by @Victor49152 :: PR: #11927 - build: Pin `triton` by @ko3n1g :: PR: #11938 - Add sharding for speechlm and vlm by @BoxiangW :: PR: #11876 - Update torch load for load from disk by @thomasdhc :: PR: #11963 - Add options to add mp_policy and parallel_fn for NeMo automodel fsdp2 by @BoxiangW :: PR: #11956 - ci: Add coverage reports by @ko3n1g :: PR: #11912 - Add batching support for evaluation by @athitten :: PR: #11934 - add use_fast option by @akoumpa :: PR: #11976 - improve error and debug messages in model connector by @cuichenx :: PR: #11979 - [checkpoint][docs] Fix typos in dist checkpointing docs by @ananthsub :: PR: #11983 - callbacks and bf16 grad by @malay-nagda :: PR: #11985 - remove --disable-ckpt from tests by @akoumpa :: PR: #11996 - nemo automodel sft squad data prep fix by @akoumpa :: PR: #11994 - Introduce evaluation API by @Glorf :: PR: #11895 - Remove deprecated tests/infer_data_path.py by @janekl :: PR: #11997 - Checkpoint saving for automodels via ModelCheckpoint by @akoumpa :: PR: #11998 - Mask vocab padding token ids from CE loss by @maanug-nv :: PR: #11999 - Add the NeMo2 memory profiling plugin by @gdengk :: PR: #12009 - chore(ci): Disable VMs cron job on forks by @mikemckiernan :: PR: #12020 - Adding speechlm AutoModel test by @oyilmaz-nvidia :: PR: #11990 - minor fix and simplify by @akoumpa :: PR: #12007 - ci: Build wheel workflow by @ko3n1g :: PR: #12021 - ci: Release workflow by @ko3n1g :: PR: #12022 - Version bump to `2.2.0rc1` by @github-actions[bot] :: PR: #12023 - ci: Run unit tests on main by @ko3n1g :: PR: #11986 - [Audio] Fix extra step in Euler sampler for flow matching inference by @racoiaws :: PR: #11989 - Set zarr range to >=2.18.2 and <3.0.0 by @chtruong814 :: PR: #12005 - ci: Run linting per domain by @ko3n1g :: PR: #12027 - Replace reference of requirements_infer.txt with requirements_deploy.txt by @chtruong814 :: PR: #12029 - ci: Always run linting by @ko3n1g :: PR: #12035 - ci: Retry on timeout by @ko3n1g :: PR: #11974 - [MoE] fix run err in mixtral22B recipe and update its perf config by @gdengk :: PR: #12036 - Version bump to `2.2.0rc2.dev0` by @github-actions[bot] :: PR: #12040 - ci: Update weekly brain by @ko3n1g :: PR: #12043 - ci: Update workflow by @ko3n1g :: PR: #12044 - nemo-automodel: fsdp2 support for peft by @akoumpa :: PR: #12008 - fix llama-3.1 hf model_id by @AtsunoriFujita :: PR: #11774 - Clip Model in Nemo2 by @abhinavg4 :: PR: #11980 - Adding TFLOPs callback for Multimodal models and NeVA calculator by @parthmannan :: PR: #11969 - ci: Allow skipping docs by @ko3n1g :: PR: #12048 - avoid missmatch error when loading older TE checkpoints by @dimapihtar :: PR: #12028 - Add padding in mllama vision encoder to align with HF by @meatybobby :: PR: #11808 - chore: Add warning for rebase by @ko3n1g :: PR: #12061 - ci: Lint Python files only by @ko3n1g :: PR: #12064 - Recipe changes for performance by @guyueh1 :: PR: #11763 - Pipeline-parallel support for Knowledge Distillation (NeMo 2) by @AAnoosheh :: PR: #11766 - add cp_comm_type param to Mistral config by @dimapihtar :: PR: #12049 - Conformer-based spectrogram estimator by @anteju :: PR: #12002 - Adding nemo CI by @abhinavg4 :: PR: #12052 - Update optimization features readme from nemo1 to nemo2 by @yaoyu-33 :: PR: #12071 - Add Llama Embedding Tutorial by @suiyoubi :: PR: #12042 - Fix Linting by @suiyoubi :: PR: #12079 - Fix hf_dataset bug by @BoxiangW :: PR: #12072 - set TOKENIZERS_PARALLELISM=True by @akoumpa :: PR: #12083 - minor fix in model's summary identation during logging by @akoumpa :: PR: #12084 - Refactor VLM modules / Add InternVit submodule support by @yaoyu-33 :: PR: #11851 - Fix SBERT with sequence_len_offset by @suiyoubi :: PR: #12057 - ci: codecov by @ko3n1g :: PR: #12030 - build: Improve installer by @ko3n1g :: PR: #12016 - ci: Modular unit tests by @ko3n1g :: PR: #12104 - ci: Update bump workflow by @ko3n1g :: PR: #12106 - etp docs by @akoumpa :: PR: #12111 - build: Better caching by @ko3n1g :: PR: #12109 - ci: Fix flaky test by @ko3n1g :: PR: #12113 - Ensure nemo.collections.vlm does not strictly require transformer engine by @chtruong814 :: PR: #12108 - build: Optimize by @ko3n1g :: PR: #12112 - refactor peft module matching; introduce exclude_modules by @akoumpa :: PR: #12066 - Update mcore commit (02.06.25) by @pablo-garay :: PR: #12114 - ci: Bump Mcore inplace by @ko3n1g :: PR: #12115 - ci: Bump bot by @ko3n1g :: PR: #12117 - Add neva pretrain script by @yaoyu-33 :: PR: #12033 - DAPT playbooks - with NeMo 2.0 by @jvamaraju :: PR: #12067 - Malay/bw scripts by @malay-nagda :: PR: #11961 - [MoE] Add type annotation for mixtral configs by @gdengk :: PR: #12126 - ci: Disable checks by @ko3n1g :: PR: #12129 - Add performance-optimized example for llama2 70b LoRA by @vysarge :: PR: #12055 - Add Automodel support for Deepseek v3 model by @BoxiangW :: PR: #12099 - Bug fix with generation of expert_tensor_parallel_rank by @guyueh1 :: PR: #12125 - Rename neva datamodule by @yaoyu-33 :: PR: #12121 - Update vLLM to 0.7.2 by @Laplasjan107 :: PR: #12078 - Prevent downloading dataset every time in ci test by @cuichenx :: PR: #12095 - AudioToAudioModel: fix model->dataloader sample_rate parameter injection by @racoiaws :: PR: #12092 - Minor Bug Fixes - LLaMa Embedding by @soluwalana :: PR: #12146 - build: Force re-install VCS dependencies by @ko3n1g :: PR: #12155 - Cherry pick `build: Force re-install VCS dependencies (12155)` into `r2.2.0` by @ko3n1g :: PR: #12191 - Cherry pick `Add function calling SFT NeMo2.0 tutorial (11868)` into `r2.2.0` by @ko3n1g :: PR: #12180 - Cherry pick `Update TTS code to remove calls to deprecated functions (12153)` into `r2.2.0` by @ko3n1g :: PR: #12201 - Cherry pick `Fix multi-GPU in-framework deployment (12090)` into `r2.2.0` by @ko3n1g :: PR: #12172 - Cherry pick `disable moe logging to avoid deepseek hang (12168)` into `r2.2.0` by @ko3n1g :: PR: #12192 - Cherry pick `build: Pin down transformers (12229)` into `r2.2.0` by @ko3n1g :: PR: #12230 - Cherry pick `Fix loading extra states from torch tensor (12185)` into `r2.2.0` by @ko3n1g :: PR: #12226 - Cherry pick `nemo-automodel checkpoint-io refactor (12070)` into `r2.2.0` by @ko3n1g :: PR: #12234 - ci: Flaky tests release by @ko3n1g :: PR: #12293 - Cherry pick `Set L2_Speech_Batch_Size_OOMptimizer_Canary to be optional (12299)` into `r2.2.0` by @ko3n1g :: PR: #12300 - build: Editable nemo install (#12304) by @ko3n1g :: PR: #12308 - ci: Fix test workflow by @ko3n1g :: PR: #12311 - Cherry pick `build: Exclude tensorstore 0.1.72 (12317)` into `r2.2.0` by @ko3n1g :: PR: #12318 - Cherry pick `Fix the local path in Sortformer diarizer training tutorial (12135)` into `r2.2.0` by @ko3n1g :: PR: #12316 - Cherry pick `Add eval requirement to setup.py (12152)` into `r2.2.0` by @ko3n1g :: PR: #12277 - Cherry pick `Add modelopt to requirements_nlp.txt (12261)` into `r2.2.0` by @ko3n1g :: PR: #12278 - cherry pick 12209 by @akoumpa :: PR: #12240 - Cherry pick `Energon ckpt multimodal (12245)` into `r2.2.0` by @ko3n1g :: PR: #12307 - Cherry pick `[nemo1] Fix Mamba/Bert loading from checkpoint after TE extra states were introduced (12275)` into `r2.2.0` by @ko3n1g :: PR: #12314 - Cherry pick `fix masked loss calculation (12255)` into `r2.2.0` by @ko3n1g :: PR: #12286 - chore: Cherry pick deepseek by @ko3n1g :: PR: #12324 - build: Bump PyT to 25.01 (#11973) by @ko3n1g :: PR: #12323 - Cherry pick `build: Bump mcore (12320)` into `r2.2.0` by @ko3n1g :: PR: #12328 - Cherry pick `[automodel] re-enable FSDP2 tests (12325)` into `r2.2.0` by @ko3n1g :: PR: #12331 - Cherry pick `[automodel] fix loss reporting (12303)` into `r2.2.0` by @ko3n1g :: PR: #12334 - build: Bump Mcore by @ko3n1g :: PR: #12340 - Cherry-pick Asr fixes 2.2 (#12227) by @ko3n1g :: PR: #12345 - Cherry-pick Bug fixes (#12315) by @chtruong814 :: PR: #12346 - Cherry pick `[automodel] remove fix_progress_bar from fsdp2 strategy (12339)` into `r2.2.0` by @ko3n1g :: PR: #12347 - Cherry pick `Fix NeMo1 Bert Embedding Dataset args (12341)` into `r2.2.0` by @ko3n1g :: PR: #12349 - Cherry pick `Fix NeMo1 sequence_len_offset in Bert fwd (12350)` into `r2.2.0` by @ko3n1g :: PR: #12359 - Cherry pick `Add nemo-run recipe for evaluation (12301)` into `r2.2.0` by @ko3n1g :: PR: #12352 - Cherry pick `Add DeepSeek-R1 Distillation NeMo 2.0 tutorial (12187)` into `r2.2.0` by @ko3n1g :: PR: #12355 - chore: Update package_info.py by @ko3n1g :: PR: #12362 - Version bump to `2.2.0rc4.dev0` by @github-actions[bot] :: PR: #12363 - Bump mcore to latest commit on release branch by @chtruong814 :: PR: #12360 - Cherry pick `[automodel] add lr scheduler (12351)` into `r2.2.0` by @ko3n1g :: PR: #12361 - Cherry pick `[automodel] add distributed data sampler (12326)` into `r2.2.0` by @ko3n1g :: PR: #12373 - Cherry pick `[NeVA] Fix for CP+THD (12366)` into `r2.2.0` by @ko3n1g :: PR: #12375 - Cherry pick `Ignore attribute error when serializing mcore specs (12353)` into `r2.2.0` by @ko3n1g :: PR: #12383 - Cherry pick `Avoid init_ddp for inference (12011)` into `r2.2.0` by @ko3n1g :: PR: #12385 - Cherry pick `[docs] fix notebook render (12374)` into `r2.2.0` by @ko3n1g :: PR: #12394 - Cherry pick `Neva finetune scripts and PP fix (12387)` into `r2.2.0` by @ko3n1g :: PR: #12397 - Cherry pick `[automodel] update runner tags for notebooks (12428)` into `r2.2.0` by @ko3n1g :: PR: #12431 - Cherry pick `[automodel] update examples (12411)` into `r2.2.0` by @ko3n1g :: PR: #12432 - Cherry pick `Evaluation docs (12348)` into `r2.2.0` by @ko3n1g :: PR: #12460 - Cherry pick `Update prompt format (12452)` into `r2.2.0` by @ko3n1g :: PR: #12455 - Cherry pick `Fixing a wrong Sortformer Tutorial Notebook path. (12479)` into `r2.2.0` by @ko3n1g :: PR: #12480 - Cherry pick `added a needed checks and changes for bugfix (12400)` into `r2.2.0` by @Ssofja :: PR: #12447 - Cherry pick `[automodel] fix loss/tps reporting across ranks (12389)` into `r2.2.0` by @ko3n1g :: PR: #12413 - Cherry pick `enable fsdp flag for FSDP2Strategy (12392)` into `r2.2.0` by @ko3n1g :: PR: #12429 - Cherry pick `Fix lita notebook issue (12474)` into `r2.2.0` by @ko3n1g :: PR: #12476 - Cherrypick multinode tut changes by @BoxiangW :: PR: #12501 - Cherry pick `Changed the argument types passed to metrics calculation functions (12500)` into `r2.2.0` by @ko3n1g :: PR: #12502 - Cherry pick `added needed fixes (12495)` into `r2.2.0` by @ko3n1g :: PR: #12509 - Cherry pick `update transformers version requirements (12475)` into `r2.2.0` by @ko3n1g :: PR: #12507 - Cherry pick `[checkpoint] Log timings for checkpoint IO save and load (11972)` into `r2.2.0` by @ko3n1g :: PR: #12520 - Cherry pick `few checkings needed because of the change of asr models output (12499)` into `r2.2.0` by @ko3n1g :: PR: #12513 - Oyilmaz nvidia/chore/cherry pick 12242 by @oyilmaz-nvidia :: PR: #12523 - Cherry pick `Remove`_attn_implementation` in `LlamaBidirectionalModel`constructor (12364)` into `r2.2.0` by @ko3n1g :: PR: #12525 - Cherry pick `Configure FSDP to keep module params (12074)` into `r2.2.0` by @ko3n1g :: PR: #12524 - Cherry pick `[automodel] docs (11942)` into `r2.2.0` by @ko3n1g :: PR: #12530 - Cherry pick `[automodel] update examples' comments (12518)` and `[automodel] Move PEFT to configure_model (#12491)` into `r2.2.0` by @ko3n1g :: PR: #12529 - Cherry pick `update readme to include latest pytorch version (12539)` into `r2.2.0` by @ko3n1g :: PR: #12577 - Publish r2.2.0 by @chtruong814 :: PR: #12583
## NVIDIA Neural Modules 2.1.0 ### Highlights - Training - Fault Tolerance - Straggler Detection - Auto Relaunch - LLM & MM - MM models - Llava-next - Llama 3.2 - Sequence Model Parallel for NeVa - Enable Energon - SigLIP (NeMo 1.0 only) - LLM 2.0 migration - Starcoder2 - Gemma 2 - T5 - Baichuan - BERT - Mamba - ChatGLM - DoRA support - Export - Nemo 2.0 base model export path for NIM - PTQ in Nemo 2.0 - ASR - Timestamps with TDT decoder - Timestamps option with .transcribe() ### Detailed Changelogs #### ASR
Changelog - [Fix] Fixed sampler override and audio_key in prepare_audio_data by @anteju :: PR: #10980 - Akoumparouli/mixtral recipe fix r2.0.0 by @akoumpa :: PR: #10994 - TDT compute timestamps option and Extra Whitespace handling for SPE by @monica-sekoyan :: PR: #10875 - ci: Switch to CPU only runner by @ko3n1g :: PR: #11035 - Fix timestamps tests by @monica-sekoyan :: PR: #11053 - ci: Pin release freeze by @ko3n1g :: PR: #11143 - Fix RNN-T loss memory usage by @artbataev :: PR: #11144 - Added deprecation notice by @Ssofja :: PR: #11133 - Fixes for Canary adapters tutorial by @pzelasko :: PR: #11184 - add ipython import guard by @nithinraok :: PR: #11191 - Self Supervised Pre-Training tutorial Fix by @monica-sekoyan :: PR: #11206 - update the return type by @nithinraok :: PR: #11210 - Timestamps to transcribe by @nithinraok :: PR: #10950 - [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045 - Beam search algorithm implementation for TDT models by @lilithgrigoryan :: PR: #10903 - Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252 - Remove pytorch-lightning by @maanug-nv :: PR: #11306 - update hypothesis when passed through cfg by @nithinraok :: PR: #11366 - Revert "update hypothesis when passed through cfg" by @pablo-garay :: PR: #11373 - Fix transcribe speech by @nithinraok :: PR: #11379 - Lhotse support for transcribe_speech_parallel by @nune-tadevosyan :: PR: #11249 - Sortformer Diarizer 4spk v1 model PR Part 1: models, modules and dataloaders by @tango4j :: PR: #11282 - Removing unnecessary lines by @nune-tadevosyan :: PR: #11408 - Support for initializing lhotse shar dataloader via field: list[path] mapping by @pzelasko :: PR: #11460 - New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations by @pzelasko :: PR: #11058 - Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom by @weiqingw4ng :: PR: #11636
#### TTS
Changelog - [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045 - Add T5TTS by @blisc :: PR: #11193 - Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252 - Remove pytorch-lightning by @maanug-nv :: PR: #11306 - Add nvidia/low-frame-rate-speech-codec-22khz model on docs by @Edresson :: PR: #11457
#### NLP / NMT
Changelog - Move collectiob.nlp imports inline for t5 by @marcromeyn :: PR: #10877 - Use a context-manager when opening files by @akoumpa :: PR: #10895 - Packed sequence bug fixes by @cuichenx :: PR: #10898 - ckpt convert bug fixes by @dimapihtar :: PR: #10878 - remove deprecated ci tests by @dimapihtar :: PR: #10922 - Update T5 tokenizer (adding additional tokens to tokenizer config) by @huvunvidia :: PR: #10972 - Add support and recipes for HF models via AutoModelForCausalLM by @akoumpa :: PR: #10962 - gpt3 175b cli by @malay-nagda :: PR: #10985 - Fix for crash with LoRA + tp_overlap_comm=false + sequence_parallel=true by @vysarge :: PR: #10920 - Update `BaseMegatronSampler` for compatibility with PTL's `_BatchProgress` by @ashors1 :: PR: #11016 - add deprecation note by @dimapihtar :: PR: #11024 - Update ModelOpt Width Pruning example defaults by @kevalmorabia97 :: PR: #10902 - switch to NeMo 2.0 recipes by @dimapihtar :: PR: #10948 - NeMo 1.0: upcycle dense to moe by @akoumpa :: PR: #11002 - Gemma2 in Nemo2 with Recipes by @suiyoubi :: PR: #11037 - Add Packed Seq option to GPT based models by @suiyoubi :: PR: #11100 - Fix MCoreGPTModel import in llm.gpt.model.base by @hemildesai :: PR: #11109 - TP+MoE peft fix by @akoumpa :: PR: #11114 - GPT recipes to use full te spec by @JimmyZhang12 :: PR: #11119 - Virtual pipeline parallel support for LoRA in NLPAdapterModelMixin by @vysarge :: PR: #11128 - update nemo args for mcore flash decode arg change by @HuiyingLi :: PR: #11138 - Call `ckpt_to_weights_subdir` from `MegatronCheckpointIO` by @ashors1 :: PR: #10897 - [Doc fixes] update file names, installation instructions, bad links by @erastorgueva-nv :: PR: #11045 - fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255 - Use MegatronDataSampler in HfDatasetDataModule by @akoumpa :: PR: #11274 - Add T5TTS by @blisc :: PR: #11193 - ci: Exclude CPU machines from scan by @ko3n1g :: PR: #11300 - Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301 - remove redundant docs by @sharathts :: PR: #11302 - Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252 - Add `attention_bias` argument in transformer block and transformer layer modules, addressing change in MCore by @yaoyu-33 :: PR: #11289 - Remove pytorch-lightning by @maanug-nv :: PR: #11306 - Update T5 attention-mask shapes to be compatible with all attention-backend in new TE versions by @huvunvidia :: PR: #11059 - Add support for restoring from 2.0 checkpoint in 1.0 by @hemildesai :: PR: #11347 - Fix Gemma2 Attention Args by @suiyoubi :: PR: #11365 - mlm conversion & tiktokenizer support by @dimapihtar :: PR: #11349 - [Nemo1] Generate sharded optimizer state dicts only if needed for saving by @ananthsub :: PR: #11451 - add hindi tn/itn coverage by @mgrafu :: PR: #11382 - chore(beep boop 🤖): Bump `MCORE_TAG=67a50f2...` (2024-11-28) by @ko3n1g :: PR: #11427 - Handle exception when importing RetroGPTChunkDatasets by @guyueh1 :: PR: #11415 - Update restore from config for gpt type continual training in NeMo1 by @yaoyu-33 :: PR: #11471 - ci: Re-enable `L2_Megatron_LM_To_NeMo_Conversion` by @ko3n1g :: PR: #11484 - Apply packed sequence params change for fused rope compatibility by @ananthsub :: PR: #11506 - Huvu/tiktoken tokenizer update by @huvunvidia :: PR: #11494
#### Text Normalization / Inverse Text Normalization
Changelog - Adding support for LightningDataModule inside Fabric-API by @marcromeyn :: PR: #10879 - Add registry to register all needed classes with artifacts in nemo.lightning.io by @hemildesai :: PR: #10861 - Update import 'pytorch_lightning' -> 'lightning.pytorch' by @maanug-nv :: PR: #11252 - Remove pytorch-lightning by @maanug-nv :: PR: #11306 - add hindi tn/itn coverage by @mgrafu :: PR: #11382
#### Export
Changelog - Update engine build step for TRT-LLM 0.13.0 by @janekl :: PR: #10880 - Nemo 2.0 ckpt support in TRT-LLM export by @oyilmaz-nvidia :: PR: #10891 - Fix TRTLLM parallel_embedding by @meatybobby :: PR: #10975 - Export & deploy updates (part I) by @janekl :: PR: #10941 - Add doc-strings to import & export + improve logging by @marcromeyn :: PR: #11078 - NeMo-UX: fix nemo-ux export path by @akoumpa :: PR: #11081 - Fix TRTLLM nemo2 activation parsing by @meatybobby :: PR: #11062 - Support exporting Nemotron-340B for TensorRT-LLM by @jinyangyuan-nvidia :: PR: #11015 - vLLM Hugging Face exporter by @oyilmaz-nvidia :: PR: #11124 - Fix export of configuration parameters to Weights and Biases by @soluwalana :: PR: #10995 - Change activation parsing in TRTLLM by @meatybobby :: PR: #11173 - Remove builder_opt param from trtllm-build for TensorRT-LLM >= 0.14.0 by @janekl :: PR: #11259 - fix(export): GPT models w/ bias=False convert properly by @terrykong :: PR: #11255 - fix(export): update API for disabling device reassignment in TRTLLM for Aligner by @terrykong :: PR: #10863 - Add openai-gelu in gated activation for TRTLLM export by @meatybobby :: PR: #11293 - Revert "fix(export): GPT models w/ bias=False convert properly" by @terrykong :: PR: #11301 - Adding alinger export by @shanmugamr1992 :: PR: #11269 - Export & deploy updates (part II) by @janekl :: PR: #11344 - Introducing TensorRT lazy export and caching option with trt_compile() by @borisfom :: PR: #11266 - fix: export converts properly if no model_prefix by @terrykong :: PR: #11477
#### Bugfixes
Changelog - Change default ckpt name by @maanug-nv :: PR: #11277 - Fix patching of NeMo tokenizers for correct Lambada evaluation by @janekl :: PR: #11326
#### Uncategorized
Changelog - ci: Use Slack group by @ko3n1g :: PR: #10866 - Bump `Dockerfile.ci` (2024-10-14) by @ko3n1g :: PR: #10871 - Fix peft resume by @cuichenx :: PR: #10887 - call __post_init__ after altering config values by @akoumpa :: PR: #10885 - Late import prettytable by @maanug-nv :: PR: #10912 - Bump `Dockerfile.ci` (2024-10-17) by @ko3n1g :: PR: #10919 - Warning for missing FP8 checkpoint support for vLLM deployment by @janekl :: PR: #10906 - Fix artifact saving by @hemildesai :: PR: #10914 - Lora improvement by @cuichenx :: PR: #10918 - Huvu/t5 nemo2.0 peft by @huvunvidia :: PR: #10916 - perf recipes and Mcore DistOpt params by @malay-nagda :: PR: #10883 - ci: Fix cherry pick team by @ko3n1g :: PR: #10945 - Fix requirements for MacOS by @artbataev :: PR: #10930 - Fix nemo 2.0 recipes by @BoxiangW :: PR: #10915 - Akoumparouli/nemo ux fix dir or string artifact by @akoumpa :: PR: #10936 - Fix typo in docstring by @ashors1 :: PR: #10955 - [Nemo CICD] Remove deprecated tests by @pablo-garay :: PR: #10960 - Restore NeMo 2.0 T5 pretraining CICD test by @huvunvidia :: PR: #10952 - Convert perf plugin env vars to strings by @hemildesai :: PR: #10947 - disable dynamo for ddp checker by @akoumpa :: PR: #10961 - Bump `Dockerfile.ci` (2024-10-21) by @ko3n1g :: PR: #10965 - respect warnings' filters by @akoumpa :: PR: #10953 - Alit/mamba recipe by @JRD971000 :: PR: #10935 - Long context performance doc hot fix by @youngeunkwon0405 :: PR: #10946 - Performance mode by @malay-nagda :: PR: #10926 - Bump `Dockerfile.ci` (2024-10-22) by @ko3n1g :: PR: #10979 - Add more recipes by @cuichenx :: PR: #10957 - ci: Update tests by @ko3n1g :: PR: #10987 - Bump `Dockerfile.ci` (2024-10-23) by @ko3n1g :: PR: #11001 - llm.generate fixes by @HuiyingLi :: PR: #10983 - use __dict__ in check by @akoumpa :: PR: #11012 - LoRA support for HF::AutoModelForCausalLM by @akoumpa :: PR: #10982 - Change default for always_save_context to True by @athitten :: PR: #11014 - Fix pip install by @marcromeyn :: PR: #11026 - Change dist ckpt defaults by @ShriyaPalsamudram :: PR: #10913 - Fix _strategy_lib tests by @maanug-nv :: PR: #11033 - Basic online dynamic FP8 quantization with vLLM by @janekl :: PR: #10904 - Expose packed seq in finetuning recipes by @cuichenx :: PR: #11006 - PEFT Inference by @cuichenx :: PR: #11030 - added Lhotse online augmentation tutorial for SE by @nasretdinovr :: PR: #10944 - Bump `Dockerfile.ci` (2024-10-27) by @ko3n1g :: PR: #11051 - ci: Send team alerts on specific keywords by @ko3n1g :: PR: #10986 - Qwen2 Recipe by @suiyoubi :: PR: #10974 - Bump `Dockerfile.ci` (2024-10-28) by @ko3n1g :: PR: #11054 - Generalizing Inference pipeline in NeMo 2.0 to support encoder-decoder models by @huvunvidia :: PR: #10924 - [Bug fix] In energon MultiModalSampleConfig use default_factory in dataclass by @guyueh1 :: PR: #11041 - fix: Resolve mutable default issue in MultiModalSampleConfig dataclass by @michal2409 :: PR: #11061 - SC1/SC2 Recipe by @suiyoubi :: PR: #10971 - Wrap batch_sampler with_IndexBatchSamplerWrapper by @farhadrgh :: PR: #10934 - Performance fine-tuning recipes for llama3 8b + 70b by @vysarge :: PR: #11046 - Set TE spec name for NeMo to HF checkpoint converters by @kevalmorabia97 :: PR: #11036 - ci: Re-add secrets detector by @ko3n1g :: PR: #11038 - Adding nemo-run recipes for NeMo 2.0 T5 by @huvunvidia :: PR: #10964 - Minor fixes for NeMo 2.0 PTQ by @Laplasjan107 :: PR: #11079 - Add copyright check by @pablo-garay :: PR: #11048 - Fix finalize model grad for PEFT by @cuichenx :: PR: #11065 - ci: Less verbose infra alerts by @ko3n1g :: PR: #11080 - Add copyright notice by @pablo-garay :: PR: #11085 - ci: Fix cron schedule by @ko3n1g :: PR: #11076 - ci: Use code-freeze via Nemo-FW-Templates by @ko3n1g :: PR: #11073 - Akoumparouli/hf lit module peft ckpt bugfix by @akoumpa :: PR: #11022 - PEFT perf and TE spec fixes by @JimmyZhang12 :: PR: #11070 - Bump `Dockerfile.ci` (2024-10-30) by @ko3n1g :: PR: #11092 - NeMorun for NeMo 2.0 T5 finetuning by @huvunvidia :: PR: #11040 - fix model_checkpoint.py by @ethanhe42 :: PR: #11057 - Update PTQ tests and ModelOpt version by @janekl :: PR: #11095 - Fix datasets in CLI by @marcromeyn :: PR: #11097 - Fix yaml serialization in io mixin by @hemildesai :: PR: #11106 - disable overlap_param_gather_with_optimizer_step by @JimmyZhang12 :: PR: #11102 - nemo1 to nemo2 checkpoint convert by @HuiyingLi :: PR: #10937 - fix expert regex filter by @akoumpa :: PR: #11103 - Remove stale checkpoint deletion on checkpoint saving failure by @akoumpa :: PR: #11116 - NeMo-UX: Mistral/mixtral peft ci test by @akoumpa :: PR: #11094 - Make nemo.collections.llm PreTrainingDataModule num samples configurable by @hemildesai :: PR: #11088 - Fix packed seq path by @cuichenx :: PR: #11121 - Allow arguments passed to dataset class + Gemma recipe fix by @cuichenx :: PR: #11125 - Nemotron Recipe by @suiyoubi :: PR: #11118 - NeMo-UX: HF PeFT fix by @akoumpa :: PR: #11096 - Remove deprecated tests by @pablo-garay :: PR: #11134 - Recipe Fix for NeMo CI by @suiyoubi :: PR: #11127 - Fix freeze_model call in peft by @cuichenx :: PR: #11146 - Bump `Dockerfile.ci` (2024-11-05) by @ko3n1g :: PR: #11159 - NeMo-UX: Add sgd optim by @akoumpa :: PR: #11157 - Update copyright check by @pablo-garay :: PR: #11168 - add lora recipt for 405b by @JRD971000 :: PR: #10991 - dit training diagrams by @zpx01 :: PR: #10873 - ci: Switch to FW templates for build by @ko3n1g :: PR: #11077 - Bump `Dockerfile.ci` (2024-11-06) by @ko3n1g :: PR: #11174 - feat: Run PyLint by @ko3n1g :: PR: #11147 - Add Alpaca Finetune Datamodule by @suiyoubi :: PR: #11185 - Updated Diffusion Collection README by @zpx01 :: PR: #11179 - Add support for Cosmos Tokenizers by @jojennin :: PR: #11194 - Run formatting only if files changed. Echo message if pylint fails. by @artbataev :: PR: #11188 - Bump `Dockerfile.ci` (2024-11-07) by @ko3n1g :: PR: #11196 - Fix rotary_percentage parsing in nemo2 config by @meatybobby :: PR: #11197 - ci: Update cherry pick workflow by @ko3n1g :: PR: #11202 - ci: Build, test, publish a wheel by @ko3n1g :: PR: #11183 - Bump `Dockerfile.ci` (2024-11-08) by @ko3n1g :: PR: #11222 - update default pipeline_parallelism_type by @akoumpa :: PR: #11213 - check actual value of vocab_file by @akoumpa :: PR: #11228 - Fix VP Initialization Issue with Latest MCore by @suiyoubi :: PR: #11209 - ci: Run Pylint strictly on new files, softly on history by @ko3n1g :: PR: #11212 - ci: Add release workflow by @ko3n1g :: PR: #11180 - Fix llm.generate by @hemildesai :: PR: #11217 - Bump `Dockerfile.ci` (2024-11-11) by @ko3n1g :: PR: #11247 - Bump `Dockerfile.ci` (2024-11-12) by @ko3n1g :: PR: #11254 - Handling tokenizer in PTQ for Nemo 2.0 by @janekl :: PR: #11237 - Fix finetuning datamodule resume by @cuichenx :: PR: #11187 - ci: Move `bump mcore` to templates by @ko3n1g :: PR: #11229 - ci: Fix secrets detector by @ko3n1g :: PR: #11205 - chore(beep boop 🤖): Bump `MCORE_TAG=aded519...` (2024-11-12) by @ko3n1g :: PR: #11260 - ci: Run secrets detector on `pull_request_target` by @ko3n1g :: PR: #11263 - Advanced Diffusion Training Features by @zpx01 :: PR: #11246 - Update pruning and distillation tutorial notebooks by @gvenkatakris :: PR: #11091 - update nemo1->2 conversion according to changes in main by @HuiyingLi :: PR: #11253 - Add llama 3.1 recipes by @cuichenx :: PR: #11273 - Fix Finetune Recipe by @suiyoubi :: PR: #11267 - Configure no restart validation loop in nl.Trainer by @hemildesai :: PR: #11029 - Handle _io_unflatten_object when_thread_local.output_dir is not available by @hemildesai :: PR: #11199 - Remove opencc upperbound by @thomasdhc :: PR: #10909 - Fix head_size in NeMo to HF checkpoint converters for width pruned model support by @eagle705 :: PR: #11230 - Fixes per comments by @gvenkatakris :: PR: #11280 - Create phi3mini.py by @mayani-nv :: PR: #11281 - ci: Fix release workflow by @ko3n1g :: PR: #11286 - fix perf plugin CUDA_DEVICE_MAX_CONNECTIONS setting by @JimmyZhang12 :: PR: #11299 - PTQ via NeMo-Run CLI by @janekl :: PR: #10984 - PTQ memory optimization by @Laplasjan107 :: PR: #11257 - Update README.md for collection page by @yaoyu-33 :: PR: #11223 - Adding multimodal examples by @shanmugamr1992 :: PR: #11279 - Add HF untrusted code toggle by @akoumpa :: PR: #11313 - P2p chunk size setting in nemo 2.0 by @erhoo82 :: PR: #11312 - Nemo2 batcheval by @HuiyingLi :: PR: #11158 - DoRA by @cuichenx :: PR: #11104 - Profiling - support Chakra & Kineto trace dumping by @lilyw97 :: PR: #11115 - NeMo 2.0 SFT PEFT notebooks by @HuiyingLi :: PR: #10874 - Update symlink option for save_last in ModelCheckpoint by @paul-gibbons :: PR: #11319 - ci: Pass-through of `workflow_event` by @ko3n1g :: PR: #11322 - Add StragglerDetection and auto-relaunch to NeMo2.0 by @ShriyaPalsamudram :: PR: #11328 - Huvu/t5 nemo2.0 nemoci by @huvunvidia :: PR: #11291 - TE acceleration using callbacks by @oyilmaz-nvidia :: PR: #11261 - Leave target_module as default in PEFT Recipes by @cuichenx :: PR: #11334 - More robust tar file loading from AIStore by @pzelasko :: PR: #11323 - Fix CLIP transformer layer api by @yaoyu-33 :: PR: #11337 - pass trust_remote_code to AutoTokenizer by @akoumpa :: PR: #11343 - Fix linear layer replacement by @oyilmaz-nvidia :: PR: #11356 - fix typo by @JRD971000 :: PR: #11351 - Add torchrun local executor to recipes by @marcromeyn :: PR: #11342 - Add PP support in NeVA along with few bug fixes by @yaoyu-33 :: PR: #11170 - nemo2 peft merge by @HuiyingLi :: PR: #11017 - Add dora recipes by @cuichenx :: PR: #11330 - add fix to recipe by @JRD971000 :: PR: #11368 - Add missing test to CICD needed list by @pablo-garay :: PR: #11376 - update SquadDataModule to use run.config by @huvunvidia :: PR: #11358 - Add llama 3.2 1b and 3b by @cuichenx :: PR: #11335 - calculate metrics for nemo2 sftpeft notebook by @HuiyingLi :: PR: #11381 - Enable packed dataset for validation; add a2a_experimental argument by @michal2409 :: PR: #11378 - Fix DDP unused param error when TE is enabled in NeMo Lite by @oyilmaz-nvidia :: PR: #11364 - Update llama32 vision (mllama) use attention bias by @yaoyu-33 :: PR: #11316 - Fix environment variables in torchrun executor by @hemildesai :: PR: #11363 - Add sample generate to PTQ for NeMo 2.0 by @Laplasjan107 :: PR: #11339 - Fix selective restore by explicitly verifying keys by @hemildesai :: PR: #11377 - Minor fix by @gvenkatakris :: PR: #11353 - Add a fix for single-GPU nsys. by @tfogal :: PR: #11354 - capitalize HF as HF instead of Hf by @akoumpa :: PR: #11384 - ci: Add HF cache by @ko3n1g :: PR: #11398 - Remove logic to skip checkpoint save if checkpoint exists by @ashors1 :: PR: #11362 - Rewire tokenizer exception handling in model resume by @cuichenx :: PR: #11375 - Adding LLava-Next model class by @yashaswikarnati :: PR: #11399 - Fix vllm test issue when run_accuracy is enabled by @oyilmaz-nvidia :: PR: #11413 - data modules for llava_next by @yashaswikarnati :: PR: #11400 - Fix strategies saving unsharded optimizer states by @ananthsub :: PR: #11392 - Adjust CLI support for PTQ by @janekl :: PR: #11421 - Nemo run recipe's and example scripts for Llava Next by @yashaswikarnati :: PR: #11405 - Huvu/t5 nemo2.0 nemoci 3b11b by @huvunvidia :: PR: #11388 - ci: Allow dry-run of release by @ko3n1g :: PR: #11418 - fix dtype when init HF model from config by @akoumpa :: PR: #11420 - Handle import errors in virtual environment when running vLLM tests by @janekl :: PR: #11435 - Fix loss mask when answer_only_loss=True by @ashors1 :: PR: #11444 - [audio] Keep input directory structure when saving processed files by @anteju :: PR: #11403 - Add different recipe examples to NeMo 2.0 by @BoxiangW :: PR: #11317 - [Scripts] Remove fixed seed for adding noise by @anteju :: PR: #11401 - Add option to provide prior NeMo 2 ckpt path to convert_nemo1_to_nemo… by @hemildesai :: PR: #11452 - PTQ CLI and param updates by @janekl :: PR: #11459 - Add tests for resiliency feature integration by @maanug-nv :: PR: #11406 - ci: Disable HexHighEntropyString plugin by @ko3n1g :: PR: #11470 - Fix broken links by @shashank3959 :: PR: #11294 - Nemo 2.0 canonical lora by @cuichenx :: PR: #11416 - ci: Run secrets detector on merge-commit by @ko3n1g :: PR: #11479 - Formatting (minor) by @pablo-garay :: PR: #11485 - Fix bug related to naming by @pablo-garay :: PR: #11487 - Add BERT Model To NeMo2.0 by @suiyoubi :: PR: #11333 - Update Nemo Distributed Checkpoint User Guide by @FortunaZhang :: PR: #11489 - fix: regular torch optims (e.g., sgd) no longer error with closure spec by @terrykong :: PR: #11189 - Add recipe configs validating by @BoxiangW :: PR: #10954 - Fix finetuning PP by @cuichenx :: PR: #11474 - [docs] Documentation for audio collection by @anteju :: PR: #11426 - config hierarchy by @malay-nagda :: PR: #11145 - Force param sync when using distributed optimizer and overlap_param_gather by @hemildesai :: PR: #11486 - chore(beep boop 🤖): Bump `MCORE_TAG=bd677bf...` (2024-12-06) by @ko3n1g :: PR: #11492 - Remove default mutable arguments from AbstractEmbModel constructor by @ananthsub :: PR: #11348 - minor fix for nemo2 sftpeft readme by @HuiyingLi :: PR: #11502 - Update Llama3 Fine-Tuning Notebook by @roclark :: PR: #11522 - Fix CI issue on validation config by @BoxiangW :: PR: #11521 - Freeze tags in in `r2.1.0` by @github-actions[bot] :: PR: #11556 - Cherrypick all + R2.1.0 fix cicd by @pablo-garay :: PR: #11622 - Cherry pick `Add fix docstring for speech commands (11638)` into `r2.1.0` by @ko3n1g :: PR: #11639 - Cherrypick #11628 to r2.1.0 by @nasretdinovr :: PR: #11630 - Update package_info.py by @ko3n1g :: PR: #11646 - Cherry pick `Add fix docstring for VAD (11659)` into `r2.1.0` by @ko3n1g :: PR: #11660 - Fix tokenizer trust_remote_code by @cuichenx :: PR: #11657 - Cherrypick 11568 by @cuichenx :: PR: #11656 - Cherry pick `Downgrading the 'datasets' package from 3.0.0 to 2.21.0 for Multilang_ASR.ipynb and ASR_CTC_Language_Finetuning.ipynb (11675)` into `r2.1.0` by @ko3n1g :: PR: #11677 - r2.1.0 cherrypick by @pablo-garay :: PR: #11680 - Cherry pick `Rename multimodal data module - EnergonMultiModalDataModule (11654)` into `r2.1.0` by @ko3n1g :: PR: #11685 - chore: Bump to `r2.1.0rc2` by @ko3n1g :: PR: #11693 - r2.1.0 ptl fix by @pablo-garay :: PR: #11694
## NVIDIA Neural Modules 2.1.0rc2 Prerelease: NVIDIA Neural Modules 2.1.0rc2 (2024-12-21) ## NVIDIA Neural Modules 2.1.0rc1 Prerelease: NVIDIA Neural Modules 2.1.0rc1 (2024-12-20) ## NVIDIA Neural Modules 2.1.0rc0 Prerelease: NVIDIA Neural Modules 2.1.0rc0 (2024-12-12) ## NVIDIA Neural Modules 2.0.0rc1 ### Highlights #### Large language models - PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer - State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid) - Support Nemotron, Minitron, Gemma2, Qwen, RAG - Custom Tokenizer training in NeMo - Update the Auto-Configurator for EP, CP and FSDP #### Multimodal - NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP) - Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA #### ASR - SpeechLM and SALM - Adapters for Canary Customization - Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models - Cuda Graphs for Transducer Inference - Replaced webdataset with Lhotse - gives up to 2x speedup - Transcription Improvements - Speedup and QoL Changes - ASR Prompt Formatter for multimodal Canary #### Export & Deploy - In framework PyTriton deployment with backends: - PyTorch - vLLM - TRT-LLM update to 0.10 - TRT-LLM C++ runtime ### Detailed Changelogs #### ASR
Changelog - Support dataloader as input to `audio` for transcription by @titu1994 :: PR: #9201 - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205 - Fix Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9251 - Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281 - Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. by @galv :: PR: #9347 - Revert "Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer." by @titu1994 :: PR: #9351 - Prompt formatter API and canary transcribe tensor input support by @pzelasko :: PR: #9206 - Fix prompt formatter's defaults=None case in multi-task model by @pzelasko :: PR: #9366 - move AED chunked infer script by @stevehuang52 :: PR: #9367 - Use model-cast-to-bfloat16 rather than AMP-to-bfloat16 for inference. by @galv :: PR: #9198 - ci: Fix `L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_C… by @ko3n1g :: PR: #9399 - Fix logging message for ASR by @titu1994 :: PR: #9469 - Add support to change Multi task model prompt by @titu1994 :: PR: #9542 - Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409 - Audio model collection by @anteju :: PR: #9263 - TitaNet Batch Verify Speaker by @monica-sekoyan :: PR: #9337 - Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624 - chore: Pin branch in notebooks by @ko3n1g :: PR: #9697 - refactor: notebook branch release by @ko3n1g :: PR: #9711 - Canary Adapters tutorial (#9670) by @nithinraok :: PR: #9777 - typos and branch name update to r2.0.0rc1 by @nithinraok :: PR: #9846 - Fix RNNT alignments test by @artbataev :: PR: #9770 - By default trust remote code from HF Datasets by @nithinraok :: PR: #9886 - Temporarily disable cuda graph based RNN-T greedy inference for r2.0.0rc1 by @galv :: PR: #9904 - Enable CUDA graphs by default, but require CUDA 12.6 for full graphs by @artbataev :: PR: #9919 - update branch name for script by @nithinraok :: PR: #9936 - updte branch by @nithinraok :: PR: #9942
#### TTS
Changelog - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205 - Add mel codec checkpoints by @anteju :: PR: #9228 - GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559 - chore: Pin branch in notebooks by @ko3n1g :: PR: #9697 - refactor: notebook branch release by @ko3n1g :: PR: #9711
#### LLM/Multimodal
Changelog - Update nemo.export module for quantized models by @janekl :: PR: #9218 - Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221 - Checkpoint resuming compatible for 2403 container by @suiyoubi :: PR: #9199 - Clean up dev docs collection section by @yaoyu-33 :: PR: #9205 - use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223 - Revert rope fusion defaults by @cuichenx :: PR: #9237 - fix import by @akoumpa :: PR: #9240 - Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210 - sum-reduce grad_norm in DP+CP domain by @erhoo82 :: PR: #9262 - Alit/bert convert fix by @JRD971000 :: PR: #9285 - conv1d stable version by @JRD971000 :: PR: #9330 - Fix trainer builder when exp_manager is not in config by @yaoyu-33 :: PR: #9293 - Fix Peft Weights Loading in NeVA by @yaoyu-33 :: PR: #9341 - Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344 - Fix FSDP gradient calculation with orig params by @janEbert :: PR: #9335 - TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270 - support null/None truncation field by @arendu :: PR: #9355 - NeVa token fusion by @paul-gibbons :: PR: #9245 - bugfix if using mcore distOpt with sft by @akoumpa :: PR: #9356 - Re-org export code by @oyilmaz-nvidia :: PR: #9353 - QLoRA by @cuichenx :: PR: #9340 - PeFT fix for distOpt by @akoumpa :: PR: #9392 - [NeMo-UX] Integrating mcore's DistributedDataParallel into MegatronStrategy by @marcromeyn :: PR: #9387 - cherry pick of #9266 by @dimapihtar :: PR: #9411 - Enable specifying alpha for PTQ INT8 SmoothQuant method by @janekl :: PR: #9423 - add support for new mcore ds features by @dimapihtar :: PR: #9388 - LoRA for MoE Layer by @cuichenx :: PR: #9396 - Mistral-7B: apply user's precision to output checkpoint by @akoumpa :: PR: #9222 - Add option to merge distributed optimizer buckets by @timmoon10 :: PR: #9414 - TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402 - In-framework deployment by @oyilmaz-nvidia :: PR: #9438 - Bugfix missing variables and argument changes to MegatronPretrainingRandomSampler by @jstjohn :: PR: #9458 - Hyena Operator by @guyjacob :: PR: #9264 - Refactor Quantizer for reusing in QAT by @kevalmorabia97 :: PR: #9276 - move load state dict after initialize parallel state in nlp_model by @ryxli :: PR: #9382 - Enable user to optionally upgrade Megatron by @jstjohn :: PR: #9478 - Fix unwrap model by @cuichenx :: PR: #9480 - fix operator precedence by @akoumpa :: PR: #9403 - [NeMo-UX] Adding context- & expert-parallelism to MegatronStrategy by @marcromeyn :: PR: #9525 - update mcoreddp call by @akoumpa :: PR: #9345 - mcore distOpt restore fix by @akoumpa :: PR: #9421 - vLLM Export Support by @apanteleev :: PR: #9381 - PL: Delete precision if using plugin. TODO switch to MegatronTrainerB… by @akoumpa :: PR: #9535 - extend get_gpt_layer_modelopt_spec to support MoE by @akoumpa :: PR: #9532 - fix mock data generation for legacy dataset by @dimapihtar :: PR: #9530 - add reset learning rate functionality by @dimapihtar :: PR: #9372 - Use closed-formula to round by multiple by @akoumpa :: PR: #9307 - GPU unit tests: Mark flaky tests to be fixed by @pablo-garay :: PR: #9559 - Consolidate gpt continue training script into pretraining script by @yaoyu-33 :: PR: #9413 - Enable encoder adapters for Canary and MultiTaskAED models by @titu1994 :: PR: #9409 - PTQ refinements by @janekl :: PR: #9574 - Add ModelOpt QAT example for Llama2 SFT model by @kevalmorabia97 :: PR: #9326 - Multimodal projection layer adapter fix for PP>1 by @paul-gibbons :: PR: #9445 - Add offline quantization script for QLoRA deployment by @cuichenx :: PR: #9455 - Make QLoRA more model-agnostic by @cuichenx :: PR: #9488 - Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593 - [NeMo-UX] Fix Megatron-optimizer by @marcromeyn :: PR: #9599 - Chat template support for megatron_gpt_eval.py by @akoumpa :: PR: #9354 - [NeMo-UX] Add PEFT by @cuichenx :: PR: #9490 - Alit/mamba tmp by @JRD971000 :: PR: #9612 - Enable MCore checkpointing optimizations by @mikolajblaz :: PR: #9505 - Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620 - fix ckpt load bug by @dimapihtar :: PR: #9621 - Alit/mamba by @JRD971000 :: PR: #9575 - Unwrap ckpt_io for model opt (async save) by @mikolajblaz :: PR: #9622 - MCore T5 support for NeMo - Training by @huvunvidia :: PR: #9432 - [Nemo-UX] Expose transformer_layer_spec inside GPTConfig by @marcromeyn :: PR: #9592 - Update NeMo Clip to Use MCore Modules by @yaoyu-33 :: PR: #9594 - Mistral + Mixtral Support for NeVa by @paul-gibbons :: PR: #9459 - Adding support for mcore generate by @shanmugamr1992 :: PR: #9566 - Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638 - [Cherrypick] support lora when kv_channel != hidden_size / num_heads by @cuichenx :: PR: #9644 - Parametrize FPS group by @mikolajblaz :: PR: #9648 - Cherry-pick megatron export fix from main by @borisfom :: PR: #9643 - add documentation for reset_lr feature by @dimapihta - chore: Pin branch in notebooks by @ko3n1g :: PR: #9697 - Cherry pick: LITA Integration by @Slyne :: PR: #9684 - SDXL improvements (and support for Draft+) by @rohitrango :: PR: #9654 - Gemma 2 by @cuichenx :: PR: #9672 - Allows non-strict load with distributed checkpoints by @mikolajblaz :: PR: #9613 - refactor: notebook branch release by @ko3n1g :: PR: #9711 - [NeMo-UX] Make TE and Apex dependencies optional by @ashors1 :: PR: #9550 - Alit/r2.0.0 by @JRD971000 :: PR: #9718 - Manually cherry-pick from PR 9679 (PR to main - Support SFT/Eval/PEFT for mcore T5) by @huvunvidia :: PR: #9737 - In framework export by @oyilmaz-nvidia :: PR: #9658 - T5 changes based on mcore changes by @pablo-garay :: PR: #9829 - [NeMo-UX] Use single instance of loss reductions in GPTModel by @hemildesai :: PR: #9801 - deprecate NeMo NLP tutorial by @dimapihtar :: PR: #9864 - Disable nvFuser setup with PyTorch 23.11 and later by @athitten :: PR: #9837 - make torch_dist ckpt strategy as default by @dimapihtar :: PR: #9852 - add rampup bs documentation by @dimapihtar :: PR: #9884 - copy of #9576 by @dimapihtar :: PR: #9986 - Support Nvidia Torch and Arch versions by @thomasdhc :: PR: #9897 - Bug fix for pooler causing dist checkpointing exception by @shanmugamr1992 :: PR: #10008
#### Export
Changelog - Update nemo.export module for quantized models by @janekl :: PR: #9218 - Add save option to the TRT-LLM export test script by @oyilmaz-nvidia :: PR: #9221 - Add TRT-LLM params like max_num_tokens and opt_num_tokens by @oyilmaz-nvidia :: PR: #9210 - TRT-LLM Export Code Cleanup by @oyilmaz-nvidia :: PR: #9270 - Re-org export code by @oyilmaz-nvidia :: PR: #9353 - Use TensorRT-LLM native parameter names in nemo.export module by @janekl :: PR: #9424 - TRT-LLM 0.10 Update by @oyilmaz-nvidia :: PR: #9402 - vLLM Export Support by @apanteleev :: PR: #9381 - Add page context fmha option in TensorRTLLM export by @meatybobby :: PR: #9526 - Test C++ runtime on demand in nemo_export.py to avoid possible OOMs by @janekl :: PR: #9544 - Fix nemo export test by @oyilmaz-nvidia :: PR: #9547 - Add tps and pps params to the export script by @oyilmaz-nvidia :: PR: #9558 - Add Multimodal Exporter by @meatybobby :: PR: #9256 - Set n_gpu to None in nemo export by @oyilmaz-nvidia :: PR: #9593 - Inflight nemo model export support by @JimmyZhang12 :: PR: #9527 - vLLM Export Improvements by @apanteleev :: PR: #9596 - Akoumparouli/nemo ux mixtral export by @akoumpa :: PR: #9603 - Change mixtral moe key name for trt-llm by @oyilmaz-nvidia :: PR: #9620 - Fix the arguments of forward_for_export function in msdd_models by @tango4j :: PR: #9624 - Improve error messaging during trt-llm export by @oyilmaz-nvidia :: PR: #9638 - Cherry-pick megatron export fix from main by @borisfom :: PR: #9643 - In framework export by @oyilmaz-nvidia :: PR: #9658 - Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826~
#### Bugfixes
Changelog - use get with fallback when reading checkpoint_callback_params by @akoumpa :: PR: #9223 - fix import by @akoumpa :: PR: #9240 - Remove .nemo instead of renaming by @mikolajblaz :: PR: #9281 - call set_expert_model_parallel_world_size instead of set_cpu_expert_m… by @akoumpa :: PR: #9275 - Fix typos in Mixtral NeMo->HF and Starcoder2 NeMo->HF conversion scripts by @evellasques :: PR: #9325 - Skip sequence_parallel allreduce when using Mcore DistOpt by @akoumpa :: PR: #9344 - Add OpenAI format response to r2.0.0rc1 by @athitten :: PR: #9796 - [NeMo UX] Support generating datasets using different train/valid/test distributions by @ashors1 :: PR: #9771 - Add missing imports for torch dist ckpt in export by @oyilmaz-nvidia :: PR: #9826
#### General Improvements
Changelog - [Nemo CICD] run_cicd_for_release_branches_also by @pablo-garay :: PR: #9213 - rename paths2audiofiles to audio by @github-actions[bot] :: PR: #9220 - Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @github-actions[bot] :: PR: #9234 - ci: Remove duplicated job by @ko3n1g :: PR: #9258 - Fix document links by @yaoyu-33 :: PR: #9260 - Pin transformers by @github-actions[bot] :: PR: #9273 - Fix loading github raw images on notebook by @github-actions[bot] :: PR: #9283 - Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @github-actions[bot] :: PR: #9278 - Refactor Sequence Packing Script by @cuichenx :: PR: #9271 - [Nemo-UX] Move code to collections + fix some small bugs by @marcromeyn :: PR: #9277 - Fix typo in HF tutorial by @github-actions[bot] :: PR: #9304 - Expand documentation for data parallelism and distributed optimizer by @timmoon10 :: PR: #9227 - Install alerting by @ko3n1g :: PR: #9311 - typos by @github-actions[bot] :: PR: #9315 - FP8 feature documentation by @ksivaman :: PR: #9265 - [Nemo CICD] Comment out flaky tests by @pablo-garay :: PR: #9333 - Fixed typos in README.rst by @gdevakumar :: PR: #9322 - Update README.rst to clarify installation via Conda by @SimonCW :: PR: #9323 - [Nemo CICD] update flaky test by @pablo-garay :: PR: #9339 - fix lora and ptuning and isort/black by @github-actions[bot] :: PR: #9295 - Fix P-tuning for Llama based models by @github-actions[bot] :: PR: #9300 - add large model stable training fix and contrastive loss update for variable seq by @github-actions[bot] :: PR: #9348 - Guard cuda memory allocator update by @github-actions[bot] :: PR: #9313 - [Nemo CICD] Remove unnecessary commented out code by @pablo-garay :: PR: #9364 - Update Gemma conversion script by @yaoyu-33 :: PR: #9365 - Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @github-actions[bot] :: PR: #9371 - Re-enable cuda graphs in training modes. by @github-actions[bot] :: PR: #9343 - fix typo infer_seq_lenght -> infer_seq_length by @akoumpa :: PR: #9370 - Make a backward compatibility for old MSDD configs in label models by @github-actions[bot] :: PR: #9378 - Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @github-actions[bot] :: PR: #9253 - Update README.rst by @jgerh :: PR: #9393 - Force diarizer to use CUDA if cuda is available and if device=None. by @github-actions[bot] :: PR: #9390 - ci: Properly catch failed tests by introduction of workflow templates by @ko3n1g :: PR: #9324 - Fix T5 G2P Input and Output Types by @github-actions[bot] :: PR: #9269 - Huvu/rag pipeline citest by @huvunvidia :: PR: #9384 - Fix circular import for MM dataprep notebook by @github-actions[bot] :: PR: #9292 - add check if num layers is divisible by pp size by @github-actions[bot] :: PR: #9298 - [Nemo CICD] timeouts fix by @pablo-garay :: PR: #9407 - [NeMo-UX] Removing un-used ModelConfig class by @marcromeyn :: PR: #9389 - Add tutorial for Llama-3-8B lora training and deployment by @shashank3959 :: PR: #9359 - [NeMo-UX] Removing default_path from ModelConnector by @marcromeyn :: PR: #9401 - Fix README by @ericharper :: PR: #9415 - [SD] Fix SD CUDA Graph Failure by @alpha0422 :: PR: #9319 - [NeMo-UX] Adding file-lock to Connector by @marcromeyn :: PR: #9400 - Add Dev Container Bug Report by @pablo-garay :: PR: #9430 - Akoumparouli/profiling docs by @akoumpa :: PR: #9420 - ci: Enrich notifications by @ko3n1g :: PR: #9412 - Fix failing RIR unit test with lhotse 1.24+ by @pzelasko :: PR: #9444 - [NeMo-UX] Adding support for mcore distributed optimizer by @marcromeyn :: PR: #9435 - Use ModelOpt build_tensorrt_llm for building engines for qnemo checkpoints by @janekl :: PR: #9452 - ci(notifications): Fix extraction of last 2K chars by @ko3n1g :: PR: #9450 - Update readme with mlperf news by @ericharper :: PR: #9457 - [NeMo-UX] Add nsys callback by @ashors1 :: PR: #9461 - [NeMo UX] Introducing optimizer module by @marcromeyn :: PR: #9454 - Fix minor import bug in deploy module by @oyilmaz-nvidia :: PR: #9463 - ci(notifications): Fetch all jobs by @ko3n1g :: PR: #9465 - Update build_dataset.py by @stevehuang52 :: PR: #9467 - bionemo: bn2/add pipelineparallel dtype by @skothenhill-nv :: PR: #9475 - [NeMo-UX] Integrate experiment manager features with NeMo-UX APIs by @ashors1 :: PR: #9460 - Add python_requires by @galv :: PR: #9431 - [NeMo-UX] Fixing imports of NeMoLogging, AutoResume & ModelCheckpoint by @marcromeyn :: PR: #9476 - Modelopt Refactor for SDXL Quantization by @suiyoubi :: PR: #9279 - [NeMo-UX] Fixing defaults in llm.train & Mistral7BModel by @marcromeyn :: PR: #9486 - In framework deploy using deploy script by @oyilmaz-nvidia :: PR: #9468 - [NeMo-UX] Integrate tokenizer import into model.import_ckpt by @marcromeyn :: PR: #9485 - append to file by @malay-nagda :: PR: #9483 - [NeMo-UX] Fix bug in import_ckpt by @marcromeyn :: PR: #9492 - Add nemotron news by @ericharper :: PR: #9510 - Add CICD test for Stable Diffusion by @michal2409 :: PR: #9464 - Akoumparouli/nemo ux mixtral by @akoumpa :: PR: #9446 - [NeMo-UX] Llama and Gemma by @cuichenx :: PR: #9528 - [NeMo-UX] minor logging bug fixes by @ashors1 :: PR: #9529 - Update neva conversion script from and to HF by @yaoyu-33 :: PR: #9296 - [Nemo-UX] IO fixes by @marcromeyn :: PR: #9512 - Fix lhotse tests for v1.24.2 by @pzelasko :: PR: #9546 - [Nemo CICD] Make GPU Unit Tests non-optional by @pablo-garay :: PR: #9551 - Add Python AIStore SDK to container and bump min Lhotse version by @pzelasko :: PR: #9537 - [NeMo-UX] Fix tokenizer IO by @marcromeyn :: PR: #9555 - [NeMo UX] Move mistral_7b.py to mistral.py by @akoumpa :: PR: #9545 - ci: Do not attempt to send slack on fork by @ko3n1g :: PR: #9556 - Fix SDXL incorrect name in Docs by @suiyoubi :: PR: #9534 - Bump PTL version by @athitten :: PR: #9557 - [Resiliency] Straggler detection by @jbieniusiewi :: PR: #9473 - [NeMo-UX] Switch to torch_dist as default distributed checkpointing backend by @ashors1 :: PR: #9541 - [NeMo-UX] Checkpointing bug fixes by @ashors1 :: PR: #9562 - Expose MCore path_to_cache option by @maanug-nv :: PR: #9570 - [NeMo-UX] Fix Trainer serialization by @marcromeyn :: PR: #9571 - Update click version requirement by @thomasdhc :: PR: #9580 - [Fault tolerance] Heartbeat detection by @maanug-nv :: PR: #9352 - [Nemo-UX] Add fabric-API for manual forward-pass by @marcromeyn :: PR: #9577 - [Nemo-UX] Add SDK-factories to llm-collection by @marcromeyn :: PR: #9589 - [NeMo-UX] Some improvements to NeMoLogger by @marcromeyn :: PR: #9591 - Set no_sync_func & grad_sync_fucn by @akoumpa :: PR: #9601 - [NeMo-UX] Fix nemo logger when trainer has no loggers by @ashors1 :: PR: #9607 - Fix the dictionary format returned by the `scheduler` method by @sararb :: PR: #9609 - [NeMo-UX] Dataloading enhancements and bug fixes by @ashors1 :: PR: #9595 - Fix serialization of AutoResume by @sararb :: PR: #9616 - Jsonl support by @adityavavre :: PR: #9611 - Akoumparouli/mistral import instruct chat template fix by @akoumpa :: PR: #9567 - Remove .cuda calls, use device isntead by @akoumpa :: PR: #9602 - fix converter defautl args by @akoumpa :: PR: #9565 - fix: remove non_blocking from PTL's .cuda call by @akoumpa :: PR: #9618 - NeVA Minor Fixes by @yaoyu-33 :: PR: #9608 - [NeMo-UX] fix pretrianing data sizes and weights by @cuichenx :: PR: #9627 - [NeMo-UX] async checkpointing support by @ashors1 :: PR: #9466 - Change default parallel_save to False by @mikolajblaz :: PR: #9632 - Add REST API to deploy module by @athitten :: PR: #9539 - ci: Timeout per step, not job by @ko3n1g :: PR: #9635 - [NeMo-UX] Fix when optimizers are setup for PEFT by @marcromeyn :: PR: #9619 - [NeMo-UX] Fix pipeline parallel bug by @ashors1 :: PR: #9637 - Fixing import error fior llama-index (RAG pipeline) by @pablo-garay :: PR: #9662 - llama CI fix by @rohitrango :: PR: #9663 - [NeMo-UX] Make 'load_directly_on_device' configurable by @ashors1 :: PR: #9657 - [Nemo-UX] Including all trainable-params in a PEFT-checkpoint by @marcromeyn :: PR: #9650 - [NeMo-UX] Fix imports so local configuration of runs works again by @marcromeyn :: PR: #9690 - Set TE flag in legacy -> mcore conversion script by @terrykong :: PR: #9722 - Update starthere docs text by @erastorgueva-nv :: PR: #9724 - TorchAudio installation workaround for incorrect `PYTORCH_VERSION` variable by @artbataev :: PR: #9736 - [NeMo-UX] Match nemo 1's default behavior for drop_last and pad_samples_to_global_batch_size by @ashors1 :: PR: #9707 - add a bit more for timeout (#9702) by @pablo-garay :: PR: #9754 - Fix missing parallelisms by @maanug-nv :: PR: #9725 - update branch by @nithinraok :: PR: #9764 - Fix data preprocessing script by @cuichenx :: PR: #9759 - vLLM 0.5.1 update by @apanteleev :: PR: #9779 - upper bound hf-hub by @akoumpa :: PR: #9805 - Fix few issues and docs for neva and clip in r2.0.0rc1 by @yaoyu-33 :: PR: #9681 - add dummy vision and text transformer config (assumed mcore to be false) by @rohitrango :: PR: #9699 - fix lita bugs by @Slyne :: PR: #9810 - [NeMo-UX] Log `val_loss` by @ashors1 :: PR: #9814 - [NeMo-UX] Fix some dataloading bugs by @ashors1 :: PR: #9807 - [NeMo-UX] Adding recipes by @marcromeyn :: PR: #9720 - [NeMo-UX] Set async_save from strategy rather than ModelCheckpoint by @ashors1 :: PR: #9800 - Fix hf hub for 0.24+ by @titu1994 :: PR: #9806 - [NeMo-UX] Fix a minor bug with async checkpointing by @ashors1 :: PR: #9856 - [NeMo-UX] make progress bar easier to parse by @ashors1 :: PR: #9877 - Docs: add "Nemo Fundamentals" page by @erastorgueva-nv :: PR: #9835 - Create __init__.py by @stevehuang52 :: PR: #9892 - [NeMo-UX] Fixes to make PreemptionCallback work by @hemildesai :: PR: #9830 - Fix Docker build. Make Dockerfile consistent with CI by @artbataev :: PR: #9784 - Multimodal data prep notebook fix by @cuichenx :: PR: #9910 - [NeMo-UX] Add distributed checkpointing unit tests by @ashors1 :: PR: #9794 - r2.0.0rc1 fix for dist checkpoint loading by @yaoyu-33 :: PR: #9854 - [NeMo-UX] Rename sdk references to NeMo Run by @hemildesai :: PR: #9872 - [NeMo-UX] Fix some serialization bugs by @ashors1 :: PR: #9868 - add mixtral neva tutorial (moe + token fusion + siglip) by @paul-gibbons :: PR: #9926 - [NeMo-UX] Add more NeMo Logger tests by @ashors1 :: PR: #9795 - Akoumparouli/mixtral fixes for r2.0.0rc1 by @akoumpa :: PR: #9911 - R2.0.0rc1 clip fix by @Slyne :: PR: #9871 - [NeMo-UX] Add missing docstrings and update some defaults by @ashors1 :: PR: #9895 - Add REST service requirements.txt by @oyilmaz-nvidia :: PR: #9923 - add bert latest fix by @JRD971000 :: PR: #9921 - remove empy reconfigure_limit_batches by @akoumpa :: PR: #9934 - fix mem by @terrykong :: PR: #9964 - Run a sample query for a quantized model conditionally by @janekl :: PR: #9965 - Add pydantic-settings by @oyilmaz-nvidia :: PR: #9961 - Resiliency features update by @jbieniusiewi :: PR: #9714 - [NeMo-UX] Wrap task config save in a try/except by @ashors1 :: PR: #9956 - [NeMo-UX] Update default PTL logging `save_dir` by @ashors1 :: PR: #9954 - Fix lita tutorial by @Slyne :: PR: #9980 - Add deploy and REST API support to NeMo 2.0 by @athitten :: PR: #9834 - ci: Allow changelog manual (#10156) by @ko3n1g :: PR: #10157 - docs: Add changelog by @ko3n1g :: PR: #10155 - add manifest file by @ko3n1g :: PR: #10161
## NVIDIA Neural Modules 2.0.0rc0 ### Highlights #### LLM and MM ##### Models - Megatron Core RETRO - Pre-training - Zero-shot Evaluation - Pretraining, conversion, evaluation, SFT, and PEFT for: - Mixtral 8X22B - Llama 3 - SpaceGemma - Embedding Models Fine Tuning - Mistral - BERT - BERT models - Context Parallel - Distributed checkpoint - Video capabilities with NeVa ##### Performance - Distributed Checkpointing - Torch native backend - Parallel read/write - Async write - Multimodal LLM (LLAVA/NeVA) - Pipeline Parallelism support - Sequence packing support ##### Export - Integration of Export & Deploy Modules into NeMo Framework container - Upgrade to TRT-LLM 0.9 #### Speech (ASR & TTS) ##### Models - AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model - Multimodal Domain - Speech LLM supporting SALM Model - Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second) - Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs - mel_codec_22khz_medium - mel_codec_44khz_medium ##### Perf Improvements - Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders - Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x - Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models - Semi Sorted Batching support - External User contribution that speeds up training by 15-30%. ##### Customization - Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation - Longform Inference - Longform inference support for AED models - Transcription of multi-channel audio for AED models ##### Misc - Upgraded webdataset - Speech and LLM / Multimodal unified container ### Detailed Changelogs #### ASR
Changelog - Enable using hybrid asr models in CTC Segmentation tool by @erastorgueva-nv :: PR: #8828 - TDT confidence fix by @GNroy :: PR: #8982 - Fix union type annotations for autodoc+mock-import rendering by @pzelasko :: PR: #8956 - NeMo dev doc restructure by @yaoyu-33 :: PR: #8896 - Improved random seed configuration for Lhotse dataloaders with docs by @pzelasko :: PR: #9001 - Fix #8948, allow preprocessor to be stream captured to a cuda graph when doing per_feature normalization by @galv :: PR: #8964 - [ASR] Support for transcription of multi-channel audio for AED models by @anteju :: PR: #9007 - Add ASR latest news by @titu1994 :: PR: #9073 - Fix docs errors and most warnings by @erastorgueva-nv :: PR: #9006 - PyTorch CUDA allocator optimization for dynamic batch shape dataloading in ASR by @pzelasko :: PR: #9061 - RNN-T and TDT inference: use CUDA graphs by default by @artbataev :: PR: #8972 - Fix #8891 by supported GPU-side batched CTC Greedy Decoding by @galv :: PR: #9100 - Update branch for notebooks and ci in release by @ericharper :: PR: #9189 - Enable CUDA graphs by default only for transcription by @artbataev :: PR: #9196 - rename paths2audiofiles to audio by @nithinraok :: PR: #9209 - Fix ASR_Context_Biasing.ipynb contains FileNotFoundError by @andrusenkoau :: PR: #9233 - Cherrypick: Support dataloader as input to `audio` for transcription (#9201) by @titu1994 :: PR: #9235 - Update Online_Offline_Microphone_VAD_Demo.ipynb by @stevehuang52 :: PR: #9252 - Dgalvez/fix greedy batch strategy name r2.0.0rc0 by @galv :: PR: #9243 - Accept None as an argument to decoder_lengths in GreedyBatchedCTCInfer::forward by @galv :: PR: #9246 - Fix loading github raw images on notebook by @nithinraok :: PR: #9282 - typos by @nithinraok :: PR: #9314 - Re-enable cuda graphs in training modes. by @galv :: PR: #9338 - add large model stable training fix and contrastive loss update for variable seq by @nithinraok :: PR: #9259 - Fix conv1d package in r2.0.0rc0 by @pablo-garay :: PR: #9369 - Fix GreedyBatchedCTCInfer regression from GreedyCTCInfer. (#9347) by @titu1994 :: PR: #9350 - Make a backward compatibility for old MSDD configs in label models by @tango4j :: PR: #9377 - Force diarizer to use CUDA if cuda is available and if device=None. by @tango4j :: PR: #9380
#### TTS
Changelog - [TTS] Add tutorial for training audio codecs by @rlangman :: PR: #8723 - Update radtts.py by @blisc :: PR: #9097 - [Nemo CICD] RADTTS test optional by @pablo-garay :: PR: #9112 - Remove Radtts CI test by @blisc :: PR: #9144 - Fix T5 G2P Input and Output Types by @blisc :: PR: #9224
#### LLM and MM
Changelog - Rachitg/dpa by @rachitgarg91 :: PR: #8911 - Remove precision args in trainer due to PTL update by @yaoyu-33 :: PR: #8908 - Huvu/mcore retro by @huvunvidia :: PR: #8861 - fsdp tp > 1 bug fix by @dimapihtar :: PR: #8947 - Fix memory leak at loss func by @minitu :: PR: #8868 - change the condition for get qkv tensor from linear_qkv output in mcoremixin by @HuiyingLi :: PR: #8965 - Add safety checks for 'data' key in MegatronGPTModel cfg by @HuiyingLi :: PR: #8991 - [NeMo-UX] Adding MegatronParallel by @cuichenx :: PR: #8987 - Skip top_p computations when set to 1.0 by @odelalleau :: PR: #8905 - Gemma bug by @cuichenx :: PR: #8962 - [NeMo-UX] Adding megatron strategy by @marcromeyn :: PR: #8995 - Quantized checkpoint support in export and deploy modules by @janekl :: PR: #8859 - add geglu to mlp swap by @JRD971000 :: PR: #8999 - add timeout for new_group by @acphile :: PR: #8998 - Zero-shot evaluation pipeline for mcore RETRO by @huvunvidia :: PR: #8941 - Added fusion for squared relu by @sanandaraj5597 :: PR: #8963 - Developer Documents for mcore RETRO by @huvunvidia :: PR: #9026 - [NeMo-UX] Adding GPTModel & MockDataModule by @marcromeyn :: PR: #9011 - Adding unit test for mcore RETRO model by @huvunvidia :: PR: #9022 - docs and simplification of cmd args by @arendu :: PR: #8979 - [NeMo-UX] Add checkpoint-io to MegatronStrategy by @marcromeyn :: PR: #9057 - Enable Sequence Packing and Pipeline Parallel in NeVA by @yaoyu-33 :: PR: #8957 - Mingyuanm/add back fp8 support to sd by @Victor49152 :: PR: #9070 - unfused lora by @arendu :: PR: #9004 - Handle case where num_query_groups is set to null for LoRA config setup by @vysarge :: PR: #9075 - Alit/griffin by @JRD971000 :: PR: #9021 - Implement DistributedCheckpointIO by @mikolajblaz :: PR: #9016 - Video Neva Pretraining + Inference Implementation by @paul-gibbons :: PR: #9095 - HF to .nemo for Mixtral-8x22B-instruct by @akoumpa :: PR: #9060 - mcore ds updates by @dimapihtar :: PR: #8951 - Alit/griffin perf by @JRD971000 :: PR: #9107 - Add assert for max_steps to be positive in MegatronGPTSFTModel by @athitten :: PR: #9110 - Extend sequence length padding for GPT SFT to account for context parallel by @vysarge :: PR: #8869 - Update gpt dataset config parameter for mock by @thomasdhc :: PR: #9118 - Add Mcore DistributedDataParallel and distributed optimizer into Nemo by @gdengk :: PR: #9034 - Revert "Add assert for max_steps to be positive in MegatronGPTSFTMode… by @pablo-garay :: PR: #9128 - scripts to convert HF lora to nemo by @arendu :: PR: #9102 - Prevent duplicated checkpoints by @mikolajblaz :: PR: #9015 - add TN/ITN link in speech tools list by @erastorgueva-nv :: PR: #9142 - Cleanup deprecated files and temporary changes by @cuichenx :: PR: #9088 - Use DP+CP groups as the FSDP sharding domain by @erhoo82 :: PR: #9145 - CUDA memory profile by @erhoo82 :: PR: #9096 - Fix missing func for T5 model by @gdengk :: PR: #9141 - Add knob for load_directly_on_device by @mikolajblaz :: PR: #9125 - Revert rope fusion defaults by @cuichenx :: PR: #9238 - Update nemo.export module for quantized models by @janekl :: PR: #9250 - Fix circular import for MM dataprep notebook by @cuichenx :: PR: #9287 - neva media_type + text generation default fix by @paul-gibbons :: PR: #9257 - fix lora and ptuning and isort/black by @oyilmaz-nvidia :: PR: #9290 - add check if num layers is divisible by pp size by @dimapihtar :: PR: #9208 - Fix P-tuning for Llama based models by @apanteleev :: PR: #9297 - add deprecation warnings by @pablo-garay :: PR: #9266 - move pooler under post_process by @dimapihtar :: PR: #9328 - add deprecation note for nmt by @dimapihtar :: PR: #9342 - Fix incorrect checkpoint removal logic (#9192) by @mikolajblaz :: PR: #9204 - fix fp16 precision issue by @dimapihtar :: PR: #9376 - Fix module.training for Neva in FusedAttn backward which causes nan by @yaoyu-33 :: PR: #8877
#### Export
Changelog - Updates for TRT-LLM 0.9 by @oyilmaz-nvidia :: PR: #8873 - Mingyuanm/sdxl export by @Victor49152 :: PR: #8926 - Avoid unpacking NeMo checkpoints before exporting to TRT-LLM by @apanteleev :: PR: #8866 - Update gemma for trt-llm 0.9 by @oyilmaz-nvidia :: PR: #8974 - TRT-LLM export P-tuning related fixes by @apanteleev :: PR: #8863
#### General Improvements
Changelog - Update package info by @ericharper :: PR: #8793 - [Nemo CICD] Update mcore 4.13.24 by @pablo-garay :: PR: #8917 - Akoumparouli/low mem mixtral ckpt converter by @akoumpa :: PR: #8895 - Adding RETRO tests to Action Tests (cicd-main.yml) by @huvunvidia :: PR: #8942 - Akoumparouli/fix sd train 2 by @akoumpa :: PR: #8883 - Update te install for jenkins by @ericharper :: PR: #8954 - [Nemo CICD] Add last job depending on others for blocking check by @pablo-garay :: PR: #8959 - Minor quantization pipeline updates by @janekl :: PR: #8924 - Fix External CLIP Converter by @yaoyu-33 :: PR: #8960 - PP support in LoRA merge script by @cuichenx :: PR: #8934 - Update PR template by @ericharper :: PR: #8978 - Update Latest News by @shashank3959 :: PR: #8837 - Fix incorrect link to latest news in README by @shashank3959 :: PR: #8985 - Update dependency install for LLM and MM by @ericharper :: PR: #8990 - Temporarily remove mcore dep by @ericharper :: PR: #9010 - [Nemo CICD] further specialize runners for more parallelism by @pablo-garay :: PR: #9036 - Update mm dataprep notebook based on feedback by @cuichenx :: PR: #9029 - Fix import in lora merge script by @cuichenx :: PR: #9032 - [Nemo CICD] Run when labeled:Run CICD by @pablo-garay :: PR: #9044 - [Nemo CICD] Add tag/label for 1-gpu runner by @pablo-garay :: PR: #9046 - [Nemo CICD] checkout v4 by @pablo-garay :: PR: #9048 - [Nemo CICD] Remove temp test change by @pablo-garay :: PR: #9049 - remove in-place addition for dreambooth train with text encoder by @Victor49152 :: PR: #8825 - Mingyuanm/sdxl quantization notebook by @Victor49152 :: PR: #9042 - [Nemo CICD] Trigger on comment issued by @pablo-garay :: PR: #9062 - zarr ckpt to torch_dist ckpt converter by @dimapihtar :: PR: #8842 - Restore PTQ tests for Llama2 (reopened) by @janekl :: PR: #9064 - add clip H config by @JRD971000 :: PR: #9082 - [NeMo-UX] Add mixed-precision plugin by @marcromeyn :: PR: #9065 - Comment baichuan test and update pr template by @ericharper :: PR: #9085 - Add safe extraction of nemo tar files by @athitten :: PR: #8976 - Improved `shard_id` parsing in `LazyNemoTarredIterator`, enables AIS dataloading by @pzelasko :: PR: #9077 - [NeMo-UX] Add mistral-7b model by @marcromeyn :: PR: #9066 - Llama3 Conversion Script Update by @suiyoubi :: PR: #9089 - dehardcode test string by @JimmyZhang12 :: PR: #8865 - [Nemo CICD] Try trigger cicd run on comment by @pablo-garay :: PR: #9111 - Lhotse dataloading: RIR augmentation and nemo/tarred input support for RIR and noise aug by @pzelasko :: PR: #9109 - mixtral evaluation PR by @Slyne :: PR: #8989 - [Nemo CICD] Revert: run GHA cicd on comment by @pablo-garay :: PR: #9119 - [Nemo CICD] Comment out flaky test: running too long by @pablo-garay :: PR: #9123 - [Nemo CICD] Add timeout to unit tests by @pablo-garay :: PR: #9132 - [Nemo CICD] Indicate optional test in name (prefix) by @pablo-garay :: PR: #9139 - video neva null image+video folder path fix by @paul-gibbons :: PR: #9116 - [NeMo-UX] Add data module by @cuichenx :: PR: #9133 - NeMo Inference Requirements by @oyilmaz-nvidia :: PR: #9093 - Remove debug print by @maanug-nv :: PR: #9074 - Remove legacy CI by @pablo-garay :: PR: #9149 - Update support for push_to_hf_hub() by @titu1994 :: PR: #9159 - [Nemo CICD] comment out flaky PTQ tests by @pablo-garay :: PR: #9160 - Update branch by @ericharper :: PR: #9211 - dist adam transpose fix by @dimapihtar :: PR: #9239 - [Nemo CICD] Increase time limit for Speech_Checkpoints_tests (#9186) by @pablo-garay :: PR: #9247 - Pin transformers by @ericharper :: PR: #9261 - Fix typo in HF tutorial by @titu1994 :: PR: #9302
## NVIDIA Neural Modules 1.23.0 ### Highlights #### Models ##### Nvidia Starcoder 2 - 15B - Announcement - - AI Foundation Model Inference - - ##### NeMo Canary Announcement - - #### NeMo LLM - Falcon - Code Llama - StarCoder - GPT perf improvements - Context parallelism - Mistral - Mixtral (without expert parallelism) - Mcore GPT Dataset integration #### NeMo MM - CLIP - Stable Diffusion (supporting LoRA) - Imagen - ControlNet (for SD) - Instruct pix2pix (for SD) - LLAVA - NeVA - DreamFusion++ - NSFW filtering #### NeMo ASR - Lhotse Dataloading support #7880 - Canary: Multi task multi lingual ASR #8242 - LongForm Audio for Diarization #7737 - Faster algorithm for RNN-T Greedy #7926 - Cache-Aware streaming notebook #8296 #### NeMo TTS #### NeMo Vision #### Known Issues ##### ASR ###### RNNT WER calculation when fused batch size > 1 during validation / test step() Previously, the RNNT metric was stateful while the CTC one was not ([r1.22.0](https://github.com/NVIDIA/NeMo/blob/r1.22.0/nemo/collections/asr/metrics/rnnt_wer_bpe.py#L419-L420), [r1.23.0](https://github.com/NVIDIA/NeMo/blob/r1.23.0/nemo/collections/asr/metrics/wer.py#L333)) Therefore this calculation in the RNNT joint for fused operation worked properly. However with the unification of metrics in r1.23.0, a bug was introduced where only the last sub-batch of metrics calculates the scores and does not accumulate. This is patched via and will be fixed in the next release. __Workaround__: Explicitly disable fused batch size during inference using the following command ```python from omegaconf import open_dict model = ... decoding_cfg = model.cfg.decoding with open_dict(decoding_cfg): decoding_cfg.fused_batch_size = -1 model.change_decoding_strategy(decoding_cfg) ``` Note: This bug does not affect scores calculated via model.transcribe() (since it does not calculate metrics during inference, just text), or using the `transcribe_speech.py` or `speech_to_text_eval.py` in `examples/asr`. ###### Two failing unit tests due to a change in expected results, caused by lhotse version update #### Container For additional information regarding NeMo containers, please visit: `docker pull nvcr.io/nvidia/nemo:24.01.speech` #### ASR
Changelog - Update link to yaml file in ASR_with_Transducers.ipynb by @Faith-Nchifor :: PR: #8014 - Use convert_hf_dataset_to_nemo by @karpnv :: PR: #8017 - Update asr_language_modeling.rst: Add a missing word by @martin0258 :: PR: #8007 - spelling mistake by @orena1 :: PR: #7903 - update asr eval by @stevehuang52 :: PR: #8045 - fix noise aug by @stevehuang52 :: PR: #8057 - Various fixes for typos and urls by @titu1994 :: PR: #8066 - [Fix] Increase length check tolerance to prevent test failing by @anteju :: PR: #8067 - Add text metrics to asr eval by @stevehuang52 :: PR: #8087 - fix device setting to allow using accelerator cpu by @orena1 :: PR: #8084 - .ctm in data simulator annotator compliant with RT-09 specification by @popcornell :: PR: #8004 - Fix AST eval by @stevehuang52 :: PR: #8112 - fix: numba.*_num_threads resets torch num_threads #8141 by @itzsimpl :: PR: #8145 - Update dependencies by @titu1994 :: PR: #8156 - NeMo + Lhotse integration by @pzelasko :: PR: #7880 - Speedup RNN-T greedy decoding by @artbataev :: PR: #7926 - [docker] Install k2 before NeMo for faster image rebuilding by @pzelasko :: PR: #8204 - [docs] Add --force_codec to tarred dataset creation examples by @pzelasko :: PR: #8227 - Temporarily use the previous RNN-T decoding algorithm as default by @artbataev :: PR: #8226 - Make TDT inference not require duration params by @hainan-xv :: PR: #8207 - Cache Aware Streaming tutorial notebook by @erastorgueva-nv :: PR: #8296 - fix path location and branch by @nithinraok :: PR: #8304 - Attention encoder-decoder models for multiple speech-to-text tasks … by @titu1994 :: PR: #8324 - Remove asr webapp by @titu1994 :: PR: #8347 - remove _target_ at model level in aed model config [ASR] by @krishnacpuvvada :: PR: #8351 - Add change_vocabulary and save_tokenizers() support to Multitask ASR models by @titu1994 :: PR: #8357 - Change default beam size by @titu1994 :: PR: #8371 - adding jenkins test for speech_to_text_aed model by @krishnacpuvvada :: PR: #8368 - Add Finetuning tutorial with HF Datasets by @nithinraok :: PR: #8356 - wer fix by @tbartley94 :: PR: #8404 - add ensemble decoding fix by @nithinraok :: PR: #8427 - Update k2 by @artbataev :: PR: #8492
#### TTS
Changelog - [TTS] Scale sampler steps by number of devices by @rlangman :: PR: #7947 - Add All Multimodal Source Code Part 2: Text to image, x to nerf by @yaoyu-33 :: PR: #7970 - [TTS] Add period discriminator and feature matching loss to codec recipe by @rlangman :: PR: #7884 - Added VectorQuantizer base class by @anteju :: PR: #8011
#### LLMS
Changelog - Add interface to set NCCL options of each process group by @erhoo82 :: PR: #7923 - Support O2 training of PEFT and SFT by @cuichenx :: PR: #7971 - [NLP] Access scaler only in FP16 case by @janekl :: PR: #7916 - [NLP] Minor improvements in Llama conversion script by @janekl :: PR: #7978 - [NLP] Use helpers from utils_funcs.py in Llama conversion by @janekl :: PR: #7979 - [NLP] Remove replace_sampler_ddp (deprecated in Trainer) by @janekl :: PR: #7981 - Reworked MegatronPretrainingRandomBatchSampler to correctly handle epochs > 1 by @trias702 :: PR: #7920 - Remove deprecated arguments from TE's TransformerLayer by @jbaczek :: PR: #7917 - Add All Multimodal Source Code by @yaoyu-33 :: PR: #7791 - First draft of mcore bert model in NeMo by @shanmugamr1992 :: PR: #7814 - Support Falcon Variants (7B/40B/180B) in Mcore NeMo by @xuanzic :: PR: #7666 - FSDP + Tensor Parallelism by @erhoo82 :: PR: #7897 - Packed Sequence by @cuichenx :: PR: #7945 - Adding method back that was removed accidentally by @ericharper :: PR: #8038 - [NLP] ArtifactItem with init=True to make it debuggable by @janekl :: PR: #7980 - SFT patch: (1) enable sequence parallelism and (2) enable profile by @erhoo82 :: PR: #7963 - migration to PTL 2.0 for spellmapper model by @bene-ges :: PR: #7924 - Change the megatron config lr scheduler default and fix to change partitions script by @shan18 :: PR: #8094 - (1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast by @erhoo82 :: PR: #7793 - Reconfigure limit_val_batches only for int by @athitten :: PR: #8099 - Fixing wrapper and moving it to base class by @shanmugamr1992 :: PR: #8055 - fix gated_linear_unit bug by @Agoniii :: PR: #8042 - Fix Adapter for MCore models by @cuichenx :: PR: #8124 - add war fix for sync issues by @gshennvm :: PR: #8130 - Improve PEFT UX by @cuichenx :: PR: #8131 - Enhance flexibility by passing callbacks as method argument by @michal2409 :: PR: #8015 - context parallelism by @xrennvidia :: PR: #7739 - Make pipelined TP comm overlap available with mcore by @erhoo82 :: PR: #8005 - remove deprecated scripts by @arendu :: PR: #8138 - adding OnlineSampleMapping by @arendu :: PR: #8137 - Add distopt support for FP8 params and BF16 optimizer state by @timmoon10 :: PR: #7909 - Revert adding OnlineSampleMapping by @pablo-garay :: PR: #8164 - Token count and sequence length logging for MegatronGPTSFTModel by @vysarge :: PR: #8136 - Use latest apex internal API by @jbaczek :: PR: #8129 - tune specific params in the base model by @arendu :: PR: #7745 - Virtual pipeline parallel support for MegatronGPTSFTModel by @vysarge :: PR: #7964 - removed deprecated peft model by @arendu :: PR: #8183 - remove more deprecated files by @arendu :: PR: #8169 - Pre-generate cu_seqlens argmin and max_seqlen to remove host-to-device sync by @erhoo82 :: PR: #8108 - Add the interface to use SHARP to FSDP strategy by @erhoo82 :: PR: #8202 - Multimodal required NLP base model changes by @yaoyu-33 :: PR: #8188 - [NLP] Improve and unify loading state_dict for community models by @janekl :: PR: #7977 - Rename Finetuning Scripts by @cuichenx :: PR: #8201 - Final multimodal PR with our recent developments on MM side by @yaoyu-33 :: PR: #8127 - Add include_text parameter to SFT dataloaders by @Kipok :: PR: #8198 - Add random_seed argument to generate by @Kipok :: PR: #8162 - Added support for neptune logger by @harishankar-gopalan :: PR: #8210 - Pre-compute max_seqlen and cu_seqlens_argmin in all model-parallel cases by @erhoo82 :: PR: #8222 - Use PackedSeqParams in accordance with changes in Megatron-LM by @cuichenx :: PR: #8205 - Fix to peft & virtual pipeline parallel unsupported check by @vysarge :: PR: #8216 - Fixed the tp overlap switch by @sanandaraj5597 :: PR: #8195 - add knobs for rope/swiglu fusion by @lhb8125 :: PR: #8184 - Added sample cpu_offloading switch to YAML by @sanandaraj5597 :: PR: #8148 - Syncing random seed between ranks in generate by @Kipok :: PR: #8230 - add first_val_step to mcore scheduler by @JimmyZhang12 :: PR: #8150 - Correct padding for SFT input data to account for sequence parallel + TE's fp8 op dimension requirements by @vysarge :: PR: #8240 - Mistral 7b conversion script by @akoumpa :: PR: #8052 - switch to mcore dataset [with FIM support] by @dimapihtar :: PR: #8149 - Mixtral to NeMo conversion script. by @akoumpa :: PR: #8155 - fixes to accomendate mcore changes by @HuiyingLi :: PR: #8261 - Allow MegatronPretrainingRandomSampler to do multi-epoch training by @trias702 :: PR: #8239 - Add dist ckpt support for regular optimizers by @mikolajblaz :: PR: #7749 - add deallocate pipeline output optimization by @JimmyZhang12 :: PR: #8279 - Fix memory leak caused by context parallelism hanging references by omegaconf by @JimmyZhang12 :: PR: #8299 - distributed fused adam + rampup bs support by @dimapihtar :: PR: #8302 - Update PEFT Doc by @cuichenx :: PR: #8262 - Converter script fixes for mixtral/mistral by @akoumpa :: PR: #8272 - Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 by @erhoo82 :: PR: #8334 - Enable megatron core loggers for GPT pretraining by @ashbhandare :: PR: #8354 - mcore ds fix by @dimapihtar :: PR: #8283 - release updates by @dimapihtar :: PR: #8378 - Mcore customization doc by @HuiyingLi :: PR: #8298 - updated link to pubmed by @nithinraok :: PR: #8402 - mcore customization doc minor fix by @HuiyingLi :: PR: #8421 - Fixing mcore bert for TP, PP and SP by @shanmugamr1992 :: PR: #8336 - Add settings to suppress bf16 compile errors in CI on V100 by @athitten :: PR: #8481 - MoE parameter passing by @akoumpa :: PR: #8255 - Add fp8 support for SD/Update notebook paths by @Victor49152 :: PR: #8489
#### NeMo Tools
Changelog - SDE bugfix log by @Jorjeous :: PR: #8430
#### General Improvements
Changelog - Add news section to README by @ericharper :: PR: #7984 - Fixing conversion script to work for code llama by @shanmugamr1992 :: PR: #7997 - Fix crash when converting to mcore a model using rotary embeddings by @odelalleau :: PR: #7998 - Added a procedure for Windows users, README by @Jorjeous :: PR: #7942 - Update manifest.py to speedup loading tarred datasets by @stevehuang52 :: PR: #7900 - [Fix] Fixed name of a test by @anteju :: PR: #7986 - Fix lora merge script by @cuichenx :: PR: #8113 - Support transcoding audio formats when saving tarred datasets (FLAC, OPUS) by @pzelasko :: PR: #8102 - README edit to change Apple Silicon install instructions (to fix a break introduced by pytorch 2) by @stephenmcconnachie :: PR: #8122 - Fixes NVIDIA/apex installation to not erroneously install the pkg by @terrykong :: PR: #8126 - Graphviz fix by @GNroy :: PR: #7843 - Update README.rst by @fayejf :: PR: #8154 - Fix TP>1 issue for conversion script by @cuichenx :: PR: #8144 - Support torch jit script by @artbataev :: PR: #8027 - NeMo Multimodal Docs and Tests Initial PR by @yaoyu-33 :: PR: #8028 - Remove left-over prints in NeMo+Lhotse code by @pzelasko :: PR: #8180 - Upgrade to DLFW PyTorch 23.12 by @ericharper :: PR: #8163 - Add Lhotse support for key in NeMo manifests by @pzelasko :: PR: #8197 - Fix CPU Initialization and TP>1 for LoRA Merge Script by @cuichenx :: PR: #8199 - Add support in Neural Typecheck to disable semantic checks by @titu1994 :: PR: #8212 - Pin lhotse=1.19.2 in r1.23.0 by @pzelasko :: PR: #8303 - Multimodal r1.23.0 bug fix by @yaoyu-33 :: PR: #8315 - MCore dataset compatibility for tokenizers by @vysarge :: PR: #8390 - Update NFA video download link by @erastorgueva-nv :: PR: #8406 - Update MM Dataprep Tutorial by @cuichenx :: PR: #8410 - Fix dreambooth data sampler issue by @yaoyu-33 :: PR: #8400 - Fix a bug in CTM line processing function for multi-speaker data simulations by @tango4j :: PR: #8416 - Akoumparouli/mistral bugfix by @akoumpa :: PR: #8353 - pin to 0.5.0 by @ericharper :: PR: #8465 - Update NeMo Multimodal Requirements by @yaoyu-33 :: PR: #8515 - Fix link in multimodal dataprep tutorial by @cuichenx :: PR: #8517