diff --git a/.claude/skills/babysit-pr/SKILL.md b/.claude/skills/babysit-pr/SKILL.md index 218dfd0b3b64..89dcefe9b25e 100644 --- a/.claude/skills/babysit-pr/SKILL.md +++ b/.claude/skills/babysit-pr/SKILL.md @@ -21,9 +21,9 @@ If no PR number is clear, ask for it before proceeding. ### Step 1 — Get the full picture ```bash -gh pr view --repo NVIDIA-NeMo/NeMo -gh pr checks --repo NVIDIA-NeMo/NeMo -gh pr diff --repo NVIDIA-NeMo/NeMo +gh pr view --repo NVIDIA-NeMo/Speech +gh pr checks --repo NVIDIA-NeMo/Speech +gh pr diff --repo NVIDIA-NeMo/Speech ``` Determine the current state: @@ -47,9 +47,9 @@ The **"Isort and Black Formatting"** workflow (`reformat_with_isort_and_black` j Check out the PR branch and inspect the failure logs: ```bash -gh pr checkout --repo NVIDIA-NeMo/NeMo -gh run list --repo NVIDIA-NeMo/NeMo --branch -gh run view --repo NVIDIA-NeMo/NeMo --log-failed +gh pr checkout --repo NVIDIA-NeMo/Speech +gh run list --repo NVIDIA-NeMo/Speech --branch +gh run view --repo NVIDIA-NeMo/Speech --log-failed ``` Before attempting a fix, check `git log` for recent commits. If you see a previous fix attempt that addressed the same failure and it is still failing, **stop and tell the user** — the issue needs human attention. Do not keep retrying the same fix. @@ -67,7 +67,7 @@ git push After pushing a fix, add the "Run CICD" label to re-trigger the CI pipeline: ```bash -gh pr edit --repo NVIDIA-NeMo/NeMo --add-label "Run CICD" +gh pr edit --repo NVIDIA-NeMo/Speech --add-label "Run CICD" ``` The "CICD NeMo" workflow is triggered by this label and removes it automatically when done. diff --git a/.claude/skills/fix-issue/SKILL.md b/.claude/skills/fix-issue/SKILL.md index 54e4e53f3635..57a8afe67a6e 100644 --- a/.claude/skills/fix-issue/SKILL.md +++ b/.claude/skills/fix-issue/SKILL.md @@ -1,6 +1,6 @@ --- name: fix-issue -description: Fix a GitHub issue in NeMo Speech (NVIDIA-NeMo/NeMo). Read the issue, reproduce the bug with a failing test, implement the fix, and verify tests pass. Only opens a PR if the user explicitly asks for it. +description: Fix a GitHub issue in NeMo Speech (NVIDIA-NeMo/Speech). Read the issue, reproduce the bug with a failing test, implement the fix, and verify tests pass. Only opens a PR if the user explicitly asks for it. --- # fix-issue @@ -28,7 +28,7 @@ Read the issue description carefully. Identify: ## Workflow -1. Read the issue: `gh issue view --repo NVIDIA-NeMo/NeMo` +1. Read the issue: `gh issue view --repo NVIDIA-NeMo/Speech` 2. Understand the bug — identify the relevant code 3. Write a minimal reproduction test in `tests/` that demonstrates the failure 4. Run the test to confirm it fails: `pytest -v` @@ -49,7 +49,7 @@ git checkout -b fix/- git add git commit -s -m "Fix (closes #)" git push origin fix/- -gh pr create --repo NVIDIA-NeMo/NeMo \ +gh pr create --repo NVIDIA-NeMo/Speech \ --title "Fix " \ --body "$(cat <<'EOF' # What does this PR do ? diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index ca927bd24183..dcf56e47e43c 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,4 @@ -> [!IMPORTANT] +> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach out to the automation team before pressing that button. @@ -18,7 +18,7 @@ Add a one line overview of what this PR aims to accomplish. - You can potentially add a usage example below ```python -# Add a code snippet demonstrating how to use this +# Add a code snippet demonstrating how to use this ``` # GitHub Actions CI @@ -33,12 +33,12 @@ To run CI on an untrusted fork, a NeMo user with write access must first click " **Pre checks**: -- [ ] Make sure you read and followed [Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md) +- [ ] Make sure you read and followed [Contributor guidelines](https://github.com/NVIDIA-NeMo/Speech/blob/main/CONTRIBUTING.md) - [ ] Did you write any new necessary tests? - [ ] Did you add or update any necessary documentation? - [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) - [ ] Reviewer: Does the PR have correct import guards for all optional libraries? - + **PR Type**: - [ ] New Feature @@ -50,7 +50,7 @@ If you haven't finished some of the above items you can still open "Draft" PR. ## Who can review? Anyone in the community is free to review the PR once the checks have passed. -[Contributor guidelines](https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md) contains specific people who can review PRs to various areas. +[Contributor guidelines](https://github.com/NVIDIA-NeMo/Speech/blob/main/CONTRIBUTING.md) contains specific people who can review PRs to various areas. # Additional Information diff --git a/.github/workflows/_build_container.yml b/.github/workflows/_build_container.yml index ab478e8be284..e55c9226d208 100644 --- a/.github/workflows/_build_container.yml +++ b/.github/workflows/_build_container.yml @@ -99,7 +99,7 @@ jobs: build-args: | IMAGE_LABEL=nemo-core NEMO_TAG=${{ github.sha }} - NEMO_REPO=https://github.com/NVIDIA/NeMo + NEMO_REPO=https://github.com/NVIDIA-NeMo/Speech PR_NUMBER=${{ github.event.pull_request.number || 0 }} cache-from: | type=registry,ref=${{ inputs.registry }}/nemo-speech:${{ inputs.image-name }}-buildcache-main,mode=max diff --git a/.github/workflows/mcore-tag-bump-bot.yml b/.github/workflows/mcore-tag-bump-bot.yml index b4fdf56ac59f..6fe681865794 100644 --- a/.github/workflows/mcore-tag-bump-bot.yml +++ b/.github/workflows/mcore-tag-bump-bot.yml @@ -15,15 +15,15 @@ jobs: - name: Get release branch names id: get-branch run: | - latest_branch=$(git ls-remote --heads https://github.com/NVIDIA/Megatron-LM.git 'refs/heads/core_r*' | - grep -o 'core_r[0-9]\+\.[0-9]\+\.[0-9]\+' | - sort -V | + latest_branch=$(git ls-remote --heads https://github.com/NVIDIA/Megatron-LM.git 'refs/heads/core_r*' | + grep -o 'core_r[0-9]\+\.[0-9]\+\.[0-9]\+' | + sort -V | tail -n1) echo "mcore_release_branch=$latest_branch" >> $GITHUB_OUTPUT - latest_branch=$(git ls-remote --heads https://github.com/NVIDIA/NeMo.git 'refs/heads/r*' | - grep -o 'r[0-9]\+\.[0-9]\+\.[0-9]\+' | - sort -V | + latest_branch=$(git ls-remote --heads https://github.com/NVIDIA-NeMo/Speech.git 'refs/heads/r*' | + grep -o 'r[0-9]\+\.[0-9]\+\.[0-9]\+' | + sort -V | tail -n1) echo "nemo_release_branch=$latest_branch" >> $GITHUB_OUTPUT diff --git a/.github/workflows/monitor-vms.yml b/.github/workflows/monitor-vms.yml index 722a4720b0e9..a94eea8402bb 100644 --- a/.github/workflows/monitor-vms.yml +++ b/.github/workflows/monitor-vms.yml @@ -22,7 +22,7 @@ jobs: -H "Accept: application/vnd.github+json" \ -H "Authorization: Bearer $GITHUB_TOKEN" \ -H "X-GitHub-Api-Version: 2022-11-28" \ - https://api.github.com/repos/NVIDIA/NeMo/actions/runners) + https://api.github.com/repos/NVIDIA-NeMo/Speech/actions/runners) MATRIX=$(echo $RUNNERS \ | jq -c '[ diff --git a/CITATION.cff b/CITATION.cff index 436750dd0af0..2ebe7a9b9db7 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -2,7 +2,7 @@ cff-version: 1.2.0 message: "If you use this software, please cite it as below." title: "NeMo: a toolkit for Conversational AI and Large Language Models" url: https://nvidia.github.io/NeMo/ -repository-code: https://github.com/NVIDIA/NeMo +repository-code: https://github.com/NVIDIA-NeMo/Speech authors: - family-names: Harper given-names: Eric @@ -16,7 +16,7 @@ authors: given-names: Yang - family-names: Bakhturina given-names: Evelina - - family-names: Noroozi + - family-names: Noroozi given-names: Vahid - family-names: Subramanian given-names: Sandeep diff --git a/README.md b/README.md index a7506cab6839..0978af98ac8f 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ [![Project Status: Active -- The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active) [![Documentation](https://readthedocs.com/projects/nvidia-nemo/badge/?version=main)](https://docs.nvidia.com/nemo/speech/nightly/) -[![CodeQL](https://github.com/nvidia/nemo/actions/workflows/codeql.yml/badge.svg?branch=main&event=push)](https://github.com/nvidia/nemo/actions/workflows/codeql.yml) -[![NeMo core license and license for collections in this repo](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://github.com/NVIDIA/NeMo/blob/master/LICENSE) +[![CodeQL](https://github.com/NVIDIA-NeMo/Speech/actions/workflows/codeql.yml/badge.svg?branch=main&event=push)](https://github.com/NVIDIA-NeMo/Speech/actions/workflows/codeql.yml) +[![NeMo core license and license for collections in this repo](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://github.com/NVIDIA-NeMo/Speech/blob/master/LICENSE) [![Release version](https://badge.fury.io/py/nemo-toolkit.svg)](https://badge.fury.io/py/nemo-toolkit) [![Python version](https://img.shields.io/pypi/pyversions/nemo-toolkit.svg)](https://badge.fury.io/py/nemo-toolkit) [![PyPi total downloads](https://static.pepy.tech/personalized-badge/nemo-toolkit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads)](https://pepy.tech/project/nemo-toolkit) @@ -17,7 +17,7 @@ weight checkpoints and demos! > For the latest stable released version, please use [the 26.02 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=26.02). - 2026-06: [Nemotron-3.5-ASR-Streaming-0.6B](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b) has been released with 40 languages supported, controllable latency 80ms-1s, and 240-2400 1xH100 concurrent streams. Built on cache-aware Fastconformer architecture. -- 2026-04: [Parakeet-unified-en-0.6b](https://huggingface.co/nvidia/parakeet-unified-en-0.6b) has been released with high-quality offline and streaming (with a minimum latency of 160ms) inference in one model for English language with punctuation and capitalization support. +- 2026-04: [Parakeet-unified-en-0.6b](https://huggingface.co/nvidia/parakeet-unified-en-0.6b) has been released with high-quality offline and streaming (with a minimum latency of 160ms) inference in one model for English language with punctuation and capitalization support. - 2026-03: [Nemotron 3 VoiceChat](https://build.nvidia.com/nvidia/nemotron-voicechat/modelcard) is now released in Early Access. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech and TTS decoder, VoiceChat delivers full-duplex, natural, interruptible conversations with low latency. Try out [the demo](https://build.nvidia.com/nvidia/nemotron-voicechat) and apply for [early access](https://developer.nvidia.com/nemotron-voicechat-early-access). - 2026-03: [Nemotron-Speech-Streaming v2603](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) has been updated. It has been trained on a larger and more diverse corpus, resulting in lower WER across all latency modes. @@ -31,7 +31,7 @@ weight checkpoints and demos! on the latency-accuracy Pareto curve! - 2026-01: MagpieTTS was released. - 2026: This repo has pivoted to focus on audio, speech, and multimodal LLM. For the last NeMo release with support for more - modalities, see [v2.7.0](https://github.com/NVIDIA-NeMo/NeMo/releases/tag/v2.7.0) + modalities, see [v2.7.0](https://github.com/NVIDIA-NeMo/Speech/releases/tag/v2.7.0) - 2025-08: [Parakeet V3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) and [Canary V2](https://huggingface.co/nvidia/canary-1b-v2) have been released with speech recognition and translation support for 25 European languages. @@ -77,7 +77,7 @@ The recommended way to install NeMo Speech is from source with [uv](https://docs ### From source with uv (recommended) ```bash -git clone https://github.com/NVIDIA-NeMo/NeMo.git +git clone https://github.com/NVIDIA-NeMo/Speech.git cd NeMo uv sync --extra all --extra cu13 # CUDA 13.x (recommended) — use --extra cu12 for CUDA 12.x ``` @@ -93,7 +93,7 @@ This installs our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2) into `. To build the container from source (CUDA 13 / H100+ by default): ```bash -git clone https://github.com/NVIDIA-NeMo/NeMo.git +git clone https://github.com/NVIDIA-NeMo/Speech.git cd NeMo docker buildx build -f docker/Dockerfile -t nemo-speech . # CUDA 13 / H100+ (default) docker run --rm -it --gpus all -v "$PWD:/workspace" nemo-speech bash @@ -121,8 +121,8 @@ pip install 'nemo-toolkit[asr,tts,cu12]' --extra-index-url https://download.pyto ## Contribute to NeMo We welcome community contributions! Please refer to -[CONTRIBUTING.md](https://github.com/NVIDIA-NeMo/NeMo/blob/main/CONTRIBUTING.md) for the process. +[CONTRIBUTING.md](https://github.com/NVIDIA-NeMo/Speech/blob/main/CONTRIBUTING.md) for the process. ## Licenses -NeMo is licensed under the [Apache License 2.0](https://github.com/NVIDIA/NeMo?tab=Apache-2.0-1-ov-file). +NeMo is licensed under the [Apache License 2.0](https://github.com/NVIDIA-NeMo/Speech?tab=Apache-2.0-1-ov-file). diff --git a/docs/source/apis.rst b/docs/source/apis.rst index a8e077902bff..38e74339e05b 100644 --- a/docs/source/apis.rst +++ b/docs/source/apis.rst @@ -6,7 +6,7 @@ NeMo APIs You can learn more about the underlying principles of the NeMo codebase in this section. -The `NeMo Toolkit codebase `__ is composed of a `core `__ section which contains the main building blocks of the framework, and various `collections `__ which help you +The `NeMo Toolkit codebase `__ is composed of a `core `__ section which contains the main building blocks of the framework, and various `collections `__ which help you build specialized AI models. You can learn more about aspects of the NeMo "core" by following the links below: diff --git a/docs/source/asr/asr_customization/legacy_language_modeling_and_customization.rst b/docs/source/asr/asr_customization/legacy_language_modeling_and_customization.rst index 5a2265a39757..42ec8fbe9061 100644 --- a/docs/source/asr/asr_customization/legacy_language_modeling_and_customization.rst +++ b/docs/source/asr/asr_customization/legacy_language_modeling_and_customization.rst @@ -7,11 +7,11 @@ N-gram Language Model Fusion In this approach, an N-gram LM is trained on text data, then it is used in fusion with beam search decoding to find the best candidates. The beam search decoders in NeMo support language models trained with KenLM library ( `https://github.com/kpu/kenlm `__). -The beam search decoders and KenLM library are not installed by default in NeMo. +The beam search decoders and KenLM library are not installed by default in NeMo. You need to install them to be able to use beam search decoding and N-gram LM. -Please refer to `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh `__ +Please refer to `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh `__ on how to install them. Alternatively, you can build Docker image -`scripts/installers/Dockerfile.ngramtools `__ with all the necessary dependencies. +`scripts/installers/Dockerfile.ngramtools `__ with all the necessary dependencies. Please, refer to :ref:`train-ngram-lm` for more details on how to train an N-gram LM using KenLM library. @@ -31,7 +31,7 @@ Evaluate by Beam Search Decoding and N-gram LM NeMo's beam search decoders are capable of using the KenLM's N-gram models to find the best candidates. The script to evaluate an ASR model with beam search decoding and N-gram models can be found at -`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__. +`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__. This script has a large number of possible argument overrides; therefore, it is recommended that you use ``python eval_beamsearch_ngram_ctc.py --help`` to see the full list of arguments. @@ -119,7 +119,7 @@ The width of the beam search (``--beam_width``) specifies the number of top cand and ``pyctcdecode`` via the ``decoding`` subconfig. To learn more about evaluating the ASR models with N-gram LM, refer to the tutorial here: Offline ASR Inference with Beam Search and External Language Model Rescoring -`Offline ASR Inference with Beam Search and External Language Model Rescoring `_ +`Offline ASR Inference with Beam Search and External Language Model Rescoring `_ Beam Search Engines ------------------- @@ -215,7 +215,7 @@ Beam Search ngram Decoding for Transducer Models (RNNT and HAT) =============================================================== You can also find a similar script to evaluate an RNNT/HAT model with beam search decoding and N-gram models at: -`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py `_ +`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py `_ .. code-block:: @@ -244,14 +244,14 @@ Weighted Finite-State Transducers (WFST) are finite-state machines with input an More precisely, WFST decoding is more of a greedy N-depth search with LM. Thus, it is asymptotically worse than conventional beam search decoding algorithms, but faster. -**WARNING** +**WARNING** At the moment, NeMo supports WFST decoding only for CTC models and word-based LMs. To run WFST decoding in NeMo, one needs to provide a NeMo ASR model and either an ARPA LM or a WFST LM (advanced). An ARPA LM can be built from source text with KenLM as follows: ``/lmplz -o --arpa --prune ``. The script to evaluate an ASR model with WFST decoding and N-gram models can be found at `scripts/asr_language_modeling/ngram_lm/eval_wfst_decoding_ctc.py -`__. +`__. This script has a large number of possible argument overrides, therefore it is advised to use ``python eval_wfst_decoding_ctc.py --help`` to see the full list of arguments. diff --git a/docs/source/asr/asr_customization/neural_rescoring.rst b/docs/source/asr/asr_customization/neural_rescoring.rst index b0c1899ed089..b38265c5b8ff 100644 --- a/docs/source/asr/asr_customization/neural_rescoring.rst +++ b/docs/source/asr/asr_customization/neural_rescoring.rst @@ -9,7 +9,7 @@ When using the neural rescoring approach, a neural network is used to score cand Train Neural Rescorer ===================== -An example script to train such a language model with Transformer can be found at `examples/nlp/language_modeling/transformer_lm.py `__. +An example script to train such a language model with Transformer can be found at `examples/nlp/language_modeling/transformer_lm.py `__. It trains a ``TransformerLMModel`` which can be used as a neural rescorer for an ASR system. For more information on language models training, see LLM/NLP documentation. @@ -21,11 +21,11 @@ Evaluation ========== Given a trained TransformerLMModel `.nemo` file or a pretrained HF model, the script available at -`scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__ +`scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__ can be used to re-score beams obtained with ASR model. You need the `.tsv` file containing the candidates produced by the acoustic model and the beam search decoding to use this script. The candidates can be the result of just the beam search decoding or the result of fusion with an N-gram LM. You can generate this file by specifying `--preds_output_folder` for -`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__. +`scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__. The neural rescorer would rescore the beams/candidates by using two parameters of `rescorer_alpha` and `rescorer_beta`, as follows: @@ -40,9 +40,9 @@ Use the following steps to evaluate a neural LM: #. Obtain `.tsv` file with beams and their corresponding scores. Scores can be from a regular beam search decoder or in fusion with an N-gram LM scores. For a given beam size `beam_size` and a number of examples for evaluation `num_eval_examples`, it should contain (`num_eval_examples` x `beam_size`) lines of - form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__ + form `beam_candidate_text \t score`. This file can be generated by `scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py `__ -#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__. +#. Rescore the candidates by `scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py `__. .. code-block:: diff --git a/docs/source/asr/asr_customization/ngpulm_language_modeling_and_customization.rst b/docs/source/asr/asr_customization/ngpulm_language_modeling_and_customization.rst index fb8c22b5cee5..a78ea938c192 100644 --- a/docs/source/asr/asr_customization/ngpulm_language_modeling_and_customization.rst +++ b/docs/source/asr/asr_customization/ngpulm_language_modeling_and_customization.rst @@ -4,14 +4,14 @@ NGPU-LM (GPU-based N-gram Language Model) Language Model Fusion *************************************************************** -ASR systems can achieve significantly improved accuracy by leveraging **external language model (LM) shallow fusion** during the decoding process. +ASR systems can achieve significantly improved accuracy by leveraging **external language model (LM) shallow fusion** during the decoding process. This technique integrates knowledge from an external LM without requiring the ASR model itself to be retrained. **How Shallow Fusion Works:** During shallow fusion, the output probabilities generated by the ASR model are combined with those from a separate, external language model. -The final transcription is then determined by selecting the word sequence that yields the highest combined score. -These external LMs are typically trained on vast text datasets, allowing them to capture the statistical patterns, syntactic structures, and contextual dependencies of language. +The final transcription is then determined by selecting the word sequence that yields the highest combined score. +These external LMs are typically trained on vast text datasets, allowing them to capture the statistical patterns, syntactic structures, and contextual dependencies of language. This enables them to predict more plausible word sequences, thereby correcting potential errors from the ASR model. **Domain Adaptation Benefits:** @@ -26,20 +26,20 @@ Traditionally, shallow fusion has been performed during **beam search decoding** NGPU-LM ======= -A widely used library for training traditional n-gram language models is KenLM. +A widely used library for training traditional n-gram language models is KenLM. While KenLM (https://github.com/kpu/kenlm) is known for its efficient CPU-based implementation, its reliance on the CPU can limit performance in high-throughput scenarios, especially when dealing with large-scale data. -NGPU-LM on contrast is a GPU-accelerated implementation of a statistical n-gram language model. +NGPU-LM on contrast is a GPU-accelerated implementation of a statistical n-gram language model. It uses a **universal trie-based data structure**, which enables fast, batched queries. For full details, please refer to the paper [ngpulm]_. -This enables shallow fusion during **greedy decoding**, creating a middle ground between standard greedy decoding and full beam search with a language model. +This enables shallow fusion during **greedy decoding**, creating a middle ground between standard greedy decoding and full beam search with a language model. It preserves the speed and simplicity of greedy decoding while regaining much of the accuracy typically achieved with beam search with external LM fusion. While not as accurate as full beam search, greedy decoding with NGPU-LM fusion offers a compelling balance between speed and accuracy. -NeMo provides efficient, fully GPU-based beam search implementations for all major ASR model types, +NeMo provides efficient, fully GPU-based beam search implementations for all major ASR model types, allowing **beam decoding to operate with real-time factors (RTFx) close to those of greedy decoding**. At a batch size of 32, the RTFx difference between beam and greedy decoding is only about 20%. -These implementations incorporate NGPU-LM, enabling fast, fully GPU-based decoding and customization. +These implementations incorporate NGPU-LM, enabling fast, fully GPU-based decoding and customization. This enables users to customize decoding while maintaining reasonable speed, even in beam search mode. For full details, please refer to the [beamsearch]_. @@ -48,10 +48,10 @@ NGPU-LM fusion is supported for BPE-based ASR models (CTC, RNNT, TDT, AED) durin Train NGPU-LM ============= -NGPU-LM is built using `.ARPA` files generated by the KenLM library. You can train an n-gram LM using the following script: -`train_kenlm.py `__. +NGPU-LM is built using `.ARPA` files generated by the KenLM library. You can train an n-gram LM using the following script: +`train_kenlm.py `__. -The generated `.ARPA` files can be directly used for GPU-based decoding. +The generated `.ARPA` files can be directly used for GPU-based decoding. However, for faster performance, it is recommended to convert the model to the `.nemo` format by setting the ``save_nemo`` flag to ``true``. .. code-block:: @@ -80,7 +80,7 @@ To run inference with NGPU-LM fusion, the ``ngram_lm_model`` and ``ngram_lm_alph .. note:: - For CTC, RNNT, and TDT models, these fields should be set within the respective ``greedy`` or ``beam`` sub-configurations. + For CTC, RNNT, and TDT models, these fields should be set within the respective ``greedy`` or ``beam`` sub-configurations. For AED models running in greedy mode, set the beam size to 1 and specify these fields under the ``beam`` sub-configuration. Examples for different model types are provided below. @@ -208,7 +208,7 @@ Final hypotheses is chosen based on the normalized score ``final_score / seq_len *Blank Scoring in Transducer Models* -Transducer models include a blank symbol (``∅``) for frame transitions, while LMs do not model blanks. +Transducer models include a blank symbol (``∅``) for frame transitions, while LMs do not model blanks. During shallow fusion, the LM is typically applied only to non-blank tokens: .. math:: @@ -219,7 +219,7 @@ During shallow fusion, the LM is typically applied only to non-blank tokens: \ln p[\emptyset], & k = \emptyset \end{cases} -This can lead to excessive blank predictions at higher LM weights, increasing deletion errors. +This can lead to excessive blank predictions at higher LM weights, increasing deletion errors. NeMo supports a blank-aware scoring method that adjusts LM contributions to better balance predictions: .. math:: @@ -327,20 +327,20 @@ You can run NGPU-LM shallow fusion during greedy CTC decoding using the followin References ========== -.. [ngpulm] V. Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V. Lavrukhin, and B. Ginsburg. - *NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding*. +.. [ngpulm] V. Bataev, A. Andrusenko, L. Grigoryan, A. Laptev, V. Lavrukhin, and B. Ginsburg. + *NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding*. arXiv:2505.22857, 2025. Available at: https://arxiv.org/abs/2505.22857 -.. [beamsearch] L. Grigoryan, V. Bataev, A. Andrusenko, H. Xu, V. Lavrukhin, and B. Ginsburg. - *Pushing the Limits of Beam Search Decoding for Transducer-based ASR Models*. +.. [beamsearch] L. Grigoryan, V. Bataev, A. Andrusenko, H. Xu, V. Lavrukhin, and B. Ginsburg. + *Pushing the Limits of Beam Search Decoding for Transducer-based ASR Models*. arXiv:2506.00185, 2025. Available at: https://arxiv.org/abs/2506.00185 -.. [alsd] G. Saon, Z. Tüske, and K. Audhkhasi. - *Alignment-Length Synchronous Decoding for RNN Transducer*. - In: ICASSP 2020 – IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7804–7808, 2020. +.. [alsd] G. Saon, Z. Tüske, and K. Audhkhasi. + *Alignment-Length Synchronous Decoding for RNN Transducer*. + In: ICASSP 2020 – IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7804–7808, 2020. doi: https://doi.org/10.1109/ICASSP40776.2020.9053040 -.. [aes] J. Kim, Y. Lee, and E. Kim. - *Accelerating RNN Transducer Inference via Adaptive Expansion Search*. - IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020. +.. [aes] J. Kim, Y. Lee, and E. Kim. + *Accelerating RNN Transducer Inference via Adaptive Expansion Search*. + IEEE Signal Processing Letters, vol. 27, pp. 2019–2023, 2020. doi: https://doi.org/10.1109/LSP.2020.3036335 diff --git a/docs/source/asr/asr_customization/ngram_utils.rst b/docs/source/asr/asr_customization/ngram_utils.rst index caca1671ac1f..43b9d4a277fc 100644 --- a/docs/source/asr/asr_customization/ngram_utils.rst +++ b/docs/source/asr/asr_customization/ngram_utils.rst @@ -12,15 +12,15 @@ NeMo utilizes the KenLM library (`https://github.com/kpu/kenlm`) for building ef .. note:: - KenLM is not installed by default in NeMo. - Please see the installation instructions in the script: - `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh `__. + KenLM is not installed by default in NeMo. + Please see the installation instructions in the script: + `scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh `__. - Alternatively, you can build a Docker image with all required dependencies using: - `scripts/installers/Dockerfile.ngramtools `__. + Alternatively, you can build a Docker image with all required dependencies using: + `scripts/installers/Dockerfile.ngramtools `__. -The script for training an n-gram language model with KenLM is available here: -`scripts/asr_language_modeling/ngram_lm/train_kenlm.py `__. +The script for training an n-gram language model with KenLM is available here: +`scripts/asr_language_modeling/ngram_lm/train_kenlm.py `__. This script supports training n-gram LMs on both character-level and BPE-level encodings, which are automatically detected from the model type. The resulting language models can then be used with beam search decoders integrated on top of ASR models. @@ -78,11 +78,11 @@ It is recommended that you use 6 as the order of the N-gram model for BPE-based Combine N-gram Language Models ============================== -Before combining N-gram LMs, install the required OpenGrm NGram library using `scripts/installers/install_opengrm.sh `__. -Alternatively, you can use Docker image `scripts/installers/Dockerfile.ngramtools `__ with all the necessary dependencies. +Before combining N-gram LMs, install the required OpenGrm NGram library using `scripts/installers/install_opengrm.sh `__. +Alternatively, you can use Docker image `scripts/installers/Dockerfile.ngramtools `__ with all the necessary dependencies. Alternatively, you can use the Docker image at: -`scripts/asr_language_modeling/ngram_lm/ngram_merge.py `__, which includes all the necessary dependencies. +`scripts/asr_language_modeling/ngram_lm/ngram_merge.py `__, which includes all the necessary dependencies. This script interpolates two ARPA N-gram language models and creates a KenLM binary file that can be used with the beam search decoders on top of ASR models. You can specify weights (`--alpha` and `--beta`) for each of the models (`--ngram_a` and `--ngram_b`) correspondingly: `alpha` * `ngram_a` + `beta` * `ngram_b`. diff --git a/docs/source/asr/asr_customization/word_boosting.rst b/docs/source/asr/asr_customization/word_boosting.rst index 62f3d2b45a47..5c1ab504d7f3 100644 --- a/docs/source/asr/asr_customization/word_boosting.rst +++ b/docs/source/asr/asr_customization/word_boosting.rst @@ -297,22 +297,22 @@ Context-biasing candidates obtained by CTC-WS are also filtered by the scores wi Scheme of the CTC-WS method: -.. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/asset-post-v1.22.0-ctcws_scheme_1.png +.. image:: https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.22.0/asset-post-v1.22.0-ctcws_scheme_1.png :align: center :alt: CTC-WS scheme :width: 80% High-level overview of the context-biasing words replacement with CTC-WS method: -.. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.22.0/asset-post-v1.22.0-ctcws_scheme_2.png +.. image:: https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.22.0/asset-post-v1.22.0-ctcws_scheme_2.png :align: center :alt: CTC-WS high level overview :width: 80% -More details about CTC-WS context-biasing can be found in the `tutorial `__. +More details about CTC-WS context-biasing can be found in the `tutorial `__. To use CTC-WS context-biasing, you need to create a context-biasing text file that contains words/phrases to be boosted, with its transcriptions (spellings) separated by underscore. -Multiple transcriptions can be useful for abbreviations ("gpu" -> "g p u"), compound words ("nvlink" -> "nv link"), +Multiple transcriptions can be useful for abbreviations ("gpu" -> "g p u"), compound words ("nvlink" -> "nv link"), or words with common mistakes in the case of our ASR model ("nvidia" -> "n video"). Example of the context-biasing file: @@ -326,7 +326,7 @@ Example of the context-biasing file: nvlink_nvlink_nv link ray tracing_ray tracing -The main script for CTC-WS context-biasing in NeMo is: +The main script for CTC-WS context-biasing in NeMo is: .. code-block:: @@ -346,7 +346,7 @@ The script will run the recognition with all the combinations of the parameters .. code-block:: - # Context-biasing with the CTC-WS method for CTC ASR model + # Context-biasing with the CTC-WS method for CTC ASR model python {NEMO_DIR_PATH}/scripts/asr_context_biasing/eval_greedy_decoding_with_context_biasing.py \ nemo_model_file={ctc_model_name} \ input_manifest={test_nemo_manifest} \ diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst index 722696879234..680b3a9ded0b 100644 --- a/docs/source/asr/configs.rst +++ b/docs/source/asr/configs.rst @@ -10,7 +10,7 @@ for audio files, parameters for any augmentation being performed, as well as the this page cover each of these in more detail. Example configuration files for all of the NeMo ASR scripts can be found in the -`config directory of the examples `_. +`config directory of the examples `_. .. _asr-configs-dataset-configuration: @@ -223,7 +223,7 @@ BLEU score relies on TorchMetrics' SacreBLEU implementation and supports all Sac * ``"13a"`` - Default WMT tokenizer (mteval-v13a script compatible) * ``"none"`` - No tokenization applied -* ``"intl"`` - International tokenization (mteval-v14 script compatible) +* ``"intl"`` - International tokenization (mteval-v14 script compatible) * ``"char"`` - Character-level tokenization (language-agnostic) * ``"zh"`` - Chinese tokenization (separates Chinese characters, uses 13a for non-Chinese) * ``"ja-mecab"`` - Japanese tokenization using MeCab morphological analyzer @@ -751,7 +751,7 @@ conformer as encoder). Hybrid-Transducer-CTC with Prompt Conditioning Configuration ------------------------------------------------------------ -The :ref:`Hybrid-Transducer-CTC model with prompt conditioning ` +The :ref:`Hybrid-Transducer-CTC model with prompt conditioning ` (``EncDecHybridRNNTCTCBPEModelWithPrompt``) extends the base hybrid model to support prompt-based multilingual ASR/AST. **Key Configuration Parameters:** @@ -789,7 +789,7 @@ The model requires training data with prompt annotations when using Lhotse datas prompt_field: "target_lang" # Field name for prompt extraction prompt_dictionary: ${model.model_defaults.prompt_dictionary} num_prompts: ${model.model_defaults.num_prompts} - + validation_ds: use_lhotse: true initialize_prompt_feature: true diff --git a/docs/source/asr/examples/kinyarwanda_asr.rst b/docs/source/asr/examples/kinyarwanda_asr.rst index d7a6588de85d..a0f8a8df10f7 100644 --- a/docs/source/asr/examples/kinyarwanda_asr.rst +++ b/docs/source/asr/examples/kinyarwanda_asr.rst @@ -341,7 +341,7 @@ We used the following script from NeMo toolkit to create `Sentencepiece `_. +Most of the arguments are similar to those explained in the `ASR with Subword Tokenization tutorial `_. The resulting tokenizer is a folder like that: @@ -436,8 +436,8 @@ The CTC model predicts output tokens for each timestep. The outputs are assumed Training scripts and configs ############################ -To train a Conformer-CTC model, we use `speech_to_text_ctc_bpe.py `_ with the default config `conformer_ctc_bpe.yaml `_. -To train a Conformer-Transducer model, we use `speech_to_text_rnnt_bpe.py `_ with the default config `conformer_transducer_bpe.yaml `_. +To train a Conformer-CTC model, we use `speech_to_text_ctc_bpe.py `_ with the default config `conformer_ctc_bpe.yaml `_. +To train a Conformer-Transducer model, we use `speech_to_text_rnnt_bpe.py `_ with the default config `conformer_transducer_bpe.yaml `_. Any options of default config can be overwritten from command line. Usually we should provide the options related to the dataset and tokenizer. diff --git a/docs/source/asr/featured_community_checkpoints.rst b/docs/source/asr/featured_community_checkpoints.rst index 7c0eb5ad1737..00157657ef32 100644 --- a/docs/source/asr/featured_community_checkpoints.rst +++ b/docs/source/asr/featured_community_checkpoints.rst @@ -46,4 +46,4 @@ For NVIDIA-published checkpoints, see :doc:`./asr_checkpoints` and the `NVIDIA H Submit a Community Checkpoint ----------------------------- -To suggest a checkpoint for this page, open a `GitHub issue `__ with the Hugging Face model link, NeMo base checkpoint, task, languages, evaluation results, and inference framework. +To suggest a checkpoint for this page, open a `GitHub issue `__ with the Hugging Face model link, NeMo base checkpoint, task, languages, evaluation results, and inference framework. diff --git a/docs/source/asr/featured_models.rst b/docs/source/asr/featured_models.rst index 01fea483771c..cd28b5f102b6 100644 --- a/docs/source/asr/featured_models.rst +++ b/docs/source/asr/featured_models.rst @@ -27,7 +27,7 @@ They support ASR in 25 EU languages, speech translation (AST), and punctuation/c * `Canary-Qwen-2.5B `__ — English only, PnC, highest accuracy * `Canary-1B Flash `__ / `180M Flash `__ — Optimized for speed -Canary supports chunked and `streaming inference `__. +Canary supports chunked and `streaming inference `__. .. _Conformer_model: @@ -64,7 +64,7 @@ Cache-aware Streaming Conformer Streaming models trained with limited right context for real-time inference with caching to avoid duplicate computation. Supports three modes: fully causal, regular look-ahead, and chunk-aware look-ahead (recommended). -* `Tutorial notebook `_ +* `Tutorial notebook `_ * Simulation script: ``examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py`` * Supports multiple look-aheads with ``att_context_size`` lists @@ -85,7 +85,7 @@ Multitalker Streaming Streaming multi-speaker ASR based on cache-aware FastConformer with speaker kernel injection :cite:`asr-models-wang25y_interspeech`. Deploys one model instance per speaker for robust transcription of overlapped speech. * `Model card `__ -* `Tutorial `_ +* `Tutorial `_ .. _Hybrid-Transducer_CTC_model: diff --git a/docs/source/asr/fine_tuning.rst b/docs/source/asr/fine_tuning.rst index 99e9f215fd12..1f0ed5b77faf 100644 --- a/docs/source/asr/fine_tuning.rst +++ b/docs/source/asr/fine_tuning.rst @@ -22,7 +22,7 @@ If you have a large, diverse dataset and want to train from scratch, see :doc:`C Working with an agent? ---------------------- -Check out our latest ``/nemo-speech-finetune-asr`` `agent skill `_. +Check out our latest ``/nemo-speech-finetune-asr`` `agent skill `_. Fine-Tuning Script @@ -155,7 +155,7 @@ The most important parameters for fine-tuning: - Number of fine-tuning epochs (typically 50-100 for domain adaptation) * - ``model.optim.lr`` - Learning rate (use lower than training from scratch, e.g., 1e-4 to 1e-5) - * - ``model.train_ds.manifest_filepath`` + * - ``model.train_ds.manifest_filepath`` - Path to training manifest (NeMo JSON format) * - ``model.train_ds.batch_size`` - Batch size per GPU diff --git a/docs/source/asr/inference.rst b/docs/source/asr/inference.rst index 9aff6b598630..aa36693a88ff 100644 --- a/docs/source/asr/inference.rst +++ b/docs/source/asr/inference.rst @@ -162,7 +162,7 @@ For audio longer than what fits in memory (especially with Conformer's quadratic **Buffered / chunked inference:** Divide audio into overlapping chunks and merge outputs. Scripts are in -`examples/asr/asr_chunked_inference `_. +`examples/asr/asr_chunked_inference `_. **Local attention (recommended for Fast Conformer):** @@ -252,7 +252,7 @@ Streaming Inference NeMo provides a unified streaming-first Pipeline API for real-time ASR under ``nemo.collections.asr.inference``. It supports buffered CTC/RNNT/TDT pipelines (overlapping chunks with any offline model) and cache-aware CTC/RNNT pipelines (processes each frame once using cached activations). -See the `Streaming ASR Pipelines tutorial `_ for a comprehensive walkthrough covering buffered and cache-aware pipelines, per-stream options, EoU detection, word timestamps, per-stream biasing, ITN, and speech translation. +See the `Streaming ASR Pipelines tutorial `_ for a comprehensive walkthrough covering buffered and cache-aware pipelines, per-stream options, EoU detection, word timestamps, per-stream biasing, ITN, and speech translation. See :ref:`cache-aware streaming conformer` for model architecture details. @@ -308,4 +308,4 @@ Execution Flow -------------- When writing custom inference scripts, follow the execution flow diagram at the -`ASR examples README `_. +`ASR examples README `_. diff --git a/docs/source/asr/results.rst b/docs/source/asr/results.rst index 4a4d3a657488..6e5c9426d049 100644 --- a/docs/source/asr/results.rst +++ b/docs/source/asr/results.rst @@ -106,7 +106,7 @@ In order to obtain alignments from CTC or RNNT models (previously called ``logpr .. code-block:: python hyps = model.transcribe(audio=[list of audio files], batch_size=BATCH_SIZE, return_hypotheses=True) - logprobs = hyps[0].alignments + logprobs = hyps[0].alignments ----- @@ -138,13 +138,13 @@ In some cases the audio is too long for standard inference, especially if you're There are two main ways of performing inference on long audio files in NeMo: The first way is to use buffered inference, where the audio is divided into chunks to run on, and the output is merged afterwards. -The relevant scripts for this are contained in `this folder `_. +The relevant scripts for this are contained in `this folder `_. The second way, specifically for models with the Conformer/Fast Conformer encoder, is to use local attention, which changes the costs to be linear. You can train Fast Conformer models with Longformer-style (https://arxiv.org/abs/2004.05150) local+global attention using one of the following configs: CTC config at ``/examples/asr/conf/fastconformer/fast-conformer-long_ctc_bpe.yaml`` and transducer config at ``/examples/asr/conf/fastconformer/fast-conformer-long_transducer_bpe.yaml``. You can also convert any model trained with full context attention to local, though this may result in lower WER in some cases. You can switch to local attention when running the -`transcribe `_ or `evaluation `_ +`transcribe `_ or `evaluation `_ scripts in the following way: .. code-block:: python @@ -187,10 +187,10 @@ Multi-task models that use structured prompts require additionl task tokens as i .. code-block:: python from nemo.collections.asr.models import EncDecMultiTaskModel - + # load model canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b-v2') - + # update dcode params decode_cfg = canary_model.cfg.decoding decode_cfg.beam.beam_size = 1 @@ -213,10 +213,10 @@ Here the manifest file should be a json file where each line has the following f "source_lang": "en", # language of the audio input, set `source_lang`==`target_lang` for ASR "target_lang": "en", # language of the text output "pnc": "yes", # whether to have PnC output, choices=['yes', 'no'] - "answer": "na", # set to non-dummy strings to calculate WER/BLEU scores + "answer": "na", # set to non-dummy strings to calculate WER/BLEU scores } -Note that using manifest allows to specify the task configuration for each audio individually. If we want to use the same task configuration for all the audio files, it can be specified in `transcribe` method directly. +Note that using manifest allows to specify the task configuration for each audio individually. If we want to use the same task configuration for all the audio files, it can be specified in `transcribe` method directly. .. code-block:: python @@ -232,7 +232,7 @@ Note that using manifest allows to specify the task configuration for each audio Inference on Apple M-Series GPU ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -To perform inference on Apple Mac M-Series GPU (``mps`` PyTorch device), use PyTorch 2.0 or higher (see the `mac-installation ` section). Environment variable ``PYTORCH_ENABLE_MPS_FALLBACK=1`` should be set, since not all operations in PyTorch are currently implemented on ``mps`` device. +To perform inference on Apple Mac M-Series GPU (``mps`` PyTorch device), use PyTorch 2.0 or higher (see the `mac-installation ` section). Environment variable ``PYTORCH_ENABLE_MPS_FALLBACK=1`` should be set, since not all operations in PyTorch are currently implemented on ``mps`` device. If ``allow_mps=true`` flag is passed to ``speech_to_text_eval.py``, the ``mps`` device will be selected automatically. @@ -251,7 +251,7 @@ There are multiple ASR tutorials provided in the Tutorials section. Most of thes Inference Execution Flow Diagram -------------------------------- -When preparing your own inference scripts, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. +When preparing your own inference scripts, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. Automatic Speech Recognition Models @@ -261,7 +261,7 @@ Automatic Speech Recognition Models Speech Recognition ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Below is a list of the high quality ASR models available in NeMo for specific languages. All ASR models can be found in :doc:`ASR Model Checkpoints <./asr_checkpoints>`. +Below is a list of the high quality ASR models available in NeMo for specific languages. All ASR models can be found in :doc:`ASR Model Checkpoints <./asr_checkpoints>`. Multilingual Multitask ^^^^^^^^^^^^^^^^^^^^^^ @@ -269,7 +269,7 @@ Multilingual Multitask .. csv-table:: :file: data/benchmark_canary.csv :align: left - :widths: 50,50 + :widths: 50,50 :header-rows: 1 Parakeet @@ -310,10 +310,10 @@ The Hybrid-Transducer-CTC model with prompt conditioning (``EncDecHybridRNNTCTCB .. code-block:: python import nemo.collections.asr as nemo_asr - + # Load the model model = nemo_asr.models.EncDecHybridRNNTCTCBPEModelWithPrompt.restore_from("path/to/model.nemo") - + # Transcribe with language prompts transcriptions = model.transcribe( paths2audio_files=["audio1.wav", "audio2.wav"], diff --git a/docs/source/asr/speaker_diarization/configs.rst b/docs/source/asr/speaker_diarization/configs.rst index d2fd2fcfefe4..c40e3f412a12 100644 --- a/docs/source/asr/speaker_diarization/configs.rst +++ b/docs/source/asr/speaker_diarization/configs.rst @@ -5,8 +5,8 @@ Speaker Diarization Configuration Files For the full configuration files, see the YAML configs on GitHub: - - `sortformer_diarizer_hybrid_loss_4spk-v1.yaml `__ - - `streaming_sortformer_diarizer_4spk-v2.yaml `__ + - `sortformer_diarizer_hybrid_loss_4spk-v1.yaml `__ + - `streaming_sortformer_diarizer_4spk-v2.yaml `__ Hydra Configurations for Sortformer Diarizer Training ----------------------------------------------------- @@ -106,13 +106,13 @@ The Streaming Sortformer config extends the offline config with ``streaming_mode hidden_size: ${model.model_defaults.tf_d_model} num_attention_heads: 8 -See the full YAML configs on GitHub: `Sortformer `__ · `Streaming Sortformer `__ +See the full YAML configs on GitHub: `Sortformer `__ · `Streaming Sortformer `__ Hydra Configurations for (Streaming) Sortformer Diarization Post-processing ----------------------------------------------------------------------------- -Post-processing converts the floating point number based Tensor output to time stamp output. While generating the speaker-homogeneous segments, onset and offset threshold, +Post-processing converts the floating point number based Tensor output to time stamp output. While generating the speaker-homogeneous segments, onset and offset threshold, paddings can be considered to render the time stamps that can lead to the lowest diarization error rate (DER). This post-processing can be applied to both offline and streaming Sortformer diarizer. @@ -120,7 +120,7 @@ By default, post-processing is bypassed, and only binarization is performed. If .. code-block:: yaml - parameters: + parameters: onset: 0.64 # Onset threshold for detecting the beginning of a speech segment offset: 0.74 # Offset threshold for detecting the end of a speech segment pad_onset: 0.06 # Adds the specified duration at the beginning of each speech segment @@ -136,7 +136,7 @@ Example configuration files for speaker diarization inference can be found in `` The configurations for all the components of diarization inference are included in a single file named ``diar_infer_.yaml``. Each ``.yaml`` file has a few different sections for the following modules: VAD, Speaker Embedding, Clustering and ASR. -In speaker diarization inference, the datasets provided in manifest format denote the data that you would like to perform speaker diarization on. +In speaker diarization inference, the datasets provided in manifest format denote the data that you would like to perform speaker diarization on. Diarizer Configurations ----------------------- @@ -165,18 +165,18 @@ Parameters for VAD model are provided as in the following Hydra config example. model_path: null # .nemo local model path or pretrained model name or none external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set - parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) - window_length_in_sec: 0.15 # Window length in sec for VAD context input + parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) + window_length_in_sec: 0.15 # Window length in sec for VAD context input shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction smoothing: "median" # False or type of smoothing method (eg: median) overlap: 0.875 # Overlap ratio for overlapped mean/median smoothing filter - onset: 0.4 # Onset threshold for detecting the beginning and end of a speech + onset: 0.4 # Onset threshold for detecting the beginning and end of a speech offset: 0.7 # Offset threshold for detecting the end of a speech - pad_onset: 0.05 # Adding durations before each speech segment - pad_offset: -0.1 # Adding durations after each speech segment + pad_onset: 0.05 # Adding durations before each speech segment + pad_offset: -0.1 # Adding durations after each speech segment min_duration_on: 0.2 # Threshold for short speech segment deletion min_duration_off: 0.2 # Threshold for small non_speech deletion - filter_speech_first: True + filter_speech_first: True Configurations for Speaker Embedding in Diarization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -199,19 +199,19 @@ Configurations for Clustering in Diarization Parameters for clustering algorithm are provided in the following Hydra config example. .. code-block:: yaml - + clustering: parameters: oracle_num_speakers: False # If True, use num of speakers value provided in the manifest file. max_num_speakers: 20 # Max number of speakers for each recording. If oracle_num_speakers is passed, this value is ignored. enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated. - max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. - sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. + max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. + sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. Configurations for Diarization with ASR ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The following configuration needs to be appended under ``diarizer`` to run ASR with diarization to get a transcription with speaker labels. +The following configuration needs to be appended under ``diarizer`` to run ASR with diarization to get a transcription with speaker labels. .. code-block:: yaml @@ -223,13 +223,13 @@ The following configuration needs to be appended under ``diarizer`` to run ASR w asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null. lenient_overlap_WDER: True # If true, when a word falls into speaker-overlapped regions, consider the word as a correctly diarized word. decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model. - word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05 0.2]. + word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05 0.2]. word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'. fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature. colored_text: False # If True, use colored text to distinguish speakers in the output transcript. print_time: True # If True, the start of the end time of each speaker turn is printed in the output transcript. break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars) - + ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode) pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file. beam_width: 32 diff --git a/docs/source/asr/speaker_diarization/resources.rst b/docs/source/asr/speaker_diarization/resources.rst index 3f391242798e..4464022dd63b 100644 --- a/docs/source/asr/speaker_diarization/resources.rst +++ b/docs/source/asr/speaker_diarization/resources.rst @@ -9,7 +9,7 @@ Resource and Documentation Guide * - Resource - Link * - Tutorials - - `Speaker Tasks Notebooks `__ + - `Speaker Tasks Notebooks `__ * - Models - :doc:`Model Architectures <./models>` * - Datasets diff --git a/docs/source/asr/speaker_recognition/resources.rst b/docs/source/asr/speaker_recognition/resources.rst index 55e83bb598b7..039fa3f46378 100644 --- a/docs/source/asr/speaker_recognition/resources.rst +++ b/docs/source/asr/speaker_recognition/resources.rst @@ -3,7 +3,7 @@ Resource and Documentation Guide -------------------------------- Hands-on speaker recognition tutorial notebooks can be found under -`the speaker recognition tutorials folder `_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. +`the speaker recognition tutorials folder `_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. If you are looking for information about a particular SpeakerNet model, or would like to find out more about the model architectures available in the ``nemo_asr`` collection, check out the :doc:`Models <./models>` page. @@ -19,4 +19,4 @@ Documentation for configuration files specific to the ``nemo_asr`` models can be :doc:`Configuration Files <./configs>` page. -For a clear step-by-step tutorial we advise you to refer to the tutorials found in `folder `_. +For a clear step-by-step tutorial we advise you to refer to the tutorials found in `folder `_. diff --git a/docs/source/asr/speech_classification/configs.rst b/docs/source/asr/speech_classification/configs.rst index 5223ed420250..1c71911280b8 100644 --- a/docs/source/asr/speech_classification/configs.rst +++ b/docs/source/asr/speech_classification/configs.rst @@ -27,7 +27,7 @@ Any initialization parameters that are accepted for the Dataset class used in yo can be set in the config file. See the :ref:`Datasets ` section of the API for a list of Datasets and their respective parameters. -An example Speech Classification train and validation configuration could look like: +An example Speech Classification train and validation configuration could look like: .. code-block:: yaml @@ -59,7 +59,7 @@ If you would like to use tarred dataset, have a look at :ref:`Datasets Configura Preprocessor Configuration -------------------------- -Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. +Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. For details on how to write this section, refer to :ref:`Preprocessor Configuration ` Check config yaml files in ``/examples/asr/conf`` to find the processors been used by speech classification models. @@ -72,7 +72,7 @@ There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which configuration file using the ``augmentor`` and ``spec_augment`` section. For details on how to write this section, refer to the ASR :ref:`Augmentation Configuration ` section. -Check config yaml files in ``/tutorials/asr/conf`` to find the processors been used by speech classification models. +Check config yaml files in ``/tutorials/asr/conf`` to find the processors been used by speech classification models. Model Architecture Configurations @@ -84,13 +84,13 @@ specifying the module to use for each. The following sections go into more detail about the specific configurations of each model architecture. -The :ref:`MatchboxNet ` and :ref:`MarbleNet ` models are very similar, and as +The :ref:`MatchboxNet ` and :ref:`MarbleNet ` models are very similar, and as such the components in their configs are very similar as well. Decoder Configurations ------------------------ -After features have been computed from ConvASREncoder, we pass the features to decoder to compute embeddings and then to compute log_probs +After features have been computed from ConvASREncoder, we pass the features to decoder to compute embeddings and then to compute log_probs for training models. .. code-block:: yaml @@ -111,5 +111,5 @@ When preparing your own training or fine-tuning scripts, please follow the execu Depending on the type of model, there may be extra steps that must be performed - -* Speech Classification models - `Examples directory for Classification Models `_ +* Speech Classification models - `Examples directory for Classification Models `_ diff --git a/docs/source/asr/ssl/intro.rst b/docs/source/asr/ssl/intro.rst index 89002711be97..bb5050d0bf85 100644 --- a/docs/source/asr/ssl/intro.rst +++ b/docs/source/asr/ssl/intro.rst @@ -1,26 +1,26 @@ Speech Self-Supervised Learning =============================== -Self-Supervised Learning (SSL) refers to the problem of learning without explicit labels. As -any learning process require feedback, without explit labels, SSL derives supervisory signals from -the data itself. The general ideal of SSL is to predict any hidden part (or property) of the input -from observed part of the input (e.g., filling in the blanks in a sentence or predicting whether +Self-Supervised Learning (SSL) refers to the problem of learning without explicit labels. As +any learning process require feedback, without explit labels, SSL derives supervisory signals from +the data itself. The general ideal of SSL is to predict any hidden part (or property) of the input +from observed part of the input (e.g., filling in the blanks in a sentence or predicting whether an image is upright or inverted). -SSL for speech/audio understanding broadly falls into either contrastive or reconstruction -based approaches. In contrastive methods, models learn by distinguishing between true and distractor -tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC), -Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating -the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive +SSL for speech/audio understanding broadly falls into either contrastive or reconstruction +based approaches. In contrastive methods, models learn by distinguishing between true and distractor +tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC), +Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating +the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive Predictive Coding (APC) are few examples. -In the recent past, SSL has been a major benefactor in improving Acoustic Modeling (AM), i.e., the -encoder module of neural ASR models. Here too, majority of SSL effort is focused on improving AM. -While it is common that AM is the focus of SSL in ASR, it can also be utilized in improving other parts of +In the recent past, SSL has been a major benefactor in improving Acoustic Modeling (AM), i.e., the +encoder module of neural ASR models. Here too, majority of SSL effort is focused on improving AM. +While it is common that AM is the focus of SSL in ASR, it can also be utilized in improving other parts of ASR models (e.g., predictor module in transducer based ASR models). -In NeMo, we provide two types of SSL models, `Wav2Vec-BERT `_ and `NEST `_. -The training script for them can be found in `https://github.com/NVIDIA/NeMo/tree/main/examples/asr/speech_pretraining`. +In NeMo, we provide two types of SSL models, `Wav2Vec-BERT `_ and `NEST `_. +The training script for them can be found in `https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/asr/speech_pretraining`. The full documentation tree is as follows: diff --git a/docs/source/asr/ssl/resources.rst b/docs/source/asr/ssl/resources.rst index 7b6dc685f7d2..2232316cfa93 100644 --- a/docs/source/asr/ssl/resources.rst +++ b/docs/source/asr/ssl/resources.rst @@ -1,23 +1,23 @@ Resources and Documentation --------------------------- -Refer to `SSL-for-ASR notebook `_ -for a hands-on tutorial. If you are a beginner to NeMo, consider trying out the -`ASR with NeMo `_ -tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the +Refer to `SSL-for-ASR notebook `_ +for a hands-on tutorial. If you are a beginner to NeMo, consider trying out the +`ASR with NeMo `_ +tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. -If you are looking for information about a particular ASR model, or would like to find out more -about the model architectures available in the ``nemo_asr`` collection, refer to the +If you are looking for information about a particular ASR model, or would like to find out more +about the model architectures available in the ``nemo_asr`` collection, refer to the :doc:`ASR Featured Models <../featured_models>` page. -NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`ASR Datasets <../datasets>` -page contains instructions on running those scripts. It also includes guidance for creating your +NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`ASR Datasets <../datasets>` +page contains instructions on running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. -Information about how to load model checkpoints (either local files or pretrained ones from NGC), -as well as a list of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` +Information about how to load model checkpoints (either local files or pretrained ones from NGC), +as well as a list of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page. -Documentation regarding the configuration files specific to the SSL can be found in the +Documentation regarding the configuration files specific to the SSL can be found in the :doc:`Configuration Files <./configs>` page. diff --git a/docs/source/asr/ssl/results.rst b/docs/source/asr/ssl/results.rst index 11095b7bd1e3..ba5bb29705af 100644 --- a/docs/source/asr/ssl/results.rst +++ b/docs/source/asr/ssl/results.rst @@ -1,7 +1,7 @@ Checkpoints =========== -Pre-trained SSL checkpoints available in NeMo need to be further fine-tuned on down-stream task. +Pre-trained SSL checkpoints available in NeMo need to be further fine-tuned on down-stream task. There are two main ways to load pretrained checkpoints in NeMo: * Using the :code:`restore_from()` method to load a local checkpoint file (``.nemo``), or @@ -9,13 +9,13 @@ There are two main ways to load pretrained checkpoints in NeMo: Refer to the following sections for instructions and examples for each. -Note that these instructions are for fine-tuning. To resume an unfinished training experiment, +Note that these instructions are for fine-tuning. To resume an unfinished training experiment, use the Experiment Manager to do so by setting the ``resume_if_exists`` flag to ``True``. Loading Local Checkpoints ------------------------- -NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any +NeMo automatically saves checkpoints of a model that is trained in a ``.nemo`` format. Alternatively, to manually save the model at any point, issue :code:`model.save_to(.nemo)`. If there is a local ``.nemo`` checkpoint that you'd like to load, use the :code:`restore_from()` method: @@ -30,7 +30,7 @@ Where the model base class is the ASR model class of the original checkpoint, or Loading NGC Pretrained Checkpoints ---------------------------------- -The SSL collection has checkpoints of several models trained on various datasets. These checkpoints are +The SSL collection has checkpoints of several models trained on various datasets. These checkpoints are obtainable via NGC `NeMo Automatic Speech Recognition collection `_. The model cards on NGC contain more information about each of the checkpoints available. @@ -63,8 +63,8 @@ If you would like to programatically list the models available for a particular Loading SSL checkpoint into Down-stream Model --------------------------------------------- -After loading an SSL checkpoint as shown above, it's ``state_dict`` needs to be copied to a -down-stream model for fine-tuning. +After loading an SSL checkpoint as shown above, it's ``state_dict`` needs to be copied to a +down-stream model for fine-tuning. For example, to load a SSL checkpoint for ASR down-stream task using ``EncDecRNNTBPEModel``, run: @@ -79,19 +79,19 @@ For example, to load a SSL checkpoint for ASR down-stream task using ``EncDecRNN # discard ssl model del ssl model -Refer to :doc:`SSL configs <./configs>` to do this automatically via config files. +Refer to :doc:`SSL configs <./configs>` to do this automatically via config files. Fine-tuning on Downstream Datasets ----------------------------------- -After loading SSL checkpoint into down-stream model, refer to multiple ASR tutorials provided in the :ref:`Tutorials ` section. +After loading SSL checkpoint into down-stream model, refer to multiple ASR tutorials provided in the :ref:`Tutorials ` section. Most of these tutorials explain how to fine-tune on some dataset as a demonstration. Inference Execution Flow Diagram -------------------------------- -When preparing your own inference scripts after downstream fine-tuning, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. +When preparing your own inference scripts after downstream fine-tuning, please follow the execution flow diagram order for correct inference, found at the `examples directory for ASR collection `_. SSL Models ----------------------------------- diff --git a/docs/source/audio/configs.rst b/docs/source/audio/configs.rst index 734741dfd86a..190157674c8a 100644 --- a/docs/source/audio/configs.rst +++ b/docs/source/audio/configs.rst @@ -6,7 +6,7 @@ For general information about how to set up and run experiments that is common t The model section of the NeMo audio configuration files generally requires information about the dataset(s) being used, parameters for any augmentation being performed, as well as the model architecture specification. Example configuration files for all of the NeMo audio models can be found in the -`config directory of the examples `_. +`config directory of the examples `_. .. _audio-configs-nemo-dataset-configuration: @@ -58,7 +58,7 @@ An example train, validation and test datasets can be configured as follows: num_workers: 4 pin_memory: true -More information about online augmentation can found in the `masking example configuration `_ +More information about online augmentation can found in the `masking example configuration `_ .. _audio-configs-lhotse-dataset-configuration: @@ -103,8 +103,8 @@ An example train dataset in Lhotse CutSet format using online augmentation with rir_path: ??? # path to Lhotse recordings manifest with room impulse response signals noise_path: ??? # path to Lhotse cuts manifest with noise signals -A configuration file with Lhotse online augmentation can found in the `online augmentation example configuration `_. -More information about the online augmentation can be found in the `tutorial notebook `_. +A configuration file with Lhotse online augmentation can found in the `online augmentation example configuration `_. +More information about the online augmentation can be found in the `tutorial notebook `_. Lhotse Shar @@ -125,7 +125,7 @@ An example train dataset in Lhotse shar format can be configured as follows: pin_memory: true -A configuration file with Lhotse shar format can found in the `SSL pretraining example configuration `_. +A configuration file with Lhotse shar format can found in the `SSL pretraining example configuration `_. Dataset Reweighting with Temperature @@ -357,7 +357,7 @@ An example of a simple predictive model configuration is shown below: weight_decay: 0.0 -Complete configuration file can found in the `example configuration `_. +Complete configuration file can found in the `example configuration `_. Finetuning Configuration diff --git a/docs/source/audio/datasets.rst b/docs/source/audio/datasets.rst index 4c023961a29e..527b5a44b97e 100644 --- a/docs/source/audio/datasets.rst +++ b/docs/source/audio/datasets.rst @@ -54,7 +54,7 @@ Lhotse dataloading supports the following types of inputs: Converting NeMo manifest to Lhotse ---------------------------------- -A dataset with a manifest in NeMo format can be converted to Lhotse format using the provided `conversion script `_. +A dataset with a manifest in NeMo format can be converted to Lhotse format using the provided `conversion script `_. .. code:: shell diff --git a/docs/source/audio/models.rst b/docs/source/audio/models.rst index 51c4bbf634ed..fb11dd99f561 100644 --- a/docs/source/audio/models.rst +++ b/docs/source/audio/models.rst @@ -2,8 +2,8 @@ Models ======= This section provides a brief overview of models that NeMo's audio collection currently supports. -* **Model Recipes** can be accessed through `examples/audio `_. -* **Configuration Files** can be found in the directory of `examples/audio/conf `_. For detailed information about configuration files and how they +* **Model Recipes** can be accessed through `examples/audio `_. +* **Configuration Files** can be found in the directory of `examples/audio/conf `_. For detailed information about configuration files and how they should be structured, please refer to the section :doc:`./configs`. * **Pretrained Model Checkpoints** are available for any users for immediately synthesizing speech or fine-tuning models on your custom datasets. Please follow the section :doc:`./checkpoints` for instructions on how to use those pretrained models. diff --git a/docs/source/audio/resources.rst b/docs/source/audio/resources.rst index 404e29f4de82..76b12ca2e612 100644 --- a/docs/source/audio/resources.rst +++ b/docs/source/audio/resources.rst @@ -1,9 +1,9 @@ Resources and Documentation =========================== -Tutorial notebooks can be found under `the audio tutorials folder `_. If you are just starting with NeMo, consider trying out the tutorials of `NeMo Primer `_ and `NeMo Model `_. These tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. +Tutorial notebooks can be found under `the audio tutorials folder `_. If you are just starting with NeMo, consider trying out the tutorials of `NeMo Primer `_ and `NeMo Model `_. These tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. -If you are looking for information about a particular model, or would like to find out more about the model architectures available in the directory of `nemo.collections.audio `_, refer to the :doc:`Models <./models>` section. +If you are looking for information about a particular model, or would like to find out more about the model architectures available in the directory of `nemo.collections.audio `_, refer to the :doc:`Models <./models>` section. Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints available on NGC are located on the :doc:`Checkpoints <./checkpoints>` section. diff --git a/docs/source/broken_links_false_positives.json b/docs/source/broken_links_false_positives.json index 8a12633d9c9e..1d783f725643 100644 --- a/docs/source/broken_links_false_positives.json +++ b/docs/source/broken_links_false_positives.json @@ -11,7 +11,7 @@ "lineno": 113, "status": "broken", "code": 0, - "uri": "https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/modules/hybrid_autoregressive_transducer.py#L39", + "uri": "https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/asr/modules/hybrid_autoregressive_transducer.py#L39", "info": "Anchor 'L39' not found" } { @@ -27,7 +27,7 @@ "lineno": 52, "status": "broken", "code": 0, - "uri": "https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py#L64", + "uri": "https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/llm/recipes/tp_overlap_configs/userbuffers.py#L64", "info": "Anchor 'L64' not found" } { @@ -35,7 +35,7 @@ "lineno": 52, "status": "broken", "code": 0, - "uri": "https://github.com/NVIDIA/NeMo/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61", + "uri": "https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L61", "info": "Anchor 'L61' not found" } { @@ -59,8 +59,8 @@ "lineno": 16, "status": "broken", "code": 0, - "uri": "https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/GLUE_Benchmark.ipynb", - "info": "404 Client Error: Not Found for url: https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/GLUE_Benchmark.ipynb" + "uri": "https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/nlp/GLUE_Benchmark.ipynb", + "info": "404 Client Error: Not Found for url: https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/nlp/GLUE_Benchmark.ipynb" } { "filename": "tools/nemo_forced_aligner.rst", diff --git a/docs/source/conf.py b/docs/source/conf.py index be4512b86dbe..817cc0c0ca24 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -245,7 +245,7 @@ "icon_links": [ { "name": "GitHub", - "url": "https://github.com/NVIDIA-NeMo/NeMo", + "url": "https://github.com/NVIDIA-NeMo/Speech", "icon": "fa-brands fa-github", } ], diff --git a/docs/source/core/core.rst b/docs/source/core/core.rst index c3351c281ba6..86944e6faffe 100644 --- a/docs/source/core/core.rst +++ b/docs/source/core/core.rst @@ -6,7 +6,7 @@ Basics NeMo models contain everything needed to train and reproduce conversational AI models: -- neural network architectures +- neural network architectures - datasets/data loaders - data preprocessing/postprocessing - data augmentors @@ -17,14 +17,14 @@ NeMo models contain everything needed to train and reproduce conversational AI m NeMo uses `Hydra `_ for configuring both NeMo models and the PyTorch Lightning Trainer. .. note:: - Every NeMo model has an example configuration file and training script that can be found `here `__. + Every NeMo model has an example configuration file and training script that can be found `here `__. The end result of using NeMo, `Pytorch Lightning `__, and Hydra is that NeMo models all have the same look and feel and are also fully compatible with the PyTorch ecosystem. Pretrained ---------- -NeMo comes with many pretrained models for each of our collections: ASR, TTS, Audio, and SpeechLM2. Every pretrained NeMo model can be downloaded +NeMo comes with many pretrained models for each of our collections: ASR, TTS, Audio, and SpeechLM2. Every pretrained NeMo model can be downloaded and used with the ``from_pretrained()`` method. As an example, we can instantiate a Parakeet model with the following: @@ -41,7 +41,7 @@ To see all available pretrained models for a specific NeMo model, use the ``list nemo_asr.models.EncDecCTCModel.list_available_models() -For detailed information on the available pretrained models, refer to the collections documentation: +For detailed information on the available pretrained models, refer to the collections documentation: - :doc:`Automatic Speech Recognition (ASR) <../asr/intro>` - :doc:`Text-to-Speech Synthesis (TTS) <../tts/intro>` @@ -50,7 +50,7 @@ Training -------- NeMo leverages `PyTorch Lightning `__ for model training. PyTorch Lightning lets NeMo decouple the -conversational AI code from the PyTorch training code. This means that NeMo users can focus on their domain (ASR, NLP, TTS) and +conversational AI code from the PyTorch training code. This means that NeMo users can focus on their domain (ASR, NLP, TTS) and build complex AI applications without having to rewrite boilerplate code for PyTorch training. When using PyTorch Lightning, NeMo users can automatically train with: @@ -62,13 +62,13 @@ When using PyTorch Lightning, NeMo users can automatically train with: - early stopping - and more -The two main aspects of the Lightning API are the `LightningModule `_ +The two main aspects of the Lightning API are the `LightningModule `_ and the `Trainer `_. PyTorch Lightning ``LightningModule`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Every NeMo model is a ``LightningModule`` which is an ``nn.module``. This means that NeMo models are compatible with the PyTorch +Every NeMo model is a ``LightningModule`` which is an ``nn.module``. This means that NeMo models are compatible with the PyTorch ecosystem and can be plugged into existing PyTorch workflows. Creating a NeMo model is similar to any other PyTorch workflow. We start by initializing our model architecture, then define the forward pass: @@ -168,9 +168,9 @@ While validation logic can be found in ``validation_step``: return {'val_loss': val_loss, 'tp': tp, 'fn': fn, 'fp': fp} PyTorch Lightning then handles all of the boilerplate code needed for training. Virtually any aspect of training can be customized -via PyTorch Lightning `hooks `_, -`Plugins `_, -`callbacks `_, or by overriding `methods `_. +via PyTorch Lightning `hooks `_, +`Plugins `_, +`callbacks `_, or by overriding `methods `_. For more domain-specific information, see: @@ -180,15 +180,15 @@ For more domain-specific information, see: PyTorch Lightning Trainer ~~~~~~~~~~~~~~~~~~~~~~~~~ -Since every NeMo model is a ``LightningModule``, we can automatically take advantage of the PyTorch Lightning ``Trainer``. Every NeMo -`example `_ training script uses the ``Trainer`` object to fit the model. +Since every NeMo model is a ``LightningModule``, we can automatically take advantage of the PyTorch Lightning ``Trainer``. Every NeMo +`example `_ training script uses the ``Trainer`` object to fit the model. First, instantiate the model and trainer, then call ``.fit``: .. code-block:: python - + # We first instantiate the trainer based on the model configuration. - # See the model configuration documentation for details. + # See the model configuration documentation for details. trainer = pl.Trainer(**cfg.trainer) # Then pass the model configuration and trainer object into the NeMo model @@ -200,35 +200,35 @@ First, instantiate the model and trainer, then call ``.fit``: # Or we can run the test loop on test data by calling trainer.test(model=model) -All `trainer flags `_ can be set from from the NeMo configuration. - +All `trainer flags `_ can be set from from the NeMo configuration. + Configuration ------------- -Hydra is an open-source Python framework that simplifies configuration for complex applications that must bring together many different -software libraries. Conversational AI model training is a great example of such an application. To train a conversational AI model, we +Hydra is an open-source Python framework that simplifies configuration for complex applications that must bring together many different +software libraries. Conversational AI model training is a great example of such an application. To train a conversational AI model, we must be able to configure: - neural network architectures -- training and optimization algorithms +- training and optimization algorithms - data pre/post processing - data augmentation - experiment logging/visualization -- model checkpointing +- model checkpointing For an introduction to using Hydra, refer to the `Hydra Tutorials `_. With Hydra, we can configure everything needed for NeMo with three interfaces: -- Command Line (CLI) +- Command Line (CLI) - Configuration Files (YAML) - Dataclasses (Python) YAML ~~~~ -NeMo provides YAML configuration files for all of our `example `_ training scripts. +NeMo provides YAML configuration files for all of our `example `_ training scripts. YAML files make it easy to experiment with different model and training configurations. Every NeMo example YAML has the same underlying configuration structure: @@ -289,11 +289,11 @@ A NeMo configuration file should look similar to the following: CLI ~~~ -With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots +With NeMo and Hydra, every aspect of model training can be modified from the command-line. This is extremely helpful for running lots of experiments on compute clusters or for quickly testing parameters during development. -All NeMo `examples `_ come with instructions on how to -run the training/inference script from the command-line (e.g. see `here `__ +All NeMo `examples `_ come with instructions on how to +run the training/inference script from the command-line (e.g. see `here `__ for an example). With Hydra, arguments are set using the ``=`` operator: @@ -350,10 +350,10 @@ We can specify configuration files using the ``--config-path`` and ``--config-na Dataclasses ~~~~~~~~~~~ -Dataclasses allow NeMo to ship model configurations as part of the NeMo library and also enables pure Python configuration of NeMo models. -With Hydra, dataclasses can be used to create `structured configs `_ for the conversational AI application. +Dataclasses allow NeMo to ship model configurations as part of the NeMo library and also enables pure Python configuration of NeMo models. +With Hydra, dataclasses can be used to create `structured configs `_ for the conversational AI application. -As an example, refer to the code block below for an *Attenion is All You Need* machine translation model. The model configuration can +As an example, refer to the code block below for an *Attenion is All You Need* machine translation model. The model configuration can be instantiated and modified like any Python `Dataclass `_. .. code-block:: Python @@ -375,7 +375,7 @@ be instantiated and modified like any Python `Dataclass `_ has optimizer and scheduler configurations for every NeMo model. +.. note:: `NeMo Examples `_ has optimizer and scheduler configurations for every NeMo model. Optimizers can be configured from the CLI as well: @@ -416,7 +416,7 @@ Optimizers can be configured from the CLI as well: model.optim=adam \ # change the learning rate model.optim.lr=.0004 \ - # modify betas + # modify betas model.optim.betas=[.8, .5] .. _optimizers-label: @@ -448,7 +448,7 @@ Optimizers Optimizer Params ~~~~~~~~~~~~~~~~ -Optimizer params can vary between optimizers but the ``lr`` param is required for all optimizers. To see the available params for an +Optimizer params can vary between optimizers but the ``lr`` param is required for all optimizers. To see the available params for an optimizer, we can look at its corresponding dataclass. .. code-block:: python @@ -477,8 +477,8 @@ Learning Rate Schedulers Learning rate schedulers can be optionally configured under the ``optim.sched`` namespace. -``name`` corresponds to the name of the learning rate schedule. To view a list of available schedulers, run: - +``name`` corresponds to the name of the learning rate schedule. To view a list of available schedulers, run: + .. code-block:: Python from nemo.core.optim.lr_scheduler import AVAILABLE_SCHEDULERS @@ -528,7 +528,7 @@ To register a new scheduler to be used with NeMo, run: Save and Restore ---------------- -NeMo models all come with ``.save_to`` and ``.restore_from`` methods. +NeMo models all come with ``.save_to`` and ``.restore_from`` methods. Save ~~~~ @@ -539,7 +539,7 @@ To save a NeMo model, run: model.save_to('/path/to/model.nemo') -Everything needed to use the trained model is packaged and saved in the ``.nemo`` file. For example, in the NLP domain, ``.nemo`` files +Everything needed to use the trained model is packaged and saved in the ``.nemo`` file. For example, in the NLP domain, ``.nemo`` files include the necessary tokenizer models and/or vocabulary files, etc. .. note:: A ``.nemo`` file is simply an archive like any other ``.tar`` file. @@ -554,9 +554,9 @@ To restore a NeMo model, run: # Here, you should usually use the class of the model, or simply use ModelPT.restore_from() for simplicity. model.restore_from('/path/to/model.nemo') -When using the PyTorch Lightning Trainer, a PyTorch Lightning checkpoint is created. These are mainly used within NeMo to auto-resume -training. Since NeMo models are ``LightningModules``, the PyTorch Lightning method ``load_from_checkpoint`` is available. Note that -``load_from_checkpoint`` won't necessarily work out-of-the-box for all models as some models require more artifacts than just the +When using the PyTorch Lightning Trainer, a PyTorch Lightning checkpoint is created. These are mainly used within NeMo to auto-resume +training. Since NeMo models are ``LightningModules``, the PyTorch Lightning method ``load_from_checkpoint`` is available. Note that +``load_from_checkpoint`` won't necessarily work out-of-the-box for all models as some models require more artifacts than just the checkpoint to be restored. For these models, the user will have to override ``load_from_checkpoint`` if they want to use it. It's highly recommended to use ``restore_from`` to load NeMo models. @@ -594,7 +594,7 @@ Restoring conversational AI models can be complicated because it requires more t NeMo models can save additional artifacts in the .nemo file by calling ``.register_artifact``. When restoring NeMo models using ``.restore_from`` or ``.from_pretrained``, any artifacts that were registered will be available automatically. -As an example, consider an NLP model that requires a trained tokenizer model. +As an example, consider an NLP model that requires a trained tokenizer model. The tokenizer model file can be automatically added to the .nemo file with the following: .. code-block:: python @@ -606,12 +606,12 @@ The tokenizer model file can be automatically added to the .nemo file with the f verify_src_exists=True), ) -By default, ``.register_artifact`` will always return a path. If the model is being restored from a .nemo file, +By default, ``.register_artifact`` will always return a path. If the model is being restored from a .nemo file, then that path will be to the artifact in the .nemo file. Otherwise, ``.register_artifact`` will return the local path specified by the user. ``config_path`` is the artifact key. It usually corresponds to a model configuration but does not have to. The model config that is packaged with the .nemo file will be updated according to the ``config_path`` key. -In the above example, the model config will have +In the above example, the model config will have .. code-block:: YAML @@ -620,7 +620,7 @@ In the above example, the model config will have tokenizer_model: nemo:4978b28103264263a03439aaa6560e5e_tokenizer.model ``src`` is the path to the artifact and the base-name of the path will be used when packaging the artifact in the .nemo file. -Each artifact will have a hash prepended to the basename of ``src`` in the .nemo file. This is to prevent collisions with basenames +Each artifact will have a hash prepended to the basename of ``src`` in the .nemo file. This is to prevent collisions with basenames base-names that are identical (say when there are two or more tokenizers, both called `tokenizer.model`). The resulting .nemo file will then have the following file: @@ -628,7 +628,7 @@ The resulting .nemo file will then have the following file: 4978b28103264263a03439aaa6560e5e_tokenizer.model -If ``verify_src_exists`` is set to ``False``, then the artifact is optional. This means that ``.register_artifact`` will return ``None`` +If ``verify_src_exists`` is set to ``False``, then the artifact is optional. This means that ``.register_artifact`` will return ``None`` if the ``src`` cannot be found. Push to Hugging Face Hub @@ -737,7 +737,7 @@ To register a child model, use the ``register_nemo_submodule`` method of the par -Profiling +Profiling --------- NeMo offers users two options for profiling: Nsys and CUDA memory profiling. These two options allow users diff --git a/docs/source/core/exp_manager.rst b/docs/source/core/exp_manager.rst index 5d9e8a858c9a..a89426d2b8af 100644 --- a/docs/source/core/exp_manager.rst +++ b/docs/source/core/exp_manager.rst @@ -215,7 +215,7 @@ and stability. To use EMA, set the following parameters via YAML or :class:`~nem This feature is implemented in the ``StragglerDetectionCallback``, which is disabled by default. - The callback computes normalized GPU performance scores, which are scalar values ranging from 0.0 (worst) to 1.0 (best). + The callback computes normalized GPU performance scores, which are scalar values ranging from 0.0 (worst) to 1.0 (best). A performance score can be interpreted as the ratio of current performance to reference performance. There are two types of performance scores provided by the callback: @@ -228,7 +228,7 @@ and stability. To use EMA, set the following parameters via YAML or :class:`~nem If a GPU performance score drops below the specified threshold, it is identified as a straggler. - To enable straggler detection, add ``create_straggler_detection_callback: True`` under exp_manager in the config YAML file. + To enable straggler detection, add ``create_straggler_detection_callback: True`` under exp_manager in the config YAML file. You might also want to adjust the callback parameters: .. code-block:: yaml @@ -257,21 +257,21 @@ and stability. To use EMA, set the following parameters via YAML or :class:`~nem Fault Tolerance feature is included in the optional NeMo resiliency package. When training Deep Neural Network (DNN models), faults may occur, hindering the progress of the entire training process. - This is particularly common in distributed, multi-node training scenarios, with many nodes and GPUs involved. + This is particularly common in distributed, multi-node training scenarios, with many nodes and GPUs involved. - NeMo incorporates a fault tolerance mechanism to detect training halts. + NeMo incorporates a fault tolerance mechanism to detect training halts. In response, it can terminate a hung workload and, if requested, restart it from the last checkpoint. - Fault tolerance ("FT") relies on a special launcher (``ft_launcher``), which is a modified ``torchrun``. - The FT launcher runs background processes called rank monitors. **You need to use ft_launcher to start - your workload if you are using FT**. I.e., `NeMo-Framework-Launcher `_ - can be used to generate SLURM batch scripts with FT support. + Fault tolerance ("FT") relies on a special launcher (``ft_launcher``), which is a modified ``torchrun``. + The FT launcher runs background processes called rank monitors. **You need to use ft_launcher to start + your workload if you are using FT**. I.e., `NeMo-Framework-Launcher `_ + can be used to generate SLURM batch scripts with FT support. Each training process (rank) sends `heartbeats` to its monitor during training and validation steps. If a rank monitor stops receiving `heartbeats`, a training failure is detected. - Fault detection is implemented in the ``FaultToleranceCallback`` and is disabled by default. - To enable it, add a ``create_fault_tolerance_callback: True`` option under ``exp_manager`` in the + Fault detection is implemented in the ``FaultToleranceCallback`` and is disabled by default. + To enable it, add a ``create_fault_tolerance_callback: True`` option under ``exp_manager`` in the config YAML file. Additionally, you can customize FT parameters by adding ``fault_tolerance`` section: .. code-block:: yaml @@ -286,9 +286,9 @@ and stability. To use EMA, set the following parameters via YAML or :class:`~nem Timeouts for fault detection need to be adjusted for a given workload: * ``initial_rank_heartbeat_timeout`` should be long enough to allow for workload initialization. - * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. + * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. - **Importantly, `heartbeats` are not sent during checkpoint loading and saving**, so time for + **Importantly, `heartbeats` are not sent during checkpoint loading and saving**, so time for checkpointing related operations should be taken into account. If ``calculate_timeouts: True``, timeouts will be automatically estimated based on observed intervals. @@ -297,25 +297,25 @@ and stability. To use EMA, set the following parameters via YAML or :class:`~nem training started from scratch, estimated timeouts won't be available during the initial two runs. Estimated timeouts are stored in a separate JSON file. - ``max_subsequent_job_failures`` allows for the automatic continuation of training on a SLURM cluster. - This feature requires SLURM job to be scheduled with ``NeMo-Framework-Launcher``. If ``max_subsequent_job_failures`` - value is `>0` continuation job is prescheduled. It will continue the work until ``max_subsequent_job_failures`` - subsequent jobs failed (SLURM job exit code is `!= 0`) or the training is completed successfully + ``max_subsequent_job_failures`` allows for the automatic continuation of training on a SLURM cluster. + This feature requires SLURM job to be scheduled with ``NeMo-Framework-Launcher``. If ``max_subsequent_job_failures`` + value is `>0` continuation job is prescheduled. It will continue the work until ``max_subsequent_job_failures`` + subsequent jobs failed (SLURM job exit code is `!= 0`) or the training is completed successfully ("end of training" marker file is produced by the ``FaultToleranceCallback``, i.e. due to iters or time limit reached). All FT configuration items summary: * ``workload_check_interval`` (float, default=5.0) Periodic workload check interval [seconds] in the workload monitor. - * ``initial_rank_heartbeat_timeout`` (Optional[float], default=60.0 * 60.0) Timeout [seconds] for the first heartbeat from a rank. - * ``rank_heartbeat_timeout`` (Optional[float], default=45.0 * 60.0) Timeout [seconds] for subsequent heartbeats from a rank. - * ``calculate_timeouts`` (bool, default=True) Try to calculate ``rank_heartbeat_timeout`` and ``initial_rank_heartbeat_timeout`` + * ``initial_rank_heartbeat_timeout`` (Optional[float], default=60.0 * 60.0) Timeout [seconds] for the first heartbeat from a rank. + * ``rank_heartbeat_timeout`` (Optional[float], default=45.0 * 60.0) Timeout [seconds] for subsequent heartbeats from a rank. + * ``calculate_timeouts`` (bool, default=True) Try to calculate ``rank_heartbeat_timeout`` and ``initial_rank_heartbeat_timeout`` based on the observed heartbeat intervals. - * ``safety_factor``: (float, default=5.0) When calculating the timeouts, multiply the maximum observed heartbeat interval - by this factor to obtain the timeout estimate. Can be made smaller for stable environments and larger for unstable ones. + * ``safety_factor``: (float, default=5.0) When calculating the timeouts, multiply the maximum observed heartbeat interval + by this factor to obtain the timeout estimate. Can be made smaller for stable environments and larger for unstable ones. * ``rank_termination_signal`` (signal.Signals, default=signal.SIGKILL) Signal used to terminate the rank when failure is detected. * ``log_level`` (str, default='INFO') Log level for the FT client and server(rank monitor). - * ``max_rank_restarts`` (int, default=0) Used by FT launcher. Max number of restarts for a rank. + * ``max_rank_restarts`` (int, default=0) Used by FT launcher. Max number of restarts for a rank. If ``>0`` ranks will be restarted on existing nodes in case of a failure. - * ``max_subsequent_job_failures`` (int, default=0) Used by FT launcher. How many subsequent job failures are allowed until stopping autoresuming. + * ``max_subsequent_job_failures`` (int, default=0) Used by FT launcher. How many subsequent job failures are allowed until stopping autoresuming. ``0`` means do not auto-resume. * ``additional_ft_launcher_args`` (str, default='') Additional FT launcher params (for advanced use). diff --git a/docs/source/index.rst b/docs/source/index.rst index 7b36badc6731..e1cc98400cf9 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,7 +1,7 @@ NVIDIA NeMo Speech Developer Docs ================================= -`NVIDIA NeMo Speech `_ is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment. +`NVIDIA NeMo Speech `_ is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment. .. raw:: html @@ -50,7 +50,7 @@ NVIDIA NeMo Speech Developer Docs What is NeMo? -------------- -`NVIDIA NeMo `_ is an open-source toolkit for building, customizing, and deploying speech, audio, and multimodal language models. It provides: +`NVIDIA NeMo `_ is an open-source toolkit for building, customizing, and deploying speech, audio, and multimodal language models. It provides: - **Pretrained models** — production-ready checkpoints on `NGC `__ and `HuggingFace Hub `__ - **Modular architecture** — neural modules you can mix, match, and extend @@ -73,7 +73,7 @@ Get started (install the PyTorch build for your platform first): Trying to finetune a model? --------------------------- -Check out our latest ``/nemo-speech-finetune-asr`` `agent skill `_. +Check out our latest ``/nemo-speech-finetune-asr`` `agent skill `_. .. toctree:: diff --git a/docs/source/starthere/install.rst b/docs/source/starthere/install.rst index 5745b25f7980..d6dc17e7a50f 100644 --- a/docs/source/starthere/install.rst +++ b/docs/source/starthere/install.rst @@ -35,7 +35,7 @@ The recommended way to install NeMo Speech is from source with `uv `_ + - `NeMo Fundamentals `_ * - General - Getting Started: Audio translator example - - `Audio translator example `_ + - `Audio translator example `_ * - General - Getting Started: Voice swap example - - `Voice swap example `_ + - `Voice swap example `_ * - General - Getting Started: NeMo Models - - `NeMo Models `_ + - `NeMo Models `_ * - General - Getting Started: NeMo Adapters - - `NeMo Adapters `_ + - `NeMo Adapters `_ * - General - Getting Started: NeMo Models on Hugging Face Hub - - `NeMo Models on HF Hub `_ + - `NeMo Models on HF Hub `_ .. list-table:: **Automatic Speech Recognition (ASR) Tutorials** :widths: 15 30 55 @@ -53,79 +53,79 @@ Tutorial Overview - GitHub URL * - ASR - ASR with NeMo - - `ASR with NeMo `_ + - `ASR with NeMo `_ * - ASR - ASR with Subword Tokenization - - `ASR with Subword Tokenization `_ + - `ASR with Subword Tokenization `_ * - ASR - Offline ASR - - `Offline ASR `_ + - `Offline ASR `_ * - ASR - Online ASR Microphone Cache Aware Streaming - - `Online ASR Microphone Cache Aware Streaming `_ + - `Online ASR Microphone Cache Aware Streaming `_ * - ASR - Online ASR Microphone Buffered Streaming - - `Online ASR Microphone Buffered Streaming `_ + - `Online ASR Microphone Buffered Streaming `_ * - ASR - ASR CTC Language Fine-Tuning - - `ASR CTC Language Fine-Tuning `_ + - `ASR CTC Language Fine-Tuning `_ * - ASR - Intro to Transducers - - `Intro to Transducers `_ + - `Intro to Transducers `_ * - ASR - ASR with Transducers - - `ASR with Transducers `_ + - `ASR with Transducers `_ * - ASR - ASR with Adapters - - `ASR with Adapters `_ + - `ASR with Adapters `_ * - ASR - Speech Commands - - `Speech Commands `_ + - `Speech Commands `_ * - ASR - Online Offline Microphone Speech Commands - - `Online Offline Microphone Speech Commands `_ + - `Online Offline Microphone Speech Commands `_ * - ASR - Voice Activity Detection - - `Voice Activity Detection `_ + - `Voice Activity Detection `_ * - ASR - Online Offline Microphone VAD - - `Online Offline Microphone VAD `_ + - `Online Offline Microphone VAD `_ * - ASR - Speaker Recognition and Verification - - `Speaker Recognition and Verification `_ + - `Speaker Recognition and Verification `_ * - ASR - Speaker Diarization Inference - - `Speaker Diarization Inference `_ + - `Speaker Diarization Inference `_ * - ASR - ASR with Speaker Diarization - - `ASR with Speaker Diarization `_ + - `ASR with Speaker Diarization `_ * - ASR - Online Noise Augmentation - - `Online Noise Augmentation `_ + - `Online Noise Augmentation `_ * - ASR - ASR for Telephony Speech - - `ASR for Telephony Speech `_ + - `ASR for Telephony Speech `_ * - ASR - Streaming inference - - `Streaming inference `_ + - `Streaming inference `_ * - ASR - Buffered Transducer inference - - `Buffered Transducer inference `_ + - `Buffered Transducer inference `_ * - ASR - Buffered Transducer inference with LCS Merge - - `Buffered Transducer inference with LCS Merge `_ + - `Buffered Transducer inference with LCS Merge `_ * - ASR - Offline ASR with VAD for CTC models - - `Offline ASR with VAD for CTC models `_ + - `Offline ASR with VAD for CTC models `_ * - ASR - Self-supervised Pre-training for ASR - - `Self-supervised Pre-training for ASR `_ + - `Self-supervised Pre-training for ASR `_ * - ASR - Multi-lingual ASR - - `Multi-lingual ASR `_ + - `Multi-lingual ASR `_ * - ASR - ASR Confidence Estimation - - `ASR Confidence Estimation `_ + - `ASR Confidence Estimation `_ .. list-table:: **Text-to-Speech (TTS) Tutorials** :widths: 15 35 50 @@ -136,34 +136,34 @@ Tutorial Overview - GitHub URL * - TTS - Basic and Advanced: NeMo TTS Primer - - `NeMo TTS Primer `_ + - `NeMo TTS Primer `_ * - TTS - Basic and Advanced: TTS Speech/Text Aligner Inference - - `TTS Speech/Text Aligner Inference `_ + - `TTS Speech/Text Aligner Inference `_ * - TTS - Basic and Advanced: FastPitch and MixerTTS Model Training - - `FastPitch and MixerTTS Model Training `_ + - `FastPitch and MixerTTS Model Training `_ * - TTS - Basic and Advanced: FastPitch Finetuning - - `FastPitch Finetuning `_ + - `FastPitch Finetuning `_ * - TTS - Basic and Advanced: FastPitch and HiFiGAN Model Training for German - - `FastPitch and HiFiGAN Model Training for German `_ + - `FastPitch and HiFiGAN Model Training for German `_ * - TTS - Basic and Advanced: Tacotron2 Model Training - - `Tacotron2 Model Training `_ + - `Tacotron2 Model Training `_ * - TTS - Basic and Advanced: FastPitch Duration and Pitch Control - - `FastPitch Duration and Pitch Control `_ + - `FastPitch Duration and Pitch Control `_ * - TTS - Basic and Advanced: FastPitch Speaker Interpolation - - `FastPitch Speaker Interpolation `_ + - `FastPitch Speaker Interpolation `_ * - TTS - Basic and Advanced: TTS Inference and Model Selection - - `TTS Inference and Model Selection `_ + - `TTS Inference and Model Selection `_ * - TTS - Basic and Advanced: TTS Pronunciation Customization - - `TTS Pronunciation Customization `_ + - `TTS Pronunciation Customization `_ .. list-table:: **Tools and Utilities** :widths: 15 25 60 @@ -174,11 +174,11 @@ Tutorial Overview - GitHub URL * - Utility Tools - Utility Tools for Speech and Text: NeMo Forced Aligner - - `NeMo Forced Aligner `_ + - `NeMo Forced Aligner `_ * - Utility Tools - Utility Tools for Speech and Text: Speech Data Explorer - - `Speech Data Explorer `_ + - `Speech Data Explorer `_ * - Utility Tools - Utility Tools for Speech and Text: CTC Segmentation - - `CTC Segmentation `_ + - `CTC Segmentation `_ diff --git a/docs/source/tools/asr_evaluator.rst b/docs/source/tools/asr_evaluator.rst index f40c171681d9..7b573fd35bf2 100644 --- a/docs/source/tools/asr_evaluator.rst +++ b/docs/source/tools/asr_evaluator.rst @@ -3,4 +3,4 @@ ASR Evaluator ASR evaluator is a tool for thoroughly evaluating the performance of ASR models and other features such as Voice Activity Detection. -See more details in: https://github.com/NVIDIA/NeMo/tree/stable/tools/asr_evaluator \ No newline at end of file +See more details in: https://github.com/NVIDIA-NeMo/Speech/tree/stable/tools/asr_evaluator \ No newline at end of file diff --git a/docs/source/tools/comparison_tool.rst b/docs/source/tools/comparison_tool.rst index 87f80ca373e9..cf8de6cd44d7 100644 --- a/docs/source/tools/comparison_tool.rst +++ b/docs/source/tools/comparison_tool.rst @@ -1,7 +1,7 @@ Comparison tool for ASR Models ============================== -The Comparison Tool (CT) allows to compare predictions of different ASR models at word accuracy and utterance level. +The Comparison Tool (CT) allows to compare predictions of different ASR models at word accuracy and utterance level. +--------------------------------------------------------------------------------------------------------------------------+ | **Comparison tool features:** | @@ -19,7 +19,7 @@ The Comparison Tool (CT) allows to compare predictions of different ASR models a Getting Started --------------- -The Comparison Tool is integrated in NeMo Speech Data Explorer (SDE) that could be found at `NeMo/tools/speech_data_explorer `__. +The Comparison Tool is integrated in NeMo Speech Data Explorer (SDE) that could be found at `NeMo/tools/speech_data_explorer `__. Please install the SDE requirements: @@ -77,7 +77,7 @@ SDE has three pages if `--names_compared` argument is not empty: :align: center :width: 800px :alt: SDE Statistics - + * `Samples` (to allow navigation across the entire dataset and exploration of individual utterances) @@ -122,9 +122,9 @@ If there is a pre-trained ASR model, then the JSON manifest file can be extended .. code-block:: bash python examples/asr/transcribe_speech.py pretrained_name= dataset_manifest= append_pred=False pred_name_postfix= - -More information about transcribe_speech parameters is available in the code: `NeMo/examples/asr/transcribe_speech.py `__. + +More information about transcribe_speech parameters is available in the code: `NeMo/examples/asr/transcribe_speech.py `__. . .. image:: images/scrsh_2.png @@ -164,19 +164,19 @@ At the next field you could choose metric: WER or CER :align: center :width: 800px :alt: Switch mode - -When an utterance level is selected, it is possible to click on a point on the graph, and the corresponding utterance will be automatically selected. -If audio files are available, there will be an option to listen to the audio recording and view its waveform. +When an utterance level is selected, it is possible to click on a point on the graph, and the corresponding utterance will be automatically selected. + +If audio files are available, there will be an option to listen to the audio recording and view its waveform. .. image:: images/scr_11.png :align: center :width: 800px :alt: Audio player - + In this mode, filtering is still available as well. **Limitations** -To ensure efficient processing and avoid issues with memory limitations and slow performance, it is recommended to keep the manifests within the limits of 320 hours or around 170,000 utterances. +To ensure efficient processing and avoid issues with memory limitations and slow performance, it is recommended to keep the manifests within the limits of 320 hours or around 170,000 utterances. Exceeding these limits may result in both memory constraints and slower processing. \ No newline at end of file diff --git a/docs/source/tools/ctc_segmentation.rst b/docs/source/tools/ctc_segmentation.rst index 7d0d2ea36283..a2ff60e1a1c2 100644 --- a/docs/source/tools/ctc_segmentation.rst +++ b/docs/source/tools/ctc_segmentation.rst @@ -4,7 +4,7 @@ Dataset Creation Tool Based on CTC-Segmentation This tool provides functionality to align long audio files with the corresponding transcripts and split them into shorter fragments that are suitable for an Automatic Speech Recognition (ASR) model training. -More details could be found in `NeMo/tutorials/tools/CTC_Segmentation_Tutorial.ipynb `__ (can be executed with `Google's Colab `_). +More details could be found in `NeMo/tutorials/tools/CTC_Segmentation_Tutorial.ipynb `__ (can be executed with `Google's Colab `_). The tool is based on the `CTC-Segmentation `__ package and `CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition diff --git a/docs/source/tools/intro.rst b/docs/source/tools/intro.rst index 3a1c8eb376da..c086eaa4a024 100644 --- a/docs/source/tools/intro.rst +++ b/docs/source/tools/intro.rst @@ -4,7 +4,7 @@ Speech AI Tools =============== NeMo provides a set of tools useful for developing Automatic Speech Recognitions (ASR) and Text-to-Speech (TTS) synthesis models: \ -`https://github.com/NVIDIA/NeMo/tree/stable/tools `__ . +`https://github.com/NVIDIA-NeMo/Speech/tree/stable/tools `__ . .. toctree:: :maxdepth: 1 diff --git a/docs/source/tools/nemo_forced_aligner.rst b/docs/source/tools/nemo_forced_aligner.rst index 2ad87f0dd33a..705589208eee 100644 --- a/docs/source/tools/nemo_forced_aligner.rst +++ b/docs/source/tools/nemo_forced_aligner.rst @@ -1,11 +1,11 @@ NeMo Forced Aligner (NFA) ========================= -NFA is hosted here: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner. +NFA is hosted here: https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/nemo_forced_aligner. -NFA is a tool for generating token-, word- and segment-level timestamps of speech in audio using NeMo's CTC-based Automatic Speech Recognition models. -You can provide your own reference text, or use ASR-generated transcription. +NFA is a tool for generating token-, word- and segment-level timestamps of speech in audio using NeMo's CTC-based Automatic Speech Recognition models. +You can provide your own reference text, or use ASR-generated transcription. You can use NeMo's ASR Model checkpoints out of the box in 14+ languages (see :doc:`ASR Model Checkpoints `), or train your own model. NFA can be used on long audio files of 1+ hours duration (subject to your hardware and the ASR model used). @@ -14,7 +14,7 @@ Demos & Tutorials * HuggingFace Space `demo `__ to quickly try out NFA in various languages. * NFA "how-to" notebook `tutorial `__. -* "How forced alignment works" NeMo blog `tutorial `__. +* "How forced alignment works" NeMo blog `tutorial `__. Quickstart ---------- @@ -30,7 +30,7 @@ Quickstart manifest_filepath= \ output_dir= -.. image:: https://github.com/NVIDIA/NeMo/releases/download/v1.20.0/nfa_run.png +.. image:: https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.20.0/nfa_run.png How do I use NeMo Forced Aligner? --------------------------------- @@ -49,12 +49,12 @@ Call the ``align.py`` script, specifying the parameters as follows: * ``manifest_filepath``: The path to the manifest of the data you want to align, containing ``'audio_filepath'`` and ``'text'`` fields. The audio filepaths need to be absolute paths. -* ``output_dir``: The folder where to save the output files (e.g. CTM, ASS) containing the generated alignments and new JSON manifest containing paths to those CTM/ASS files. The CTM file will be called ``/ctm/{tokens,words,segments}/.ctm`` and each line in each file will start with ````. By default, ``utt_id`` will be the stem of the audio_filepath. This can be changed by overriding ``audio_filepath_parts_in_utt_id``. The new JSON manifest will be at ``/_with_ctm_paths.json``. The ASS files will be at ``/ass/{tokens,words}/.ass``. You can adjust which files should be saved by adjusting the parameter ``save_output_file_formats``. +* ``output_dir``: The folder where to save the output files (e.g. CTM, ASS) containing the generated alignments and new JSON manifest containing paths to those CTM/ASS files. The CTM file will be called ``/ctm/{tokens,words,segments}/.ctm`` and each line in each file will start with ````. By default, ``utt_id`` will be the stem of the audio_filepath. This can be changed by overriding ``audio_filepath_parts_in_utt_id``. The new JSON manifest will be at ``/_with_ctm_paths.json``. The ASS files will be at ``/ass/{tokens,words}/.ass``. You can adjust which files should be saved by adjusting the parameter ``save_output_file_formats``. Optional parameters: ^^^^^^^^^^^^^^^^^^^^ -* ``align_using_pred_text``: if True, will transcribe the audio using the ASR model (specified by ``pretrained_name`` or ``model_path``) and then use that transcription as the reference text for the forced alignment. The ``"pred_text"`` will be saved in the output JSON manifest at ``/{original manifest name}_with_ctm_paths.json``. To avoid over-writing other transcribed texts, if there are already ``"pred_text"`` entries in the original manifest, the program will exit without attempting to generate alignments. (Default: False). +* ``align_using_pred_text``: if True, will transcribe the audio using the ASR model (specified by ``pretrained_name`` or ``model_path``) and then use that transcription as the reference text for the forced alignment. The ``"pred_text"`` will be saved in the output JSON manifest at ``/{original manifest name}_with_ctm_paths.json``. To avoid over-writing other transcribed texts, if there are already ``"pred_text"`` entries in the original manifest, the program will exit without attempting to generate alignments. (Default: False). * ``transcribe_device``: The device that will be used for generating log-probs (i.e. transcribing). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). If specified ``transcribe_device`` needs to be a string that can be input to the ``torch.device()`` method. (Default: ``None``). @@ -68,7 +68,7 @@ Optional parameters: .. note:: Starting in NeMo 2.5.0, separators are preserved in segment text after splitting. if ``additional_segment_grouping_separator="['.', '?', '!', '...']"`` (as is the default), then the text ``"Hi, have you updated your NeMo? Yes. Sure!"`` will result in the following segments ``["Hi, have you updated your NeMo?", "Yes.", "Sure!"]``. -* ``remove_blank_tokens_from_ctm``: a boolean denoting whether to remove tokens from token-level output CTMs. (Default: False). +* ``remove_blank_tokens_from_ctm``: a boolean denoting whether to remove tokens from token-level output CTMs. (Default: False). * ``audio_filepath_parts_in_utt_id``: This specifies how many of the 'parts' of the audio_filepath we will use (starting from the final part of the audio_filepath) to determine the utt_id that will be used in the CTM files. (Default: 1, i.e. utt_id will be the stem of the basename of audio_filepath). Note also that any spaces that are present in the audio_filepath will be replaced with dashes, so as not to change the number of space-separated elements in the CTM files. @@ -163,9 +163,9 @@ How do I evaluate the alignment accuracy? Ideally you would have some 'true' CTM files to compare with your generated CTM files. With these you could obtain metrics such as the mean (absolute) errors between predicted starts/ends and the 'true' starts/ends of the segments. -Alternatively (or additionally), you can visualize the quality of alignments using tools such as Gecko, which can play your audio file and display the predicted alignments at the same time. The Gecko tool requires you to upload an audio file and at least one CTM file. The Gecko tool can be accessed here: https://gong-io.github.io/gecko/. More information about the Gecko tool can be found on its Github page here: https://github.com/gong-io/gecko. +Alternatively (or additionally), you can visualize the quality of alignments using tools such as Gecko, which can play your audio file and display the predicted alignments at the same time. The Gecko tool requires you to upload an audio file and at least one CTM file. The Gecko tool can be accessed here: https://gong-io.github.io/gecko/. More information about the Gecko tool can be found on its Github page here: https://github.com/gong-io/gecko. -.. note:: +.. note:: The following may help improve your experience viewing the CTMs in Gecko: * setting ``minimum_timestamp_duration`` to a larger number, as Gecko may not display some tokens/words/segments properly if their timestamps are too short. diff --git a/docs/source/tools/speech_data_explorer.rst b/docs/source/tools/speech_data_explorer.rst index ac13f3936746..436726e8b792 100644 --- a/docs/source/tools/speech_data_explorer.rst +++ b/docs/source/tools/speech_data_explorer.rst @@ -20,7 +20,7 @@ Speech Data Explorer (SDE) is a `Dash `__-based web ap Getting Started --------------- -SDE could be found in `NeMo/tools/speech_data_explorer `__. +SDE could be found in `NeMo/tools/speech_data_explorer `__. Please install the SDE requirements: @@ -75,7 +75,7 @@ SDE application has two pages: :align: center :width: 800px :alt: SDE Statistics - + * `Samples` (to allow navigation across the entire dataset and exploration of individual utterances) @@ -83,7 +83,7 @@ SDE application has two pages: :align: center :width: 800px :alt: SDE Statistics - + Plotly Dash Datatable provides core SDE's interactive features (navigation, filtering, and sorting). SDE has two datatables: @@ -94,7 +94,7 @@ SDE has two datatables: :align: center :width: 800px :alt: Vocabulary - + * Data (that visualizes all dataset's utterances on `Samples` page) @@ -102,7 +102,7 @@ SDE has two datatables: :align: center :width: 800px :alt: Data - + Every column of the DataTable has the following interactive features: @@ -112,7 +112,7 @@ Every column of the DataTable has the following interactive features: :align: center :width: 800px :alt: Toggling - + * sorting (by clicking on small triangle icons in the column's header cell): unordered (two triangles point up and down), ascending (a triangle points up), descending (a triangle points down) @@ -120,7 +120,7 @@ Every column of the DataTable has the following interactive features: :align: center :width: 800px :alt: Sorting - + * filtering (by entering a filtering expression in a cell below the header's cell): SDE supports ``<``, ``>``, ``<=``, ``>=``, ``=``, ``!=``, and ``contains`` operators; to match a specific substring, the quoted substring can be used as a filtering expression @@ -128,7 +128,7 @@ Every column of the DataTable has the following interactive features: :align: center :width: 800px :alt: Filtering - + Analysis of Speech Datasets @@ -146,14 +146,14 @@ If there is a pre-trained ASR model, then the JSON manifest file can be extended .. code-block:: bash python examples/asr/transcribe_speech.py pretrained_name= dataset_manifest= - -After that it is worth to check words with zero accuracy. + +After that it is worth to check words with zero accuracy. .. image:: images/sde_mls_words.png :align: center :width: 800px :alt: MLS Words - + And then look at high CER utterances. @@ -161,7 +161,7 @@ And then look at high CER utterances. :align: center :width: 800px :alt: MLS CER - + Listening to the audio recording helps to validate the corresponding reference transcript. @@ -169,7 +169,7 @@ Listening to the audio recording helps to validate the corresponding reference t :align: center :width: 800px :alt: MLS Player - + diff --git a/docs/source/tools/speech_data_processor.rst b/docs/source/tools/speech_data_processor.rst index 262b214c6355..7b4c43c4fa2e 100644 --- a/docs/source/tools/speech_data_processor.rst +++ b/docs/source/tools/speech_data_processor.rst @@ -5,6 +5,6 @@ Speech Data Processor (SDP) is a toolkit to make it easy to: 1. write code to process a new dataset, minimizing the amount of boilerplate code required. 2. share the steps for processing a speech dataset. -SDP is hosted here: https://github.com/NVIDIA/NeMo-speech-data-processor. +SDP is hosted here: https://github.com/NVIDIA-NeMo/Speech-speech-data-processor. To learn more about SDP, please check the [documentation](https://nvidia.github.io/NeMo-speech-data-processor/). diff --git a/docs/source/tts/checkpoints.rst b/docs/source/tts/checkpoints.rst index 23337fea2225..dfad77c06a23 100644 --- a/docs/source/tts/checkpoints.rst +++ b/docs/source/tts/checkpoints.rst @@ -110,10 +110,10 @@ NeMo TTS supports both cascaded and end-to-end models to synthesize audios. Most Fine-Tuning on Different Datasets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -There are multiple TTS tutorials provided in the directory of `tutorials/tts/ `_. Most of these tutorials demonstrate how to instantiate a pre-trained model, and prepare the model for fine-tuning on datasets with the same language or different languages, the same speaker or different speakers. +There are multiple TTS tutorials provided in the directory of `tutorials/tts/ `_. Most of these tutorials demonstrate how to instantiate a pre-trained model, and prepare the model for fine-tuning on datasets with the same language or different languages, the same speaker or different speakers. -* **cross-lingual fine-tuning**: https://github.com/NVIDIA/NeMo/tree/stable/tutorials/tts/FastPitch_GermanTTS_Training.ipynb -* **cross-speaker fine-tuning**: https://github.com/NVIDIA/NeMo/tree/stable/tutorials/tts/FastPitch_Finetuning.ipynb +* **cross-lingual fine-tuning**: https://github.com/NVIDIA-NeMo/Speech/tree/stable/tutorials/tts/FastPitch_GermanTTS_Training.ipynb +* **cross-speaker fine-tuning**: https://github.com/NVIDIA-NeMo/Speech/tree/stable/tutorials/tts/FastPitch_Finetuning.ipynb .. _NGC TTS Models: diff --git a/docs/source/tts/configs.rst b/docs/source/tts/configs.rst index 3d784bb767a1..ec02698dbe72 100644 --- a/docs/source/tts/configs.rst +++ b/docs/source/tts/configs.rst @@ -9,7 +9,7 @@ for audio files, parameters for any augmentation being performed, as well as the this page cover each of these in more detail. Example configuration files for all of the NeMo TTS scripts can be found in the -`config directory of the examples `_. +`config directory of the examples `_. Dataset Configuration --------------------- @@ -17,7 +17,7 @@ Dataset Configuration Training, validation, and test parameters are specified using the ``model.train_ds``, ``model.validation_ds``, and ``model.test_ds`` sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate of the audio files, supplementary data such as speech/text alignment priors and speaker IDs, etc., the threshold to trim leading and trailing silence from an audio signal, pitch normalization parameters, and so on. You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command-line at runtime. Any initialization parameter that is accepted for the class `nemo.collections.tts.data.dataset.TTSDataset -`_ can be set in the config +`_ can be set in the config file. Refer to the `Dataset Processing Classes <./api.html#datasets>`__ section of the API for a list of datasets classes and their respective parameters. An example TTS train and validation configuration should look similar to the following: .. code-block:: yaml @@ -106,7 +106,7 @@ Text normalization (TN) converts text from written form into its verbalized form Tokenizer Configuration ------------------------ -Tokenization converts input text string to a list of integer tokens. It may pad leading and/or trailing whitespaces to a string. NeMo tokenizer supports grapheme-only inputs, phoneme-only inputs, or a mixer of grapheme and phoneme inputs to disambiguate pronunciations of heteronyms for English, German, and Spanish. It also utilizes a grapheme-to-phoneme (G2P) tool to transliterate out-of-vocabulary (OOV) words. Please refer to the :doc:`G2P section <./g2p>` and `TTS tokenizer collection `_ for more details. Note that G2P integration to NeMo TTS tokenizers pipeline is upcoming soon. The following example sets up a ``EnglishPhonemesTokenizer`` with a mixer of grapheme and phoneme inputs where each word shown in the heteronym list is transliterated into graphemes or phonemes by a 50% chance. +Tokenization converts input text string to a list of integer tokens. It may pad leading and/or trailing whitespaces to a string. NeMo tokenizer supports grapheme-only inputs, phoneme-only inputs, or a mixer of grapheme and phoneme inputs to disambiguate pronunciations of heteronyms for English, German, and Spanish. It also utilizes a grapheme-to-phoneme (G2P) tool to transliterate out-of-vocabulary (OOV) words. Please refer to the :doc:`G2P section <./g2p>` and `TTS tokenizer collection `_ for more details. Note that G2P integration to NeMo TTS tokenizers pipeline is upcoming soon. The following example sets up a ``EnglishPhonemesTokenizer`` with a mixer of grapheme and phoneme inputs where each word shown in the heteronym list is transliterated into graphemes or phonemes by a 50% chance. .. code-block:: yaml @@ -127,7 +127,7 @@ Tokenization converts input text string to a list of integer tokens. It may pad Model Architecture Configuration -------------------------------- -Each configuration file should describe the model architecture being used for the experiment. Models in the NeMo TTS collection need several module sections with the ``_target_`` field specifying which model architecture or component is used. Please refer to `TTS module collection `_ for details. Below shows an example of FastPitch model architecture, +Each configuration file should describe the model architecture being used for the experiment. Models in the NeMo TTS collection need several module sections with the ``_target_`` field specifying which model architecture or component is used. Please refer to `TTS module collection `_ for details. Below shows an example of FastPitch model architecture, .. code-block:: yaml @@ -192,7 +192,7 @@ Each configuration file should describe the model architecture being used for th Finetuning Configuration -------------------------- -All TTS scripts support easy finetuning by partially/fully loading the pretrained weights from a checkpoint into the **currently instantiated model**. Note that the currently instantiated model should have parameters that match the pre-trained checkpoint (such that weights may load properly). In order to directly finetune a pre-existing checkpoint, please follow the tutorial of `Finetuning FastPitch for a new speaker. `_ +All TTS scripts support easy finetuning by partially/fully loading the pretrained weights from a checkpoint into the **currently instantiated model**. Note that the currently instantiated model should have parameters that match the pre-trained checkpoint (such that weights may load properly). In order to directly finetune a pre-existing checkpoint, please follow the tutorial of `Finetuning FastPitch for a new speaker. `_ Pre-trained weights can be provided in multiple ways: @@ -200,7 +200,7 @@ Pre-trained weights can be provided in multiple ways: 2) Providing a name of a pretrained NeMo model (which will be downloaded via the cloud) (via ``init_from_pretrained_model``) 3) Providing a path to a Pytorch Lightning checkpoint file (via ``init_from_ptl_ckpt``) -There are multiple TTS model finetuning scripts in `examples/tts/_finetune.py `_. You can finetune any model by substituting the ```` tag. An example of finetuning a HiFiGAN model is shown below. +There are multiple TTS model finetuning scripts in `examples/tts/_finetune.py `_. You can finetune any model by substituting the ```` tag. An example of finetuning a HiFiGAN model is shown below. Fine-tuning via a NeMo model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/source/tts/datasets.rst b/docs/source/tts/datasets.rst index a833ab38b843..d5b26f4373d3 100644 --- a/docs/source/tts/datasets.rst +++ b/docs/source/tts/datasets.rst @@ -1,7 +1,7 @@ Data Preprocessing ================== -NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ `_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/data/dataset.py `_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. We recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocessing `__ can automatically resample the original sampling rate into the target one. +NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ `_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/data/dataset.py `_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. We recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocessing `__ can automatically resample the original sampling rate into the target one. There should be one ``JSON`` manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice versa. Each line of the manifest should be in the following format: @@ -76,12 +76,12 @@ This table below summarizes the statistics for a collection of high-quality publ Corpus-Specific Data Preprocessing ---------------------------------- -NeMo implements model-agnostic data preprocessing scripts that wrap up steps of **downloading raw datasets, extracting files, and/or normalizing raw texts, and generating data manifest files**. Most scripts are able to be reused for any datasets with only minor adaptations. Most TTS models work out-of-the-box with the LJSpeech dataset, so it would be straightforward to start adapting your custom script from `LJSpeech script `_. For some models that may require supplementary data for training and validating, such as speech/text alignment prior, pitch, speaker ID, emotion ID, energy, etc, you may need an extra step of **supplementary data extraction** by calling `script/dataset_processing/tts/extract_sup_data.py `_ . The following sub-sections demonstrate detailed instructions for running data preprocessing scripts. +NeMo implements model-agnostic data preprocessing scripts that wrap up steps of **downloading raw datasets, extracting files, and/or normalizing raw texts, and generating data manifest files**. Most scripts are able to be reused for any datasets with only minor adaptations. Most TTS models work out-of-the-box with the LJSpeech dataset, so it would be straightforward to start adapting your custom script from `LJSpeech script `_. For some models that may require supplementary data for training and validating, such as speech/text alignment prior, pitch, speaker ID, emotion ID, energy, etc, you may need an extra step of **supplementary data extraction** by calling `script/dataset_processing/tts/extract_sup_data.py `_ . The following sub-sections demonstrate detailed instructions for running data preprocessing scripts. LJSpeech ~~~~~~~~ * Dataset URL: https://keithito.com/LJ-Speech-Dataset/ -* Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/ljspeech/get_data.py +* Dataset Processing Script: https://github.com/NVIDIA-NeMo/Speech/tree/stable/scripts/dataset_processing/tts/ljspeech/get_data.py * Command Line Instruction: .. code-block:: shell-session @@ -99,7 +99,7 @@ LJSpeech LibriTTS ~~~~~~~~ * Dataset URL: https://www.openslr.org/60/ -* Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/libritts/get_data.py +* Dataset Processing Script: https://github.com/NVIDIA-NeMo/Speech/tree/stable/scripts/dataset_processing/tts/libritts/get_data.py * Command Line Instruction: .. code-block:: console @@ -130,7 +130,7 @@ The texts of this dataset has been normalized already. So there is no extra need Thorsten Müller's German Neutral-TTS Datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There are two German neutral datasets released by Thorsten Müller for now, 21.02 and 22.10, respectively. Version 22.10 has been recorded with a better recording setup, such as recording chamber and better microphone. So it is advised to train models on the 22.10 version because its audio quality is better and it has a way more natural speech flow and higher character rate per second speech. The two datasets are described below and defined in `scripts/dataset_processing/tts/thorsten_neutral/get_data.py `_. +There are two German neutral datasets released by Thorsten Müller for now, 21.02 and 22.10, respectively. Version 22.10 has been recorded with a better recording setup, such as recording chamber and better microphone. So it is advised to train models on the 22.10 version because its audio quality is better and it has a way more natural speech flow and higher character rate per second speech. The two datasets are described below and defined in `scripts/dataset_processing/tts/thorsten_neutral/get_data.py `_. .. code-block:: python @@ -149,7 +149,7 @@ There are two German neutral datasets released by Thorsten Müller for now, 21.0 } * Thorsten Müller's German Datasets repo: https://github.com/thorstenMueller/Thorsten-Voice -* Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/thorsten_neutral/get_data.py +* Dataset Processing Script: https://github.com/NVIDIA-NeMo/Speech/tree/stable/scripts/dataset_processing/tts/thorsten_neutral/get_data.py * Command Line Instruction: .. code-block:: bash @@ -184,7 +184,7 @@ There are two German neutral datasets released by Thorsten Müller for now, 21.0 HUI Audio Corpus German ~~~~~~~~~~~~~~~~~~~~~~~ * Dataset URL: https://github.com/iisys-hof/HUI-Audio-Corpus-German -* Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/hui_acg/get_data.py +* Dataset Processing Script: https://github.com/NVIDIA-NeMo/Speech/tree/stable/scripts/dataset_processing/tts/hui_acg/get_data.py * Command Line Instruction: .. code-block:: bash @@ -213,8 +213,8 @@ HUI Audio Corpus German SFSpeech Chinese/English Bilingual Speech ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Dataset URL: https://catalog.ngc.nvidia.com/orgs/nvidia/resources/sf_bilingual_speech_zh_en -* Dataset Processing Script: https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/sfbilingual/get_data.py -* Command Line Instruction: please refer details in Section 1 (NGC Registry CLI installation), Section 2 (Downloading SFSpeech Dataset), and Section 3 (Creatiung Data Manifests) from https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_ChineseTTS_Training.ipynb. Below code block briefly describes the steps. +* Dataset Processing Script: https://github.com/NVIDIA-NeMo/Speech/tree/stable/scripts/dataset_processing/tts/sfbilingual/get_data.py +* Command Line Instruction: please refer details in Section 1 (NGC Registry CLI installation), Section 2 (Downloading SFSpeech Dataset), and Section 3 (Creatiung Data Manifests) from https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/FastPitch_ChineseTTS_Training.ipynb. Below code block briefly describes the steps. .. code-block:: bash diff --git a/docs/source/tts/g2p.rst b/docs/source/tts/g2p.rst index 71f61139290d..65d2eb5c40d1 100644 --- a/docs/source/tts/g2p.rst +++ b/docs/source/tts/g2p.rst @@ -118,7 +118,7 @@ Here is the overall overview of the data labeling pipeline for sentence-level G2 :scale: 70% Here we describe the automatic phoneme-labeling process for generating augmented data. The figure below shows the phoneme-labeling steps to prepare data for sentence-level G2P model training. We first convert known unambiguous words to their phonetic pronunciations with dictionary lookups, e.g. CMU dictionary. -Next, we automatically label heteronyms using an Aligner :cite:`g2p--badlani2022one`. More details on how to disambiguate heteronyms with a pretrained Aligner model could be found in `NeMo/tutorials/tts/Aligner_Inference_Examples.ipynb `__ in `Google's Colab `_. +Next, we automatically label heteronyms using an Aligner :cite:`g2p--badlani2022one`. More details on how to disambiguate heteronyms with a pretrained Aligner model could be found in `NeMo/tutorials/tts/Aligner_Inference_Examples.ipynb `__ in `Google's Colab `_. Finally, we mask-out OOV words with a special masking token, “” in the figure below (note, we use `model.tokenizer_grapheme.unk_token="҂"` symbol during G2P model training.) Using this unknown token forces a G2P model to produce the same masking token as a phonetic representation during training. During inference, the model generates phoneme predictions for OOV words without emitting the masking token as long as this token is not included in the grapheme input. diff --git a/docs/source/tts/magpietts-po.rst b/docs/source/tts/magpietts-po.rst index 8f987d784b8f..86c130171dd5 100644 --- a/docs/source/tts/magpietts-po.rst +++ b/docs/source/tts/magpietts-po.rst @@ -280,5 +280,5 @@ See Also ######## - :doc:`magpietts`: Main Magpie-TTS documentation -- `Preference Optimization Source Code `__ +- `Preference Optimization Source Code `__ diff --git a/docs/source/tts/magpietts.rst b/docs/source/tts/magpietts.rst index b79c11ea88ff..a05f02050eaa 100644 --- a/docs/source/tts/magpietts.rst +++ b/docs/source/tts/magpietts.rst @@ -149,11 +149,11 @@ To enable Long-form speech generation (beta) set ``--longform_mode`` to ``auto o Resources ######### -To get started with Magpie-TTS, you can download the pretrained multilingual checkpoint from `Hugging Face `__ and try it out in the interactive `demo space `__. For deeper technical details, refer to [1]_, [2]_, [3]_, and [4]_. The complete source code is available in the `NeMo GitHub repository `__. +To get started with Magpie-TTS, you can download the pretrained multilingual checkpoint from `Hugging Face `__ and try it out in the interactive `demo space `__. For deeper technical details, refer to [1]_, [2]_, [3]_, and [4]_. The complete source code is available in the `NeMo GitHub repository `__. Additional documentation on advanced features can be found in the repository: -- `Frame Stacking Guide `__: Detailed explanation of the two-stage decoding architecture +- `Frame Stacking Guide `__: Detailed explanation of the two-stage decoding architecture References ########## diff --git a/docs/source/tts/models.rst b/docs/source/tts/models.rst index 7ff1872ee8d4..c4acd782b732 100644 --- a/docs/source/tts/models.rst +++ b/docs/source/tts/models.rst @@ -2,8 +2,8 @@ Models ======= This section provides a brief overview of TTS models that NeMo's TTS collection currently supports. -* **Model Recipes** can be accessed through `examples/tts/*.py `_. -* **Configuration Files** can be found in the directory of `examples/tts/conf/ `_. For detailed information about TTS configuration files and how they +* **Model Recipes** can be accessed through `examples/tts/*.py `_. +* **Configuration Files** can be found in the directory of `examples/tts/conf/ `_. For detailed information about TTS configuration files and how they should be structured, please refer to the section :doc:`./configs`. * **Pretrained Model Checkpoints** are available for any users for immediately synthesizing speech or fine-tuning models on your custom datasets. Please follow the section :doc:`./checkpoints` for instructions on how to use those pretrained models. diff --git a/docs/source/tts/resources.rst b/docs/source/tts/resources.rst index 9b946ed20dc2..ad36b677b8d0 100644 --- a/docs/source/tts/resources.rst +++ b/docs/source/tts/resources.rst @@ -1,9 +1,9 @@ Resources and Documentation =========================== -Hands-on TTS tutorial notebooks can be found under `the TTS tutorials folder `_. If you are a beginner to NeMo, consider trying out the tutorials of `NeMo Primer `_ and `NeMo Model `_. If you are also a beginner to TTS, consider trying out the `NeMo TTS Primer Tutorial `_. These tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. +Hands-on TTS tutorial notebooks can be found under `the TTS tutorials folder `_. If you are a beginner to NeMo, consider trying out the tutorials of `NeMo Primer `_ and `NeMo Model `_. If you are also a beginner to TTS, consider trying out the `NeMo TTS Primer Tutorial `_. These tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. -If you are looking for information about a particular TTS model, or would like to find out more about the model architectures available in the directory of `nemo.collections.tts `_, refer to the :doc:`Models <./models>` section. +If you are looking for information about a particular TTS model, or would like to find out more about the model architectures available in the directory of `nemo.collections.tts `_, refer to the :doc:`Models <./models>` section. NeMo includes preprocessing scripts for several common TTS datasets. The :doc:`Data Preprocessing <./datasets>` section contains instructions on how to run those scripts. You can also creating your own NeMo-compatible dataset preprocessing script by following the guidance. diff --git a/examples/asr/asr_chunked_inference/README.md b/examples/asr/asr_chunked_inference/README.md index 939ced44c2e1..d689704d173f 100644 --- a/examples/asr/asr_chunked_inference/README.md +++ b/examples/asr/asr_chunked_inference/README.md @@ -1,6 +1,6 @@ # Streaming / Buffered / Chunked ASR -Contained within this directory are scripts to perform streaming or buffered inference of audio files using Transducer ASR models, and chunked inference for MultitaskAED models (e.g., "nvidia/canary-1b"). For CTC models, please refer to the [asr_streaming_inference.py](https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/asr/asr_streaming_inference) script. +Contained within this directory are scripts to perform streaming or buffered inference of audio files using Transducer ASR models, and chunked inference for MultitaskAED models (e.g., "nvidia/canary-1b"). For CTC models, please refer to the [asr_streaming_inference.py](https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/asr/asr_streaming_inference) script. ## Difference between streaming and buffered ASR diff --git a/examples/asr/asr_eou/README.md b/examples/asr/asr_eou/README.md index 728205c79c74..b4b26174939b 100644 --- a/examples/asr/asr_eou/README.md +++ b/examples/asr/asr_eou/README.md @@ -1,6 +1,6 @@ # Finetuning streming ASR model for integrated end-of-utterance (EOU) detection -This tutorial shows how to finetune a streaming ASR model (e.g., [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)) for integrated EOU detection (e.g., [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1)). +This tutorial shows how to finetune a streaming ASR model (e.g., [nvidia/nemotron-speech-streaming-en-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b)) for integrated EOU detection (e.g., [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1)). We use [Nemotron-Speech-Streaming-En-0.6b](https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b) as an example of pretrained ASR model. @@ -64,7 +64,7 @@ loss: # FastEmit regularization: https://arxiv.org/abs/2010.11148 # You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming # You may set it to lower values like 1e-3 for models with larger right context - fastemit_lambda: 3e-2 + fastemit_lambda: 3e-2 ``` We also need to change the training, validation and test data paths in the model config file based on how we prepare the EOU labeled dataset illustrated in the next section. @@ -96,7 +96,7 @@ Your original input manifest should contain the fields `audio_filepath`, `text`, ### 2.2 Getting timestamps for end-of-utterance (EOU) -We recommend using forced alignment to get the timestamps for EOU. One way to do this is to use the [Nemo Forced Aligner](https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner) tool. +We recommend using forced alignment to get the timestamps for EOU. One way to do this is to use the [Nemo Forced Aligner](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/nemo_forced_aligner) tool. ```bash python /tools/nemo_forced_aligner/align_eou.py \ @@ -190,7 +190,7 @@ model: pad_distribution: 'uniform' # distribution of padding duration, 'uniform' or 'normal' normal_mean: 0.5 # mean of normal distribution used when pad_distribution='normal' normal_std: 2.0 # standard deviation of normal distribution used when pad_distribution='normal' - + augmentor: white_noise: prob: 0.9 @@ -228,7 +228,7 @@ TRAIN_INPUT_CFG=/path/to/train_input_config.yaml VAL_MANIFEST=/path/to/val_manifest.json NOISE_MANIFEST=/path/to/noise_manifest.json -PRETRAINED_NEMO=/path/to/nemotron-speech-streaming-en-0.6b.nemo +PRETRAINED_NEMO=/path/to/nemotron-speech-streaming-en-0.6b.nemo BATCH_SIZE=16 NUM_WORKERS=8 @@ -300,5 +300,5 @@ The script will show the WER metrics along with EOU metrics like latency, early ## 5. Model deployment with voice agent -Please refer to the [NeMo Voice Agent](https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md) example for more details on how to deploy the ASR-EOU model with voice agent. +Please refer to the [NeMo Voice Agent](https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md) example for more details on how to deploy the ASR-EOU model with voice agent. diff --git a/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml b/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml index 8d1479dc027e..0eac00c7abff 100644 --- a/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml +++ b/examples/asr/conf/speech_multitask/fast-conformer_aed.yaml @@ -15,7 +15,7 @@ # | canary-1b-flash | 883M | 32 | 4 | 1024 | 1024 | 1024 | # | canary-180m-flash | 182M | 17 | 4 | 1024 | 512 | 1024 | # -# a typical training manifest entry looks like this - +# a typical training manifest entry looks like this - # {"audio_filepath": "/path/to/audio/file.wav", "duration": 16.192, "text": "Text spoken in the audio.", "source_lang": "en", "target_lang": "en", "taskname": "asr", "pnc": "yes"} name: "FastConformer-Transformer-MultiTask" @@ -43,7 +43,7 @@ model: prompt_defaults: null # Sub-config for specifying multiple metrics for multitask modeling. - # Each metric allows a custom bool constraint to determine conditions for evaluation. + # Each metric allows a custom bool constraint to determine conditions for evaluation. # See `asr.collections.metrics.multitask` for further details. multitask_metrics_cfg: log_predictions: true @@ -52,7 +52,7 @@ model: _target_: nemo.collections.asr.metrics.WER constraint: ".source_lang==.target_lang" bleu: - _target_: nemo.collections.asr.metrics.BLEU + _target_: nemo.collections.asr.metrics.BLEU constraint: ".source_lang!=.target_lang" bleu_tokenizer: 13a check_cuts_for_bleu_tokenizers: false # For E.Asian languages. If `true`, calculates BLEU with SacreBLEU tokenizer passed by `bleu_tokenizer string' property in `cuts.custom`. @@ -70,9 +70,9 @@ model: shuffle: true num_workers: 8 # To understand the settings below, please refer to Lhotse Dataloading documentation: - # https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/datasets.rst#lhotse-dataloading + # https://github.com/NVIDIA-NeMo/Speech/blob/main/docs/source/asr/datasets.rst#lhotse-dataloading # You can also check the following configuration dataclass: - # https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/dataloader.py#L36 + # https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/data/lhotse/dataloader.py#L36 batch_size: null batch_duration: 360 quadratic_duration: 15 @@ -117,7 +117,7 @@ model: spl_tokens: # special tokens model dir: null # Passed in training script type: bpe - en: # English tokenizer (example, replace with whichever language you would like or add tokenizers to add tokenizer for additional languages) + en: # English tokenizer (example, replace with whichever language you would like or add tokenizers to add tokenizer for additional languages) dir: ??? type: bpe diff --git a/examples/asr/speech_classification/vad_infer.py b/examples/asr/speech_classification/vad_infer.py index 08ebb78d05fc..2be876eb5291 100644 --- a/examples/asr/speech_classification/vad_infer.py +++ b/examples/asr/speech_classification/vad_infer.py @@ -13,15 +13,15 @@ # limitations under the License. """ -During inference, we perform frame-level prediction by two approaches: +During inference, we perform frame-level prediction by two approaches: 1) shift the window of length window_length_in_sec (e.g. 0.63s) by shift_length_in_sec (e.g. 10ms) to generate the frame and use the prediction of the window to represent the label for the frame; [this script demonstrate how to do this approach] - 2) generate predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments. + 2) generate predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments. [get frame level prediction by this script and use vad_overlap_posterior.py in NeMo/scripts/voice_activity_detection - One can also find posterior about converting frame level prediction + One can also find posterior about converting frame level prediction to speech/no-speech segment in start and end times format in that script.] - - Image https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/asr/images/vad_post_overlap_diagram.png + + Image https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/tutorials/asr/images/vad_post_overlap_diagram.png will help you understand this method. This script will also help you perform postprocessing and generate speech segments if needed diff --git a/examples/asr/speech_pretraining/README.md b/examples/asr/speech_pretraining/README.md index ed7954beaa21..0b6f46a8dfe2 100644 --- a/examples/asr/speech_pretraining/README.md +++ b/examples/asr/speech_pretraining/README.md @@ -1,13 +1,13 @@ # Speech Self-Supervised Learning -This directory contains example scripts to self-supervised speech models. +This directory contains example scripts to self-supervised speech models. There are two main types of supported self-supervised learning methods: - [Wav2vec-BERT](https://arxiv.org/abs/2108.06209): `speech_pre_training.py` - [NEST](https://arxiv.org/abs/2408.13106): `masked_token_pred_pretrain.py` - For downstream tasks that use NEST as multi-layer feature extractor, please refer to `./downstream/speech_classification_mfa_train.py` - For extracting multi-layer features from NEST, please refer to `/scripts/ssl/extract_features.py` - - For using NEST as weight initialization for downstream tasks, please refer to the usage of [maybe_init_from_pretrained_checkpoint](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1242). + - For using NEST as weight initialization for downstream tasks, please refer to the usage of [maybe_init_from_pretrained_checkpoint](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/core/classes/modelPT.py#L1242). For their corresponding usage, please refer to the example yaml config: @@ -24,7 +24,7 @@ The dataset format follows that of ASR models, but no groundtruth transcriptions Please refer to the [ASR dataset documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#preparing-custom-asr-data%60) for more details. -For most efficient data loading, please refer to -- [lhotse dataloading](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#lhotse-dataloading) +For most efficient data loading, please refer to +- [lhotse dataloading](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#lhotse-dataloading) - [pre-compute bucket durations](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#pre-computing-bucket-duration-bins) - [optimizing GPU memory usage](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html#pushing-gpu-utilization-to-the-limits-with-bucketing-and-oomptimizer) diff --git a/examples/speaker_tasks/recognition/README.md b/examples/speaker_tasks/recognition/README.md index 5cc60a2f62dc..04aad5e24d80 100644 --- a/examples/speaker_tasks/recognition/README.md +++ b/examples/speaker_tasks/recognition/README.md @@ -29,14 +29,14 @@ For training ecapa_tdnn (channel-attention) model: ```bash python speaker_reco.py --config_path='conf' --config_name='ecapa_tdnn.yaml' ``` -For step by step tutorial see [notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb). +For step by step tutorial see [notebook](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb). ### Fine Tuning For fine tuning on a pretrained .nemo speaker recognition model, ```bash python speaker_reco_finetune.py --config_path='conf' --config_name='titanet-finetune.yaml' ``` -for fine tuning tips see this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb) +for fine tuning tips see this [tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb) ## Inference We provide generic scripts for manifest file creation, embedding extraction, Voxceleb evaluation and speaker ID inference. Hence most of the steps would be common and differ slightly based on your end application. diff --git a/examples/speaker_tasks/recognition/speaker_reco.py b/examples/speaker_tasks/recognition/speaker_reco.py index ac5cb12ac836..666ec16042c5 100644 --- a/examples/speaker_tasks/recognition/speaker_reco.py +++ b/examples/speaker_tasks/recognition/speaker_reco.py @@ -37,13 +37,13 @@ exp_manager.name=$EXP_NAME +exp_manager.use_datetime_version=False \ exp_manager.exp_dir='./speaker_exps' -See https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb for notebook tutorial +See https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb for notebook tutorial Optional: Use tarred dataset to speech up data loading. Prepare ONE manifest that contains all training data you would like to include. Validation should use non-tarred dataset. - Note that it's possible that tarred datasets impacts validation scores because it drop values in order to have same amount of files per tarfile; - Scores might be off since some data is missing. - + Note that it's possible that tarred datasets impacts validation scores because it drop values in order to have same amount of files per tarfile; + Scores might be off since some data is missing. + Use the `convert_to_tarred_audio_dataset.py` script under /speech_recognition/scripts in order to prepare tarred audio dataset. For details, please see TarredAudioToClassificationLabelDataset in /nemo/collections/asr/data/audio_to_label.py """ diff --git a/examples/tts/conf/audio_codec/acoustic_codec_16000.yaml b/examples/tts/conf/audio_codec/acoustic_codec_16000.yaml index 8859aa373372..bfc45c557c50 100644 --- a/examples/tts/conf/audio_codec/acoustic_codec_16000.yaml +++ b/examples/tts/conf/audio_codec/acoustic_codec_16000.yaml @@ -8,7 +8,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -182,7 +182,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/audio_codec_16000.yaml b/examples/tts/conf/audio_codec/audio_codec_16000.yaml index 93b44b579655..b512c9111872 100644 --- a/examples/tts/conf/audio_codec/audio_codec_16000.yaml +++ b/examples/tts/conf/audio_codec/audio_codec_16000.yaml @@ -13,7 +13,7 @@ batch_size: 32 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -160,7 +160,7 @@ trainer: log_every_n_steps: 100 check_val_every_n_epoch: 1 benchmark: false - + exp_manager: exp_dir: null diff --git a/examples/tts/conf/audio_codec/audio_codec_22050.yaml b/examples/tts/conf/audio_codec/audio_codec_22050.yaml index c45f2c2a129c..4ba46088d1bc 100644 --- a/examples/tts/conf/audio_codec/audio_codec_22050.yaml +++ b/examples/tts/conf/audio_codec/audio_codec_22050.yaml @@ -12,7 +12,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -182,7 +182,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/audio_codec_24000.yaml b/examples/tts/conf/audio_codec/audio_codec_24000.yaml index cf48db807d25..d65b259db6db 100644 --- a/examples/tts/conf/audio_codec/audio_codec_24000.yaml +++ b/examples/tts/conf/audio_codec/audio_codec_24000.yaml @@ -12,7 +12,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -164,7 +164,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/audio_codec_44100.yaml b/examples/tts/conf/audio_codec/audio_codec_44100.yaml index eab13a0e440b..5c32561e00f6 100644 --- a/examples/tts/conf/audio_codec/audio_codec_44100.yaml +++ b/examples/tts/conf/audio_codec/audio_codec_44100.yaml @@ -12,7 +12,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -182,7 +182,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/audio_codec_low_frame_rate_22050.yaml b/examples/tts/conf/audio_codec/audio_codec_low_frame_rate_22050.yaml index 6ecc37953fcb..52c701aa4ef0 100644 --- a/examples/tts/conf/audio_codec/audio_codec_low_frame_rate_22050.yaml +++ b/examples/tts/conf/audio_codec/audio_codec_low_frame_rate_22050.yaml @@ -12,7 +12,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -151,7 +151,7 @@ model: slm_sr: 16000 input_sr: ${sample_rate} slm_hidden: 768 - slm_layers: 13 + slm_layers: 13 initial_channel: 64 use_spectral_norm: false diff --git a/examples/tts/conf/audio_codec/encodec_24000.yaml b/examples/tts/conf/audio_codec/encodec_24000.yaml index be66fd4b4979..7d8d2e6f0792 100644 --- a/examples/tts/conf/audio_codec/encodec_24000.yaml +++ b/examples/tts/conf/audio_codec/encodec_24000.yaml @@ -12,7 +12,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -166,7 +166,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/mel_codec_22050.yaml b/examples/tts/conf/audio_codec/mel_codec_22050.yaml index df77e7747a51..c103b6a4e3e0 100644 --- a/examples/tts/conf/audio_codec/mel_codec_22050.yaml +++ b/examples/tts/conf/audio_codec/mel_codec_22050.yaml @@ -13,7 +13,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -183,7 +183,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/audio_codec/mel_codec_44100.yaml b/examples/tts/conf/audio_codec/mel_codec_44100.yaml index 3ae528df6a64..26f1d5890c1f 100644 --- a/examples/tts/conf/audio_codec/mel_codec_44100.yaml +++ b/examples/tts/conf/audio_codec/mel_codec_44100.yaml @@ -13,7 +13,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 +# https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/vocoder_dataset.py#L39-L41 train_ds_meta: ??? val_ds_meta: ??? @@ -183,7 +183,7 @@ exp_manager: wandb_logger_kwargs: name: null project: null - create_checkpoint_callback: true + create_checkpoint_callback: true checkpoint_callback_params: monitor: val_loss mode: min diff --git a/examples/tts/conf/magpietts/magpietts.yaml b/examples/tts/conf/magpietts/magpietts.yaml index 6d0b9a7cd3b7..59872efff9d1 100644 --- a/examples/tts/conf/magpietts/magpietts.yaml +++ b/examples/tts/conf/magpietts/magpietts.yaml @@ -8,7 +8,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# See DatasetMeta in https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/data/text_to_speech_dataset.py +# See DatasetMeta in https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/text_to_speech_dataset.py train_ds_meta: ??? val_ds_meta: ??? diff --git a/examples/tts/conf/magpietts/magpietts_po_inference.yaml b/examples/tts/conf/magpietts/magpietts_po_inference.yaml index 27bfee33656d..df45e7fab2c7 100644 --- a/examples/tts/conf/magpietts/magpietts_po_inference.yaml +++ b/examples/tts/conf/magpietts/magpietts_po_inference.yaml @@ -9,7 +9,7 @@ batch_size: 16 weighted_sampling_steps_per_epoch: null # Dataset metadata for each manifest -# See DatasetMeta in https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/data/text_to_speech_dataset.py +# See DatasetMeta in https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/data/text_to_speech_dataset.py test_ds_meta: ??? phoneme_dict_path: "scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt" diff --git a/examples/voice_agent/README.md b/examples/voice_agent/README.md index 248ddfe22754..30c51678d784 100644 --- a/examples/voice_agent/README.md +++ b/examples/voice_agent/README.md @@ -1,6 +1,6 @@ # NeMo Voice Agent -A fully open-source NVIDIA NeMo Voice Agent example demonstrating a simple way to combine NVIDIA NeMo STT/TTS service and HuggingFace LLM together into a conversational agent. Everything is open-source and deployed locally so you can have your own voice agent. Feel free to explore the code and see how different speech technologies can be integrated with LLMs to create a seamless conversation experience. +A fully open-source NVIDIA NeMo Voice Agent example demonstrating a simple way to combine NVIDIA NeMo STT/TTS service and HuggingFace LLM together into a conversational agent. Everything is open-source and deployed locally so you can have your own voice agent. Feel free to explore the code and see how different speech technologies can be integrated with LLMs to create a seamless conversation experience. As of now, we only support English input and output, but more languages will be supported in the future. @@ -25,7 +25,7 @@ As of now, we only support English input and output, but more languages will be ## ✨ Key Features - Open-source, local deployment, and flexible customization. -- Allow users to talk to most LLMs from HuggingFace with configurable prompts. +- Allow users to talk to most LLMs from HuggingFace with configurable prompts. - Streaming speech recognition with low latency and end-of-utterance detection. - Low latency TTS for fast audio response generation. - Speaker diarization up to 4 speakers in different user turns. @@ -88,7 +88,7 @@ Then you can activate the environment via `conda activate nemo-voice`. If you want to just try the default server config, you can skip this step. Edit the `server/server_configs/default.yaml` file to configure the server as needed, for example: -- Changing the LLM and system prompt you want to use in `llm.model` and `llm.system_prompt`, by either putting a local path to a text file or the whole prompt string. See `server/example_prompts/` for examples to start with. +- Changing the LLM and system prompt you want to use in `llm.model` and `llm.system_prompt`, by either putting a local path to a text file or the whole prompt string. See `server/example_prompts/` for examples to start with. - Configure the LLM parameters, such as temperature, max tokens, etc. You may also need to change the HuggingFace or vLLM server parameters, depending on the LLM you are using. Please refer to the LLM's model page for details on the recommended parameters. - If you know whether you want to use vLLM or HuggingFace, you can set `llm.type` to `vllm` or `hf` to force using vLLM or HuggingFace, respectively. Otherwise, it will automatically switch between the two based on the model's support. Please also remember to update the parameters of the chosen backend as well, by referring to the LLM's model page. - Distribute different components to different GPUs if you have more than one. @@ -103,7 +103,7 @@ Edit the `server/server_configs/default.yaml` file to configure the server as ne Open a terminal and run the server via: ```bash -NEMO_PATH=??? # Use your local NeMo path with the latest main branch from: https://github.com/NVIDIA-NeMo/NeMo +NEMO_PATH=??? # Use your local NeMo path with the latest main branch from: https://github.com/NVIDIA-NeMo/Speech export PYTHONPATH=$NEMO_PATH:$PYTHONPATH # export HF_TOKEN="hf_..." # Use your own HuggingFace API token if needed, as some models may require. # export HF_HUB_CACHE="/path/to/your/huggingface/cache" # change where HF cache is stored if you don't want to use the default cache @@ -125,9 +125,9 @@ There should be a message in terminal showing the address and port of the client ### Connect to the client via browser -Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatever address and port is shown in the terminal where the client was launched). +Open the client via browser: `http://[YOUR MACHINE IP ADDRESS]:5173/` (or whatever address and port is shown in the terminal where the client was launched). -You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button. +You can mute/unmute your microphone via the "Mute" button, and reset the LLM context history and speaker cache by clicking the "Reset" button. **If using chrome browser, you need to add `http://[YOUR MACHINE IP ADDRESS]:5173/` to the allow list via `chrome://flags/#unsafely-treat-insecure-origin-as-secure`.** You may also need to restart the browser for the changes to take effect. @@ -142,7 +142,7 @@ Most LLMs from HuggingFace are supported. A few examples are: - Please use `server/server_configs/llm_configs/nemotron_nano_v2.yaml` as the server config. - Tool calling is enabled for this model. - [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) - - Please use `server/server_configs/llm_configs/nemotron_nano_v3.yaml` as the server config. It needs more than 60GB VRAM to host the model, thus the config by default is set to use tensor parallelism of 2. Expect additional 5GB for kv-cache and other components in the voice agent. To better monitor the vllm status, `start_vllm_on_init` is set to `false`, so that you can manually start the vllm server in another terminal via: + - Please use `server/server_configs/llm_configs/nemotron_nano_v3.yaml` as the server config. It needs more than 60GB VRAM to host the model, thus the config by default is set to use tensor parallelism of 2. Expect additional 5GB for kv-cache and other components in the voice agent. To better monitor the vllm status, `start_vllm_on_init` is set to `false`, so that you can manually start the vllm server in another terminal via: ```bash vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ --trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.8 --max-model-len 8192 \ @@ -161,7 +161,7 @@ Most LLMs from HuggingFace are supported. A few examples are: - [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - Please use `server/server_configs/llm_configs/llama3.1-8B-instruct.yaml` as the server config. - Note that you need to get access to the model first, and specify `export HF_TOKEN="hf_..."` when launching the server. -- [nvidia/Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) +- [nvidia/Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1) - [nvidia/Nemotron-Mini-4B-Instruct](https://huggingface.co/nvidia/Nemotron-Mini-4B-Instruct) @@ -182,7 +182,7 @@ If thinking/reasoning mode is enabled (e.g., in `server/server_configs/llm_confi For vLLM server, if you specify `--reasoning_parser` in `vllm_server_params`, the thinking/reasoning content will be filtered out and does not show up in the output. -### 🎤 ASR +### 🎤 ASR We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now: - [nvidia/parakeet_realtime_eou_120m-v1](https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1) (default) @@ -195,9 +195,9 @@ We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) t ### 💬 Speaker Diarization -Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. +Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. -As of now, we only support detecting 1 speaker per user turn, but different turns come from different speakers, with a maximum of 4 speakers in the whole conversation. +As of now, we only support detecting 1 speaker per user turn, but different turns come from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are: - [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) (default) @@ -210,7 +210,7 @@ Please note that in some circumstances, the diarization model might not work wel Here are the supported TTS models: - [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) is a lightweight TTS model. This model is the default speech generation backend. - Please use `server/server_configs/tts_configs/kokoro_82M.yaml` as the server config. -- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output. +- [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) is an NVIDIA-NeMo TTS model. It only supports English output. - Please use `server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml` as the server config. - [magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) is a multilingual TTS model. - Please use `server/server_configs/tts_configs/magpie_tts_multilingual_357m.yaml` as the server config. @@ -239,7 +239,7 @@ We support tool calling for LLMs to use external tools (e.g., getting the curren - "Reset to the original speaking speed." - "Speak twice as fast." - "Speak half as slow." - + 3. Switching between British and American accents, and changing the gender of the voice: - "Speak in British accent." - "Switch to a male voice." diff --git a/examples/voice_agent/server/server_configs/NVIDIA_NeMo_models.yaml b/examples/voice_agent/server/server_configs/NVIDIA_NeMo_models.yaml index 1c9e4c31a8bb..fa013fe6c0c1 100644 --- a/examples/voice_agent/server/server_configs/NVIDIA_NeMo_models.yaml +++ b/examples/voice_agent/server/server_configs/NVIDIA_NeMo_models.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up a NeMo Voice Agent server only with NVIDIA NeMo models. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # STT, LLM and TTS models have standalone configs in the folder "server/server_configs/{stt,llm,tts}_configs". # Specify the type and an a model identifier to automatically configure the model. diff --git a/examples/voice_agent/server/server_configs/default.yaml b/examples/voice_agent/server/server_configs/default.yaml index ecb45f942062..7e4f502d2b0a 100644 --- a/examples/voice_agent/server/server_configs/default.yaml +++ b/examples/voice_agent/server/server_configs/default.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # STT, LLM and TTS models have standalone configs in the folder "server/server_configs/{stt,llm,tts}_configs". # Specify the type and an a model identifier to automatically configure the model. @@ -42,7 +42,7 @@ llm: # model: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # model_config: "./server_configs/llm_configs/nemotron_nano_v3.yaml" # if `model_config` is not specified, and the llm.model is not in model_registry.yaml, it will use `llm_configs/hf_llm_generic.yaml` - # model: "Qwen/Qwen2.5-7B-Instruct" + # model: "Qwen/Qwen2.5-7B-Instruct" # model_config: "./server_configs/llm_configs/qwen2.5-7B.yaml" # model: "Qwen/Qwen3-8B" # model_config: "./server_configs/llm_configs/qwen3-8B.yaml" diff --git a/examples/voice_agent/server/server_configs/llm_configs/hf_llm_generic.yaml b/examples/voice_agent/server/server_configs/llm_configs/hf_llm_generic.yaml index b200a2da00e5..42fdb553f7fc 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/hf_llm_generic.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/hf_llm_generic.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up a generic HuggingFace LLM for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # type: auto # choices in ['auto', 'hf', 'vllm'] # device: "cuda" @@ -21,11 +21,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -36,15 +36,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/llama3.1-8B-instruct.yaml b/examples/voice_agent/server/server_configs/llm_configs/llama3.1-8B-instruct.yaml index ff4cf8d24626..0a592e687720 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/llama3.1-8B-instruct.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/llama3.1-8B-instruct.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up Qwen2.5-7B model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # type: auto # choices in ['auto', 'hf', 'vllm'] # model: meta-llama/Llama-3.1-8B-Instruct @@ -22,11 +22,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -37,15 +37,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v2.yaml b/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v2.yaml index 7eba545d9cfd..72dbba1f2cbc 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v2.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v2.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up nemotron_nano_v2 model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # model: "nvidia/NVIDIA-Nemotron-Nano-9B-v2" # model name for HF models, will be used via `AutoModelForCausalLM.from_pretrained()` type: vllm # Overwrite to vllm to enable tool calling, the HF backend currently doesn't support tools @@ -22,11 +22,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -37,15 +37,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters. # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --enable-prefix-caching --max-num-seqs 1 --gpu-memory-utilization 0.85 --max-model-len 8192 --enable-auto-tool-choice --tool-parser-plugin server/parsers/nemotron_toolcall_parser_streaming.py --tool-call-parser nemotron_json --mamba_ssm_cache_dtype float32" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --enable-prefix-caching --max-num-seqs 1 --gpu-memory-utilization 0.85 --max-model-len 8192 --enable-auto-tool-choice --tool-parser-plugin server/parsers/nemotron_toolcall_parser_streaming.py --tool-call-parser nemotron_json --mamba_ssm_cache_dtype float32" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v3.yaml b/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v3.yaml index 5cadedb725fc..bd0c3d64a3de 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v3.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/nemotron_nano_v3.yaml @@ -1,12 +1,12 @@ # This is an example config for setting up nemotron_nano_v3 model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # model: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # model name for HF models, will be used via `AutoModelForCausalLM.from_pretrained()` type: vllm # Overwrite to vllm to enable tool calling, the HF backend currently doesn't support tools dtype: auto # torch.dtype for LLM, `auto` is only available for vllm device: "cuda" system_role: "system" # role for system prompt, set it to `user` for models that do not support system prompt -system_prompt_suffix: "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools. If you are provided with a set of tools, use them only when needed, do not limit your capabilities to the scope of the tools. If the purpose of a tool matches well with a user's request, always try to call the tool first. Conversation history should not limit your behavior on whether you can use tools. You must answer questions not related to the tools. Avoid any emoji in your response." +system_prompt_suffix: "Before responding to the user, check if the user request requires using external tools, and use the tools if they match with the user's intention. Otherwise, use your internal knowledge to answer the user's question. Do not use tools for casual conversation or when the tools don't fit the use cases. You should still try to address the user's request when it's not related to the provided tools. If you are provided with a set of tools, use them only when needed, do not limit your capabilities to the scope of the tools. If the purpose of a tool matches well with a user's request, always try to call the tool first. Conversation history should not limit your behavior on whether you can use tools. You must answer questions not related to the tools. Avoid any emoji in your response." enable_tool_calling: true # set to True since the vllm config below supports tool calling inject_dummy_user_message: true @@ -23,11 +23,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -38,15 +38,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: false # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters. # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --tensor-parallel-size 2 --enable-prefix-caching --max-num-seqs 1 --gpu-memory-utilization 0.8 --max-model-len 8192 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser-plugin server/parsers/nano_v3_reasoning_parser.py --reasoning-parser nano_v3" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --tensor-parallel-size 2 --enable-prefix-caching --max-num-seqs 1 --gpu-memory-utilization 0.8 --max-model-len 8192 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser-plugin server/parsers/nano_v3_reasoning_parser.py --reasoning-parser nano_v3" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/qwen2.5-7B.yaml b/examples/voice_agent/server/server_configs/llm_configs/qwen2.5-7B.yaml index 907b1a521373..bfd288f689f3 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/qwen2.5-7B.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/qwen2.5-7B.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up Qwen2.5-7B model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # type: hf # choices in ['auto', 'hf', 'vllm'] # model: Qwen/Qwen2.5-7B-Instruct @@ -22,11 +22,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -37,15 +37,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B.yaml b/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B.yaml index 427c2313b5bb..81a897e31f20 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up Qwen3-8B model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # type: auto # choices in ['auto', 'hf', 'vllm'] # model: "Qwen/Qwen3-8B" # model name for HF models, will be used via `AutoModelForCausalLM.from_pretrained()` @@ -22,11 +22,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -37,15 +37,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B_think.yaml b/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B_think.yaml index bfbea8f5c8a1..cccb872c13bd 100644 --- a/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B_think.yaml +++ b/examples/voice_agent/server/server_configs/llm_configs/qwen3-8B_think.yaml @@ -1,5 +1,5 @@ # This is an example config for setting up Qwen3-8B model in thinking mode for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details # type: auto # choices in ['auto', 'hf', 'vllm'] # model: "Qwen/Qwen3-8B" # model name for HF models, will be used via `AutoModelForCausalLM.from_pretrained()` @@ -22,11 +22,11 @@ max_new_tokens: 256 # max num of output tokens from LLM ############################## # Please refer to the model page of each HF LLM model to set following params properly. # kwargs that will be passed into tokenizer.apply_chat_template() function -apply_chat_template_kwargs: +apply_chat_template_kwargs: add_generation_prompt: true # This is required in most cases, do not change unless you're sure of it tokenize: false # This is required, do not change # kwargs that will be passed into model.generate() function of HF models -generation_kwargs: +generation_kwargs: temperature: ${llm.temperature} # LLM sampling params top_k: ${llm.top_k} # LLM sampling params top_p: ${llm.top_p} # LLM sampling params @@ -37,15 +37,15 @@ generation_kwargs: ######## vLLM config ######### ############################## api_key: "EMPTY" -base_url: "http://localhost:8000/v1" +base_url: "http://localhost:8000/v1" # Set `start_vllm_on_init` to automatically start vllm server if it's not manually started yet start_vllm_on_init: true # Specifying vllm_server_params with the parameters you want to pass to the vllm server command `vllm serve $model $vllm_server_params` # Refer to each LLM's model page for details on the recommended parameters # It's recommended to stay with `--max-num-seqs` 1 as the voice agent currently supports one connection at a time. # You can try increasing the model's max context len `--max-model-len` if GPU memory allows, or decrease it if GPU OOM occurs. -vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" -# `params` are the inference parameters that would be passed into OpenAI API, +vllm_server_params: "--trust-remote-code --max-num-seqs 1 --gpu-memory-utilization 0.85" +# `params` are the inference parameters that would be passed into OpenAI API, # please put additional model-specific parameters in `extra` vllm_generation_params: frequency_penalty: 0.0 # Penalty for frequent tokens (-2.0 to 2.0). diff --git a/examples/voice_agent/server/server_configs/stt_configs/nemo_cache_aware_streaming.yaml b/examples/voice_agent/server/server_configs/stt_configs/nemo_cache_aware_streaming.yaml index 1a386397d625..360fe86100db 100644 --- a/examples/voice_agent/server/server_configs/stt_configs/nemo_cache_aware_streaming.yaml +++ b/examples/voice_agent/server/server_configs/stt_configs/nemo_cache_aware_streaming.yaml @@ -1,6 +1,6 @@ # This is an example config for setting up a NeMo cache-aware streaming ASR model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details att_context_size: [70,1] # left and right attention context sizes for streaming ASR frame_len_in_secs: 0.08 # default for FastConformer, do not change unless using other architechtures - audio_chunk_size_in_secs: 0.08 \ No newline at end of file + audio_chunk_size_in_secs: 0.08 \ No newline at end of file diff --git a/examples/voice_agent/server/server_configs/tts_configs/kokoro_82M.yaml b/examples/voice_agent/server/server_configs/tts_configs/kokoro_82M.yaml index a28b62c88dc5..21fd751bf93c 100644 --- a/examples/voice_agent/server/server_configs/tts_configs/kokoro_82M.yaml +++ b/examples/voice_agent/server/server_configs/tts_configs/kokoro_82M.yaml @@ -1,11 +1,11 @@ # This is an example config for setting up a NeMo FastPitch-Hifigan TTS model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details main_model_id: "hexgrad/Kokoro-82M" sub_model_id: "af_heart" # "af_heart" "af_bella" "am_fenrir" "am_michael" device: "cuda" speed: 1.25 # Speaking rate -extra_separator: # a list of additional punctuations to chunk LLM response into segments for faster TTS output, e.g., ",". Set to `null` to use default behavior +extra_separator: # a list of additional punctuations to chunk LLM response into segments for faster TTS output, e.g., ",". Set to `null` to use default behavior - ',' - '\n' - "." diff --git a/examples/voice_agent/server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml b/examples/voice_agent/server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml index 0852cb50de25..ed6a1e385f99 100644 --- a/examples/voice_agent/server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml +++ b/examples/voice_agent/server/server_configs/tts_configs/nemo_fastpitch-hifigan.yaml @@ -1,10 +1,10 @@ # This is an example config for setting up a NeMo FastPitch-Hifigan TTS model for a NeMo Voice Agent server. -# Please refer to https://github.com/NVIDIA-NeMo/NeMo/tree/main/examples/voice_agent/README.md for more details +# Please refer to https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/voice_agent/README.md for more details main_model_id: "nvidia/tts_en_fastpitch" sub_model_id: "nvidia/tts_hifigan" device: "cuda" -extra_separator: # a list of additional punctuations to chunk LLM response into segments for faster TTS output, e.g., ",". Set to `null` to use default behavior +extra_separator: # a list of additional punctuations to chunk LLM response into segments for faster TTS output, e.g., ",". Set to `null` to use default behavior - ',' - '\n' - "." diff --git a/nemo/collections/asr/README.md b/nemo/collections/asr/README.md index 27f9472f1ac8..d8cab4e9f2d2 100644 --- a/nemo/collections/asr/README.md +++ b/nemo/collections/asr/README.md @@ -15,14 +15,14 @@ * Transducer/RNNT * Hybrid Transducer/CTC * NeMo Original [Multi-blank Transducers](https://arxiv.org/abs/2211.03541) and [Token-and-Duration Transducers (TDT)](https://arxiv.org/abs/2304.06795) - * Streaming/Buffered ASR (CTC/Transducer) - [Chunked Inference Examples](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_chunked_inference) - * [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer) with multiple lookaheads (including microphone streaming [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb). + * Streaming/Buffered ASR (CTC/Transducer) - [Chunked Inference Examples](https://github.com/NVIDIA-NeMo/Speech/tree/stable/examples/asr/asr_chunked_inference) + * [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#cache-aware-streaming-conformer) with multiple lookaheads (including microphone streaming [tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb). * Beam Search decoding * [Language Modelling for ASR (CTC and RNNT)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html): N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer * [Support of long audios for Conformer with memory efficient local attention](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/results.html#inference-on-long-audio) * [Speech Classification, Speech Command Recognition and Language Identification](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html): MatchboxNet (Command Recognition), AmberNet (LangID) * [Voice activity Detection (VAD)](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad): MarbleNet - * ASR with VAD Inference - [Example](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_vad) + * ASR with VAD Inference - [Example](https://github.com/NVIDIA-NeMo/Speech/tree/stable/examples/asr/asr_vad) * [Speaker Recognition](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html): TitaNet, ECAPA_TDNN, SpeakerNet * [Speaker Diarization](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html) * Clustering Diarizer: TitaNet, ECAPA_TDNN, SpeakerNet diff --git a/nemo/collections/asr/modules/wav2vec_modules.py b/nemo/collections/asr/modules/wav2vec_modules.py index e823943de4e2..59218e4d2caa 100644 --- a/nemo/collections/asr/modules/wav2vec_modules.py +++ b/nemo/collections/asr/modules/wav2vec_modules.py @@ -229,7 +229,7 @@ class Wav2VecTransformerEncoder(TransformerEncoder): Takes convolutional encodings of all time steps and adds to features before applying series of self-attention layers. - Example configs may be found at: https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/wav2vec + Example configs may be found at: https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/asr/conf/wav2vec Args: layer_drop: Floating point value specifying proportion of module for layer dropout (See Fan et al. https://arxiv.org/pdf/1909.11556.pdf). diff --git a/nemo/collections/common/parts/preprocessing/manifest.py b/nemo/collections/common/parts/preprocessing/manifest.py index 7661c31b9943..764deabc2d2b 100644 --- a/nemo/collections/common/parts/preprocessing/manifest.py +++ b/nemo/collections/common/parts/preprocessing/manifest.py @@ -29,7 +29,7 @@ class ManifestBase: def __init__(self, *args, **kwargs): raise ValueError( - "This class is deprecated, look at https://github.com/NVIDIA/NeMo/pull/284 for correct behaviour." + "This class is deprecated, look at https://github.com/NVIDIA-NeMo/Speech/pull/284 for correct behaviour." ) @@ -38,7 +38,7 @@ class ManifestEN: def __init__(self, *args, **kwargs): raise ValueError( - "This class is deprecated, look at https://github.com/NVIDIA/NeMo/pull/284 for correct behaviour." + "This class is deprecated, look at https://github.com/NVIDIA-NeMo/Speech/pull/284 for correct behaviour." ) diff --git a/nemo/collections/common/prompts/t5nmt.py b/nemo/collections/common/prompts/t5nmt.py index 3ba812fb8073..9f2bf1a81b78 100644 --- a/nemo/collections/common/prompts/t5nmt.py +++ b/nemo/collections/common/prompts/t5nmt.py @@ -27,7 +27,7 @@ class T5NMTPromptFormatter(PromptFormatter): """ The default prompt format for Megatron T5 based neural machine translation models. - Based on: https://github.com/NVIDIA/NeMo/blob/ad5ef750e351edbb5eeb7eb6df2d0c804819600f/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py#L790 + Based on: https://github.com/NVIDIA-NeMo/Speech/blob/ad5ef750e351edbb5eeb7eb6df2d0c804819600f/nemo/collections/nlp/models/machine_translation/megatron_nmt_model.py#L790 """ NAME = "t5nmt" @@ -50,7 +50,7 @@ class T5NMTPromptFormatter(PromptFormatter): def encode_turn(self, prompt_template: str, expected_slots: dict, slot_values: dict) -> list[int]: # Automatically adds "<" and ">" to target lang token for T5 NMT. - # Based on: https://github.com/NVIDIA/NeMo/blob/ad5ef750e351edbb5eeb7eb6df2d0c804819600f/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py#L307 + # Based on: https://github.com/NVIDIA-NeMo/Speech/blob/ad5ef750e351edbb5eeb7eb6df2d0c804819600f/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py#L307 if (val := slot_values.get("target_lang")) is not None: if not val.startswith("<") or not val.endswith(">"): slot_values["target_lang"] = f"<{val}>" diff --git a/nemo/collections/common/tokenizers/tokenizer_utils.py b/nemo/collections/common/tokenizers/tokenizer_utils.py index fbdfcfa847c6..1499889f091d 100644 --- a/nemo/collections/common/tokenizers/tokenizer_utils.py +++ b/nemo/collections/common/tokenizers/tokenizer_utils.py @@ -95,7 +95,7 @@ def get_tokenizer( except (ImportError, ModuleNotFoundError): raise ImportError( "Megatron-core was not found. Please see the NeMo README for installation instructions: " - " https://github.com/NVIDIA/NeMo#megatron-gpt." + " https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) if vocab_file is None: vocab_file = get_megatron_vocab_file(tokenizer_name) diff --git a/nemo/collections/speechlm2/parts/hf_hub.py b/nemo/collections/speechlm2/parts/hf_hub.py index d8e6fafc0bd8..885eabc832a5 100644 --- a/nemo/collections/speechlm2/parts/hf_hub.py +++ b/nemo/collections/speechlm2/parts/hf_hub.py @@ -26,7 +26,7 @@ class HFHubMixin( PyTorchModelHubMixin, library_name="NeMo", - repo_url="https://github.com/NVIDIA/NeMo", + repo_url="https://github.com/NVIDIA-NeMo/Speech", docs_url="https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit", ): @classmethod diff --git a/nemo/collections/tts/data/dataset.py b/nemo/collections/tts/data/dataset.py index 857207cc6274..db8975891a2a 100644 --- a/nemo/collections/tts/data/dataset.py +++ b/nemo/collections/tts/data/dataset.py @@ -197,7 +197,7 @@ def __init__( self.text_normalizer_call = None elif not PYNINI_AVAILABLE: raise ImportError( - "`nemo_text_processing` is not installed, see https://github.com/NVIDIA/NeMo-text-processing for details. " + "`nemo_text_processing` is not installed, see https://github.com/NVIDIA-NeMo/Speech-text-processing for details. " "If you wish to continue without text normalization, please remove the text_normalizer part in your TTS yaml file." ) else: diff --git a/nemo/collections/tts/models/base.py b/nemo/collections/tts/models/base.py index ae432f0c8168..3752bda8910e 100644 --- a/nemo/collections/tts/models/base.py +++ b/nemo/collections/tts/models/base.py @@ -42,7 +42,7 @@ def _setup_normalizer(self, cfg): if "text_normalizer" in cfg: if not PYNINI_AVAILABLE: logging.error( - "`nemo_text_processing` not installed, see https://github.com/NVIDIA/NeMo-text-processing for more details." + "`nemo_text_processing` not installed, see https://github.com/NVIDIA-NeMo/Speech-text-processing for more details." ) logging.error("The normalizer will be disabled.") return diff --git a/nemo/collections/tts/models/magpietts_preference_optimization.py b/nemo/collections/tts/models/magpietts_preference_optimization.py index d393ee69113c..6bc8d9cc816a 100644 --- a/nemo/collections/tts/models/magpietts_preference_optimization.py +++ b/nemo/collections/tts/models/magpietts_preference_optimization.py @@ -309,12 +309,12 @@ def preference_loss( logits = pi_logratios - ref_logratios # also known as h_{\pi_\theta}^{y_w,y_l} # logits = (policy_chosen_logps - policy_rejected_logps) - (reference_chosen_logps - reference_rejected_logps) # logits = (policy_chosen_logps - reference_chosen_logps) - (policy_rejected_logps - reference_rejected_logps) - # logits is the same as rewards_delta in NeMo aligner: https://github.com/NVIDIA/NeMo-Aligner/blob/0b5bffeb78a8316dd57e0816a2a9544540f0c8dd/nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py#L241 + # logits is the same as rewards_delta in NeMo aligner: https://github.com/NVIDIA-NeMo/Speech-Aligner/blob/0b5bffeb78a8316dd57e0816a2a9544540f0c8dd/nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py#L241 if loss_type == "ipo": losses = (logits - 1 / (2 * beta)) ** 2 # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf elif loss_type == "rpo": - # https://github.com/NVIDIA/NeMo-Aligner/blob/0b5bffeb78a8316dd57e0816a2a9544540f0c8dd/nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py#L241 + # https://github.com/NVIDIA-NeMo/Speech-Aligner/blob/0b5bffeb78a8316dd57e0816a2a9544540f0c8dd/nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py#L241 logbeta_hat_chosen = torch.nn.functional.logsigmoid(beta * logits) logbeta_hat_rejected = torch.nn.functional.logsigmoid(-beta * logits) gt_rewards_delta = gt_reward_scale * (chosen_gt_rewards - rejected_gt_rewards) diff --git a/nemo/core/config/templates/model_card.py b/nemo/core/config/templates/model_card.py index 3de051b3845e..adc57a66ac1a 100644 --- a/nemo/core/config/templates/model_card.py +++ b/nemo/core/config/templates/model_card.py @@ -39,7 +39,7 @@ To train, fine-tune, or experiment with the model, install the PyTorch build for your platform first, then install [NVIDIA NeMo](https://docs.nvidia.com/nemo/speech/nightly/starthere/install.html) with the extras you need. ``` pip install 'nemo-toolkit[all]' -``` +``` ## How to Use this Model @@ -96,9 +96,9 @@ An example is provided below for ASR - The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml). + The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml). - The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). + The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py). ### Datasets @@ -150,7 +150,7 @@ ### NOTE An example is provided below for ASR metrics list that can be added to the top of the README - + model-index: - name: PUT_MODEL_NAME results: @@ -182,7 +182,7 @@ type: wer value: 14.11 -Provide any caveats about the results presented in the top of the discussion so that nuance is not lost. +Provide any caveats about the results presented in the top of the discussion so that nuance is not lost. It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)** @@ -193,7 +193,7 @@ ### Note - An example is provided below + An example is provided below Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech. @@ -206,5 +206,5 @@ **Provide appropriate references in the markdown link format below. Please order them numerically.** -[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) +[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA-NeMo/Speech) """ diff --git a/nemo/core/optim/optimizer_with_main_params.py b/nemo/core/optim/optimizer_with_main_params.py index 9723a876e58a..8887cc5787eb 100755 --- a/nemo/core/optim/optimizer_with_main_params.py +++ b/nemo/core/optim/optimizer_with_main_params.py @@ -88,12 +88,12 @@ class GradBucket(object): def __init__(self, numel, chunk_size_mb, data_group): if not HAVE_APEX: raise ImportError( - "Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) if not HAVE_MEGATRON_CORE: raise ImportError( - "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) self.numel = numel @@ -192,12 +192,12 @@ def __init__( ): if not HAVE_APEX: raise ImportError( - "Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "Apex was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) if not HAVE_MEGATRON_CORE: raise ImportError( - "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "megatron-core was not found. Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) self.optimizer = optimizer diff --git a/nemo/package_info.py b/nemo/package_info.py index 7e7157579152..a9f29467f741 100644 --- a/nemo/package_info.py +++ b/nemo/package_info.py @@ -57,8 +57,8 @@ def _source_tree_version() -> str: __contact_names__ = "NVIDIA" __contact_emails__ = "nemo-toolkit@nvidia.com" __homepage__ = "https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/" -__repository_url__ = "https://github.com/NVIDIA-NeMo/NeMo" -__download_url__ = "https://github.com/NVIDIA-NeMo/NeMo/releases" +__repository_url__ = "https://github.com/NVIDIA-NeMo/Speech" +__download_url__ = "https://github.com/NVIDIA-NeMo/Speech/releases" __description__ = "NeMo - a toolkit for Conversational AI" __license__ = "Apache2" __keywords__ = "deep learning, machine learning, gpu, NLP, NeMo, nvidia, pytorch, torch, tts, speech, language" diff --git a/nemo/utils/callbacks/dist_ckpt_io.py b/nemo/utils/callbacks/dist_ckpt_io.py index c7c74e98a655..54b05cb32a39 100644 --- a/nemo/utils/callbacks/dist_ckpt_io.py +++ b/nemo/utils/callbacks/dist_ckpt_io.py @@ -54,7 +54,7 @@ HAVE_MEGATRON_CORE = False IMPORT_ERROR = ( "megatron-core was not found. " - "Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." f" Exact error: {e}" ) diff --git a/nemo/utils/export_utils.py b/nemo/utils/export_utils.py index 00ab91d00e21..57959060723c 100644 --- a/nemo/utils/export_utils.py +++ b/nemo/utils/export_utils.py @@ -87,7 +87,7 @@ class MatchedScaleMaskSoftmax(ApexGuardDefaults): def __init__(self): super().__init__() logging.warning( - "Apex was not found. ColumnLinear will not work. Please see the NeMo README for installation instructions: https://github.com/NVIDIA/NeMo#megatron-gpt." + "Apex was not found. ColumnLinear will not work. Please see the NeMo README for installation instructions: https://github.com/NVIDIA-NeMo/Speech#megatron-gpt." ) diff --git a/pyproject.toml b/pyproject.toml index ad7711a3d582..d519aebc9e3e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -355,8 +355,8 @@ py-modules = ["nemo"] nemo_speechlm = "nemo.collections.speechlm2.vllm.salm:register" [project.urls] -Download = "https://github.com/NVIDIA-NeMo/NeMo/releases" -Homepage = "https://github.com/NVIDIA-NeMo/NeMo" +Download = "https://github.com/NVIDIA-NeMo/Speech/releases" +Homepage = "https://github.com/NVIDIA-NeMo/Speech" [tool.isort] profile = "black" # black-compatible diff --git a/scripts/dataset_processing/tts/hui_acg/get_data.py b/scripts/dataset_processing/tts/hui_acg/get_data.py index 6f7da30b4f97..0e56fc1ecb76 100644 --- a/scripts/dataset_processing/tts/hui_acg/get_data.py +++ b/scripts/dataset_processing/tts/hui_acg/get_data.py @@ -28,7 +28,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/scripts/dataset_processing/tts/ljspeech/get_data.py b/scripts/dataset_processing/tts/ljspeech/get_data.py index e9783514e442..15c01211b74d 100644 --- a/scripts/dataset_processing/tts/ljspeech/get_data.py +++ b/scripts/dataset_processing/tts/ljspeech/get_data.py @@ -27,7 +27,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/scripts/dataset_processing/tts/preprocess_text.py b/scripts/dataset_processing/tts/preprocess_text.py index c4ed5febf642..aa0be30d9f28 100644 --- a/scripts/dataset_processing/tts/preprocess_text.py +++ b/scripts/dataset_processing/tts/preprocess_text.py @@ -41,7 +41,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/scripts/dataset_processing/tts/sfbilingual/get_data.py b/scripts/dataset_processing/tts/sfbilingual/get_data.py index 4b11fddda7e4..4ba3ed49c8bc 100755 --- a/scripts/dataset_processing/tts/sfbilingual/get_data.py +++ b/scripts/dataset_processing/tts/sfbilingual/get_data.py @@ -27,7 +27,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/scripts/dataset_processing/tts/thorsten_neutral/get_data.py b/scripts/dataset_processing/tts/thorsten_neutral/get_data.py index 6123d267468f..6fb981750025 100644 --- a/scripts/dataset_processing/tts/thorsten_neutral/get_data.py +++ b/scripts/dataset_processing/tts/thorsten_neutral/get_data.py @@ -39,7 +39,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/scripts/installers/Dockerfile.ngramtools b/scripts/installers/Dockerfile.ngramtools index fad6716a1874..ebb8ec8dd1ba 100644 --- a/scripts/installers/Dockerfile.ngramtools +++ b/scripts/installers/Dockerfile.ngramtools @@ -13,10 +13,10 @@ # See the License for the specific language governing permissions and # limitations under the License. -# Use this script to install KenLM, OpenSeq2Seq decoder, Flashlight decoder, OpenGRM Ngram tool to contaner +# Use this script to install KenLM, OpenSeq2Seq decoder, Flashlight decoder, OpenGRM Ngram tool to contaner # How to use? Build it from NeMo root folder: -# 1. git clone https://github.com/NVIDIA/NeMo.git && cd NeMo +# 1. git clone https://github.com/NVIDIA-NeMo/Speech.git && cd NeMo # 2. DOCKER_BUILDKIT=1 docker build -t nemo:23.03.1 -f ./scripts/installers/Dockerfile.ngramtools . from nvcr.io/nvidia/nemo:23.03 diff --git a/scripts/installers/Dockerfile.speech_translation_vllm b/scripts/installers/Dockerfile.speech_translation_vllm index 72b85cfa134b..e48ce102aebc 100644 --- a/scripts/installers/Dockerfile.speech_translation_vllm +++ b/scripts/installers/Dockerfile.speech_translation_vllm @@ -18,7 +18,7 @@ ARG BASE_IMAGE=pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime FROM ${BASE_IMAGE} -ARG GIT_URL="https://github.com/NVIDIA-NeMo/NeMo.git" +ARG GIT_URL="https://github.com/NVIDIA-NeMo/Speech.git" ARG CHECKOUT="main" ENV DEBIAN_FRONTEND=noninteractive diff --git a/scripts/pseudo_labeling/README.md b/scripts/pseudo_labeling/README.md index 77bd61ca067a..d018b9d5d19e 100644 --- a/scripts/pseudo_labeling/README.md +++ b/scripts/pseudo_labeling/README.md @@ -12,11 +12,11 @@ TopIPL is an **iterative pseudo-labeling algorithm** for training speech recogni TopIPL relies on the following components: -- **[`SDPNeMoRunIPLProcessor`]** - Commands for running IPL are generated and submitted using SDP processors and NeMo-Run. - See instructions for usage [here](https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/sdp/processors/ipl/README.md). +- **[`SDPNeMoRunIPLProcessor`]** + Commands for running IPL are generated and submitted using SDP processors and NeMo-Run. + See instructions for usage [here](https://github.com/NVIDIA-NeMo/Speech-speech-data-processor/blob/main/sdp/processors/ipl/README.md). -- **Training Callback: `IPLEpochStopperCallback`** +- **Training Callback: `IPLEpochStopperCallback`** Add this to your training config under `exp_manager` to **stop training at the end of each epoch**, enabling pseudo-label update: ```yaml diff --git a/tests/collections/asr/test_asr_interctc_models.py b/tests/collections/asr/test_asr_interctc_models.py index 327e39c51cb5..263545c18967 100644 --- a/tests/collections/asr/test_asr_interctc_models.py +++ b/tests/collections/asr/test_asr_interctc_models.py @@ -220,7 +220,7 @@ def __getitem__(self, idx): assert output[0].shape == logprobs.shape # Explicitly pass accelerator as cpu, since default val in PTL >= 2.0 is auto and it picks cuda - # which further causes an error in all reduce at: https://github.com/NVIDIA/NeMo/blob/v1.18.1/nemo/collections/asr/modules/conv_asr.py#L209 + # which further causes an error in all reduce at: https://github.com/NVIDIA-NeMo/Speech/blob/v1.18.1/nemo/collections/asr/modules/conv_asr.py#L209 trainer = pl.Trainer(max_epochs=1, accelerator='cpu') trainer.fit( asr_model, diff --git a/tests/collections/asr/test_asr_local_attn.py b/tests/collections/asr/test_asr_local_attn.py index 9116d998cc6e..1a77c5e4350e 100644 --- a/tests/collections/asr/test_asr_local_attn.py +++ b/tests/collections/asr/test_asr_local_attn.py @@ -174,7 +174,7 @@ def __getitem__(self, idx): asr_model.train() _ = asr_model.forward(input_signal=input_signal, input_signal_length=input_length) # Explicitly pass accelerator as cpu, since default val in PTL >= 2.0 is auto and it picks cuda - # which further causes an error in all reduce at: https://github.com/NVIDIA/NeMo/blob/v1.18.1/nemo/collections/asr/modules/conformer_encoder.py#L462 + # which further causes an error in all reduce at: https://github.com/NVIDIA-NeMo/Speech/blob/v1.18.1/nemo/collections/asr/modules/conformer_encoder.py#L462 # and in ConvASREncoder where device is CPU trainer = pl.Trainer(max_epochs=1, accelerator='cpu') trainer.fit( diff --git a/tests/collections/asr/test_text_to_text_dataset.py b/tests/collections/asr/test_text_to_text_dataset.py index 738c0c715c5e..9f38fafda363 100644 --- a/tests/collections/asr/test_text_to_text_dataset.py +++ b/tests/collections/asr/test_text_to_text_dataset.py @@ -30,7 +30,7 @@ except (ImportError, ModuleNotFoundError): raise ModuleNotFoundError( "The package `nemo_text_processing` was not installed in this environment. Please refer to" - " https://github.com/NVIDIA/NeMo-text-processing and install this package before using " + " https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using " "this script" ) diff --git a/tests/collections/tts/models/test_fastpitch.py b/tests/collections/tts/models/test_fastpitch.py index 4849cf9eeb6c..03b3120ce647 100644 --- a/tests/collections/tts/models/test_fastpitch.py +++ b/tests/collections/tts/models/test_fastpitch.py @@ -36,7 +36,7 @@ def pretrained_model(request, get_language_id_from_pretrained_model_name): # This test can only pass when nemo_text_process<=0.1.8rc0. If >0.1.8rc0, the normalized outputs are unexpected for Chinese. # Will remove the marker `pleasefixme` once next-text-processing new release fixes the bug. -# Tracking bugfix in https://github.com/NVIDIA/NeMo-text-processing/issues/109. +# Tracking bugfix in https://github.com/NVIDIA-NeMo/Speech-text-processing/issues/109. @pytest.mark.pleasefixme @pytest.mark.nightly @pytest.mark.run_only_on('GPU') diff --git a/tests/conftest.py b/tests/conftest.py index 132329abde7e..dbe5848262b5 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -31,7 +31,7 @@ # Those variables probably should go to main NeMo configuration file (config.yaml). __TEST_DATA_FILENAME = "test_data.tar.gz" -__TEST_DATA_URL = "https://github.com/NVIDIA/NeMo/releases/download/v1.0.0rc1/" +__TEST_DATA_URL = "https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.0.0rc1/" __TEST_DATA_SUBDIR = ".data" diff --git a/tools/ctc_segmentation/README.md b/tools/ctc_segmentation/README.md index c2f8c37466f5..b1c39ed83757 100644 --- a/tools/ctc_segmentation/README.md +++ b/tools/ctc_segmentation/README.md @@ -1,14 +1,14 @@ Dataset creation tool based on CTC-Segmentation ----------------------------------------------- -This tool provides functionality to align long audio files and the corresponding transcripts into shorter fragments +This tool provides functionality to align long audio files and the corresponding transcripts into shorter fragments that are suitable for an Automatic Speech Recognition (ASR) model training. -More details could be found in [this tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/CTC_Segmentation_Tutorial.ipynb). +More details could be found in [this tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tools/CTC_Segmentation_Tutorial.ipynb). -The tool is based on the [CTC Segmentation](https://github.com/lumaku/ctc-segmentation): -**CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition** -https://doi.org/10.1007/978-3-030-60276-5_27 or pre-print https://arxiv.org/abs/2007.09127 +The tool is based on the [CTC Segmentation](https://github.com/lumaku/ctc-segmentation): +**CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition** +https://doi.org/10.1007/978-3-030-60276-5_27 or pre-print https://arxiv.org/abs/2007.09127 ``` @InProceedings{ctcsegmentation, @@ -33,5 +33,5 @@ Requirements ~~~~~~~~~~~~ The tool requires: - packages listed in requirements.txt -- NeMo ASR +- NeMo ASR - see pysox’s documentation (https://pysox.readthedocs.io/en/latest/) if you want support for mp3, flac and ogg files diff --git a/tools/nemo_forced_aligner/README.md b/tools/nemo_forced_aligner/README.md index 6a0de6fc0908..951798c987b4 100644 --- a/tools/nemo_forced_aligner/README.md +++ b/tools/nemo_forced_aligner/README.md @@ -1,11 +1,11 @@ # NeMo Forced Aligner (NFA)

-Try it out: HuggingFace Space 🎤 | Tutorial: "How to use NFA?" 🚀 | Blog post: "How does forced alignment work?" 📚 +Try it out: HuggingFace Space 🎤 | Tutorial: "How to use NFA?" 🚀 | Blog post: "How does forced alignment work?" 📚

- +

NFA is a tool for generating token-, word- and segment-level timestamps of speech in audio using NeMo's CTC-based Automatic Speech Recognition models. You can provide your own reference text, or use ASR-generated transcription. You can use NeMo's ASR Model checkpoints out of the box in [14+ languages](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html#speech-recognition-languages), or train your own model. NFA can be used on long audio files of 1+ hours duration (subject to your hardware and the ASR model used). @@ -23,8 +23,8 @@ NFA is a tool for generating token-, word- and segment-level timestamps of speec ```

- +

-## Documentation +## Documentation More documentation is available [here](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/nemo_forced_aligner.html). diff --git a/tools/nemo_forced_aligner/align.py b/tools/nemo_forced_aligner/align.py index 8956adee8738..55ca92e5b0f2 100644 --- a/tools/nemo_forced_aligner/align.py +++ b/tools/nemo_forced_aligner/align.py @@ -50,10 +50,10 @@ "Install NeMo with NFA utilities support:\n" " pip install 'nemo-toolkit[all]>=2.5.0'\n" "Or install the latest development version:\n" - " pip install git+https://github.com/NVIDIA-NeMo/NeMo.git" + " pip install git+https://github.com/NVIDIA-NeMo/Speech.git" ) """ -Align the utterances in manifest_filepath. +Align the utterances in manifest_filepath. Results are saved in ctm files in output_dir. Arguments: @@ -67,25 +67,25 @@ manifest_filepath: filepath to the manifest of the data you want to align, containing 'audio_filepath' and 'text' fields. output_dir: the folder where output CTM files and new JSON manifest will be saved. - align_using_pred_text: if True, will transcribe the audio using the specified model and then use that transcription - as the reference text for the forced alignment. + align_using_pred_text: if True, will transcribe the audio using the specified model and then use that transcription + as the reference text for the forced alignment. transcribe_device: None, or a string specifying the device that will be used for generating log-probs (i.e. "transcribing"). - The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). - viterbi_device: None, or string specifying the device that will be used for doing Viterbi decoding. - The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + viterbi_device: None, or string specifying the device that will be used for doing Viterbi decoding. + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). batch_size: int specifying batch size that will be used for generating log-probs and doing Viterbi decoding. use_local_attention: boolean flag specifying whether to try to use local attention for the ASR Model (will only - work if the ASR Model is a Conformer model). If local attention is used, we will set the local attention context + work if the ASR Model is a Conformer model). If local attention is used, we will set the local attention context size to [64,64]. - additional_segment_grouping_separator: an optional string or list of strings used to separate the text into smaller segments. - If this is not specified, then the whole text will be treated as a single segment. - remove_blank_tokens_from_ctm: a boolean denoting whether to remove tokens from token-level output CTMs. + additional_segment_grouping_separator: an optional string or list of strings used to separate the text into smaller segments. + If this is not specified, then the whole text will be treated as a single segment. + remove_blank_tokens_from_ctm: a boolean denoting whether to remove tokens from token-level output CTMs. audio_filepath_parts_in_utt_id: int specifying how many of the 'parts' of the audio_filepath - we will use (starting from the final part of the audio_filepath) to determine the - utt_id that will be used in the CTM files. Note also that any spaces that are present in the audio_filepath - will be replaced with dashes, so as not to change the number of space-separated elements in the + we will use (starting from the final part of the audio_filepath) to determine the + utt_id that will be used in the CTM files. Note also that any spaces that are present in the audio_filepath + will be replaced with dashes, so as not to change the number of space-separated elements in the CTM files. e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 1 => utt_id will be "e1" e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 2 => utt_id will be "d_e1" diff --git a/tools/nemo_forced_aligner/align_eou.py b/tools/nemo_forced_aligner/align_eou.py index f851bee08ed9..d39025e66486 100644 --- a/tools/nemo_forced_aligner/align_eou.py +++ b/tools/nemo_forced_aligner/align_eou.py @@ -55,13 +55,13 @@ "Install NeMo with NFA utilities support:\n" " pip install 'nemo-toolkit[all]>=2.5.0'\n" "Or install the latest development version:\n" - " pip install git+https://github.com/NVIDIA-NeMo/NeMo.git" + " pip install git+https://github.com/NVIDIA-NeMo/Speech.git" ) """ -Align the utterances in manifest_filepath. +Align the utterances in manifest_filepath. Results are saved in ctm files in output_dir as well as json manifest in output_manifest_filepath. -If no output_manifest_filepath is specified, it will save the results in the same parent directory as +If no output_manifest_filepath is specified, it will save the results in the same parent directory as the input manifest_filepath. Arguments: @@ -78,25 +78,25 @@ output_manifest_filepath: Optional[str] = None # output of manfiest with sou_time and eou_time manifest_pattern: Optional[str] = None # pattern used in Path.glob() for finding manifests - align_using_pred_text: if True, will transcribe the audio using the specified model and then use that transcription - as the reference text for the forced alignment. + align_using_pred_text: if True, will transcribe the audio using the specified model and then use that transcription + as the reference text for the forced alignment. transcribe_device: None, or a string specifying the device that will be used for generating log-probs (i.e. "transcribing"). - The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). - viterbi_device: None, or string specifying the device that will be used for doing Viterbi decoding. - The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + viterbi_device: None, or string specifying the device that will be used for doing Viterbi decoding. + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). batch_size: int specifying batch size that will be used for generating log-probs and doing Viterbi decoding. use_local_attention: boolean flag specifying whether to try to use local attention for the ASR Model (will only - work if the ASR Model is a Conformer model). If local attention is used, we will set the local attention context + work if the ASR Model is a Conformer model). If local attention is used, we will set the local attention context size to [64,64]. - additional_segment_grouping_separator: an optional string used to separate the text into smaller segments. - If this is not specified, then the whole text will be treated as a single segment. - remove_blank_tokens_from_ctm: a boolean denoting whether to remove tokens from token-level output CTMs. + additional_segment_grouping_separator: an optional string used to separate the text into smaller segments. + If this is not specified, then the whole text will be treated as a single segment. + remove_blank_tokens_from_ctm: a boolean denoting whether to remove tokens from token-level output CTMs. audio_filepath_parts_in_utt_id: int specifying how many of the 'parts' of the audio_filepath - we will use (starting from the final part of the audio_filepath) to determine the - utt_id that will be used in the CTM files. Note also that any spaces that are present in the audio_filepath - will be replaced with dashes, so as not to change the number of space-separated elements in the + we will use (starting from the final part of the audio_filepath) to determine the + utt_id that will be used in the CTM files. Note also that any spaces that are present in the audio_filepath + will be replaced with dashes, so as not to change the number of space-separated elements in the CTM files. e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 1 => utt_id will be "e1" e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 2 => utt_id will be "d_e1" diff --git a/tools/speech_data_simulator/conf/data_simulator.yaml b/tools/speech_data_simulator/conf/data_simulator.yaml index f8cdbd7a8b77..e6068d4f4c8b 100644 --- a/tools/speech_data_simulator/conf/data_simulator.yaml +++ b/tools/speech_data_simulator/conf/data_simulator.yaml @@ -35,7 +35,7 @@ data_simulator: start_buffer: 0.1 # Buffer of silence before the start of the sentence (to avoid cutting off speech or starting abruptly) split_buffer: 0.1 # Split RTTM labels if greater than twice this amount of silence (to avoid long gaps between utterances as being labelled as speech) release_buffer: 0.1 # Buffer before window at end of sentence (to avoid cutting off speech or ending abruptly) - normalize: true # Normalize speaker volumes + normalize: true # Normalize speaker volumes normalization_type: equal # Normalizing speakers ("equal" - same volume per speaker, "var" - variable volume per speaker) normalization_var: 0.1 # Variance in speaker volume (sample from standard deviation centered at 1) min_volume: 0.75 # Minimum speaker volume (only used when variable normalization is used) @@ -57,7 +57,7 @@ data_simulator: snr_max: null # Max random SNR for background noise (using average speaker power), set `null` to use fixed SNR # Segment and session augmentations. Available augmentations are in nemo/collections/asr/parts/preprocessing/perturb.py - # See tutorial at https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_Noise_Augmentation.ipynb + # See tutorial at https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/Online_Noise_Augmentation.ipynb # Note that ImpulsePerturbation, NoisePerturbation, RirAndNoisePerturbation and other perturbations that uses `collections.ASRAudioText` # cannot use multi-proccessing in simulation, due to non-pickable errors. segment_augmentor: @@ -82,8 +82,8 @@ data_simulator: - 0.25 - 0.75 - segment_manifest: # Parameters for regenerating the segment manifest file - window: 0.5 # Window length for segmentation + segment_manifest: # Parameters for regenerating the segment manifest file + window: 0.5 # Window length for segmentation shift: 0.25 # Shift length for segmentation step_count: 50 # Number of the unit segments you want to create per utterance deci: 3 # Rounding decimals for segment manifest file @@ -147,7 +147,7 @@ data_simulator: mic_pattern: omni # Microphone type ("omni" - omnidirectional) - currently only omnidirectional microphones are supported for pyroomacoustics absorbtion_params: # Note: only `T60` is used for pyroomacoustics simulations - abs_weights: # Absorption coefficient ratios for each surface + abs_weights: # Absorption coefficient ratios for each surface - 0.9 - 0.9 - 0.9 diff --git a/tutorials/00_NeMo_Primer.ipynb b/tutorials/00_NeMo_Primer.ipynb index c221b43640ef..e0615446aefd 100644 --- a/tutorials/00_NeMo_Primer.ipynb +++ b/tutorials/00_NeMo_Primer.ipynb @@ -43,7 +43,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs" @@ -1130,7 +1130,7 @@ "\n", "NeMo constantly adds new models and new tasks to these examples, such that these examples serve as the basis to train and evaluate models from scratch with the provided config files.\n", "\n", - "NeMo Examples directory can be found here - https://github.com/NVIDIA/NeMo/tree/main/examples" + "NeMo Examples directory can be found here - https://github.com/NVIDIA-NeMo/Speech/tree/main/examples" ] }, { @@ -1212,7 +1212,7 @@ "\n", "While the tutorials are a great example of the simplicity of NeMo, please note for the best performance when training on real datasets, we advice the use of the example scripts instead of the tutorial notebooks. \n", "\n", - "NeMo Tutorials directory can be found here - https://github.com/NVIDIA/NeMo/tree/main/tutorials" + "NeMo Tutorials directory can be found here - https://github.com/NVIDIA-NeMo/Speech/tree/main/tutorials" ] } ], diff --git a/tutorials/01_NeMo_Models.ipynb b/tutorials/01_NeMo_Models.ipynb index f996be6b3de3..bc1bf768477e 100644 --- a/tutorials/01_NeMo_Models.ipynb +++ b/tutorials/01_NeMo_Models.ipynb @@ -26,7 +26,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs" @@ -98,7 +98,7 @@ "\n", "NeMo comes with several state-of-the-art pre-trained Conversational AI models for users to quickly be able to start training and fine-tuning on their own datasets. \n", "\n", - "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", + "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", "\n", "In this tutorial we will learn how to develop a non-trivial NeMo model from scratch. This helps us to understand the underlying components and how they interact with the overall PyTorch ecosystem.\n" ] diff --git a/tutorials/02_NeMo_Adapters.ipynb b/tutorials/02_NeMo_Adapters.ipynb index 28b49d8568ee..061f9cc8fd87 100644 --- a/tutorials/02_NeMo_Adapters.ipynb +++ b/tutorials/02_NeMo_Adapters.ipynb @@ -26,7 +26,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs" diff --git a/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb b/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb index 1e4e0a7cd2db..5f1098a4337d 100644 --- a/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb +++ b/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb @@ -26,7 +26,7 @@ "\n", "### Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]" ] }, { @@ -628,7 +628,7 @@ "\n", "## NVIDIA NeMo: Training\n", "\n", - "To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.\n", + "To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/Speech). We recommend you install it after you've installed latest Pytorch version.\n", "```\n", "pip install nemo_toolkit['all']\n", "```\n", @@ -700,7 +700,7 @@ "\n", "\n", "\n", - "[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)\n", + "[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA-NeMo/Speech)\n", "\n", "\"\"\"" ] diff --git a/tutorials/VoiceSwapSample.ipynb b/tutorials/VoiceSwapSample.ipynb index 283319d4a3f8..3a22903a5305 100644 --- a/tutorials/VoiceSwapSample.ipynb +++ b/tutorials/VoiceSwapSample.ipynb @@ -8,7 +8,7 @@ }, "source": [ "# Getting Started: Voice swap application\n", - "This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which will swap a voice in the audio fragment with a computer generated one.\n", + "This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA-NeMo/Speech) to construct a toy demo which will swap a voice in the audio fragment with a computer generated one.\n", "\n", "At its core the demo does: \n", "\n", @@ -39,7 +39,7 @@ "outputs": [], "source": [ "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n" ] }, { @@ -239,13 +239,13 @@ "\n", "**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n", "\n", - "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", - "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", - "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", - "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n", + "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", + "* [NeMo models](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", + "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", + "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n", "\n", "\n", - "You can find scripts for training and fine-tuning ASR and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). " + "You can find scripts for training and fine-tuning ASR and TTS models [here](https://github.com/NVIDIA-NeMo/Speech/tree/main/examples). " ] }, { @@ -255,7 +255,7 @@ "id": "ahRh2Y0Lc0G1" }, "source": [ - "That's it folks! Head over to NeMo GitHub for more examples: https://github.com/NVIDIA/NeMo" + "That's it folks! Head over to NeMo GitHub for more examples: https://github.com/NVIDIA-NeMo/Speech" ] } ], diff --git a/tutorials/asr/ASR_Confidence_Estimation.ipynb b/tutorials/asr/ASR_Confidence_Estimation.ipynb index 78fa52080d6a..0e9e3470b7d0 100644 --- a/tutorials/asr/ASR_Confidence_Estimation.ipynb +++ b/tutorials/asr/ASR_Confidence_Estimation.ipynb @@ -43,9 +43,9 @@ " ## Install dependencies\n", " !apt-get install sox libsndfile1 ffmpeg\n", "\n", - " !git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + " !git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech\n", " %cd NeMo\n", - " !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + " !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", " NEMO_DIR_PATH = os.path.abspath('')\n", " is_colab = True\n", "\n", diff --git a/tutorials/asr/ASR_Context_Biasing.ipynb b/tutorials/asr/ASR_Context_Biasing.ipynb index 49e6073c61e6..5bf9c52c15bc 100644 --- a/tutorials/asr/ASR_Context_Biasing.ipynb +++ b/tutorials/asr/ASR_Context_Biasing.ipynb @@ -81,7 +81,7 @@ "metadata": {}, "source": [ "
\n", - " \"CTC-WS\" \n", + " \"CTC-WS\" \n", "
Figure 1. High-level representation of the proposed context-biasing method with CTC-WS in case of CTC model. Detected words (gpu, nvidia, cuda) are compared with words from the greedy CTC results in the overlapping intervals according to the accumulated scores to prevent false accept replacement.
\n", "
" ] @@ -93,7 +93,7 @@ "source": [ "\n", "\n", " \n", @@ -111,7 +111,7 @@ "metadata": {}, "source": [ "
\n", - " \"CTC-WS\" \n", + " \"CTC-WS\" \n", "
Figure 2. Scheme of the context-biasing method with CTC-based Word Spotter. CTC-WS uses CTC log probabilities to detect context-biasing candidates. Obtained candidates are filtered by CTC word alignment and then merged with CTC or RNN-T word alignment to get the final text result.
\n", "
" ] @@ -161,9 +161,9 @@ " ## Install dependencies\n", " !apt-get install sox libsndfile1 ffmpeg\n", "\n", - " !git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + " !git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech\n", " %cd NeMo\n", - " !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + " !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", " NEMO_DIR_PATH = os.path.abspath('')\n", "\n", "import sys\n", diff --git a/tutorials/asr/ASR_Example_CommonVoice_Finetuning.ipynb b/tutorials/asr/ASR_Example_CommonVoice_Finetuning.ipynb index 5293f85044fc..2da8c7a7ab3d 100644 --- a/tutorials/asr/ASR_Example_CommonVoice_Finetuning.ipynb +++ b/tutorials/asr/ASR_Example_CommonVoice_Finetuning.ipynb @@ -10,7 +10,7 @@ "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", "\n", "\n", - "Training an ASR model for a new language can be challenging, especially for low-resource languages (see [example](https://github.com/NVIDIA/NeMo/blob/main/docs/source/asr/examples/kinyarwanda_asr.rst) for Kinyarwanda CommonVoice ASR model).\n", + "Training an ASR model for a new language can be challenging, especially for low-resource languages (see [example](https://github.com/NVIDIA-NeMo/Speech/blob/main/docs/source/asr/examples/kinyarwanda_asr.rst) for Kinyarwanda CommonVoice ASR model).\n", "\n", "This example describes all basic steps required to build ASR model for Esperanto:\n", "\n", @@ -160,7 +160,7 @@ "\n", "The tarred dataset allows storing the dataset as large *.tar files instead of small separate audio files. It may speed up the training and minimizes the load when data is moved from storage to GPU nodes.\n", "\n", - "The NeMo toolkit provides a [script]( https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py) to get tarred dataset.\n", + "The NeMo toolkit provides a [script]( https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py) to get tarred dataset.\n", "\n", "```bash\n", "\n", @@ -207,11 +207,11 @@ "source": [ "## Training hyper-parameters\n", "\n", - "The training parameters are defined in the [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml) (general description of the [ASR configuration file](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/configs.html)). As an encoder, the [Conformer model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc) is used here, the training parameters for which are already well configured based on the training English models. However, the set of optimal parameters may differ for a new language. In this section, we will look at the set of simple parameters that can improve recognition quality for a new language without digging into the details of the Conformer model too much.\n", + "The training parameters are defined in the [config file](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml) (general description of the [ASR configuration file](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/configs.html)). As an encoder, the [Conformer model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc) is used here, the training parameters for which are already well configured based on the training English models. However, the set of optimal parameters may differ for a new language. In this section, we will look at the set of simple parameters that can improve recognition quality for a new language without digging into the details of the Conformer model too much.\n", "\n", "### Select Training Batch Size\n", "\n", - "We trained model on server with 16 V100 GPUs with 32 GB. We use a local batch size = 32 per GPU V100), so global batch size is 32x16=512. In general, we observed, that global batch between 512 and 2048 works well for Conformer-CTC-Large model. One can use the [accumulate_grad_batches](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml#L173) parameter to increase the size of the global batch, which is equal to *local_batch * num_gpu * accumulate_grad_batches*.\n", + "We trained model on server with 16 V100 GPUs with 32 GB. We use a local batch size = 32 per GPU V100), so global batch size is 32x16=512. In general, we observed, that global batch between 512 and 2048 works well for Conformer-CTC-Large model. One can use the [accumulate_grad_batches](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml#L173) parameter to increase the size of the global batch, which is equal to *local_batch * num_gpu * accumulate_grad_batches*.\n", "\n", "### Selecting Optimizer and Learning Rate Scheduler\n", "\n", @@ -270,7 +270,7 @@ "* Fine-tuning from ASR models for other languages (English, Spanish, Italian).\n", "* Fine-tuning from an English SSL ([Self-supervised learning](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/ssl/intro.html?highlight=self%20supervised)) model.\n", "\n", - "For the training of the [Conformer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc) model, we use [speech_to_text_ctc_bpe.py](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) with the default config [conformer_ctc_bpe.yaml](https://github.com/NVIDIA/NeMo/tree/stable/examples/asr/conf/conformer/conformer_ctc_bpe.yaml). Here you can see the example of how to run this training:\n", + "For the training of the [Conformer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/models.html#conformer-ctc) model, we use [speech_to_text_ctc_bpe.py](https://github.com/NVIDIA-NeMo/Speech/tree/stable/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) with the default config [conformer_ctc_bpe.yaml](https://github.com/NVIDIA-NeMo/Speech/tree/stable/examples/asr/conf/conformer/conformer_ctc_bpe.yaml). Here you can see the example of how to run this training:\n", "\n", "```bash\n", "TOKENIZER=${YOUR_DATA_ROOT}/esperanto/tokenizers/tokenizer_spe_bpe_v128\n", @@ -327,7 +327,7 @@ "+init_from_pretrained_model=${PRETRAINED_MODEL_NAME}\n", "```\n", "\n", - "If the size of the vocabulary differs from the one presented in the pretrained model, you need to change the vocabulary manually as done in the [finetuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb).\n", + "If the size of the vocabulary differs from the one presented in the pretrained model, you need to change the vocabulary manually as done in the [finetuning tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb).\n", "\n", "```python\n", "model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(f\"nvidia/{PRETRAINED_MODEL_NAME}\", map_location='cpu')\n", diff --git a/tutorials/asr/ASR_for_telephony_speech.ipynb b/tutorials/asr/ASR_for_telephony_speech.ipynb index e8d323bce137..ea821db5385a 100644 --- a/tutorials/asr/ASR_for_telephony_speech.ipynb +++ b/tutorials/asr/ASR_for_telephony_speech.ipynb @@ -29,11 +29,11 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/config.yaml\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", diff --git a/tutorials/asr/ASR_with_NeMo.ipynb b/tutorials/asr/ASR_with_NeMo.ipynb index 75d914bbeadf..3346960c47ac 100644 --- a/tutorials/asr/ASR_with_NeMo.ipynb +++ b/tutorials/asr/ASR_with_NeMo.ipynb @@ -54,7 +54,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "# !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -72,7 +72,7 @@ "source": [ "# Introduction to End-To-End Automatic Speech Recognition\n", "\n", - "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo Framework](https://github.com/NVIDIA/NeMo).\n", + "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo Framework](https://github.com/NVIDIA-NeMo/Speech).\n", "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", "\n", "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." @@ -111,7 +111,7 @@ "\n", "Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as a phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (e.g. the prediction output `LLLAAAAPPTOOOPPPP` would become `LAPTOP`). It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do with words like \"book\" that contain consecutive repeated letters? Simply squashing repeated letters together would not work in that case!\n", "\n", - "![Alignment example](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/tutorials/asr/images/alignment_example.png)\n", + "![Alignment example](https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/tutorials/asr/images/alignment_example.png)\n", "\n", "Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the transcript--no extra data or labeling required. Let's briefly go over two popular approaches that allow us to do this, Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention.\n", "\n", @@ -418,7 +418,7 @@ "\n", "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n", "\n", - "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", + "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA-NeMo/Speech), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", "\n", "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." ] @@ -594,7 +594,7 @@ "id": "PXVKBniMlRz5" }, "outputs": [], - "source": "# --- Config Information ---#\nfrom omegaconf import OmegaConf\nconfig_path = './configs/conformer_ctc_char.yaml'\n\nif not os.path.exists(config_path):\n # Grab the config we'll use in this example\n BRANCH = 'main'\n !mkdir -p configs\n !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/conformer/conformer_ctc_char.yaml\n\nparams = OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)\nprint(params)" + "source": "# --- Config Information ---#\nfrom omegaconf import OmegaConf\nconfig_path = './configs/conformer_ctc_char.yaml'\n\nif not os.path.exists(config_path):\n # Grab the config we'll use in this example\n BRANCH = 'main'\n !mkdir -p configs\n !wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/conformer/conformer_ctc_char.yaml\n\nparams = OmegaConf.to_container(OmegaConf.load(config_path), resolve=True)\nprint(params)" }, { "cell_type": "markdown", @@ -1777,7 +1777,7 @@ "# Of course, you can combine these flags as well.\n", "```\n", "\n", - "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." + "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." ] }, { diff --git a/tutorials/asr/ASR_with_Subword_Tokenization.ipynb b/tutorials/asr/ASR_with_Subword_Tokenization.ipynb index e75dc337c502..06413773af84 100644 --- a/tutorials/asr/ASR_with_Subword_Tokenization.ipynb +++ b/tutorials/asr/ASR_with_Subword_Tokenization.ipynb @@ -1,29 +1,48 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "ASR_with_Subword_Tokenization.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3", - "language": "python" - }, - "accelerator": "GPU" - }, "cells": [ { "cell_type": "code", + "execution_count": null, "metadata": { "id": "HqBQwLAsme9b" }, - "source": "\"\"\"\nYou can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n\nInstructions for setting up Colab are as follows:\n1. Open a new Python 3 notebook.\n2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n4. Run this cell to set up dependencies.\n5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n\"\"\"\n\n# Install dependencies\n!pip install wget\n!apt-get install sox libsndfile1 ffmpeg\n!pip install text-unidecode\n!pip install matplotlib>=3.3.2\n\n## Install NeMo\nBRANCH = 'main'\n!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n\n## Grab the config we'll use in this example\n!mkdir configs\n!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml\n\n\"\"\"\nRemember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\nAlternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\nthat you want to use the \"Run All Cells\" (or similar) option.\n\"\"\"\n# exit()", - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", + "\n", + "\n", + "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\"\"\"\n", + "\n", + "# Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "!pip install matplotlib>=3.3.2\n", + "\n", + "## Install NeMo\n", + "BRANCH = 'main'\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "## Grab the config we'll use in this example\n", + "!mkdir configs\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml\n", + "\n", + "\"\"\"\n", + "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", + "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", + "that you want to use the \"Run All Cells\" (or similar) option.\n", + "\"\"\"\n", + "# exit()" + ] }, { "cell_type": "markdown", @@ -31,10 +50,10 @@ "id": "jW8pMLX4EKb0" }, "source": [ - "# Automatic Speech Recognition with Subword Tokenization\r\n", - "\r\n", - "In the [ASR with NeMo notebook](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb), we discuss the pipeline necessary for Automatic Speech Recognition (ASR), and then use the NeMo toolkit to construct a functioning speech recognition model.\r\n", - "\r\n", + "# Automatic Speech Recognition with Subword Tokenization\n", + "\n", + "In the [ASR with NeMo notebook](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb), we discuss the pipeline necessary for Automatic Speech Recognition (ASR), and then use the NeMo toolkit to construct a functioning speech recognition model.\n", + "\n", "In this notebook, we take a step further and look into subword tokenization as a useful encoding scheme for ASR models, and why they are necessary. We then construct a custom tokenizer from the dataset, and use it to construct and train an ASR model on the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." ] }, @@ -44,16 +63,16 @@ "id": "w2pDg6jJLLVM" }, "source": [ - "## Subword Tokenization\r\n", - "\r\n", - "We begin with a short intro to what exactly is subword tokenization. If you are familiar with some Natural Language Processing terminologies, then you might have heard of the term \"subword\" frequently.\r\n", - "\r\n", - "So what is a subword in the first place? Simply put, it is either a single character or a group of characters. When combined according to a tokenization-detokenization algorithm, it generates a set of characters, words, or entire sentences. \r\n", - "\r\n", - "Many subword tokenization-detokenization algorithms exist, which can be built using large corpora of text data to tokenize and detokenize the data to and from subwords effectively. Some of the most commonly used subword tokenization methods are [Byte Pair Encoding](https://arxiv.org/abs/1508.07909), [Word Piece Encoding](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and [Sentence Piece Encoding](https://www.aclweb.org/anthology/D18-2012/), to name just a few.\r\n", - "\r\n", - "------\r\n", - "\r\n", + "## Subword Tokenization\n", + "\n", + "We begin with a short intro to what exactly is subword tokenization. If you are familiar with some Natural Language Processing terminologies, then you might have heard of the term \"subword\" frequently.\n", + "\n", + "So what is a subword in the first place? Simply put, it is either a single character or a group of characters. When combined according to a tokenization-detokenization algorithm, it generates a set of characters, words, or entire sentences. \n", + "\n", + "Many subword tokenization-detokenization algorithms exist, which can be built using large corpora of text data to tokenize and detokenize the data to and from subwords effectively. Some of the most commonly used subword tokenization methods are [Byte Pair Encoding](https://arxiv.org/abs/1508.07909), [Word Piece Encoding](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and [Sentence Piece Encoding](https://www.aclweb.org/anthology/D18-2012/), to name just a few.\n", + "\n", + "------\n", + "\n", "Here, we will show a short demo on why subword tokenization is necessary for Automatic Speech Recognition under certain situations and its benefits to the model in terms of efficiency and accuracy." ] }, @@ -68,17 +87,17 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "M_MQ7NLlBbup" }, + "outputs": [], "source": [ - "TEXT_CORPUS = [\r\n", - " \"hello world\",\r\n", - " \"today is a good day\",\r\n", + "TEXT_CORPUS = [\n", + " \"hello world\",\n", + " \"today is a good day\",\n", "]" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -91,23 +110,23 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "3tusMof9QMs7" }, + "outputs": [], "source": [ - "def char_tokenize(text):\r\n", - " tokens = []\r\n", - " for char in text:\r\n", - " tokens.append(ord(char))\r\n", - " return tokens\r\n", - "\r\n", - "def char_detokenize(tokens):\r\n", - " tokens = [chr(t) for t in tokens]\r\n", - " text = \"\".join(tokens)\r\n", + "def char_tokenize(text):\n", + " tokens = []\n", + " for char in text:\n", + " tokens.append(ord(char))\n", + " return tokens\n", + "\n", + "def char_detokenize(tokens):\n", + " tokens = [chr(t) for t in tokens]\n", + " text = \"\".join(tokens)\n", " return text" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -120,17 +139,17 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "2stpuRsNQpMJ" }, + "outputs": [], "source": [ - "char_tokens = char_tokenize(TEXT_CORPUS[0])\r\n", - "print(\"Tokenized tokens :\", char_tokens)\r\n", - "text = char_detokenize(char_tokens)\r\n", + "char_tokens = char_tokenize(TEXT_CORPUS[0])\n", + "print(\"Tokenized tokens :\", char_tokens)\n", + "text = char_detokenize(char_tokens)\n", "print(\"Detokenized text :\", text)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -138,48 +157,48 @@ "id": "gY6G6Ow1RSf4" }, "source": [ - "-----\r\n", - "Great! The character tokenizer did its job correctly - each character is separated as an individual token, and they can be reconstructed into precisely the original text!\r\n", - "\r\n", + "-----\n", + "Great! The character tokenizer did its job correctly - each character is separated as an individual token, and they can be reconstructed into precisely the original text!\n", + "\n", "Now let's create a simple dictionary-based tokenizer - it will have a select set of subwords that it will use to map tokens back and forth. Note - to simplify the technique's demonstration; we will use a vocabulary with entire words. However, note that this is an uncommon occurrence unless the vocabulary sizes are huge when built on natural text." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "Mhn2MxODRNTv" }, + "outputs": [], "source": [ - "def dict_tokenize(text, vocabulary):\r\n", - " tokens = []\r\n", - "\r\n", - " # first do full word searches\r\n", - " split_text = text.split()\r\n", - " for split in split_text:\r\n", - " if split in vocabulary:\r\n", - " tokens.append(vocabulary[split])\r\n", - " else:\r\n", - " chars = list(split)\r\n", - " t_chars = [vocabulary[c] for c in chars]\r\n", - " tokens.extend(t_chars)\r\n", - " tokens.append(vocabulary[\" \"])\r\n", - "\r\n", - " # remove extra space token\r\n", - " tokens.pop(-1)\r\n", - " return tokens\r\n", - "\r\n", - "def dict_detokenize(tokens, vocabulary):\r\n", - " text = \"\"\r\n", - " reverse_vocab = {v: k for k, v in vocabulary.items()}\r\n", - " for token in tokens:\r\n", - " if token in reverse_vocab:\r\n", - " text = text + reverse_vocab[token]\r\n", - " else:\r\n", - " text = text + \"\".join(token)\r\n", + "def dict_tokenize(text, vocabulary):\n", + " tokens = []\n", + "\n", + " # first do full word searches\n", + " split_text = text.split()\n", + " for split in split_text:\n", + " if split in vocabulary:\n", + " tokens.append(vocabulary[split])\n", + " else:\n", + " chars = list(split)\n", + " t_chars = [vocabulary[c] for c in chars]\n", + " tokens.extend(t_chars)\n", + " tokens.append(vocabulary[\" \"])\n", + "\n", + " # remove extra space token\n", + " tokens.pop(-1)\n", + " return tokens\n", + "\n", + "def dict_detokenize(tokens, vocabulary):\n", + " text = \"\"\n", + " reverse_vocab = {v: k for k, v in vocabulary.items()}\n", + " for token in tokens:\n", + " if token in reverse_vocab:\n", + " text = text + reverse_vocab[token]\n", + " else:\n", + " text = text + \"\".join(token)\n", " return text" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -192,34 +211,34 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "rone69s8Ui3q" }, + "outputs": [], "source": [ - "vocabulary = {chr(i + ord(\"a\")) : (i + 1) for i in range(26)}\r\n", - "# add whole words and special tokens\r\n", - "vocabulary[\" \"] = 0\r\n", - "vocabulary[\"hello\"] = len(vocabulary) + 1\r\n", - "vocabulary[\"today\"] = len(vocabulary) + 1\r\n", - "vocabulary[\"good\"] = len(vocabulary) + 1\r\n", + "vocabulary = {chr(i + ord(\"a\")) : (i + 1) for i in range(26)}\n", + "# add whole words and special tokens\n", + "vocabulary[\" \"] = 0\n", + "vocabulary[\"hello\"] = len(vocabulary) + 1\n", + "vocabulary[\"today\"] = len(vocabulary) + 1\n", + "vocabulary[\"good\"] = len(vocabulary) + 1\n", "print(vocabulary)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "sGLGaLtXUgrN" }, + "outputs": [], "source": [ - "dict_tokens = dict_tokenize(TEXT_CORPUS[0], vocabulary)\r\n", - "print(\"Tokenized tokens :\", dict_tokens)\r\n", - "text = dict_detokenize(dict_tokens, vocabulary)\r\n", + "dict_tokens = dict_tokenize(TEXT_CORPUS[0], vocabulary)\n", + "print(\"Tokenized tokens :\", dict_tokens)\n", + "text = dict_detokenize(dict_tokens, vocabulary)\n", "print(\"Detokenized text :\", text)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -227,25 +246,25 @@ "id": "rUETSbM-XYUl" }, "source": [ - "------\r\n", - "Great! Our dictionary tokenizer works well and tokenizes-detokenizes the data correctly.\r\n", - "\r\n", - "You might be wondering - why did we have to go through all this trouble to tokenize and detokenize data if we get back the same thing?\r\n", - "\r\n", + "------\n", + "Great! Our dictionary tokenizer works well and tokenizes-detokenizes the data correctly.\n", + "\n", + "You might be wondering - why did we have to go through all this trouble to tokenize and detokenize data if we get back the same thing?\n", + "\n", "For ASR - the hidden benefit lies in the length of the tokenized representation!" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "eZFGuLqUVhLW" }, + "outputs": [], "source": [ - "print(\"Character tokenization length -\", len(char_tokens))\r\n", + "print(\"Character tokenization length -\", len(char_tokens))\n", "print(\"Dict tokenization length -\", len(dict_tokens))" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -253,8 +272,8 @@ "id": "vw6jJD8eYJpK" }, "source": [ - "By having the whole word \"hello\" in our tokenizer's dictionary, we could reduce the length of the tokenized data by four tokens and still represent the same information!\r\n", - "\r\n", + "By having the whole word \"hello\" in our tokenizer's dictionary, we could reduce the length of the tokenized data by four tokens and still represent the same information!\n", + "\n", "Actual subword algorithms like the ones discussed above go several steps further - they partition whole words based on occurrence in text and build tokens for them too! So instead of wasting 5 tokens for `[\"h\", \"e\", \"l\", \"l\", \"o\"]`, we can represent it as `[\"hel##\", \"##lo\"]` and then merge the `##` tokens together to get back `hello` by using just 2 tokens !" ] }, @@ -264,25 +283,25 @@ "id": "hcCbVA3GY-TZ" }, "source": [ - "## The necessity of subword tokenization\r\n", - "\r\n", - "It has been found via extensive research in the domain of Neural Machine Translation and Language Modelling (and its variants), that subword tokenization not only reduces the length of the tokenized representation (thereby making sentences shorter and more manageable for models to learn), but also boosts the accuracy of prediction of correct tokens (refer to the earlier cited papers).\r\n", - "\r\n", - "You might remember that earlier; we mentioned subword tokenization as a necessity rather than just a nice-to-have component for ASR. In the previous tutorial, we used the [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) loss function to train the model, but this loss function has a few limitations- \r\n", - "\r\n", - " - **Generated tokens are conditionally independent of each other**. In other words - the probability of character \"l\" being predicted after \"hel##\" is conditionally independent of the previous token - so any other token can also be predicted unless the model has future information!\r\n", - " - **The length of the generated (target) sequence must be shorter than that of the source sequence.** \r\n", - "\r\n", - "------\r\n", - "\r\n", - "It turns out - subword tokenization helps alleviate both of these issues!\r\n", - "\r\n", - " - Sophisticated subword tokenization algorithms build their vocabularies based on large text corpora. To accurately tokenize such large volumes of text with minimal vocabulary size, the subwords that are learned inherently model the interdependency between tokens of that language to some degree. \r\n", - " \r\n", - "Looking at the previous example, the token `hel##` is a single token that represents the relationship `h` => `e` => `l`. When the model predicts the single token `hel##`, it implicitly predicts this relationship - even though the subsequent token can be either `l` (for `hell`) or `##lo` (for `hello`) and is predicted independently of the previous token!\r\n", - "\r\n", - " - By reducing the target sentence length by subword tokenization (target sentence here being the characters/subwords transcribed from the audio signal), we entirely sidestep the sequence length limitation of CTC loss!\r\n", - "\r\n", + "## The necessity of subword tokenization\n", + "\n", + "It has been found via extensive research in the domain of Neural Machine Translation and Language Modelling (and its variants), that subword tokenization not only reduces the length of the tokenized representation (thereby making sentences shorter and more manageable for models to learn), but also boosts the accuracy of prediction of correct tokens (refer to the earlier cited papers).\n", + "\n", + "You might remember that earlier; we mentioned subword tokenization as a necessity rather than just a nice-to-have component for ASR. In the previous tutorial, we used the [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) loss function to train the model, but this loss function has a few limitations- \n", + "\n", + " - **Generated tokens are conditionally independent of each other**. In other words - the probability of character \"l\" being predicted after \"hel##\" is conditionally independent of the previous token - so any other token can also be predicted unless the model has future information!\n", + " - **The length of the generated (target) sequence must be shorter than that of the source sequence.** \n", + "\n", + "------\n", + "\n", + "It turns out - subword tokenization helps alleviate both of these issues!\n", + "\n", + " - Sophisticated subword tokenization algorithms build their vocabularies based on large text corpora. To accurately tokenize such large volumes of text with minimal vocabulary size, the subwords that are learned inherently model the interdependency between tokens of that language to some degree. \n", + " \n", + "Looking at the previous example, the token `hel##` is a single token that represents the relationship `h` => `e` => `l`. When the model predicts the single token `hel##`, it implicitly predicts this relationship - even though the subsequent token can be either `l` (for `hell`) or `##lo` (for `hello`) and is predicted independently of the previous token!\n", + "\n", + " - By reducing the target sentence length by subword tokenization (target sentence here being the characters/subwords transcribed from the audio signal), we entirely sidestep the sequence length limitation of CTC loss!\n", + "\n", "This means we can perform a larger number of pooling steps in our acoustic models, thereby improving execution speed while simultaneously reducing memory requirements." ] }, @@ -292,8 +311,8 @@ "id": "KAFSGJRAeTe6" }, "source": [ - "# Building a custom subword tokenizer\r\n", - "\r\n", + "# Building a custom subword tokenizer\n", + "\n", "After all that talk about subword tokenization, let's finally build a custom tokenizer for our ASR model! While the `AN4` dataset is simple enough to be trained using character-based models, its small size is also perfect for a demonstration on a notebook." ] }, @@ -303,64 +322,64 @@ "id": "Ire6cSmEe2GU" }, "source": [ - "## Preparing the dataset (AN4)\r\n", - "\r\n", - "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, and their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\r\n", - "\r\n", + "## Preparing the dataset (AN4)\n", + "\n", + "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, and their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n", + "\n", "Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the \"Downloads\" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "dLB_KedzYHCw" }, + "outputs": [], "source": [ "# This is where the an4/ directory will be placed.\n", "# Change this if you don't want the data to be extracted in the current directory.\n", "# The directory should exist.\n", "data_dir = \".\"" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "AsHdRslhe-7W" }, + "outputs": [], "source": [ - "import glob\r\n", - "import os\r\n", - "import subprocess\r\n", - "import tarfile\r\n", - "import wget\r\n", - "\r\n", - "# Download the dataset. This will take a few moments...\r\n", - "print(\"******\")\r\n", - "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\r\n", - " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\r\n", - " an4_path = wget.download(an4_url, data_dir)\r\n", - " print(f\"Dataset downloaded at: {an4_path}\")\r\n", - "else:\r\n", - " print(\"Tarfile already exists.\")\r\n", - " an4_path = data_dir + '/an4_sphere.tar.gz'\r\n", - "\r\n", - "if not os.path.exists(data_dir + '/an4/'):\r\n", - " # Untar and convert .sph to .wav (using sox)\r\n", - " tar = tarfile.open(an4_path)\r\n", - " tar.extractall(path=data_dir)\r\n", - "\r\n", - " print(\"Converting .sph to .wav...\")\r\n", - " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\r\n", - " for sph_path in sph_list:\r\n", - " wav_path = sph_path[:-4] + '.wav'\r\n", - " cmd = [\"sox\", sph_path, wav_path]\r\n", - " subprocess.run(cmd)\r\n", + "import glob\n", + "import os\n", + "import subprocess\n", + "import tarfile\n", + "import wget\n", + "\n", + "# Download the dataset. This will take a few moments...\n", + "print(\"******\")\n", + "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", + " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", + " an4_path = wget.download(an4_url, data_dir)\n", + " print(f\"Dataset downloaded at: {an4_path}\")\n", + "else:\n", + " print(\"Tarfile already exists.\")\n", + " an4_path = data_dir + '/an4_sphere.tar.gz'\n", + "\n", + "if not os.path.exists(data_dir + '/an4/'):\n", + " # Untar and convert .sph to .wav (using sox)\n", + " tar = tarfile.open(an4_path)\n", + " tar.extractall(path=data_dir)\n", + "\n", + " print(\"Converting .sph to .wav...\")\n", + " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", + " for sph_path in sph_list:\n", + " wav_path = sph_path[:-4] + '.wav'\n", + " cmd = [\"sox\", sph_path, wav_path]\n", + " subprocess.run(cmd)\n", "print(\"Finished conversion.\\n******\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -368,7 +387,7 @@ "id": "6kOuy-OWfUWn" }, "source": [ - "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\r\n" + "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\n" ] }, { @@ -377,79 +396,79 @@ "id": "S2S--I3kftF0" }, "source": [ - "## Creating Data Manifests\r\n", - "\r\n", - "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\r\n", - "\r\n", - "Here's an example of what one line in a NeMo-compatible manifest might look like:\r\n", - "```\r\n", - "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\r\n", - "```\r\n", - "\r\n", - "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\r\n", - "```\r\n", - "...\r\n", - " P I T T S B U R G H (cen5-fash-b)\r\n", - " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\r\n", - "...\r\n", + "## Creating Data Manifests\n", + "\n", + "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\n", + "\n", + "Here's an example of what one line in a NeMo-compatible manifest might look like:\n", + "```\n", + "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", + "```\n", + "\n", + "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\n", + "```\n", + "...\n", + " P I T T S B U R G H (cen5-fash-b)\n", + " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\n", + "...\n", "```" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "sFyGsk80fRp7" }, + "outputs": [], "source": [ - "# --- Building Manifest Files --- #\r\n", - "import json\r\n", - "import librosa\r\n", - "\r\n", - "# Function to build a manifest\r\n", - "def build_manifest(transcripts_path, manifest_path, wav_path):\r\n", - " with open(transcripts_path, 'r') as fin:\r\n", - " with open(manifest_path, 'w') as fout:\r\n", - " for line in fin:\r\n", - " # Lines look like this:\r\n", - " # transcript (fileID)\r\n", - " transcript = line[: line.find('(')-1].lower()\r\n", - " transcript = transcript.replace('', '').replace('', '')\r\n", - " transcript = transcript.strip()\r\n", - "\r\n", - " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\r\n", - " audio_path = os.path.join(\r\n", - " data_dir, wav_path,\r\n", - " file_id[file_id.find('-')+1 : file_id.rfind('-')],\r\n", - " file_id + '.wav')\r\n", - "\r\n", - " duration = librosa.core.get_duration(filename=audio_path)\r\n", - "\r\n", - " # Write the metadata to the manifest\r\n", - " metadata = {\r\n", - " \"audio_filepath\": audio_path,\r\n", - " \"duration\": duration,\r\n", - " \"text\": transcript\r\n", - " }\r\n", - " json.dump(metadata, fout)\r\n", - " fout.write('\\n')\r\n", - " \r\n", - "# Building Manifests\r\n", - "print(\"******\")\r\n", - "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\r\n", - "train_manifest = data_dir + '/an4/train_manifest.json'\r\n", - "if not os.path.isfile(train_manifest):\r\n", - " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\r\n", - " print(\"Training manifest created.\")\r\n", - "\r\n", - "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\r\n", - "test_manifest = data_dir + '/an4/test_manifest.json'\r\n", - "if not os.path.isfile(test_manifest):\r\n", - " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\r\n", - " print(\"Test manifest created.\")\r\n", + "# --- Building Manifest Files --- #\n", + "import json\n", + "import librosa\n", + "\n", + "# Function to build a manifest\n", + "def build_manifest(transcripts_path, manifest_path, wav_path):\n", + " with open(transcripts_path, 'r') as fin:\n", + " with open(manifest_path, 'w') as fout:\n", + " for line in fin:\n", + " # Lines look like this:\n", + " # transcript (fileID)\n", + " transcript = line[: line.find('(')-1].lower()\n", + " transcript = transcript.replace('', '').replace('', '')\n", + " transcript = transcript.strip()\n", + "\n", + " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\n", + " audio_path = os.path.join(\n", + " data_dir, wav_path,\n", + " file_id[file_id.find('-')+1 : file_id.rfind('-')],\n", + " file_id + '.wav')\n", + "\n", + " duration = librosa.core.get_duration(filename=audio_path)\n", + "\n", + " # Write the metadata to the manifest\n", + " metadata = {\n", + " \"audio_filepath\": audio_path,\n", + " \"duration\": duration,\n", + " \"text\": transcript\n", + " }\n", + " json.dump(metadata, fout)\n", + " fout.write('\\n')\n", + "\n", + "# Building Manifests\n", + "print(\"******\")\n", + "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\n", + "train_manifest = data_dir + '/an4/train_manifest.json'\n", + "if not os.path.isfile(train_manifest):\n", + " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\n", + " print(\"Training manifest created.\")\n", + "\n", + "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\n", + "test_manifest = data_dir + '/an4/test_manifest.json'\n", + "if not os.path.isfile(test_manifest):\n", + " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n", + " print(\"Test manifest created.\")\n", "print(\"***Done***\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -462,14 +481,14 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "PSv_wZTQf50U" }, + "outputs": [], "source": [ "!head -n 5 {data_dir}/an4/train_manifest.json" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -477,25 +496,25 @@ "id": "3S80tsTHhDmU" }, "source": [ - "## Build a custom tokenizer\r\n", - "\r\n", - "Next, we will use a NeMo script to easily build a tokenizer for the above dataset. The script takes a few arguments, which will be explained in detail.\r\n", - "\r\n", + "## Build a custom tokenizer\n", + "\n", + "Next, we will use a NeMo script to easily build a tokenizer for the above dataset. The script takes a few arguments, which will be explained in detail.\n", + "\n", "First, download the tokenizer creation script from the nemo repository." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "ESHI2piTgJRO" }, + "outputs": [], "source": [ "if not os.path.exists(\"scripts/tokenizers/process_asr_text_tokenizer.py\"):\n", " !mkdir scripts\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py" - ], - "execution_count": null, - "outputs": [] + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py" + ] }, { "cell_type": "markdown", @@ -503,34 +522,36 @@ "id": "BkcpeYp1iIsU" }, "source": [ - "The script above takes a few important arguments -\r\n", - "\r\n", - " - either `--manifest` or `--data_file`: If your text data lies inside of an ASR manifest file, then use the `--manifest` path. If instead the text data is inside a file with separate lines corresponding to different text lines, then use `--data_file`. In either case, you can add commas to concatenate different manifests or different data files.\r\n", - "\r\n", - " - `--data_root`: The output directory (whose subdirectories will be created if not present) where the tokenizers will be placed.\r\n", - "\r\n", - " - `--vocab_size`: The size of the tokenizer vocabulary. Larger vocabularies can accommodate almost entire words, but the decoder size of any model will grow proportionally.\r\n", - "\r\n", - " - `--tokenizer`: Can be either `spe` or `wpe` . `spe` refers to the Google `sentencepiece` library tokenizer. `wpe` refers to the HuggingFace BERT Word Piece tokenizer. Please refer to the papers above for the relevant technique in order to select an appropriate tokenizer.\r\n", - "\r\n", - " - `--no_lower_case`: When this flag is passed, it will force the tokenizer to create separate tokens for upper and lower case characters. By default, the script will turn all the text to lower case before tokenization (and if upper case characters are passed during training/inference, the tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the English language. \r\n", - "\r\n", - " - `--spe_type`: The `sentencepiece` library has a few implementations of the tokenization technique, and `spe_type` refers to these implementations. Currently supported types are `unigram`, `bpe`, `char`, `word`. Defaults to `bpe`.\r\n", - "\r\n", - " - `--spe_character_coverage`: The `sentencepiece` library considers how much of the original vocabulary it should cover in its \"base set\" of tokens (akin to the lower and upper case characters of the English language). For almost all languages with small base token sets `(<1000 tokens)`, this should be kept at its default of 1.0. For languages with larger vocabularies (say Japanese, Mandarin, Korean etc), the suggested value is 0.9995.\r\n", - "\r\n", - " - `--spe_sample_size`: If the dataset is too large, consider using a sampled dataset indicated by a positive integer. By default, any negative value (default = -1) will use the entire dataset.\r\n", - "\r\n", - " - `--spe_train_extremely_large_corpus`: When training a sentencepiece tokenizer on very large amounts of text, sometimes the tokenizer will run out of memory or won't be able to process so much data on RAM. At some point you might receive the following error - \"Input corpus too large, try with train_extremely_large_corpus=true\". If your machine has large amounts of RAM, it might still be possible to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.\r\n", - "\r\n", + "The script above takes a few important arguments -\n", + "\n", + " - either `--manifest` or `--data_file`: If your text data lies inside of an ASR manifest file, then use the `--manifest` path. If instead the text data is inside a file with separate lines corresponding to different text lines, then use `--data_file`. In either case, you can add commas to concatenate different manifests or different data files.\n", + "\n", + " - `--data_root`: The output directory (whose subdirectories will be created if not present) where the tokenizers will be placed.\n", + "\n", + " - `--vocab_size`: The size of the tokenizer vocabulary. Larger vocabularies can accommodate almost entire words, but the decoder size of any model will grow proportionally.\n", + "\n", + " - `--tokenizer`: Can be either `spe` or `wpe` . `spe` refers to the Google `sentencepiece` library tokenizer. `wpe` refers to the HuggingFace BERT Word Piece tokenizer. Please refer to the papers above for the relevant technique in order to select an appropriate tokenizer.\n", + "\n", + " - `--no_lower_case`: When this flag is passed, it will force the tokenizer to create separate tokens for upper and lower case characters. By default, the script will turn all the text to lower case before tokenization (and if upper case characters are passed during training/inference, the tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the English language. \n", + "\n", + " - `--spe_type`: The `sentencepiece` library has a few implementations of the tokenization technique, and `spe_type` refers to these implementations. Currently supported types are `unigram`, `bpe`, `char`, `word`. Defaults to `bpe`.\n", + "\n", + " - `--spe_character_coverage`: The `sentencepiece` library considers how much of the original vocabulary it should cover in its \"base set\" of tokens (akin to the lower and upper case characters of the English language). For almost all languages with small base token sets `(<1000 tokens)`, this should be kept at its default of 1.0. For languages with larger vocabularies (say Japanese, Mandarin, Korean etc), the suggested value is 0.9995.\n", + "\n", + " - `--spe_sample_size`: If the dataset is too large, consider using a sampled dataset indicated by a positive integer. By default, any negative value (default = -1) will use the entire dataset.\n", + "\n", + " - `--spe_train_extremely_large_corpus`: When training a sentencepiece tokenizer on very large amounts of text, sometimes the tokenizer will run out of memory or won't be able to process so much data on RAM. At some point you might receive the following error - \"Input corpus too large, try with train_extremely_large_corpus=true\". If your machine has large amounts of RAM, it might still be possible to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.\n", + "\n", " - `--log`: Whether the script should display log messages" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "mAw4WMqbh6ii" }, + "outputs": [], "source": [ "!python ./scripts/process_asr_text_tokenizer.py \\\n", " --manifest=\"{data_dir}/an4/train_manifest.json\" \\\n", @@ -540,9 +561,7 @@ " --no_lower_case \\\n", " --spe_type=\"unigram\" \\\n", " --log" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -550,45 +569,55 @@ "id": "gaIFIKgol-p2" }, "source": [ - "-----\r\n", - "\r\n", - "That's it! Our tokenizer is now built and stored inside the `data_root` directory that we provided to the script.\r\n", - "\r\n", + "-----\n", + "\n", + "That's it! Our tokenizer is now built and stored inside the `data_root` directory that we provided to the script.\n", + "\n", "First we start by inspecting the tokenizer vocabulary itself. To keep it manageable, we will print just the first 10 tokens of the vocabulary:" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "0A9fSpr4l58u" }, + "outputs": [], "source": [ "!head -n 10 {data_dir}/tokenizers/an4/tokenizer_spe_unigram_v32/vocab.txt" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", "metadata": { "id": "kPuyTHGTm8Q-" }, - "source": "# Training an ASR Model with subword tokenization\n\nNow that our tokenizer is built, let's begin constructing an ASR model that will use this tokenizer for its dataset pre-processing and post-processing steps.\n\nWe will use a FastConformer model to demonstrate the usage of subword tokenization models for training and inference. FastConformer is based on the [Conformer architecture](https://arxiv.org/abs/2005.08100), which combines convolution and self-attention to capture both local and global dependencies in audio. It uses subword-tokenization along with efficient downsampling and [Squeeze-and-Excitation](https://arxiv.org/abs/1709.01507) to achieve strong accuracy in transcriptions while still using non-autoregressive CTC decoding for efficient inference.\n\nWe'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n\nNeMo let us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." + "source": [ + "# Training an ASR Model with subword tokenization\n", + "\n", + "Now that our tokenizer is built, let's begin constructing an ASR model that will use this tokenizer for its dataset pre-processing and post-processing steps.\n", + "\n", + "We will use a FastConformer model to demonstrate the usage of subword tokenization models for training and inference. FastConformer is based on the [Conformer architecture](https://arxiv.org/abs/2005.08100), which combines convolution and self-attention to capture both local and global dependencies in audio. It uses subword-tokenization along with efficient downsampling and [Squeeze-and-Excitation](https://arxiv.org/abs/1709.01507) to achieve strong accuracy in transcriptions while still using non-autoregressive CTC decoding for efficient inference.\n", + "\n", + "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA-NeMo/Speech), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", + "\n", + "NeMo let us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "jALgpGLjmaCw" }, + "outputs": [], "source": [ - "# NeMo's \"core\" package\r\n", - "import nemo\r\n", - "# NeMo's ASR collection - this collections contains complete ASR models and\r\n", - "# building blocks (modules) for ASR\r\n", + "# NeMo's \"core\" package\n", + "import nemo\n", + "# NeMo's ASR collection - this collections contains complete ASR models and\n", + "# building blocks (modules) for ASR\n", "import nemo.collections.asr as nemo_asr" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -596,8 +625,8 @@ "id": "msxCiR8epEZu" }, "source": [ - "## Training from scratch\r\n", - "\r\n", + "## Training from scratch\n", + "\n", "To train from scratch, you need to prepare your training data in the right format and specify your models architecture." ] }, @@ -606,27 +635,39 @@ "metadata": { "id": "PasvgSEwpWXd" }, - "source": "### Specifying Our Model with a YAML Config File\n\nWe'll build a *FastConformer* model for this tutorial and use *greedy CTC decoder*, using the configuration found in `./configs/fast-conformer_ctc_bpe.yaml`.\n\nIf we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, which specifies the FastConformer encoder configuration. The encoder uses a combination of convolution and self-attention layers to process the input audio features.\n\nSome entries at the top of the file specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n\nUsing a YAML config such as this helps get a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." + "source": [ + "### Specifying Our Model with a YAML Config File\n", + "\n", + "We'll build a *FastConformer* model for this tutorial and use *greedy CTC decoder*, using the configuration found in `./configs/fast-conformer_ctc_bpe.yaml`.\n", + "\n", + "If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, which specifies the FastConformer encoder configuration. The encoder uses a combination of convolution and self-attention layers to process the input audio features.\n", + "\n", + "Some entries at the top of the file specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n", + "\n", + "Using a YAML config such as this helps get a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "XLUDyWOmo8xZ" }, + "outputs": [], "source": [ "from omegaconf import OmegaConf, open_dict" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "p1O8JRk1qXX9" }, - "source": "params = OmegaConf.load(\"./configs/fast-conformer_ctc_bpe.yaml\")", - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "params = OmegaConf.load(\"./configs/fast-conformer_ctc_bpe.yaml\")" + ] }, { "cell_type": "markdown", @@ -639,14 +680,14 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "raXzemtIqjL-" }, + "outputs": [], "source": [ "print(OmegaConf.to_yaml(params))" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -654,29 +695,29 @@ "id": "Nw-8epOcuCcG" }, "source": [ - "## Specifying the tokenizer to the model\r\n", - "\r\n", - "Now that we have a model config, we are almost ready to train it ! We just have to inform it where the tokenizer directory exists and it will do the rest for us !\r\n", - "\r\n", - "We have to provide just two pieces of information via the config:\r\n", - "\r\n", - " - `tokenizer.dir`: The directory where the tokenizer files are stored\r\n", - " - `tokenizer.type`: Can be `bpe` (for `sentencepiece` based tokenizers) or `wpe` (for HuggingFace based BERT Word Piece Tokenizers. Represents what type of tokenizer is being supplied and parse its directory to construct the actual tokenizer.\r\n", - "\r\n", - "**Note**: We only have to provide the **directory** where the tokenizer file exists along with its vocabulary and any other essential components. We pass the directory instead of an explicit vocabulary path, since not all libraries construct their tokenizer in the same manner, so the model will figure out how it should prepare the tokenizer.\r\n" + "## Specifying the tokenizer to the model\n", + "\n", + "Now that we have a model config, we are almost ready to train it ! We just have to inform it where the tokenizer directory exists and it will do the rest for us !\n", + "\n", + "We have to provide just two pieces of information via the config:\n", + "\n", + " - `tokenizer.dir`: The directory where the tokenizer files are stored\n", + " - `tokenizer.type`: Can be `bpe` (for `sentencepiece` based tokenizers) or `wpe` (for HuggingFace based BERT Word Piece Tokenizers. Represents what type of tokenizer is being supplied and parse its directory to construct the actual tokenizer.\n", + "\n", + "**Note**: We only have to provide the **directory** where the tokenizer file exists along with its vocabulary and any other essential components. We pass the directory instead of an explicit vocabulary path, since not all libraries construct their tokenizer in the same manner, so the model will figure out how it should prepare the tokenizer.\n" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "YME-v0rcudUz" }, + "outputs": [], "source": [ - "params.model.tokenizer.dir = data_dir + \"/tokenizers/an4/tokenizer_spe_unigram_v32/\" # note this is a directory, not a path to a vocabulary file\r\n", + "params.model.tokenizer.dir = data_dir + \"/tokenizers/an4/tokenizer_spe_unigram_v32/\" # note this is a directory, not a path to a vocabulary file\n", "params.model.tokenizer.type = \"bpe\"" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -684,47 +725,50 @@ "id": "ceelkfIHrHTR" }, "source": [ - "### Training with PyTorch Lightning\r\n", - "\r\n", - "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\r\n", - "\r\n", + "### Training with PyTorch Lightning\n", + "\n", + "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\n", + "\n", "However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "3rslHEKeq9qy" }, + "outputs": [], "source": [ - "import lightning.pytorch as pl\r\n", + "import lightning.pytorch as pl\n", "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", "metadata": { "id": "pLbXg1swre_M" }, - "source": "Next, we instantiate an ASR model based on our ``fast-conformer_ctc_bpe.yaml`` file from the previous section.\nNote that this is a stage during which we also tell the model where our training and validation manifests are." + "source": [ + "Next, we instantiate an ASR model based on our ``fast-conformer_ctc_bpe.yaml`` file from the previous section.\n", + "Note that this is a stage during which we also tell the model where our training and validation manifests are." + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "v7RnwRpprb2S" }, + "outputs": [], "source": [ - "# Update paths to dataset\r\n", - "params.model.train_ds.manifest_filepath = train_manifest\r\n", - "params.model.validation_ds.manifest_filepath = test_manifest\r\n", - "\r\n", - "# remove spec augment for this dataset\r\n", + "# Update paths to dataset\n", + "params.model.train_ds.manifest_filepath = train_manifest\n", + "params.model.validation_ds.manifest_filepath = test_manifest\n", + "\n", + "# remove spec augment for this dataset\n", "params.model.spec_augment.rect_masks = 0" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -732,21 +776,21 @@ "id": "2qLDHHOOx8T1" }, "source": [ - "Note the subtle difference in the model that we instantiate - `EncDecCTCModelBPE` instead of `EncDecCTCModel`. \r\n", - "\r\n", + "Note the subtle difference in the model that we instantiate - `EncDecCTCModelBPE` instead of `EncDecCTCModel`. \n", + "\n", "`EncDecCTCModelBPE` is nearly identical to `EncDecCTCModel` (it is in fact a subclass!) that simply adds support for subword tokenization." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "YVNc9IxdwXp7" }, + "outputs": [], "source": [ "first_asr_model = nemo_asr.models.EncDecCTCModelBPE(cfg=params.model, trainer=trainer)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -754,15 +798,17 @@ "id": "gJd4gE1uzCuO" }, "source": [ - "### Training: Monitoring Progress\r\n", + "### Training: Monitoring Progress\n", "We can now start Tensorboard to see how training went. Recall that WER stands for Word Error Rate and so the lower it is, the better." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "50qMnqagy8VM" }, + "outputs": [], "source": [ "try:\n", " from google import colab\n", @@ -776,9 +822,7 @@ " %tensorboard --logdir lightning_logs/\n", "else:\n", " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -791,15 +835,15 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "_iFfkFBTryQn" }, + "outputs": [], "source": [ - "# Start training!!!\r\n", + "# Start training!!!\n", "trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -807,32 +851,32 @@ "id": "HQ2aSenF90hs" }, "source": [ - "Save the model easily along with the tokenizer using `save_to`. \r\n", - "\r\n", + "Save the model easily along with the tokenizer using `save_to`. \n", + "\n", "Later, we use `restore_from` to restore the model, it will also reinitialize the tokenizer !" ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "6idt0dfO9z-S" }, + "outputs": [], "source": [ "first_asr_model.save_to(\"first_model.nemo\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "RpHwCTk1-q4t" }, + "outputs": [], "source": [ "!ls -l -- *.nemo" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -840,8 +884,8 @@ "id": "VIupynXOxODi" }, "source": [ - "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\r\n", - "\r\n", + "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\n", + "\n", "If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to()`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from()`." ] }, @@ -856,14 +900,14 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "wLR7PfEzxbO1" }, + "outputs": [], "source": [ "print(params.model.optim)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -871,25 +915,25 @@ "id": "7wfmZWf-xlNV" }, "source": [ - "### After training and hyper parameter tuning\r\n", - "\r\n", + "### After training and hyper parameter tuning\n", + "\n", "Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `.setup_optimization()` with the new optimization parameters." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "cH31LyZwxi_p" }, + "outputs": [], "source": [ - "import copy\r\n", - "new_opt = copy.deepcopy(params.model.optim)\r\n", - "new_opt.lr = 0.1\r\n", - "first_asr_model.setup_optimization(optim_config=new_opt);\r\n", + "import copy\n", + "new_opt = copy.deepcopy(params.model.optim)\n", + "new_opt.lr = 0.1\n", + "first_asr_model.setup_optimization(optim_config=new_opt);\n", "# And then you can invoke trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -897,18 +941,20 @@ "id": "azH7U-K8x0rd" }, "source": [ - "## Inference\r\n", - "\r\n", - "Let's have a quick look at how one could run inference with NeMo's ASR model.\r\n", - "\r\n", + "## Inference\n", + "\n", + "Let's have a quick look at how one could run inference with NeMo's ASR model.\n", + "\n", "First, ``EncDecCTCModelBPE`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "O64yk8C4xvTG" }, + "outputs": [], "source": [ "first_asr_model.cuda()\n", "first_asr_model.eval()\n", @@ -917,9 +963,7 @@ " data_dir + '/an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav',\n", " data_dir + '/an4/wav/an4_clstk/fkai/cen8-fkai-b.wav'],\n", " batch_size=4))" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -932,50 +976,50 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "Eo2TcBkozlEG" }, + "outputs": [], "source": [ - "# Bigger batch-size = bigger throughput\r\n", - "params['model']['validation_ds']['batch_size'] = 16\r\n", - "\r\n", - "# Setup the test data loader and make sure the model is on GPU\r\n", - "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\r\n", - "first_asr_model.cuda()\r\n", - "first_asr_model.eval()\r\n", - "\r\n", - "# We remove some preprocessing artifacts which benefit training\r\n", - "first_asr_model.preprocessor.featurizer.pad_to = 0\r\n", - "first_asr_model.preprocessor.featurizer.dither = 0.0\r\n", - "\r\n", - "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\r\n", - "# WER is computed as numerator/denominator.\r\n", - "# We'll gather all the test batches' numerators and denominators.\r\n", - "wer_nums = []\r\n", - "wer_denoms = []\r\n", - "\r\n", - "# Loop over all test batches.\r\n", - "# Iterating over the model's `test_dataloader` will give us:\r\n", - "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\r\n", - "# See the AudioToCharDataset for more details.\r\n", - "for test_batch in first_asr_model.test_dataloader():\r\n", - " test_batch = [x.cuda() for x in test_batch]\r\n", - " targets = test_batch[2]\r\n", - " targets_lengths = test_batch[3] \r\n", - " log_probs, encoded_len, greedy_predictions = first_asr_model(\r\n", - " input_signal=test_batch[0], input_signal_length=test_batch[1]\r\n", - " )\r\n", - " # Notice the model has a helper object to compute WER\r\n", - " first_asr_model.wer.update(greedy_predictions, None, targets, targets_lengths)\r\n", - " _, wer_num, wer_denom = first_asr_model.wer.compute()\r\n", - " wer_nums.append(wer_num.detach().cpu().numpy())\r\n", - " wer_denoms.append(wer_denom.detach().cpu().numpy())\r\n", - "\r\n", - "# We need to sum all numerators and denominators first. Then divide.\r\n", + "# Bigger batch-size = bigger throughput\n", + "params['model']['validation_ds']['batch_size'] = 16\n", + "\n", + "# Setup the test data loader and make sure the model is on GPU\n", + "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\n", + "first_asr_model.cuda()\n", + "first_asr_model.eval()\n", + "\n", + "# We remove some preprocessing artifacts which benefit training\n", + "first_asr_model.preprocessor.featurizer.pad_to = 0\n", + "first_asr_model.preprocessor.featurizer.dither = 0.0\n", + "\n", + "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\n", + "# WER is computed as numerator/denominator.\n", + "# We'll gather all the test batches' numerators and denominators.\n", + "wer_nums = []\n", + "wer_denoms = []\n", + "\n", + "# Loop over all test batches.\n", + "# Iterating over the model's `test_dataloader` will give us:\n", + "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\n", + "# See the AudioToCharDataset for more details.\n", + "for test_batch in first_asr_model.test_dataloader():\n", + " test_batch = [x.cuda() for x in test_batch]\n", + " targets = test_batch[2]\n", + " targets_lengths = test_batch[3]\n", + " log_probs, encoded_len, greedy_predictions = first_asr_model(\n", + " input_signal=test_batch[0], input_signal_length=test_batch[1]\n", + " )\n", + " # Notice the model has a helper object to compute WER\n", + " first_asr_model.wer.update(greedy_predictions, None, targets, targets_lengths)\n", + " _, wer_num, wer_denom = first_asr_model.wer.compute()\n", + " wer_nums.append(wer_num.detach().cpu().numpy())\n", + " wer_denoms.append(wer_denom.detach().cpu().numpy())\n", + "\n", + "# We need to sum all numerators and denominators first. Then divide.\n", "print(f\"WER = {sum(wer_nums)/sum(wer_denoms)}\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -992,22 +1036,22 @@ "id": "dtl9vEhx3MG7" }, "source": [ - "## Utilizing the underlying tokenizer\r\n", - "\r\n", + "## Utilizing the underlying tokenizer\n", + "\n", "Since the model has an underlying tokenizer, it would be nice to use it externally as well - say for getting the subwords of the transcript or to tokenize a dataset using the same tokenizer as the ASR model." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "fdXg21if2YRp" }, + "outputs": [], "source": [ - "tokenizer = first_asr_model.tokenizer\r\n", + "tokenizer = first_asr_model.tokenizer\n", "tokenizer" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1015,22 +1059,22 @@ "id": "Y96SOqpJ3kG3" }, "source": [ - "You can get the tokenizer's vocabulary using the `tokenizer.tokenizer.get_vocab()` method. \r\n", - "\r\n", + "You can get the tokenizer's vocabulary using the `tokenizer.tokenizer.get_vocab()` method. \n", + "\n", "ASR tokenizers will map the subword to an integer index in the vocabulary for convenience." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "F56_tIRM3g3f" }, + "outputs": [], "source": [ - "vocab = tokenizer.tokenizer.get_vocab()\r\n", + "vocab = tokenizer.tokenizer.get_vocab()\n", "vocab" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1043,51 +1087,51 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "-2tMVskF3uUf" }, + "outputs": [], "source": [ - "tokens = tokenizer.text_to_tokens(\"hello world\")\r\n", + "tokens = tokenizer.text_to_tokens(\"hello world\")\n", "tokens" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "CkxHkKQn4Q-E" }, + "outputs": [], "source": [ - "token_ids = tokenizer.text_to_ids(\"hello world\")\r\n", + "token_ids = tokenizer.text_to_ids(\"hello world\")\n", "token_ids" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "tpdoIrRt4Xim" }, + "outputs": [], "source": [ - "subwords = tokenizer.ids_to_tokens(token_ids)\r\n", + "subwords = tokenizer.ids_to_tokens(token_ids)\n", "subwords" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "wudNyONi4og8" }, + "outputs": [], "source": [ - "text = tokenizer.ids_to_text(token_ids)\r\n", + "text = tokenizer.ids_to_text(token_ids)\n", "text" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1095,29 +1139,29 @@ "id": "E35VBsbf4yWy" }, "source": [ - "## Model Improvements\r\n", - "\r\n", - "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\r\n", - "\r\n", - "### Data Augmentation\r\n", - "\r\n", - "There exist several ASR data augmentation methods that can increase the size of our training set.\r\n", - "\r\n", - "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\r\n", - "\r\n", + "## Model Improvements\n", + "\n", + "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\n", + "\n", + "### Data Augmentation\n", + "\n", + "There exist several ASR data augmentation methods that can increase the size of our training set.\n", + "\n", + "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\n", + "\n", "Our toy model disables spectrogram augmentation, because it is not significantly beneficial for the short demo." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "SMi6Bauy4Jhg" }, + "outputs": [], "source": [ "print(OmegaConf.to_yaml(first_asr_model._cfg['spec_augment']))" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1134,12 +1178,12 @@ "id": "fDTC4fXZ5QnT" }, "source": [ - "### Transfer learning\r\n", - "\r\n", - "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\r\n", - "\r\n", - "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\r\n", - "\r\n", + "### Transfer learning\n", + "\n", + "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\n", + "\n", + "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\n", + "\n", "Transfer learning with NeMo is simple. Let's demonstrate how we could fine-tune the model we trained earlier on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change the model's vocabulary to demonstrate how it's done." ] }, @@ -1149,15 +1193,17 @@ "id": "IN0LbDbY5YR1" }, "source": [ - "-----\r\n", + "-----\n", "First, let's create another tokenizer - perhaps using a larger vocabulary size than the small tokenizer we created earlier. Also we swap out `sentencepiece` for `BERT Word Piece` tokenizer." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "LFENXcXw48fc" }, + "outputs": [], "source": [ "!python ./scripts/process_asr_text_tokenizer.py \\\n", " --manifest=\"{data_dir}/an4/train_manifest.json\" \\\n", @@ -1166,9 +1212,7 @@ " --tokenizer=\"wpe\" \\\n", " --no_lower_case \\\n", " --log" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1181,14 +1225,14 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "QtyAB9fQ_qbj" }, + "outputs": [], "source": [ "restored_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(\"./first_model.nemo\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1201,9 +1245,11 @@ }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "4Ey9CUkJ5o56" }, + "outputs": [], "source": [ "# Check what kind of vocabulary/alphabet the model has right now\n", "print(restored_model.decoder.vocabulary)\n", @@ -1214,9 +1260,7 @@ " new_tokenizer_dir=data_dir + \"/tokenizers/an4/tokenizer_wpe_v64/\",\n", " new_tokenizer_type=\"wpe\"\n", ")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1224,60 +1268,60 @@ "id": "UZ3sf2P26SiA" }, "source": [ - "After this, our decoder has completely changed, but our encoder (where most of the weights are) remained intact. Let's fine tune-this model for 20 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`.\r\n", - "\r\n", + "After this, our decoder has completely changed, but our encoder (where most of the weights are) remained intact. Let's fine tune-this model for 20 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`.\n", + "\n", "**Note**: For this demonstration, we will also freeze the encoder to speed up finetuning (since both tokenizers are built on the same train set), but in general it should not be done for proper training on a new language (or on a different corpus than the original train corpus)." ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "7m_CRtH46BjO" }, + "outputs": [], "source": [ - "# Use the smaller learning rate we set before\r\n", - "restored_model.setup_optimization(optim_config=new_opt)\r\n", - "\r\n", - "# Point to the data we'll use for fine-tuning as the training set\r\n", - "restored_model.setup_training_data(train_data_config=params['model']['train_ds'])\r\n", - "\r\n", - "# Point to the new validation data for fine-tuning\r\n", - "restored_model.setup_validation_data(val_data_config=params['model']['validation_ds'])\r\n", - "\r\n", - "# Freeze the encoder layers (should not be done for finetuning, only done for demo)\r\n", + "# Use the smaller learning rate we set before\n", + "restored_model.setup_optimization(optim_config=new_opt)\n", + "\n", + "# Point to the data we'll use for fine-tuning as the training set\n", + "restored_model.setup_training_data(train_data_config=params['model']['train_ds'])\n", + "\n", + "# Point to the new validation data for fine-tuning\n", + "restored_model.setup_validation_data(val_data_config=params['model']['validation_ds'])\n", + "\n", + "# Freeze the encoder layers (should not be done for finetuning, only done for demo)\n", "restored_model.encoder.freeze()" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "uCmUWZLD63d9" }, + "outputs": [], "source": [ - "# Load the TensorBoard notebook extension\r\n", - "if COLAB_ENV:\r\n", - " %load_ext tensorboard\r\n", - " %tensorboard --logdir lightning_logs/\r\n", - "else:\r\n", + "# Load the TensorBoard notebook extension\n", + "if COLAB_ENV:\n", + " %load_ext tensorboard\n", + " %tensorboard --logdir lightning_logs/\n", + "else:\n", " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", + "execution_count": null, "metadata": { "id": "fs2aK7xB6pAd" }, + "outputs": [], "source": [ - "# And now we can create a PyTorch Lightning trainer and call `fit` again.\r\n", - "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=20)\r\n", + "# And now we can create a PyTorch Lightning trainer and call `fit` again.\n", + "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=20)\n", "trainer.fit(restored_model)" - ], - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", @@ -1294,23 +1338,23 @@ "id": "alykABQ3CNpf" }, "source": [ - "### Fast Training\r\n", - "\r\n", - "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\r\n", - "\r\n", - "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\r\n", - "\r\n", - "```python\r\n", - "# Mixed precision:\r\n", - "trainer = pl.Trainer(amp_level='O1', precision=16)\r\n", - "\r\n", - "# Trainer with a distributed backend:\r\n", - "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='auto')\r\n", - "\r\n", - "# Of course, you can combine these flags as well.\r\n", - "```\r\n", - "\r\n", - "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) which can handle mixed precision and distributed training using command-line arguments." + "### Fast Training\n", + "\n", + "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\n", + "\n", + "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\n", + "\n", + "```python\n", + "# Mixed precision:\n", + "trainer = pl.Trainer(amp_level='O1', precision=16)\n", + "\n", + "# Trainer with a distributed backend:\n", + "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='auto')\n", + "\n", + "# Of course, you can combine these flags as well.\n", + "```\n", + "\n", + "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) which can handle mixed precision and distributed training using command-line arguments." ] }, { @@ -1318,14 +1362,76 @@ "metadata": { "id": "4uQGWtRJDF0O" }, - "source": "## Under the Hood\n\nNeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n\nIn particular, ``nemo_asr.model.EncDecCTCModelBPE`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n```python\ndef forward(self, input_signal, input_signal_length):\n processed_signal, processed_signal_len = self.preprocessor(\n input_signal=input_signal, length=input_signal_length,\n )\n # Spec augment is not applied during evaluation/testing\n if self.spec_augmentation is not None and self.training:\n processed_signal = self.spec_augmentation(input_spec=processed_signal)\n encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n log_probs = self.decoder(encoder_output=encoded)\n greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n return log_probs, encoded_len, greedy_predictions\n```\nHere:\n\n* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n* ``self.encoder`` - is a FastConformer encoder that combines convolution and self-attention layers of type ``nemo_asr.modules.ConformerEncoder``\n* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n\nAlso, ``EncDecCTCModelBPE`` uses the audio dataset class ``nemo_asr.data.AudioToBPEDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n\nYou can use these and other neural modules (or create new ones yourself!) to construct new ASR models." + "source": [ + "## Under the Hood\n", + "\n", + "NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n", + "\n", + "In particular, ``nemo_asr.model.EncDecCTCModelBPE`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n", + "```python\n", + "def forward(self, input_signal, input_signal_length):\n", + " processed_signal, processed_signal_len = self.preprocessor(\n", + " input_signal=input_signal, length=input_signal_length,\n", + " )\n", + " # Spec augment is not applied during evaluation/testing\n", + " if self.spec_augmentation is not None and self.training:\n", + " processed_signal = self.spec_augmentation(input_spec=processed_signal)\n", + " encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n", + " log_probs = self.decoder(encoder_output=encoded)\n", + " greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n", + " return log_probs, encoded_len, greedy_predictions\n", + "```\n", + "Here:\n", + "\n", + "* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n", + "* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n", + "* ``self.encoder`` - is a FastConformer encoder that combines convolution and self-attention layers of type ``nemo_asr.modules.ConformerEncoder``\n", + "* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n", + "\n", + "Also, ``EncDecCTCModelBPE`` uses the audio dataset class ``nemo_asr.data.AudioToBPEDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n", + "\n", + "You can use these and other neural modules (or create new ones yourself!) to construct new ASR models." + ] }, { "cell_type": "markdown", "metadata": { "id": "5kKcSb7LDdI3" }, - "source": "# Further Reading/Watching:\n\nThat's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n- [Byte Pair Encoding](https://arxiv.org/abs/1508.07909)\n- [Word Piece Encoding](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)\n- [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing](https://www.aclweb.org/anthology/D18-2012/)\n- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n- [Conformer paper](https://arxiv.org/abs/2005.08100)\n- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n- [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507)\n- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" + "source": [ + "# Further Reading/Watching:\n", + "\n", + "That's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n", + "- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n", + "- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n", + "- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n", + "- [Byte Pair Encoding](https://arxiv.org/abs/1508.07909)\n", + "- [Word Piece Encoding](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)\n", + "- [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing](https://www.aclweb.org/anthology/D18-2012/)\n", + "- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n", + "- [Conformer paper](https://arxiv.org/abs/2005.08100)\n", + "- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n", + "- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n", + "- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n", + "- [Squeeze-and-Excitation Networks](https://arxiv.org/abs/1709.01507)\n", + "- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "ASR_with_Subword_Tokenization.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/tutorials/asr/ASR_with_Transducers.ipynb b/tutorials/asr/ASR_with_Transducers.ipynb index 3cf4fc556fec..4c3b31ccf6bc 100644 --- a/tutorials/asr/ASR_with_Transducers.ipynb +++ b/tutorials/asr/ASR_with_Transducers.ipynb @@ -7,7 +7,7 @@ "id": "SUOXg71A3w78" }, "outputs": [], - "source": "\"\"\"\nYou can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n\nInstructions for setting up Colab are as follows:\n1. Open a new Python 3 notebook.\n2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n4. Run this cell to set up dependencies.\n5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n\"\"\"\n# If you're using Google Colab and not running locally, run this cell.\nimport os\n\n# Install dependencies\n!pip install wget\n!apt-get install sox libsndfile1 ffmpeg\n!pip install text-unidecode\n!pip install matplotlib>=3.3.2\n\n## Install NeMo\nBRANCH = 'main'\n!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n\n## Grab the config we'll use in this example\n!mkdir configs\n!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml" + "source": "\"\"\"\nYou can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n\nInstructions for setting up Colab are as follows:\n1. Open a new Python 3 notebook.\n2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n4. Run this cell to set up dependencies.\n5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n\"\"\"\n# If you're using Google Colab and not running locally, run this cell.\nimport os\n\n# Install dependencies\n!pip install wget\n!apt-get install sox libsndfile1 ffmpeg\n!pip install text-unidecode\n!pip install matplotlib>=3.3.2\n\n## Install NeMo\nBRANCH = 'main'\n!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n\n## Grab the config we'll use in this example\n!mkdir configs\n!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml" }, { "cell_type": "code", @@ -74,7 +74,7 @@ " os.makedirs(\"scripts\")\n", "\n", "if not os.path.exists(\"scripts/process_an4_data.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_an4_data.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/dataset_processing/process_an4_data.py" ] }, { @@ -207,7 +207,7 @@ "outputs": [], "source": [ "if not os.path.exists(\"scripts/process_asr_text_tokenizer.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py" ] }, { diff --git a/tutorials/asr/Buffered_Transducer_Inference.ipynb b/tutorials/asr/Buffered_Transducer_Inference.ipynb index 5a2732b4dc9c..213f34c9f2a5 100644 --- a/tutorials/asr/Buffered_Transducer_Inference.ipynb +++ b/tutorials/asr/Buffered_Transducer_Inference.ipynb @@ -31,7 +31,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "# Update numba and restart (this is required to update internal numba version of Colab)\n", "\n", @@ -62,9 +62,9 @@ "\n", "There are many approaches to perform streaming/buffered inference for causal CTC / Transducer models. However, it is often observed that causal models sacrifice accuracy to perform streaming evaluation. \n", "\n", - "In this notebook, similar to the CTC tutorial for [Streaming ASR](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Streaming_ASR.ipynb), we will tackle the challenge of buffered ASR for long-form speech recognition, but this time we will use Transducer models as the basis for ASR. \n", + "In this notebook, similar to the CTC tutorial for [Streaming ASR](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Streaming_ASR.ipynb), we will tackle the challenge of buffered ASR for long-form speech recognition, but this time we will use Transducer models as the basis for ASR. \n", "\n", - "You may use this script [ASR Chunked Streaming Inference](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py) to transcribe long audio files with Transducer models. \n", + "You may use this script [ASR Chunked Streaming Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py) to transcribe long audio files with Transducer models. \n", "\n", "**Note**: It is highly recommended to review the ``Streaming ASR`` tutorial for a good overview of how streaming/buffered inference works for CTC models and the underlying motivation of streaming ASR itself.\n", "\n", @@ -104,7 +104,7 @@ "import os\n", "\n", "if not os.path.exists(\"scripts/get_librispeech_data.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/scripts/dataset_processing/get_librispeech_data.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/scripts/dataset_processing/get_librispeech_data.py" ] }, { @@ -328,10 +328,10 @@ "outputs": [], "source": [ "if not os.path.exists(\"scripts/transcribe_speech.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/transcribe_speech.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/examples/asr/transcribe_speech.py\n", "\n", "if not os.path.exists(\"scripts/speech_to_text_eval.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/speech_to_text_eval.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/examples/asr/speech_to_text_eval.py" ] }, { @@ -1522,7 +1522,7 @@ "metadata": { "id": "1gqFfQwZ-xpk" }, - "source": "Let's calculate the alignment grid. We will de-tokenize the sub-word token if it is a valid index in the vocabulary and use '' as a placeholder for the Transducer Blank token.\n\nNote that each timestep here is (roughly) 40 milli-seconds timestamp (since the window stride is 10 ms, and Conformer has 4x stride). The resolution of the model differs based on the stride of the model - Conformer has 4x stride (40 ms), and FastConformer has 8x stride (80 ms).\n\nNote: You can modify the value of config.model.loss.warprnnt_numba_kwargs.fastemit_lambda before training and see an impact on final alignment latency! For a tutorial to train your Transducer models, refer to [ASR with Transducers in NeMo](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb)." + "source": "Let's calculate the alignment grid. We will de-tokenize the sub-word token if it is a valid index in the vocabulary and use '' as a placeholder for the Transducer Blank token.\n\nNote that each timestep here is (roughly) 40 milli-seconds timestamp (since the window stride is 10 ms, and Conformer has 4x stride). The resolution of the model differs based on the stride of the model - Conformer has 4x stride (40 ms), and FastConformer has 8x stride (80 ms).\n\nNote: You can modify the value of config.model.loss.warprnnt_numba_kwargs.fastemit_lambda before training and see an impact on final alignment latency! For a tutorial to train your Transducer models, refer to [ASR with Transducers in NeMo](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb)." }, { "cell_type": "markdown", @@ -1593,7 +1593,7 @@ "\n", "Now, anyone can perform long audio transcription using any NeMo transducer model. You could even try to modify the chunk and buffer sizes to try to stream these models.\n", "\n", - "For further references on training your own transducer models, please refer to [ASR with Transducers](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb) tutorial." + "For further references on training your own transducer models, please refer to [ASR with Transducers](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb) tutorial." ] } ], diff --git a/tutorials/asr/Buffered_Transducer_Inference_with_LCS_Merge.ipynb b/tutorials/asr/Buffered_Transducer_Inference_with_LCS_Merge.ipynb index 2f179eaa9a5a..9df3758d0bcc 100644 --- a/tutorials/asr/Buffered_Transducer_Inference_with_LCS_Merge.ipynb +++ b/tutorials/asr/Buffered_Transducer_Inference_with_LCS_Merge.ipynb @@ -47,7 +47,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "# Update numba and restart (this is required to update internal numba version of Colab)\n", "\n", @@ -73,7 +73,7 @@ "source": [ "# Buffered Transducer evaluation with Longest Common Subsequence Merge\n", "\n", - "In the [Buffered Transducer Inference](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Buffered_Transducer_Inference.ipynb) tutorial, we discussed how we could perform Streaming/Buffered inference with Transducer models by using a technique which we term as `\"Middle Token\" selection` from a buffer.\n", + "In the [Buffered Transducer Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Buffered_Transducer_Inference.ipynb) tutorial, we discussed how we could perform Streaming/Buffered inference with Transducer models by using a technique which we term as `\"Middle Token\" selection` from a buffer.\n", "\n", "In this notebook, we will perform buffered ASR speech recognition and utilize another algorithm to merge buffers during inference. We term this method as the `\"Longest Common Subsequence\" (LCS) Merge` algorithm.\n", "\n", @@ -81,7 +81,7 @@ "\n", "-----\n", "\n", - "You may use this script [ASR Chunked Streaming Inference](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py) to transcribe long audio files with Transducer models as well as experiment with both merge algorithms. \n" + "You may use this script [ASR Chunked Streaming Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py) to transcribe long audio files with Transducer models as well as experiment with both merge algorithms. \n" ], "metadata": { "id": "cPuPBSU0ioJO" @@ -129,7 +129,7 @@ "import os\n", "\n", "if not os.path.exists(\"scripts/get_librispeech_data.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/scripts/dataset_processing/get_librispeech_data.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/scripts/dataset_processing/get_librispeech_data.py\n", "\n", "# If something goes wrong during data processing, un-comment the following line to delete the cached dataset \n", "# !rm -rf datasets/mini-dev-clean\n", @@ -1283,9 +1283,9 @@ "source": [ "# Final notes\n", "\n", - "Following the [Buffered Transducer Inference](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Buffered_Transducer_Inference.ipynb) tutorial and designing a token merge algorithm that can be a simple extension to the baseline `Middle Token` algorithm, we see that there are cases where both algorithms have their uses. \n", + "Following the [Buffered Transducer Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Buffered_Transducer_Inference.ipynb) tutorial and designing a token merge algorithm that can be a simple extension to the baseline `Middle Token` algorithm, we see that there are cases where both algorithms have their uses. \n", "\n", - "To expand our research effort on developing more sophisticated streaming / buffered transducer inference methods, we encourage the users to try these algorithms in script format for efficient inference on large datasets - available at [ASR Chunked Streaming Inference](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py).\n" + "To expand our research effort on developing more sophisticated streaming / buffered transducer inference methods, we encourage the users to try these algorithms in script format for efficient inference on large datasets - available at [ASR Chunked Streaming Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py).\n" ], "metadata": { "id": "GRFifXuROpzg" diff --git a/tutorials/asr/Canary_Multitask_Speech_Model.ipynb b/tutorials/asr/Canary_Multitask_Speech_Model.ipynb index b81a4307cd10..e225f319bb21 100644 --- a/tutorials/asr/Canary_Multitask_Speech_Model.ipynb +++ b/tutorials/asr/Canary_Multitask_Speech_Model.ipynb @@ -33,7 +33,7 @@ "\n", "BRANCH='main'\n", "\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -115,16 +115,16 @@ "### Decoder prompt\n", "\n", "Decoder prompt is the key to attaining multitask capability with Canary models. Decoder prompt is a sequence of special tokens that define the precise task (language output text, punctuations, timestamps, etc.) to be performed on the input audio.\n", - "As shown in the figure, the decoder takes a sequence of prompt tokens as input before generating output text. The example prompt sequence corresponds to English speech recognition as the language for input audio and output text is set to English. The format of the decoder prompt is defined by `TEMPLATE[\"user\"][\"template\"]` in the [Canary2PromptFormatter](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/prompts/canary2.py).\n", + "As shown in the figure, the decoder takes a sequence of prompt tokens as input before generating output text. The example prompt sequence corresponds to English speech recognition as the language for input audio and output text is set to English. The format of the decoder prompt is defined by `TEMPLATE[\"user\"][\"template\"]` in the [Canary2PromptFormatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/prompts/canary2.py).\n", "\n", "\n", "### Tokenizers\n", "\n", "\n", "\n", - "For Canary-1b-v2, we use a unified SentencePiece [tokenizer](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).\n", + "For Canary-1b-v2, we use a unified SentencePiece [tokenizer](https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).\n", "\n", - "For all other Canary models, we use the concatenated [tokenizer](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/canary_tokenizer.py), which combines language-specific SentencePiece tokenizers with shared special tokens. Each language uses a vocabulary of 1024 subword tokens, and these per-language vocabularies are concatenated together as shown in the figure below.\n", + "For all other Canary models, we use the concatenated [tokenizer](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/tokenizers/canary_tokenizer.py), which combines language-specific SentencePiece tokenizers with shared special tokens. Each language uses a vocabulary of 1024 subword tokens, and these per-language vocabularies are concatenated together as shown in the figure below.\n", "\n", "In addition to language-specific tokens, Canary uses 1152 tokens to represent special tokens. Special tokens include generic tokens such as `<|startoftranscript|>`, `<|endoftext|>`, ``, as well as many other task-specific tokens.\n", "Listed below is a variety of special tokens that the default tokenizer includes. This should give an idea of various tasks that can be supported with the current tokenizer and prompt formatter.\n", @@ -687,7 +687,7 @@ "BRANCH='r2.5.0'\n", "def wget_from_nemo(nemo_script_path, local_dir=\"scripts\"):\n", " os.makedirs(local_dir, exist_ok=True)\n", - " script_url = f\"https://raw.githubusercontent.com/NVIDIA/NeMo/refs/heads/{BRANCH}/{nemo_script_path}\"\n", + " script_url = f\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/refs/heads/{BRANCH}/{nemo_script_path}\"\n", " script_path = os.path.basename(nemo_script_path)\n", " if not os.path.exists(f\"{local_dir}/{script_path}\"):\n", " !wget -P {local_dir}/ {script_url}" @@ -760,7 +760,7 @@ "source": [ "## Prompt format\n", "\n", - "Canary-flash decoder generates output text conditioned on audio encoder representations and the decoder prompt. As described in the introduction, Canary-Flash models use [Canary2PromptFormatter](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/prompts/canary2.py), and so we set the `prompt_format` accordingly\n", + "Canary-flash decoder generates output text conditioned on audio encoder representations and the decoder prompt. As described in the introduction, Canary-Flash models use [Canary2PromptFormatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/prompts/canary2.py), and so we set the `prompt_format` accordingly\n", "\n", "```\n", "model.prompt_format=\"canary2\"\n", @@ -1083,7 +1083,7 @@ "source": [ "## 2. Training on a new task: A case of decoding with context\n", "\n", - "This is an example of a capability that is already supported by the current [Canary2PromptFormatter](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/prompts/canary2.py) as well as the tokenizer model.\n", + "This is an example of a capability that is already supported by the current [Canary2PromptFormatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/prompts/canary2.py) as well as the tokenizer model.\n", "\n", "```\n", "\"decodercontext\": Modality.Text\n", @@ -1142,7 +1142,7 @@ "}\n", "```\n", "\n", - "In order to support this functionality, the [Canary2PromptFormatter](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/prompts/canary2.py) should have the relevant slot value and the default values:\n", + "In order to support this functionality, the [Canary2PromptFormatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/prompts/canary2.py) should have the relevant slot value and the default values:\n", "\n", "```\n", "# Should we predict timestamps?\n", @@ -1178,7 +1178,7 @@ "We add 900 integers to the list special tokens along with task-related tokens and rebuild the tokenizer as previously discussed.\n", "\n", "Now the transcript is a mix of tokens from `spl_tokens` tokenizer (frame indices) and tokens from a language-specific tokenizer.\n", - "The [canary_tokenizer](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/tokenizers/canary_tokenizer.py) handles this by adding a modified `_text_to_ids` method.\n", + "The [canary_tokenizer](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/tokenizers/canary_tokenizer.py) handles this by adding a modified `_text_to_ids` method.\n", "\n", "\n", "```\n", @@ -1207,7 +1207,7 @@ "\n", "Speech summarization is an example of completely new task, meaning, neither the prompt format nor the default special tokens have an explicit support for this task.\n", "\n", - "You will start with modifying [Canary2PromptFormatter](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/prompts/canary2.py) or even writing your own custom prompt formatter. [This tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/multimodal/Prompt%20Formatter%20Tutorial.ipynb) has useful references on modifying and building custom prompt formatter.\n", + "You will start with modifying [Canary2PromptFormatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/common/prompts/canary2.py) or even writing your own custom prompt formatter. [This tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/multimodal/Prompt%20Formatter%20Tutorial.ipynb) has useful references on modifying and building custom prompt formatter.\n", "\n", "\n", "One possible way to modify the existing promp format is to add an optional `\"summarize\"` key whose default value is `false`:\n", @@ -1295,7 +1295,7 @@ "\n", " ```\n", "\n", - " (iv) If you wish further customization that cannot be handled with just these arguments, you can modify https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py. Specifically, modify the following snippet of code\n", + " (iv) If you wish further customization that cannot be handled with just these arguments, you can modify https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/core/classes/modelPT.py. Specifically, modify the following snippet of code\n", "\n", " ```\n", " dict_to_load = {}\n", @@ -1336,9 +1336,9 @@ "\n", "In our experience working with Canary, we noticed that starting from a pre-trained speech encoder, greatly helps convergence. Especially for larger models (1B+ params) initializing from a pretrained encoder may even be required to stabilize the training.\n", "\n", - "Canary-180M-Flash 17-layer fastconformer encoder was initialized from a 17-layer fastconformer encoder of a transducer speech recognition model ([model](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu/files), [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml#L29)). The 4-layer transformer decoder was initialized from scratch.\n", + "Canary-180M-Flash 17-layer fastconformer encoder was initialized from a 17-layer fastconformer encoder of a transducer speech recognition model ([model](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu/files), [config](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml#L29)). The 4-layer transformer decoder was initialized from scratch.\n", "\n", - "Canary-1B-Flash has 32-layer fastconformer encoder. The first 24 layers were initialized from a 24-layer fastconfromer encoder of a transducer speech recognition model and the rest were randomly initalized. This 24-layer model was training internally with this [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml#L31)." + "Canary-1B-Flash has 32-layer fastconformer encoder. The first 24 layers were initialized from a 24-layer fastconfromer encoder of a transducer speech recognition model and the rest were randomly initalized. This 24-layer model was training internally with this [config](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml#L31)." ] }, { @@ -1512,9 +1512,9 @@ "1. [SentencePiece](https://arxiv.org/abs/1808.06226) and [concatenated](https://arxiv.org/abs/2306.08753) tokenizer: To learn more about the tokenization process.\n", "\n", "\n", - "2. [Tutorial on prompt formatter](https://github.com/NVIDIA/NeMo/blob/main/tutorials/multimodal/Prompt%20Formatter%20Tutorial.ipynb): To learn more about prompt formatter.\n", + "2. [Tutorial on prompt formatter](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/multimodal/Prompt%20Formatter%20Tutorial.ipynb): To learn more about prompt formatter.\n", "\n", - "2. [Tutorial on multi-task adapters](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb): If you wish to explore adaptation of `Canary-flash` checkpoints using adapters." + "2. [Tutorial on multi-task adapters](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb): If you wish to explore adaptation of `Canary-flash` checkpoints using adapters." ] } ], diff --git a/tutorials/asr/Intro_to_Transducers.ipynb b/tutorials/asr/Intro_to_Transducers.ipynb index d3928bed987f..fdd52733892e 100644 --- a/tutorials/asr/Intro_to_Transducers.ipynb +++ b/tutorials/asr/Intro_to_Transducers.ipynb @@ -45,7 +45,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]" ], "execution_count": null, "outputs": [] @@ -225,7 +225,7 @@ "id": "0W12xF_CqcVF" }, "source": [ - "![](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/images/transducer.png?raw=true)" + "![](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/images/transducer.png?raw=true)" ] }, { @@ -398,7 +398,7 @@ "import os\n", "\n", "if not os.path.exists(\"contextnet_rnnt.yaml\"):\n", - " !wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml" + " !wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml" ], "execution_count": null, "outputs": [] diff --git a/tutorials/asr/Multilang_ASR.ipynb b/tutorials/asr/Multilang_ASR.ipynb index a4692b226a7b..b1405eae78b8 100644 --- a/tutorials/asr/Multilang_ASR.ipynb +++ b/tutorials/asr/Multilang_ASR.ipynb @@ -105,7 +105,7 @@ "## Install NeMo\n", "## We are using the main branch but you might want to adjust that too\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "!pip install datasets==2.21.0 # downgrading to 2.21.0 because latest version (4.0.0) has some incompatibility issues\n", "\n", @@ -206,7 +206,7 @@ "outputs": [], "source": [ "if not os.path.exists(\"get_librispeech_data.py\"):\n", - " !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/dataset_processing/get_librispeech_data.py" + " !wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/dataset_processing/get_librispeech_data.py" ] }, { @@ -984,7 +984,7 @@ "outputs": [], "source": [ "if not os.path.exists(\"process_asr_text_tokenizer.py\"):\n", - " !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tokenizers/process_asr_text_tokenizer.py" + " !wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/tokenizers/process_asr_text_tokenizer.py" ] }, { diff --git a/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb b/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb index 8a8335ac1542..21161f5caf40 100644 --- a/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb +++ b/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb @@ -26,7 +26,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -125,13 +125,13 @@ "# You can ignore it if run locally but do make sure change the filepaths of scripts and config file in cells below.\n", "!mkdir -p scripts\n", "if not os.path.exists(\"scripts/vad_infer.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/speech_classification/vad_infer.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/speech_classification/vad_infer.py\n", "if not os.path.exists(\"scripts/transcribe_speech.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/transcribe_speech.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/transcribe_speech.py\n", " \n", "!mkdir -p conf/vad\n", "if not os.path.exists(\"conf/vad/vad_inference_postprocessing.yaml\"):\n", - " !wget -P conf/vad/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/vad/vad_inference_postprocessing.yaml" + " !wget -P conf/vad/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/vad/vad_inference_postprocessing.yaml" ] }, { @@ -389,7 +389,7 @@ "source": [ "# Further Reading\n", "\n", - "There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_vad/speech_to_text_with_vad.py)." + "There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/asr_vad/speech_to_text_with_vad.py)." ] } ], diff --git a/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb b/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb index a15bb19dab24..e4448d8b4021 100644 --- a/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb +++ b/tutorials/asr/Online_ASR_Microphone_Demo_Buffered_Streaming.ipynb @@ -28,7 +28,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[asr]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs" diff --git a/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb b/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb index fb676af7dbb7..efab5702f7c1 100644 --- a/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb +++ b/tutorials/asr/Online_ASR_Microphone_Demo_Cache_Aware_Streaming.ipynb @@ -42,7 +42,7 @@ "source": [ "# ## Uncomment this cell to install NeMo if it has not been installed\n", "# BRANCH = 'main'\n", - "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]" + "# !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[asr]" ] }, { diff --git a/tutorials/asr/Online_Noise_Augmentation.ipynb b/tutorials/asr/Online_Noise_Augmentation.ipynb index 3ad6ca0c2590..b0d16177aa65 100644 --- a/tutorials/asr/Online_Noise_Augmentation.ipynb +++ b/tutorials/asr/Online_Noise_Augmentation.ipynb @@ -35,7 +35,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[asr]\n", "\n", "## Install TorchAudio\n", "!pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html\n", @@ -51,7 +51,7 @@ "colab_type": "text", "id": "Kqg4Rwki4jBX" }, - "source": "# Introduction\n\nData augmentation is a useful method to improve the performance of models which is applicable across multiple domains. Certain augmentations can also substantially improve robustness of models to noisy samples. \n\nIn this notebook, we describe how to construct an augmentation pipeline inside [Neural Modules (NeMo)](https://github.com/NVIDIA/NeMo), enable augmented training of a [MarbleNet model](https://arxiv.org/abs/2010.13886) and finally how to construct custom augmentations to add to NeMo.\n\nThe notebook will follow the steps below:\n\n - Dataset preparation: Preparing a noise dataset using an example file.\n\n - Construct a data augmentation pipeline.\n \n - Construct a custom augmentation and register it for use in NeMo." + "source": "# Introduction\n\nData augmentation is a useful method to improve the performance of models which is applicable across multiple domains. Certain augmentations can also substantially improve robustness of models to noisy samples. \n\nIn this notebook, we describe how to construct an augmentation pipeline inside [Neural Modules (NeMo)](https://github.com/NVIDIA-NeMo/Speech), enable augmented training of a [MarbleNet model](https://arxiv.org/abs/2010.13886) and finally how to construct custom augmentations to add to NeMo.\n\nThe notebook will follow the steps below:\n\n - Dataset preparation: Preparing a noise dataset using an example file.\n\n - Construct a data augmentation pipeline.\n \n - Construct a custom augmentation and register it for use in NeMo." }, { "attachments": {}, @@ -738,7 +738,7 @@ "source": [ "# This is where the rir data will be downloaded.\n", "# Change this if you don't want the data to be extracted in the current directory.\n", - "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/get_openslr_rir_data.py\n", + "!wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/dataset_processing/get_openslr_rir_data.py\n", "rir_data_path = '.'\n", "!python get_openslr_rir_data.py --data_root {rir_data_path}\n", "rir_manifest_path = os.path.join(rir_data_path, 'processed', 'rir.json')\n", @@ -1117,7 +1117,7 @@ " MODEL_CONFIG = \"matchboxnet_3x1x64_v2.yaml\"\n", "\n", "if not os.path.exists(f\"configs/{MODEL_CONFIG}\"):\n", - " !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/matchboxnet/{MODEL_CONFIG}\"" + " !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/matchboxnet/{MODEL_CONFIG}\"" ] }, { diff --git a/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb b/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb index db0159977e84..983f2ac114f2 100644 --- a/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb +++ b/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb @@ -30,7 +30,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[asr]" ] }, { @@ -208,7 +208,7 @@ "metadata": {}, "source": [ "### Posterior\n", - "\n", + "\n", "\n", "2. generate predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments. Perform this step alongside with above step with flag **gen_overlap_seq=True** or use\n", "```python\n", @@ -226,7 +226,7 @@ "metadata": {}, "source": [ "### Finetune\n", - "You might need to finetune on your data for better performance. For finetuning/transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)" + "You might need to finetune on your data for better performance. For finetuning/transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)" ] }, { diff --git a/tutorials/asr/Self_Supervised_Pre_Training.ipynb b/tutorials/asr/Self_Supervised_Pre_Training.ipynb index 0506bafb56e3..082b7ae4e3e6 100644 --- a/tutorials/asr/Self_Supervised_Pre_Training.ipynb +++ b/tutorials/asr/Self_Supervised_Pre_Training.ipynb @@ -31,7 +31,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -53,7 +53,7 @@ "\n", "The approach we will use for pre-training our models is represented in the following diagram:\n", "\n", - " ![SSL diagram](https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/asr/images/contrastive_ssl.png)\n", + " ![SSL diagram](https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/tutorials/asr/images/contrastive_ssl.png)\n", "\n", "We first mask parts of our input using SpecAugment. The model is then trained to solve a contrastive task of distinguishing the latent representation of the masked time steps from several sampled distractors. Since our encoders also contain stride blocks which reduce the length of the inputs, in order to obtain target representations we combine several consecutive time steps. They are then passed through a quantizer, which has been found to help with contrastive pre-training." ] @@ -274,8 +274,8 @@ "source": [ "## Grab the configs we'll use in this example\n", "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/citrinet/citrinet_1024.yaml\n" + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/citrinet/citrinet_1024.yaml\n" ] }, { @@ -484,7 +484,7 @@ "outputs": [], "source": [ "!mkdir scripts\n", - "!wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py\n", + "!wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py\n", "\n", "!python ./scripts/process_asr_text_tokenizer.py \\\n", " --manifest=\"{data_dir}/an4/train_manifest.json\" \\\n", diff --git a/tutorials/asr/Streaming_ASR.ipynb b/tutorials/asr/Streaming_ASR.ipynb index a4701dc025d8..7218ac4e4a7b 100644 --- a/tutorials/asr/Streaming_ASR.ipynb +++ b/tutorials/asr/Streaming_ASR.ipynb @@ -29,11 +29,11 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/config.yaml\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -62,7 +62,7 @@ "* Real-time or close to real-time inference for live transcriptions\n", "* Offline transcriptions of very long audio\n", "\n", - "In this tutorial, we will mainly focus on streaming for handling long form audio and close to real-time inference with CTC based models. For training ASR models we usually use short segments of audio (<20s) that may be smaller chunks of a long audio that is aligned with the transcriptions and segmented into smaller chunks (see [tools/](https://github.com/NVIDIA/NeMo/tree/main/tools) for some great tools to do this). For running inference on long audio files we are restricted by the available GPU memory that dictates the maximum length of audio that can be transcribed in one inference call. We will take a look at one of the ways to overcome this restriction using NeMo's Conformer-CTC ASR model." + "In this tutorial, we will mainly focus on streaming for handling long form audio and close to real-time inference with CTC based models. For training ASR models we usually use short segments of audio (<20s) that may be smaller chunks of a long audio that is aligned with the transcriptions and segmented into smaller chunks (see [tools/](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools) for some great tools to do this). For running inference on long audio files we are restricted by the available GPU memory that dictates the maximum length of audio that can be transcribed in one inference call. We will take a look at one of the ways to overcome this restriction using NeMo's Conformer-CTC ASR model." ] }, { diff --git a/tutorials/asr/Streaming_ASR_Pipelines.ipynb b/tutorials/asr/Streaming_ASR_Pipelines.ipynb index 17868c82a080..5c9f6fa2ff7b 100644 --- a/tutorials/asr/Streaming_ASR_Pipelines.ipynb +++ b/tutorials/asr/Streaming_ASR_Pipelines.ipynb @@ -29,7 +29,7 @@ "!pip install omegaconf\n", "\n", "BRANCH='main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[all]" ] }, { @@ -44,7 +44,7 @@ "# - vllm: required for LLM-based Speech Translation\n", "#\n", "# For a complete Docker-based setup for speech translation with vLLM, see:\n", - "# https://github.com/NVIDIA/NeMo/blob/main/scripts/installers/Dockerfile.speech_translation_vllm\n", + "# https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/installers/Dockerfile.speech_translation_vllm\n", "\n", "!pip install nemo_text_processing\n", "!pip install vllm==0.12.0" diff --git a/tutorials/asr/Streaming_Multitalker_ASR.ipynb b/tutorials/asr/Streaming_Multitalker_ASR.ipynb index 53b1cfaf3190..7d5564a9b23c 100644 --- a/tutorials/asr/Streaming_Multitalker_ASR.ipynb +++ b/tutorials/asr/Streaming_Multitalker_ASR.ipynb @@ -27,7 +27,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -448,7 +448,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Since streaming processing of speech signals involves many complications in cache handling, we first need to set up a config dataclass that aggregates all the parameters in one place. You can access this class in the example multitalker streaming ASR script: [speech_to_text_multitalker_streaming_infer.py](https://raw.githubusercontent.com/NVIDIA-NeMo/NeMo/main/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py\") " + "Since streaming processing of speech signals involves many complications in cache handling, we first need to set up a config dataclass that aggregates all the parameters in one place. You can access this class in the example multitalker streaming ASR script: [speech_to_text_multitalker_streaming_infer.py](https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py\") " ] }, { diff --git a/tutorials/asr/Transducers_with_HF_Datasets.ipynb b/tutorials/asr/Transducers_with_HF_Datasets.ipynb index ec69da5eb54f..62cd81442e4e 100644 --- a/tutorials/asr/Transducers_with_HF_Datasets.ipynb +++ b/tutorials/asr/Transducers_with_HF_Datasets.ipynb @@ -32,7 +32,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "!pip install --upgrade datasets==3.4.0 # downgrading to 3.4.0 because latest version (4.0.0) has some issues\n" ] @@ -50,7 +50,7 @@ "\n", "In this tutorial, we demonstrate the usage of HF datasets for the Telugu language, where we use the Fluers dataset for training, validation, and testing. However, the same procedure can be used for other languages or domains and finetuned for specific use cases accordingly. \n", "\n", - "For scripts, refer to [speech_to_text_finetune.py]('https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_finetune.py') for training from scratch. \n", + "For scripts, refer to [speech_to_text_finetune.py]('https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/speech_to_text_finetune.py') for training from scratch. \n", "\n", "--------\n", "\n", @@ -127,14 +127,14 @@ "source": [ "Since we are finetuning Parakeet model, which is an English language model, we need to update the tokenizer and update the decoder to support the new language. \n", "\n", - "First, we will extract text transcriptions from the dataset and use them to train a tokenizer. We will use the scripts from NeMo first to get the data from HF dataset using `get_hf_dataset.py` script. Next we use `process_asr_text_tokenizer.py` script to prepare the tokenizer from [scripts](https://github.com/NVIDIA/NeMo/tree/main/scripts/tokenizers) folder. \n" + "First, we will extract text transcriptions from the dataset and use them to train a tokenizer. We will use the scripts from NeMo first to get the data from HF dataset using `get_hf_dataset.py` script. Next we use `process_asr_text_tokenizer.py` script to prepare the tokenizer from [scripts](https://github.com/NVIDIA-NeMo/Speech/tree/main/scripts/tokenizers) folder. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Download the `get_hf_text_data.py` script from the [scripts](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers) folder and run the following command to get the data from HF dataset. " + "Download the `get_hf_text_data.py` script from the [scripts](https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/tokenizers) folder and run the following command to get the data from HF dataset. " ] }, { @@ -146,7 +146,7 @@ "outputs": [], "source": [ "if not os.path.exists(\"scripts/get_hf_text_data.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/get_hf_text_data.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/get_hf_text_data.py" ] }, { @@ -189,7 +189,7 @@ "source": [ "\n", "if not os.path.exists('configs/huggingface_data_tokenizer.yaml'):\n", - " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/conf/huggingface_data_tokenizer.yaml\n", + " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/conf/huggingface_data_tokenizer.yaml\n", "\n", "\n", "!export HYDRA_FULL_ERROR=1;python scripts/get_hf_text_data.py \\\n", @@ -242,7 +242,7 @@ "source": [ "\n", "if not os.path.exists(\"scripts/process_asr_text_tokenizer.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py\n", "\n", "# Now this downloads the text corpus of data to tokenizers script\n", "VOCAB_SIZE = 256 # can be any value above 29\n", @@ -325,7 +325,7 @@ "## Grab the config we'll use in this example\n", "!mkdir -p configs\n", "if not os.path.exists('configs/speech_to_text_hf_finetune.yaml'):\n", - " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml\n", + " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml\n", "\n", "config = OmegaConf.load(\"configs/speech_to_text_hf_finetune.yaml\")" ] diff --git a/tutorials/asr/Voice_Activity_Detection.ipynb b/tutorials/asr/Voice_Activity_Detection.ipynb index 1ccb05ba80d9..3e5b3a25fbea 100644 --- a/tutorials/asr/Voice_Activity_Detection.ipynb +++ b/tutorials/asr/Voice_Activity_Detection.ipynb @@ -31,7 +31,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[asr]\n", "\n", "## Install TorchAudio\n", "## NOTE: TorchAudio installation may not work in all environments, please use Google Colab for best experience\n", @@ -143,7 +143,7 @@ "source": [ "script = os.path.join(tmp, 'process_vad_data.py')\n", "if not os.path.exists(script):\n", - " !wget -P $tmp https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_vad_data.py" + " !wget -P $tmp https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/dataset_processing/process_vad_data.py" ] }, { @@ -299,7 +299,7 @@ "MODEL_CONFIG = \"marblenet_3x2x64.yaml\"\n", "\n", "if not os.path.exists(f\"configs/{MODEL_CONFIG}\"):\n", - " !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/marblenet/{MODEL_CONFIG}\"" + " !wget -P configs/ \"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/marblenet/{MODEL_CONFIG}\"" ] }, { @@ -576,7 +576,7 @@ "Experiment with increasing the number of epochs or with batch size to see how much you can improve the score! \n", "\n", "**NOTE:** Noise robustness is quite important for VAD task. Below we list the augmentation we used in this demo. \n", - "Please refer to [Online_Noise_Augmentation.ipynb](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_Noise_Augmentation.ipynb) for understanding noise augmentation in NeMo.\n", + "Please refer to [Online_Noise_Augmentation.ipynb](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Online_Noise_Augmentation.ipynb) for understanding noise augmentation in NeMo.\n", "\n", "\n" ] @@ -1131,9 +1131,9 @@ "metadata": {}, "source": [ "# Transfer Leaning & Fine-tuning on a new dataset\n", - "For transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", + "For transfer learning, please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", "\n", - "More details on saving and restoring checkpoint, and exporting a model in its entirety, please refer to [**Fine-tuning on a new dataset** & **Advanced Usage parts** of Speech Command tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Speech_Commands.ipynb)\n", + "More details on saving and restoring checkpoint, and exporting a model in its entirety, please refer to [**Fine-tuning on a new dataset** & **Advanced Usage parts** of Speech Command tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Speech_Commands.ipynb)\n", "\n", "\n", "\n" @@ -1148,7 +1148,7 @@ }, "source": [ "# Inference and more\n", - "If you are interested in **pretrained** model and **streaming inference**, please have a look at our [VAD inference tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb) and script [vad_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/speech_classification/vad_infer.py)\n" + "If you are interested in **pretrained** model and **streaming inference**, please have a look at our [VAD inference tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb) and script [vad_infer.py](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/speech_classification/vad_infer.py)\n" ] }, { @@ -1167,7 +1167,7 @@ "\n", "During inference, since frame-VAD model doesn't require splicing input into overlapping segments, it is more efficient than segment-VAD model, with 8x less GPU memory consumption.\n", "\n", - "For more information on the frame-VAD model, please refer to the [README.md](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/speech_classification/README.md). For training and running inference on frame-VAD, please refer to [speech_to_frame_label.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/speech_classification/speech_to_frame_label.py) and [frame_vad_infer.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/speech_classification/frame_vad_infer.py)." + "For more information on the frame-VAD model, please refer to the [README.md](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/speech_classification/README.md). For training and running inference on frame-VAD, please refer to [speech_to_frame_label.py](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/speech_classification/speech_to_frame_label.py) and [frame_vad_infer.py](https://github.com/NVIDIA-NeMo/Speech/blob/stable/examples/asr/speech_classification/frame_vad_infer.py)." ] } ], diff --git a/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb b/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb index c3334a59b0d2..656b91d1fe6a 100644 --- a/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb +++ b/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb @@ -51,11 +51,11 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", "# !mkdir configs\n", - "# !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml" + "# !wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/conf/contextnet_rnnt/contextnet_rnnt.yaml" ] }, { @@ -166,7 +166,7 @@ " os.makedirs(\"scripts\")\n", "\n", "if not os.path.exists(\"scripts/process_an4_data.py\"):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/process_an4_data.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/dataset_processing/process_an4_data.py" ], "metadata": { "id": "NpKgT6q5-gNk" @@ -524,10 +524,10 @@ "cell_type": "code", "source": [ "if not os.path.exists('scripts/transcribe_speech.py'):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/transcribe_speech.py\n", + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/transcribe_speech.py\n", "\n", "if not os.path.exists('scripts/speech_to_text_eval.py'):\n", - " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/speech_to_text_eval.py" + " !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/examples/asr/speech_to_text_eval.py" ], "metadata": { "id": "Ak4v4aWjGoQH" @@ -1297,7 +1297,7 @@ "source": [ "# Further reading\n", "\n", - "For efficient scripts to add, train, and evaluate adapter augmented models, please refer to the [Adapters example section](https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_adapters).\n", + "For efficient scripts to add, train, and evaluate adapter augmented models, please refer to the [Adapters example section](https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/asr/asr_adapters).\n", "\n", "Please follow the following articles that discuss the use of adapters in ASR - \n", "- [Exploiting Adapters for Cross-lingual Low-resource Speech Recognition](https://arxiv.org/abs/2105.11905)\n", diff --git a/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb b/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb index 290b706721cd..c9e94b2b3e8f 100644 --- a/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb +++ b/tutorials/asr/asr_adapters/Multi_Task_Adapters.ipynb @@ -33,7 +33,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install \"nemo_toolkit[asr] @ git+https://github.com/NVIDIA/NeMo.git@$BRANCH\"" + "!python -m pip install \"nemo_toolkit[asr] @ git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH\"" ] }, { @@ -54,7 +54,7 @@ "\n", "Multi Task (Canary) models are highly capable large neural networks capable of things like speech recognition, X to English and English to X translation and able to select whether to transcribe speech with punctuation and capitalization. These huge models are trained on several thousand hours of speech and text data, making it challenging to adapt to new datasets.\n", "\n", - "In the previous tutorial for [ASR Adapters](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb), we used small adapter modules to tune a large ASR model on a small amount of data. In this tutorial, we will adapt a [Nvidia Canary](https://huggingface.co/nvidia/canary-1b) model onto a small amount of speech data for both Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST).\n", + "In the previous tutorial for [ASR Adapters](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/asr/asr_adapters/ASR_with_Adapters.ipynb), we used small adapter modules to tune a large ASR model on a small amount of data. In this tutorial, we will adapt a [Nvidia Canary](https://huggingface.co/nvidia/canary-1b) model onto a small amount of speech data for both Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST).\n", "\n", "In this tutorial, we will also demonstrate a simple way of creating custom Data Modules from PyTorch Lightning to design custom datasets and data loaders for the highly flexible Multi Task Models in NeMo ASR. This offers users more flexibility in designing new tasks, and finetuning the models on small amounts of data." ] diff --git a/tutorials/audio/speech_enhancement/BNR_Speech_enhancement_with_NeMo.ipynb b/tutorials/audio/speech_enhancement/BNR_Speech_enhancement_with_NeMo.ipynb index 62caa21f2815..aad50d99d5c9 100644 --- a/tutorials/audio/speech_enhancement/BNR_Speech_enhancement_with_NeMo.ipynb +++ b/tutorials/audio/speech_enhancement/BNR_Speech_enhancement_with_NeMo.ipynb @@ -825,10 +825,10 @@ "\n", "For more details about NeMo models and applications in in ASR and TTS, we recommend you checkout other tutorials next:\n", "\n", - "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", - "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", - "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", - "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" + "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", + "* [NeMo models](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", + "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", + "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" ] }, { diff --git a/tutorials/audio/speech_enhancement/Speech_Enhancement_with_NeMo.ipynb b/tutorials/audio/speech_enhancement/Speech_Enhancement_with_NeMo.ipynb index 8e3da9778848..c0950a87e4fe 100644 --- a/tutorials/audio/speech_enhancement/Speech_Enhancement_with_NeMo.ipynb +++ b/tutorials/audio/speech_enhancement/Speech_Enhancement_with_NeMo.ipynb @@ -417,7 +417,7 @@ "id": "9ce4eebe" }, "source": [ - "\"encmaskdecoder_model\"" + "\"encmaskdecoder_model\"" ] }, { @@ -453,7 +453,7 @@ "id": "4404d6af" }, "source": [ - "\"single_output_example_model\"" + "\"single_output_example_model\"" ] }, { @@ -942,7 +942,7 @@ "\n", "This can be achieved with small changes to the model configuration.\n", "\n", - "\"dual_output_example_model\"" + "\"dual_output_example_model\"" ] }, { @@ -1261,10 +1261,10 @@ "\n", "For more details about NeMo models and applications in in ASR and TTS, we recommend you checkout other tutorials next:\n", "\n", - "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", - "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", - "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", - "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" + "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", + "* [NeMo models](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", + "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", + "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" ] }, { diff --git a/tutorials/audio/speech_enhancement/Speech_Enhancement_with_Online_Augmentation.ipynb b/tutorials/audio/speech_enhancement/Speech_Enhancement_with_Online_Augmentation.ipynb index 876771e3872c..49cdb10de7d1 100644 --- a/tutorials/audio/speech_enhancement/Speech_Enhancement_with_Online_Augmentation.ipynb +++ b/tutorials/audio/speech_enhancement/Speech_Enhancement_with_Online_Augmentation.ipynb @@ -464,7 +464,7 @@ "id": "c0ff2bff-1637-4a17-85b2-c92307a4d8d7" }, "source": [ - "\"encmaskdecoder_model\"" + "\"encmaskdecoder_model\"" ] }, { @@ -500,7 +500,7 @@ "id": "8ed37321-2e8b-4d8e-90bf-efc0897517e5" }, "source": [ - "\"single_output_example_model\"" + "\"single_output_example_model\"" ] }, { @@ -928,10 +928,10 @@ "\n", "For more details about NeMo models and applications in in ASR and TTS, we recommend you checkout other tutorials next:\n", "\n", - "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", - "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", - "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", - "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" + "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", + "* [NeMo models](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", + "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", + "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)" ] }, { diff --git a/tutorials/cloud/aws/ASR_Finetuning_at_Scale_with_AWS_SageMaker.ipynb b/tutorials/cloud/aws/ASR_Finetuning_at_Scale_with_AWS_SageMaker.ipynb index c4406a4f04ee..9e01c02dc70e 100644 --- a/tutorials/cloud/aws/ASR_Finetuning_at_Scale_with_AWS_SageMaker.ipynb +++ b/tutorials/cloud/aws/ASR_Finetuning_at_Scale_with_AWS_SageMaker.ipynb @@ -71,7 +71,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -193,17 +193,17 @@ "config_path = str(config_dir / \"config.yaml\")\n", "\n", "# download scripts to format the data source.\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speech_recognition/convert_hf_dataset_to_nemo.py\", str(code_dir))\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py\",\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/speech_recognition/convert_hf_dataset_to_nemo.py\", str(code_dir))\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py\",\n", " str(code_dir))\n", "\n", "# download scripts to run training\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml\", config_path)\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py\",\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml\", config_path)\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py\",\n", " str(code_dir))\n", "\n", "# download script to create tokenizer\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tokenizers/process_asr_text_tokenizer.py\",\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/tokenizers/process_asr_text_tokenizer.py\",\n", " str(code_dir))" ] }, diff --git a/tutorials/cloud/aws/SageMaker_ASR_Training.ipynb b/tutorials/cloud/aws/SageMaker_ASR_Training.ipynb index 8cf540b27114..8c239a15d1e0 100644 --- a/tutorials/cloud/aws/SageMaker_ASR_Training.ipynb +++ b/tutorials/cloud/aws/SageMaker_ASR_Training.ipynb @@ -56,7 +56,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", @@ -173,8 +173,8 @@ "outputs": [], "source": [ "config_path = str(config_dir / \"config.yaml\")\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/conf/conformer/conformer_ctc_char.yaml\", config_path)\n", - "wget.download(\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/asr/asr_ctc/speech_to_text_ctc.py\", str(code_dir))" + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/asr/conf/conformer/conformer_ctc_char.yaml\", config_path)\n", + "wget.download(\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/asr/asr_ctc/speech_to_text_ctc.py\", str(code_dir))" ] }, { diff --git a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb index f25ff76163af..ed3c21dd38c7 100644 --- a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb +++ b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb @@ -31,7 +31,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -49,13 +49,13 @@ "\n", "In this tutorial, we demonstrate how we can get ASR transcriptions combined with speaker labels. Since we don't include a detailed process of getting ASR results or diarization results, please refer to the following links for more in-depth description.\n", "\n", - "If you need detailed understanding of transcribing words with ASR, refer to this [ASR Tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb) tutorial.\n", + "If you need detailed understanding of transcribing words with ASR, refer to this [ASR Tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb) tutorial.\n", "\n", "\n", - "For detailed parameter setting and execution of speaker diarization, refer to this [Diarization Inference](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb) tutorial.\n", + "For detailed parameter setting and execution of speaker diarization, refer to this [Diarization Inference](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb) tutorial.\n", "\n", "\n", - "An example script that runs ASR and speaker diarization together can be found at [ASR with Diarization](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_with_asr_infer.py).\n", + "An example script that runs ASR and speaker diarization together can be found at [ASR with Diarization](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_with_asr_infer.py).\n", "\n", "### Speaker diarization in ASR pipeline\n", "\n", @@ -193,7 +193,7 @@ "DOMAIN_TYPE = \"meeting\" # Can be meeting or telephonic based on domain type of the audio file\n", "CONFIG_FILE_NAME = f\"diar_infer_{DOMAIN_TYPE}.yaml\"\n", "\n", - "CONFIG_URL = f\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}\"\n", + "CONFIG_URL = f\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}\"\n", "\n", "if not os.path.exists(os.path.join(data_dir,CONFIG_FILE_NAME)):\n", " CONFIG = wget.download(CONFIG_URL, data_dir)\n", diff --git a/tutorials/speaker_tasks/End_to_End_Diarization_Inference.ipynb b/tutorials/speaker_tasks/End_to_End_Diarization_Inference.ipynb index 155dac4c974f..ee227bf894d4 100644 --- a/tutorials/speaker_tasks/End_to_End_Diarization_Inference.ipynb +++ b/tutorials/speaker_tasks/End_to_End_Diarization_Inference.ipynb @@ -25,7 +25,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -69,7 +69,7 @@ "source": [ "## Sortformer Diarization Inference\n", "\n", - "As explained in the [Sortformer Diarization Training](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb) tutorial, Sortformer is trained with Sort-Loss to generate speaker segments in arrival-time order. If a diarization model can generate speaker segments in a pre-defined rule or order, we do not need to match the permutations when we train diarization model with multi-speaker automatic speech recognition (ASR) models or we do not need to match permutations from each window when a diarization model is running in streaming mode where audio chunk sequences are processed, creating a problem of permutation matchin between inference windows. " + "As explained in the [Sortformer Diarization Training](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb) tutorial, Sortformer is trained with Sort-Loss to generate speaker segments in arrival-time order. If a diarization model can generate speaker segments in a pre-defined rule or order, we do not need to match the permutations when we train diarization model with multi-speaker automatic speech recognition (ASR) models or we do not need to match permutations from each window when a diarization model is running in streaming mode where audio chunk sequences are processed, creating a problem of permutation matchin between inference windows. " ] }, { @@ -275,7 +275,7 @@ "yaml_name=\"sortformer_diar_4spk-v1_dihard3-dev.yaml\"\n", "MODEL_CONFIG = os.path.join(data_dir, yaml_name)\n", "if True:\n", - " config_url = f\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/post_processing/{yaml_name}\"\n", + " config_url = f\"https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/examples/speaker_tasks/diarization/conf/post_processing/{yaml_name}\"\n", " MODEL_CONFIG = wget.download(config_url, data_dir)\n" ] }, diff --git a/tutorials/speaker_tasks/End_to_End_Diarization_Training.ipynb b/tutorials/speaker_tasks/End_to_End_Diarization_Training.ipynb index 354cf1208672..f3f2d72e9fa4 100644 --- a/tutorials/speaker_tasks/End_to_End_Diarization_Training.ipynb +++ b/tutorials/speaker_tasks/End_to_End_Diarization_Training.ipynb @@ -21,9 +21,9 @@ "NEMO_DIR_PATH = \"NeMo\"\n", "BRANCH = 'main'\n", "\n", - "! git clone https://github.com/NVIDIA/NeMo\n", + "! git clone https://github.com/NVIDIA-NeMo/Speech\n", "%cd NeMo\n", - "! python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]\n", + "! python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]\n", "%cd .." ] }, @@ -375,11 +375,11 @@ "source": [ "### Example Data Creation\n", "\n", - "In this tutorial, we will create a simple toy training dataset using the [NeMo Multispeaker Simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb), with Librispeech as the source dataset for demonstration purposes. If you already have datasets with proper speaker annotations (RTTM files), you can replace the simulated dataset with your own.\n", + "In this tutorial, we will create a simple toy training dataset using the [NeMo Multispeaker Simulator](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb), with Librispeech as the source dataset for demonstration purposes. If you already have datasets with proper speaker annotations (RTTM files), you can replace the simulated dataset with your own.\n", "\n", - "If you don’t have access to any speaker diarization datasets, the [NeMo Multispeaker Simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb) can be used to generate a sufficient amount of data samples to meet your requirements.\n", + "If you don’t have access to any speaker diarization datasets, the [NeMo Multispeaker Simulator](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb) can be used to generate a sufficient amount of data samples to meet your requirements.\n", "\n", - "For more details on the data simulator, refer to the documentation in the [NeMo Multispeaker Simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb). This tutorial will not cover the configurations and detailed process of data simulation." + "For more details on the data simulator, refer to the documentation in the [NeMo Multispeaker Simulator](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb). This tutorial will not cover the configurations and detailed process of data simulation." ] }, { @@ -419,9 +419,9 @@ " print(\"Downloading necessary scripts\")\n", " !mkdir -p NeMo/scripts/dataset_processing\n", " !mkdir -p NeMo/scripts/speaker_tasks\n", - " !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/dataset_processing/get_librispeech_data.py\n", - " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/speaker_tasks/create_alignment_manifest.py\n", - " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py" + " !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/scripts/dataset_processing/get_librispeech_data.py\n", + " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/scripts/speaker_tasks/create_alignment_manifest.py\n", + " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py" ] }, { @@ -512,7 +512,7 @@ "!mkdir -p {conf_dir}\n", "CONFIG_PATH = os.path.join(conf_dir, 'data_simulator.yaml')\n", "if not os.path.exists(CONFIG_PATH):\n", - " !wget -P {conf_dir} https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/tools/speech_data_simulator/conf/data_simulator.yaml\n", + " !wget -P {conf_dir} https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/tools/speech_data_simulator/conf/data_simulator.yaml\n", "\n", "config = OmegaConf.load(CONFIG_PATH)\n", "print(OmegaConf.to_yaml(config))" @@ -804,7 +804,7 @@ "\n", "NEMO_ROOT = os.getcwd()\n", "!mkdir -p conf \n", - "!wget -P conf https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml\n", + "!wget -P conf https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml\n", "MODEL_CONFIG = os.path.join(NEMO_ROOT,'conf/sortformer_diarizer_hybrid_loss_4spk-v1.yaml')\n", "config = OmegaConf.load(MODEL_CONFIG)\n" ] @@ -875,7 +875,7 @@ "outputs": [], "source": [ "\n", - "!wget -P conf https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/speaker_tasks/diarization/conf/neural_diarizer/streaming_sortformer_diarizer_4spk-v2.yaml\n", + "!wget -P conf https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/examples/speaker_tasks/diarization/conf/neural_diarizer/streaming_sortformer_diarizer_4spk-v2.yaml\n", "MODEL_CONFIG = os.path.join(NEMO_ROOT,'conf/streaming_sortformer_diarizer_4spk-v2.yaml')\n", "config = OmegaConf.load(MODEL_CONFIG)\n", "\n", diff --git a/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb b/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb index 49eaafa49331..352557003d6b 100644 --- a/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb +++ b/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb @@ -28,7 +28,7 @@ "\n", "## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -55,7 +55,7 @@ "source": [ "In this tutorial, we shall first train these embeddings on speaker-related datasets, and then get speaker embeddings from a pretrained network for a new dataset. Since Google Colab has very slow read-write speeds, I'll be demonstrating this tutorial using [an4](http://www.speech.cs.cmu.edu/databases/an4/). \n", "\n", - "Instead, if you'd like to try on a bigger dataset like [hi-mia](https://arxiv.org/abs/1912.01231) use the [get_hi-mia-data.py](https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing/speaker_tasks/get_hi-mia_data.py) script to download the necessary files, extract them, and resample to 16Khz if any of these samples are not at 16Khz. " + "Instead, if you'd like to try on a bigger dataset like [hi-mia](https://arxiv.org/abs/1912.01231) use the [get_hi-mia-data.py](https://github.com/NVIDIA-NeMo/Speech/tree/main/scripts/dataset_processing/speaker_tasks/get_hi-mia_data.py) script to download the necessary files, extract them, and resample to 16Khz if any of these samples are not at 16Khz. " ] }, { @@ -192,7 +192,7 @@ "if not os.path.exists('scripts'):\n", " print(\"Downloading necessary scripts\")\n", " !mkdir -p scripts/speaker_tasks\n", - " !wget -P scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/speaker_tasks/filelist_to_manifest.py\n", + " !wget -P scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/scripts/speaker_tasks/filelist_to_manifest.py\n", "!python {NEMO_ROOT}/scripts/speaker_tasks/filelist_to_manifest.py --filelist {data_dir}/an4/wav/an4_clstk/train_all.txt --id -2 --out {data_dir}/an4/wav/an4_clstk/all_manifest.json --split" ] }, @@ -273,7 +273,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note: All the following steps are just for explanation of each section, but one can use the provided [training script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/speaker_reco.py) to launch training in the command line." + "Note: All the following steps are just for explanation of each section, but one can use the provided [training script](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/speaker_tasks/recognition/speaker_reco.py) to launch training in the command line." ] }, { @@ -322,7 +322,7 @@ "source": [ "# This line will print the entire config of sample TitaNet model\n", "!mkdir conf\n", - "!wget -P conf https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/speaker_tasks/recognition/conf/titanet-large.yaml\n", + "!wget -P conf https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/examples/speaker_tasks/recognition/conf/titanet-large.yaml\n", "MODEL_CONFIG = os.path.join(NEMO_ROOT,'conf/titanet-large.yaml')\n", "config = OmegaConf.load(MODEL_CONFIG)\n", "print(OmegaConf.to_yaml(config))" @@ -757,7 +757,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note: You may use [finetune-script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/speaker_reco_finetune.py) to launch training in the command line. Following is just a demonstration of the script" + "Note: You may use [finetune-script](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/speaker_tasks/recognition/speaker_reco_finetune.py) to launch training in the command line. Following is just a demonstration of the script" ] }, { @@ -766,7 +766,7 @@ "metadata": {}, "outputs": [], "source": [ - "!wget -P conf https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/speaker_tasks/recognition/conf/titanet-finetune.yaml\n", + "!wget -P conf https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/examples/speaker_tasks/recognition/conf/titanet-finetune.yaml\n", "MODEL_CONFIG = os.path.join(NEMO_ROOT,'conf/titanet-finetune.yaml')\n", "finetune_config = OmegaConf.load(MODEL_CONFIG)\n", "print(OmegaConf.to_yaml(finetune_config))" diff --git a/tutorials/speaker_tasks/Streaming_End_to_End_Diarization_Inference.ipynb b/tutorials/speaker_tasks/Streaming_End_to_End_Diarization_Inference.ipynb index 30431972d5f0..b7a5030a5192 100644 --- a/tutorials/speaker_tasks/Streaming_End_to_End_Diarization_Inference.ipynb +++ b/tutorials/speaker_tasks/Streaming_End_to_End_Diarization_Inference.ipynb @@ -25,7 +25,7 @@ "\n", "# ## Install NeMo\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[asr]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@{BRANCH}#egg=nemo_toolkit[asr]" ] }, { @@ -43,7 +43,7 @@ "source": [ "## Streaming Diarization Inference with Sortformer\n", "\n", - "As explained in the [Sortformer Diarization Training](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb) tutorial, Sortformer is trained with Sort-Loss to generate speaker segments in arrival-time order. If a diarization model can generate speaker segments in a pre-defined manner or order, we do not need to match the permutations when we train diarization model with multi-speaker automatic speech recognition (ASR) models, nor do we need to match permutations from each window when a diarization model is running in streaming mode where audio chunk sequences are processed, creating a problem of permutation matching between inference windows. \n", + "As explained in the [Sortformer Diarization Training](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Training.ipynb) tutorial, Sortformer is trained with Sort-Loss to generate speaker segments in arrival-time order. If a diarization model can generate speaker segments in a pre-defined manner or order, we do not need to match the permutations when we train diarization model with multi-speaker automatic speech recognition (ASR) models, nor do we need to match permutations from each window when a diarization model is running in streaming mode where audio chunk sequences are processed, creating a problem of permutation matching between inference windows. \n", "\n", "### Arrival-Order Speaker Cache\n", "\n", diff --git a/tutorials/tools/CTC_Segmentation_Tutorial.ipynb b/tutorials/tools/CTC_Segmentation_Tutorial.ipynb index 56eac3aec12d..7b57f70b11ae 100644 --- a/tutorials/tools/CTC_Segmentation_Tutorial.ipynb +++ b/tutorials/tools/CTC_Segmentation_Tutorial.ipynb @@ -32,9 +32,9 @@ "\n", "# option #2: download NeMo repo\n", "if 'google.colab' in str(get_ipython()) or not os.path.exists(NEMO_DIR_PATH):\n", - " ! git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + " ! git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech\n", " ! cd NeMo\n", - " ! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + " ! python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]" ], "metadata": { "id": "6DGWYSp62hs1" @@ -106,7 +106,7 @@ "id": "S1DZk-inQGTI" }, "source": [ - "`TOOLS_DIR` contains scripts that we are going to need during the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/ctc_segmentation/scripts)." + "`TOOLS_DIR` contains scripts that we are going to need during the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/ctc_segmentation/scripts)." ] }, { @@ -260,7 +260,7 @@ "* `max_length` argument - max number of words in a segment for alignment (used only if there are no punctuation marks present in the original text. Long non-speech segments are better for segments split and are more likely to co-occur with punctuation marks. Random text split could deteriorate the quality of the alignment.\n", "* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, and the text will be changed to lowercase\n", "* sentences for alignment with the original punctuation and capitalization will be stored under `$OUTPUT_DIR/processed/*_with_punct.txt`\n", - "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`https://github.com/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. See [https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n", + "* numbers will be converted from written to their spoken form with `num2words` package. For English, it's recommended to use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`https://github.com/NVIDIA-NeMo/Speech-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA-NeMo/Speech-text-processing/blob/main/tutorials/Text_(Inverse)_Normalization.ipynb) for more details). Even `num2words` normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English, German and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. See [https://github.com/NVIDIA-NeMo/Speech-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA-NeMo/Speech-text-processing/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n", "\n", "### Audio preprocessing:\n", "* non '.wav' audio files will be converted to `.wav` format\n", @@ -470,7 +470,7 @@ "outputs": [], "source": [ "import sys\n", - "wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/asr/transcribe_speech.py')\n", + "wget.download(f'https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/{BRANCH}/examples/asr/transcribe_speech.py')\n", "\n", "!{sys.executable} transcribe_speech.py \\\n", "pretrained_name=$MODEL \\\n", @@ -709,7 +709,7 @@ "source": [ "# Next Steps\n", "\n", - "- Check out [NeMo Speech Data Explorer tool](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer#speech-data-explorer) to interactively evaluate the aligned segments.\n", + "- Check out [NeMo Speech Data Explorer tool](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/speech_data_explorer#speech-data-explorer) to interactively evaluate the aligned segments.\n", "- Try Audio-based normalization tool." ] }, diff --git a/tutorials/tools/DefinedCrowd_x_NeMo_ASR_Training_Tutorial.ipynb b/tutorials/tools/DefinedCrowd_x_NeMo_ASR_Training_Tutorial.ipynb index 8b0114690540..ce6282c97e7d 100644 --- a/tutorials/tools/DefinedCrowd_x_NeMo_ASR_Training_Tutorial.ipynb +++ b/tutorials/tools/DefinedCrowd_x_NeMo_ASR_Training_Tutorial.ipynb @@ -528,7 +528,7 @@ "\n", "NVIDIA NeMo is a toolkit built by NVIDIA for **creating conversational AI applications**. This toolkit includes collections of pre-trained modules for **Automatic Speech Recognition (ASR)**, Natural Language Processing (NLP), and Texto-to-Speech (TTS), enabling researchers and data scientists to easily compose complex neural network architectures and focus on designing their applications.\n", "\n", - "In this tutorial, we want to demonstrate how to **connect DefinedCrowd Speech Workflows** to **train and improve an ASR model** using NVIDIA NeMo. The tutorial re-uses parts of a previous [ASR tutorial from NeMo](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)." + "In this tutorial, we want to demonstrate how to **connect DefinedCrowd Speech Workflows** to **train and improve an ASR model** using NVIDIA NeMo. The tutorial re-uses parts of a previous [ASR tutorial from NeMo](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)." ] }, { @@ -1312,7 +1312,7 @@ "\n", "After downloading the speech data from DefinedCrowd API, we need to adapt it for the format expected by NeMo for ASR training. For this, we need to create manifests for our training and evaluation data, including each audio file's metadata.\n", "\n", - "NeMo requires that we adapt our data to a [particular manifest format](https://github.com/NVIDIA/NeMo/blob/ebade85f6d10319ef59312cb2eefcba4fd298a3d/nemo/collections/asr/parts/manifest.py#L39). Each line corresponding to one audio sample, so the line count equals the number of samples represented by the manifest. A line must contain the path to an audio file, the corresponding transcript, and the audio sample duration. For example, here is what one line might look like in a NeMo-compatible manifest:\n", + "NeMo requires that we adapt our data to a [particular manifest format](https://github.com/NVIDIA-NeMo/Speech/blob/ebade85f6d10319ef59312cb2eefcba4fd298a3d/nemo/collections/asr/parts/manifest.py#L39). Each line corresponding to one audio sample, so the line count equals the number of samples represented by the manifest. A line must contain the path to an audio file, the corresponding transcript, and the audio sample duration. For example, here is what one line might look like in a NeMo-compatible manifest:\n", "```\n", "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", "```\n", @@ -1419,7 +1419,7 @@ "source": [ "In this tutorial, we'll describe how to use the QuartzNet15x5 model as a base model for fine-tuning with our data. We want to improve the recognition of our dataset, so we will benchmark the model performance on the base model, and after, on the fine-tuned version.\n", "\n", - "Some of the following functions were retrieved from the Nemo Tutorial on ASR that could be checked at [https://github.com/NVIDIA/NeMo](https://github.com/NVIDIA/NeMo)" + "Some of the following functions were retrieved from the Nemo Tutorial on ASR that could be checked at [https://github.com/NVIDIA-NeMo/Speech](https://github.com/NVIDIA-NeMo/Speech)" ] }, { @@ -1464,7 +1464,7 @@ "source": [ "## Download the config we'll use in this example\n", "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/stable/examples/asr/conf/config.yaml &> /dev/null\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/examples/asr/conf/config.yaml &> /dev/null\n", "\n", "# --- Config Information ---#\n", "from ruamel.yaml import YAML\n", diff --git a/tutorials/tools/Multispeaker_Simulator.ipynb b/tutorials/tools/Multispeaker_Simulator.ipynb index 2c83da82914a..a048e9230bc5 100644 --- a/tutorials/tools/Multispeaker_Simulator.ipynb +++ b/tutorials/tools/Multispeaker_Simulator.ipynb @@ -20,9 +20,9 @@ "NEMO_DIR_PATH = \"NeMo\"\n", "BRANCH = 'main'\n", "\n", - "! git clone https://github.com/NVIDIA/NeMo\n", + "! git clone https://github.com/NVIDIA-NeMo/Speech\n", "%cd NeMo\n", - "! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "! python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", "%cd .." ] }, @@ -71,8 +71,8 @@ " print(\"Downloading necessary scripts\")\n", " !mkdir -p NeMo/scripts/dataset_processing\n", " !mkdir -p NeMo/scripts/speaker_tasks\n", - " !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/get_librispeech_data.py\n", - " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/speaker_tasks/create_alignment_manifest.py" + " !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/dataset_processing/get_librispeech_data.py\n", + " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/scripts/speaker_tasks/create_alignment_manifest.py" ] }, { @@ -186,7 +186,7 @@ "!mkdir -p $conf_dir\n", "CONFIG_PATH = os.path.join(conf_dir, 'data_simulator.yaml')\n", "if not os.path.exists(CONFIG_PATH):\n", - " !wget -P $conf_dir https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/tools/speech_data_simulator/conf/data_simulator.yaml\n", + " !wget -P $conf_dir https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/$BRANCH/tools/speech_data_simulator/conf/data_simulator.yaml\n", "\n", "config = OmegaConf.load(CONFIG_PATH)\n", "print(OmegaConf.to_yaml(config))" @@ -326,7 +326,7 @@ "source": [ "import wget\n", "if not os.path.exists(\"multispeaker_data_analysis.py\"):\n", - " !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speaker_tasks/multispeaker_data_analysis.py\n", + " !wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/speaker_tasks/multispeaker_data_analysis.py\n", "\n", "from multispeaker_data_analysis import run_multispeaker_data_analysis\n", "\n", diff --git a/tutorials/tools/NeMo_Forced_Aligner_Tutorial.ipynb b/tutorials/tools/NeMo_Forced_Aligner_Tutorial.ipynb index 96fe1b029018..5395bfdff7f8 100644 --- a/tutorials/tools/NeMo_Forced_Aligner_Tutorial.ipynb +++ b/tutorials/tools/NeMo_Forced_Aligner_Tutorial.ipynb @@ -46,9 +46,9 @@ "\n", "# option #2: download NeMo repo\n", "if 'google.colab' in str(get_ipython()) or not os.path.exists(NEMO_DIR_PATH):\n", - " !git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + " !git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech\n", " %cd NeMo\n", - " !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + " !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n", " %cd -" ] }, @@ -68,7 +68,7 @@ "id": "A4NE9GhNn8f-" }, "source": [ - "In this tutorial, we will use [NeMo Forced Aligner](https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_aligner) to generate token and word alignments for a video of Neil Armstrong's first steps on the moon. We will use the ASS-format subtitle files generated by NFA to add subtitles with token-by-token and word-by-word highlighting to the video.\n", + "In this tutorial, we will use [NeMo Forced Aligner](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/nemo_forced_aligner) to generate token and word alignments for a video of Neil Armstrong's first steps on the moon. We will use the ASS-format subtitle files generated by NFA to add subtitles with token-by-token and word-by-word highlighting to the video.\n", "\n", "\n", "We will use the video at this [link](https://www.nasa.gov/wp-content/uploads/static/history/alsj/a11/a11.v1092338.mov), which is in the public domain and was obtained from the NASA website [here](https://history.nasa.gov/alsj/a11/video11.html#Step). The transcript for the video is obtained from the transcript of the mission [here](https://history.nasa.gov/alsj/a11/a11transcript_tec.pdf). As referenced on this [page](https://history.nasa.gov/alsj/a11/a11trans.html), this is a raw transcript with no copyright asserted.\n", @@ -83,7 +83,7 @@ "id": "PGr_hMTCcm3J" }, "source": [ - "![NFA forced alignment pipeline](https://github.com/NVIDIA/NeMo/releases/download/v1.20.0/nfa_forced_alignment_pipeline.png)" + "![NFA forced alignment pipeline](https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.20.0/nfa_forced_alignment_pipeline.png)" ] }, { @@ -224,7 +224,7 @@ "id": "i84J0AyiW6X7" }, "source": [ - "![NFA usage pipeline](https://github.com/NVIDIA/NeMo/releases/download/v1.20.0/nfa_run.png)" + "![NFA usage pipeline](https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.20.0/nfa_run.png)" ] }, { @@ -383,7 +383,7 @@ "id": "OXUL-KyUdVpL" }, "source": [ - "![How NFA generates word and segment alignments from token alignments](https://github.com/NVIDIA/NeMo/releases/download/v1.20.0/nfa_word_segment_alignments.png)" + "![How NFA generates word and segment alignments from token alignments](https://github.com/NVIDIA-NeMo/Speech/releases/download/v1.20.0/nfa_word_segment_alignments.png)" ] }, { diff --git a/tutorials/tools/SDE_HowTo_v2.ipynb b/tutorials/tools/SDE_HowTo_v2.ipynb index 65ed77d9b018..e61699b35d7e 100644 --- a/tutorials/tools/SDE_HowTo_v2.ipynb +++ b/tutorials/tools/SDE_HowTo_v2.ipynb @@ -17,7 +17,7 @@ "id": "9BVTGynbHSoy" }, "source": [ - "[Speech Data Explorer](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer) (SDE) is a visual tool for interactive exploration of speech datasets and error analysis of Automatic Speech Recognition (ASR) models. This tutorial demonstrates how to use SDE in Comparison mode to evaluate two ASR models on a given test set and identify differences in their predictions." + "[Speech Data Explorer](https://github.com/NVIDIA-NeMo/Speech/tree/main/tools/speech_data_explorer) (SDE) is a visual tool for interactive exploration of speech datasets and error analysis of Automatic Speech Recognition (ASR) models. This tutorial demonstrates how to use SDE in Comparison mode to evaluate two ASR models on a given test set and identify differences in their predictions." ] }, { @@ -61,7 +61,7 @@ "source": [ "BRANCH = 'main'\n", "\n", - "!git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + "!git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech\n", "\n", "!apt-get update && apt-get install -y libsndfile1 ffmpeg sox\n", "\n", @@ -156,7 +156,7 @@ "source": [ "To compare two models, JSON file should contain predictions from 1st (e.g., `QuartzNet15x5`) and 2nd (e.g., `Conformer-CTC Small`) models.\n", "\n", - "NeMo includes a Python script for ASR inference: [`transcribe_speech.py`](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/transcribe_speech.py).\n", + "NeMo includes a Python script for ASR inference: [`transcribe_speech.py`](https://github.com/NVIDIA-NeMo/Speech/blob/main/examples/asr/transcribe_speech.py).\n", "\n", "`transcribe_speech.py` accepts `append_pred` flag that allows saving an ASR transcript in the JSON file with a given custom field (like `pred_text_QN`). `pred_name_postfix` parameter defines the custom field's name. In this example it is set to the abbreviated model name `QN`." ] diff --git a/tutorials/tools/label-studio/setup-asr-preannotations.ipynb b/tutorials/tools/label-studio/setup-asr-preannotations.ipynb index ec0a0e01f969..242966cd71e5 100644 --- a/tutorials/tools/label-studio/setup-asr-preannotations.ipynb +++ b/tutorials/tools/label-studio/setup-asr-preannotations.ipynb @@ -221,7 +221,7 @@ "id": "alike-realtor", "metadata": {}, "source": [ - "Then you can start to annotate your audio files by correcting the text areas prepopulated by NeMo ASR's output. After you finish labeling, you can export results in the `ASR_MANIFEST` format ready to use for [training a NeMo ASR model](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)" + "Then you can start to annotate your audio files by correcting the text areas prepopulated by NeMo ASR's output. After you finish labeling, you can export results in the `ASR_MANIFEST` format ready to use for [training a NeMo ASR model](https://colab.research.google.com/github/NVIDIA-NeMo/Speech/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)" ] } ], diff --git a/tutorials/tts/Audio_Codec_Inference.ipynb b/tutorials/tts/Audio_Codec_Inference.ipynb index 3d7c19f07558..1e0d676daa42 100644 --- a/tutorials/tts/Audio_Codec_Inference.ipynb +++ b/tutorials/tts/Audio_Codec_Inference.ipynb @@ -39,7 +39,7 @@ "id": "pZ2QSsXuGbMe" }, "source": [ - "In this tutorial we show how use NeMo **neural audio codecs** at inference time. To learn more about training and finetuning neural audio codecs in NeMo, check the [Audio Codec Training tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb).\n", + "In this tutorial we show how use NeMo **neural audio codecs** at inference time. To learn more about training and finetuning neural audio codecs in NeMo, check the [Audio Codec Training tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/Audio_Codec_Training.ipynb).\n", "\n", "An audio codec typically consists of an encoder, a quantizer and a decoder, with a typical architecture depicted in the figure below.\n", "An audio codec can be used to encode an input audio signal into a sequence of discrete values.\n", @@ -52,7 +52,7 @@ "The list of the available models can be found [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/checkpoints.html#codec-models).\n", "\n", "
\n", - "\n", + "\n", "
" ] }, @@ -74,10 +74,10 @@ "outputs": [], "source": [ "BRANCH = 'main'\n", - "# Install NeMo library. If you are running locally (rather than on Google Colab), follow the instructions at https://github.com/NVIDIA/NeMo#Installation\n", + "# Install NeMo library. If you are running locally (rather than on Google Colab), follow the instructions at https://github.com/NVIDIA-NeMo/Speech#Installation\n", "\n", "if 'google.colab' in str(get_ipython()):\n", - " !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + " !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]" ] }, { @@ -426,7 +426,7 @@ "source": [ "To learn more about audio codec models in NeMo, look at our [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/models.html#codecs).\n", "\n", - "For more information on training and finetuning neural audio codecs in NeMo, check the [Audio Codec Training tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb)." + "For more information on training and finetuning neural audio codecs in NeMo, check the [Audio Codec Training tutorial](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/Audio_Codec_Training.ipynb)." ] }, { diff --git a/tutorials/tts/Audio_Codec_Training.ipynb b/tutorials/tts/Audio_Codec_Training.ipynb index 0fa378bdfa69..19694fec9f6f 100644 --- a/tutorials/tts/Audio_Codec_Training.ipynb +++ b/tutorials/tts/Audio_Codec_Training.ipynb @@ -44,14 +44,14 @@ "Neural audio codecs are deep learning models that compress audio into a low bitrate representation. The compact embedding space created by these models can be useful for various speech tasks, such as TTS and ASR.\n", "\n", "
\n", - "\n", + "\n", "
\n", "\n", "Audio codec models typically have an *encoder-quantizer-decoder* structure. The **encoder** takes an input audio signal and encodes it into a sequence of embeddings. The **quantizer** discretizes the embeddings to create a lookup table known as a **codebook**. The embeddings saved in the codebook are referred to as **audio codes**. The **decoder** takes the audio codes as input and attempts to reconstruct the original audio signal.\n", "\n", "To store compressed audio we only need to save the codebook index for each embedding in an audio sequence. This is how audio codec models achieve low bitrates. The codebook indices for an audio are referred to **audio tokens**. It is becoming common for speech generation models to synthesize speech by predicting audio tokens.\n", "\n", - "In NeMo we have implementations of the [SEANet encoder and decoder](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/encodec_modules.py#L146) used by [EnCodec](https://github.com/facebookresearch/encodec). As well as a [ResNet encoder](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L1035) and [HiFi-GAN decoder](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L875). For quantizers we support [Residual Vector Quantizer](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/encodec_modules.py#L694) (**RVQ**) and [Finite Scalar Quantizer](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L409) (**FSQ**).\n" + "In NeMo we have implementations of the [SEANet encoder and decoder](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/modules/encodec_modules.py#L146) used by [EnCodec](https://github.com/facebookresearch/encodec). As well as a [ResNet encoder](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L1035) and [HiFi-GAN decoder](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L875). For quantizers we support [Residual Vector Quantizer](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/modules/encodec_modules.py#L694) (**RVQ**) and [Finite Scalar Quantizer](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/modules/audio_codec_modules.py#L409) (**FSQ**).\n" ] }, { @@ -73,8 +73,8 @@ "source": [ "BRANCH = 'main'\n", "# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below line\n", - "# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]" + "# and instead follow the instructions at https://github.com/NVIDIA-NeMo/Speech#Installation\n", + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[tts]" ] }, { @@ -108,7 +108,7 @@ "nemo_download_dir = str(NEMO_DIR)\n", "# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,\n", "# comment out the below line and set NEMO_ROOT_DIR to your local path.\n", - "!git clone -b $BRANCH https://github.com/NVIDIA/NeMo.git $nemo_download_dir" + "!git clone -b $BRANCH https://github.com/NVIDIA-NeMo/Speech.git $nemo_download_dir" ] }, { @@ -126,7 +126,7 @@ "id": "ODgdGgsAAUku" }, "source": [ - "Predefined model configurations are available in https://github.com/NVIDIA/NeMo/tree/main/examples/tts/conf/audio_codec.\n", + "Predefined model configurations are available in https://github.com/NVIDIA-NeMo/Speech/tree/main/examples/tts/conf/audio_codec.\n", "\n", "Configurations available include:\n", "\n", @@ -189,7 +189,7 @@ "source": [ "We provide pretrained model checkpoints for fine-tuning.\n", "\n", - "A list of models available on NGC can be found [here](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/audio_codec.py#L645).\n", + "A list of models available on NGC can be found [here](https://github.com/NVIDIA-NeMo/Speech/blob/main/nemo/collections/tts/models/audio_codec.py#L645).\n", "\n", "A list of models available on Hugging Face can be found [here](https://huggingface.co/collections/nvidia/nemo-audio-codecs-674f57ab6cb1324f997b5d5b). To use a checkpoint from hugging face, add \"nvidia/\" before the model name." ] @@ -400,7 +400,7 @@ "id": "4WfEaMwpUsFt" }, "source": [ - "Next we process the audio data using [preprocess_audio.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_audio.py).\n", + "Next we process the audio data using [preprocess_audio.py](https://github.com/NVIDIA-NeMo/Speech/blob/main/scripts/dataset_processing/tts/preprocess_audio.py).\n", "\n", "During this step we can apply the following transformations:\n", "\n", diff --git a/tutorials/tts/Evaluation_MelCepstralDistortion.ipynb b/tutorials/tts/Evaluation_MelCepstralDistortion.ipynb index 699f1b131408..d5ee2b680b83 100644 --- a/tutorials/tts/Evaluation_MelCepstralDistortion.ipynb +++ b/tutorials/tts/Evaluation_MelCepstralDistortion.ipynb @@ -601,9 +601,9 @@ "source": [ "## Additional NeMo Resources\n", "\n", - "If you are unsure where to begin for training a TTS model, you may want to start with the [FastPitch and Mixer-TTS Training notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) or the [NeMo TTS Primer notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb). For fine-tuning, there is also the [FastPitch Fine-Tuning notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb).\n", + "If you are unsure where to begin for training a TTS model, you may want to start with the [FastPitch and Mixer-TTS Training notebook](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) or the [NeMo TTS Primer notebook](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb). For fine-tuning, there is also the [FastPitch Fine-Tuning notebook](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb).\n", "\n", - "For some guidance on how to load a trained model and perform inference to generate mels or waveforms, check out how it's done in the [Inference notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Inference_ModelSelect.ipynb). Important functions to know are include `from_pretrained()` (if loading from an NGC model) and `restore_from()` (if loading a `.nemo` file). See the [NeMo Primer notebook](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) for more general information about model training, saving, and loading." + "For some guidance on how to load a trained model and perform inference to generate mels or waveforms, check out how it's done in the [Inference notebook](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/Inference_ModelSelect.ipynb). Important functions to know are include `from_pretrained()` (if loading from an NGC model) and `restore_from()` (if loading a `.nemo` file). See the [NeMo Primer notebook](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/00_NeMo_Primer.ipynb) for more general information about model training, saving, and loading." ] } ], diff --git a/tutorials/tts/NeMo_TTS_Primer.ipynb b/tutorials/tts/NeMo_TTS_Primer.ipynb index 87a5d66c5e9c..a80da0f7a0e5 100644 --- a/tutorials/tts/NeMo_TTS_Primer.ipynb +++ b/tutorials/tts/NeMo_TTS_Primer.ipynb @@ -24,9 +24,9 @@ "outputs": [], "source": [ "# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below lines\n", - "# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation\n", + "# and instead follow the instructions at https://github.com/NVIDIA-NeMo/Speech#Installation\n", "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" + "!python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]" ] }, { @@ -43,7 +43,7 @@ "# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,\n", "# comment out the below lines and set NEMO_DIR to your local path.\n", "NEMO_DIR = 'nemo'\n", - "!git clone https://github.com/NVIDIA/NeMo.git $NEMO_DIR" + "!git clone https://github.com/NVIDIA-NeMo/Speech.git $NEMO_DIR" ] }, { @@ -67,7 +67,7 @@ } }, "source": [ - "This notebook provides a high level overview of text-to-speech (TTS). It will cover high level concepts and discuss each component in a standard TTS pipeline, providing relevant examples and code snippets using [NeMo](https://github.com/NVIDIA/NeMo)." + "This notebook provides a high level overview of text-to-speech (TTS). It will cover high level concepts and discuss each component in a standard TTS pipeline, providing relevant examples and code snippets using [NeMo](https://github.com/NVIDIA-NeMo/Speech)." ] }, { @@ -129,7 +129,7 @@ "While this is the most common structure, there may be fewer or additional steps depending on the use case. For example, some languages do not require G2P and can instead rely on the model to convert raw text/graphemes to spectrogram.\n", "\n", "
\n", - "\n", + "\n", "
" ] }, @@ -201,7 +201,7 @@ "\n", "The above examples may be slightly different than the output of the NeMo text normalization code. More details on NeMo text normalization can be found in the [TN documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/intro.html).\n", "\n", - "A more comprehensive list of text normalization rules, examples, and languages are available in the [code](https://github.com/NVIDIA/NeMo-text-processing/tree/main/nemo_text_processing/text_normalization).\n", + "A more comprehensive list of text normalization rules, examples, and languages are available in the [code](https://github.com/NVIDIA-NeMo/Speech-text-processing/tree/main/nemo_text_processing/text_normalization).\n", "\n" ] }, @@ -245,7 +245,7 @@ "except ModuleNotFoundError:\n", " raise ModuleNotFoundError(\n", " \"The package `nemo_text_processing` was not installed in this environment. Please refer to\"\n", - " \" https://github.com/NVIDIA/NeMo-text-processing and install this package before using \"\n", + " \" https://github.com/NVIDIA-NeMo/Speech-text-processing and install this package before using \"\n", " \"this script\"\n", " )\n", "\n", @@ -764,7 +764,7 @@ "\n", "
\n", "
\n", - "\n", + "\n", "
https://wiki.hydrogenaud.io/index.php?title=File:Digital_wave.png\n", "
\n", "
" @@ -1059,7 +1059,7 @@ "\n", "
\n", "
\n", - "\n", + "\n", "
\n", "\n", "The model is fairly complex. At a high level, it contains:\n", @@ -1476,7 +1476,7 @@ "In NeMo we support [FastPitch](https://fastpitch.github.io/), a parallel transformer-based model with pitch and duration control and prediction.\n", "\n", "
\n", - "\n", + "\n", "
\n", "\n", "At a high level it contains:\n", @@ -1854,7 +1854,7 @@ "\n", "
\n", "
\n", - "\n", + "\n", "
Diagram of a dilated causal CNN
\n", "
\n", "
\n", @@ -1894,7 +1894,7 @@ "In addition to penalizing the model if the discriminator can classify the synthesized audio as fake, it also uses **feature matching loss** to penalize the model if the distribution of intermediate layer outputs in the discriminator networks differ between the real and synthesized audio.\n", "\n", "
\n", - "\n", + "\n", "
HiFi-Gan scale and period discriminators
\n", "
" ] @@ -1988,10 +1988,10 @@ "source": [ "To learn more about what TTS technology and models are available in NeMo, please look through our [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/intro.html#).\n", "\n", - "To get more hands on experience with NeMo TTS, look through some of our other [tutorials](https://github.com/NVIDIA/NeMo/tree/stable/tutorials/tts).\n", + "To get more hands on experience with NeMo TTS, look through some of our other [tutorials](https://github.com/NVIDIA-NeMo/Speech/tree/stable/tutorials/tts).\n", "\n", - "* Running pretrained models: [Inference_ModelSelect](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n", - "* FastPitch [training](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) and [fine-tuning](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb)\n", + "* Running pretrained models: [Inference_ModelSelect](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n", + "* FastPitch [training](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/FastPitch_MixerTTS_Training.ipynb) and [fine-tuning](https://github.com/NVIDIA-NeMo/Speech/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb)\n", "\n", "To learn how to deploy and serve your TTS models, visit [Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)." ] diff --git a/tutorials/tts/Pronunciation_customization.ipynb b/tutorials/tts/Pronunciation_customization.ipynb index 6fe269e76904..f1bddb2fc560 100644 --- a/tutorials/tts/Pronunciation_customization.ipynb +++ b/tutorials/tts/Pronunciation_customization.ipynb @@ -30,7 +30,7 @@ "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n", "# !apt-get install sox libsndfile1 ffmpeg\n", "# !pip install wget text-unidecode \n", - "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n" + "# !python -m pip install git+https://github.com/NVIDIA-NeMo/Speech.git@$BRANCH#egg=nemo_toolkit[all]\n" ] }, { @@ -58,7 +58,7 @@ "* *[heteronyms](https://en.wikipedia.org/wiki/Heteronym_(linguistics))* - words with the same spelling but different pronunciations and/or meanings, e.g., *bass* (the fish) and *bass* (the musical instrument).\n", "\n", "#### Important NeMo flags:\n", - "* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA/NeMo/blob/stable/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt)\n", + "* `your_spec_generator_model.vocab.g2p.phoneme_dict` - phoneme dictionary that maps words to their phonetic transcriptions, e.g., [ARPABET-based CMU Dictionary](https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/stable/scripts/tts_dataset_files/cmudict-0.7b_nv22.10) or [IPA-based CMU Dictionary](https://github.com/NVIDIA-NeMo/Speech/blob/stable/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt)\n", "* `your_spec_generator_model.vocab.g2p.heteronyms` - list of the model's heteronyms, grapheme form of these words will be used even if the word is present in the phoneme dictionary.\n", "* `your_spec_generator_model.vocab.g2p.ignore_ambiguous_words`: if is set to **True**, words with more than one phonetic representation in the pronunciation dictionary are ignored. This flag is relevant to the words with multiple valid phonetic transcriptions in the dictionary that are not in `your_spec_generator_model.vocab.g2p.heteronyms` list.\n", "* `your_spec_generator_model.vocab.phoneme_probability` - phoneme probability flag in the Tokenizer and the same from in the G2P module: `your_spec_generator_model.vocab.g2p.phoneme_probability` ([0, 1]). If a word is present in the phoneme dictionary, we still want our TTS model to see graphemes and phonemes during training to handle OOV words during inference. The `phoneme_probability` determines the probability of an unambiguous dictionary word appearing in phonetic form during model training, `(1 - phoneme_probability)` is the probability of the graphemes. This flag is set to `1` in the parse() method during inference.\n", @@ -128,7 +128,7 @@ "metadata": {}, "source": [ "#### Expected results if you run the tutorial:\n", - " \n", + " \n", "\n", "\n", "During preprocessing, unambiguous dictionary words are converted to phonemes, while OOV and words with multiple entries are kept as graphemes. For example, **paracetamol** is missing from the phoneme dictionary, and **can** has 2 forms." @@ -186,7 +186,7 @@ "metadata": {}, "source": [ "#### Expected results if you run the tutorial:\n", - " \n", + " \n", "\n", "\n", "## Dictionary customization\n", @@ -212,7 +212,7 @@ "if os.path.exists(ipa_cmu_dict):\n", " ! rm $ipa_cmu_dict\n", "\n", - "! wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/$ipa_cmu_dict\n", + "! wget https://raw.githubusercontent.com/NVIDIA-NeMo/Speech/main/scripts/tts_dataset_files/$ipa_cmu_dict\n", "\n", "with open(ipa_cmu_dict, \"a\") as f:\n", " f.write(f\"PARACETAMOL {new_pronunciation}\\n\")\n", @@ -267,7 +267,7 @@ "metadata": {}, "source": [ "#### Expected results if you run the tutorial:\n", - " " + " " ] }, { @@ -276,7 +276,7 @@ "source": [ "# Resources\n", "* [TTS pipeline customization](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-custom.html#tts-pipeline-configuration)\n", - "* [Overview of TTS in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb)\n", + "* [Overview of TTS in NeMo](https://github.com/NVIDIA-NeMo/Speech/blob/main/tutorials/tts/NeMo_TTS_Primer.ipynb)\n", "* [G2P models in NeMo](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/g2p.html)\n", "* [Riva TTS documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html)" ]