Skip to content

TDT ASR seg fault #14941

@domklement

Description

@domklement

Hi,

I've just pulled and installed the latest version of NeMo and am trying tu run asr TDT training. I'm getting a segmentation fault whenever the training is about to begin. I stepped through the code and it seems like a problem with TDT loss computation.

I've installed the toolkit using pip install -e '.[all]' and am running it the following way:

python ~/ASR/NeMo/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py \
    --config-path=~/ASR/NeMo/examples/asr/conf/fastconformer/hybrid_transducer_ctc \
    --config-name=fastconformer_hybrid_tdt_ctc_bpe.yaml \
    model.train_ds.manifest_filepath=~/data/librispeech/train.json \
    model.validation_ds.manifest_filepath=~/data/librispeech/dev.json \
    model.tokenizer.dir=~/tokenizers/ls960/tokenizer_spe_bpe_v500 \
    model.tokenizer.type=bpe \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    trainer.strategy="ddp" \
    trainer.max_epochs=100 \
    exp_manager.create_wandb_logger=False \
    exp_manager.exp_dir=~/nemo_experiments/segfault_debug

I've been running the same code using an older version of NeMo (2.4.0rc0) and it worked fine, but I updated the dependencies and the same segfault is happening when I checkout to 2.4.0rc0 tag. Hence, it seems like a dependency issue as not all the requirements are properly tagged with a corresponding version.

Am I the only one who is it happening to? Do you have any solution/workaround?

Hardware:
CPU: AMD EPYC 9454
GPU: Nvidia H100 NVL 94GB (the same happens if I run it on multiple H100 SXM5 - DGX machine)
CUDA: 13.0
DRIVER: 580.65.06

Thank you!

Best,
Dominik Klement

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions