[Pytorch] Transformer Engine報錯:RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
[Pytorch] Transformer Engine報錯:RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
問題描述
有一天,我用Megatron-LM在跑MoE相關的代碼時,報了這個錯誤:
Exception: The specified pointer resides on host memory and is not registered with any CUDA device.
Traceback (most recent call last):
File "/workspace/userdata/moelb/Megatron-LM/tests/unit_tests/transformer/moe/xxx.py", line 309, in perf_test
output, _ = self.model(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/userdata/moelb/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
return self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
return inner()
^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 251, in forward
output, mlp_bias = custom_forward(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 217, in custom_forward
expert_output, mlp_bias = self.experts(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
return inner()
^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/userdata/moelb/Megatron-LM/megatron/core/transformer/moe/experts.py", line 780, in forward
intermediate_parallel, bias_parallel = self.linear_fc1(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
return inner()
^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1805, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/userdata/moelb/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 1059, in forward
out = super().forward(x, m_splits, is_first_microbatch=_is_first_microbatch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 749, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/grouped_linear.py", line 654, in forward
out = linear_fn(*args)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 578, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/grouped_linear.py", line 158, in forward
_ = general_grouped_gemm(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 206, in general_grouped_gemm
bias = tex.te_general_grouped_gemm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
分析
-
從字面意思上看,貌似是我在進行GroupedMLP計算時,有些tensor并不位于GPU上,而是位于CPU上。但我在
expert.py內,linear_fc1計算之前,打印了輸入tensor以及參數tensor的device,發現它們都是cuda,而且里面的內容看起來也沒啥毛病。 -
也許是因為沒有同步?我在前面加入
torch.cuda.current_stream().synchronize(),結果依然報錯。我打印了tensor里面的內容,看起來也沒啥問題。 -
而且離譜的是,這個錯誤只在
sequence_length*batch_size達到16384時才出現,8192的時候都沒出現。我也嘗試了很多其他選項(如moe_permute_fusion)是否會造成影響,結果依然都會報錯。 -
總之我沒轍了,懷疑這是transformer-engine內部的bug。
問題解決
出問題時,我使用的環境是nvidia官方的pytorch容器,版本25.03-py3,里面的transformer-engine版本為2.1.0。我將容器版本升級到25.05-py3,里面transformer-engine的版本為2.3.0。報錯就消失了。所以這果然大概率是transformer-engine內部的bug。
| 歡迎來原網站坐坐! >原文鏈接<

浙公網安備 33010602011771號