Skip to content

关于iluvatar_gpu在bincount算子上的异常行为 #2358

@PlumBlossomMaid

Description

@PlumBlossomMaid

关于iluvatar_gpu在bincount算子上的异常行为

我在AI Studio上面使用Iluvatar BI-V150S这张卡的时候,运行了如下代码:

import paddle
input = paddle.ones([100],dtype=paddle.int64)
# 使用CPU计算第三行代码,几乎会瞬间出结果。
paddle.bincount(x=input,minlength=100) # !!!

当使用bincount算子进行计算的时候,加速卡会显示存在一定的占用率,显存也会有占用,但是,bincount算子会一直卡在那里无法继续计算
如果此时此刻按Ctrl + C进行打断的话,会有如下报错:

>>> paddle.bincount(x=input,minlength=100)
^C

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_bincount(_object*, _object*, _object*)
1   bincount_ad_func(paddle::Tensor const&, paddle::optional<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor>, paddle::optional<paddle::Tensor*>)
2   paddle::experimental::bincount(paddle::Tensor const&, paddle::optional<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, paddle::optional<paddle::Tensor*>)
3   void phi::BincountCUDAInner<phi::CustomContext, long, long>(phi::CustomContext const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor> const&, long, phi::DenseTensor*)
4   void phi::Copy<phi::CustomContext>(phi::CustomContext const&, phi::DenseTensor const&, phi::Place, bool, phi::DenseTensor*)
5   phi::memory_utils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long, void*)
6   phi::MemoryUtils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long, void*)
7   void paddle::memory::Copy<phi::Place, phi::Place>(phi::Place, void*, phi::Place, void const*, unsigned long, void*)
8   void paddle::memory::Copy<phi::CPUPlace, phi::CustomPlace>(phi::CPUPlace, void*, phi::CustomPlace, void const*, unsigned long, void*)
9   phi::CustomDevice::MemoryCopyD2H(unsigned long, void*, void const*, unsigned long, phi::stream::Stream const*)
10  phi::CustomDevice::SynchronizeStream(unsigned long, void*)
11  SyncStream(C_Device_st*, C_Stream_st*)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1768708118 (unix time) try "date -d @1768708118" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3e8000200df) received by PID 131295 (TID 0x7f727ecdd780) from PID 131295 ***]

Terminated

看上去bincount算子确实在实现这块有问题啊……

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions