Branch/Tag/Commit
main
Docker Image Version
nvcr.io/nvidia/pytorch:22.12-py3
GPU name
A10
CUDA Driver
535.54.03
Reproduced Steps
1. docker run -ti --gpus all --rm nvcr.io/nvidia/pytorch:22.12-py3 bash
2. git clone --recursive https://github.com/NVIDIA/FasterTransformer.git
3. cd FasterTransformer
4. mkdir build
5. cd build
6. cmake -DSM=86 -DCMAKE_BUILD_TYPE=Release ..
7. make -j14
8. CUDA_VISIBLE_DEVICES=0 ./satrn 1 1 8 64 2048 4022 3 100 576 512 0 0.0 0
Abnormal Phenomena:
in
|
val = val + position_encoding[step_offset + col_index]; |
, step_offset is calculated with intervals of hidden_units,
|
step_offset *= hidden_units; |
So I think
|
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * vocab_size_); |
should be
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * hidden_units_);
instead of
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * vocab_size_);
There are two similar situations
|
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * vocab_size_); |
|
deviceMalloc(&weights_ptr[0], max_seq_len_ * vocab_size_); |
I have pull a pr to try to fix it. @byshiue
Branch/Tag/Commit
main
Docker Image Version
nvcr.io/nvidia/pytorch:22.12-py3
GPU name
A10
CUDA Driver
535.54.03
Reproduced Steps
Abnormal Phenomena:
in
FasterTransformer/src/fastertransformer/kernels/decoding_kernels.cu
Line 137 in df4a753
FasterTransformer/src/fastertransformer/kernels/decoding_kernels.cu
Line 134 in df4a753
So I think
FasterTransformer/src/fastertransformer/models/decoding/DecodingWeight.h
Line 101 in df4a753
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * hidden_units_);instead of
cudaD2Dcpy(weights_ptr[0], other.weights_ptr[0], max_seq_len_ * vocab_size_);There are two similar situations
FasterTransformer/src/fastertransformer/models/decoding/DecodingWeight.h
Line 77 in df4a753
FasterTransformer/src/fastertransformer/models/decoding/DecodingWeight.h
Line 118 in df4a753
I have pull a pr to try to fix it. @byshiue