Reproduce inference benchmark mentioned in the paper #21

zhouheyun · 2024-05-11T08:50:39Z

I have a few questions about the inference efficiency of deepseek v2
1.

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650

luofuli · 2024-05-14T05:19:38Z

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

zhouheyun · 2024-05-14T06:14:53Z

Our open-source code (vllm-project/vllm#4650) is not the inference code used in the API platform, so it cannot achieve the throughput speed mentioned in the paper. @zhouheyun

What‘s the average inference context length to achieve the claimed throughput in the paper? @luofuli

luofuli · 2024-05-27T11:08:52Z

32K context length @zhouheyun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce inference benchmark mentioned in the paper #21

Reproduce inference benchmark mentioned in the paper #21

zhouheyun commented May 11, 2024 •

edited

luofuli commented May 14, 2024

zhouheyun commented May 14, 2024 •

edited

luofuli commented May 27, 2024

Reproduce inference benchmark mentioned in the paper #21

Reproduce inference benchmark mentioned in the paper #21

Comments

zhouheyun commented May 11, 2024 • edited

luofuli commented May 14, 2024

zhouheyun commented May 14, 2024 • edited

luofuli commented May 27, 2024

zhouheyun commented May 11, 2024 •

edited

zhouheyun commented May 14, 2024 •

edited