How to deploy in VLLM? #7

ZHENG518 · 2024-05-07T03:46:52Z

No description provided.

stack-heap-overflow · 2024-05-07T06:06:08Z

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Xu-Chen · 2024-05-07T06:22:28Z

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Can it support quantitative deployment? GPTQ or AWQ？

zwd003 · 2024-05-07T08:52:31Z

hi, we have support vllm in this pr(vllm-project/vllm#4650)

BasicCoder · 2024-05-07T09:11:51Z

hi, we have support vllm in this pr(vllm-project/vllm#4650)

Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?

Ricardokevins · 2024-05-08T15:38:02Z

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

zwd003 · 2024-05-09T04:42:51Z

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

Ricardokevins · 2024-05-09T04:54:23Z

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

got it, thank you~

ccp123456789 · 2024-05-09T08:08:56Z

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G，8*40G only work for 4bit model

4bit model ？ we don't get it

ZhangYaoFu · 2024-05-11T09:59:45Z

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config
speculative_config = SpeculativeConfig.maybe_create_spec_config(
TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

yukiwayx · 2024-05-29T06:09:54Z

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

Same problem

Solved by checking the engine/arg_utils.py file.

luofuli mentioned this issue May 14, 2024

源码 #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deploy in VLLM? #7

How to deploy in VLLM? #7

ZHENG518 commented May 7, 2024

stack-heap-overflow commented May 7, 2024

Xu-Chen commented May 7, 2024

zwd003 commented May 7, 2024

BasicCoder commented May 7, 2024

Ricardokevins commented May 8, 2024

zwd003 commented May 9, 2024

Ricardokevins commented May 9, 2024

ccp123456789 commented May 9, 2024

ZhangYaoFu commented May 11, 2024

yukiwayx commented May 29, 2024 •

edited

How to deploy in VLLM? #7

How to deploy in VLLM? #7

Comments

ZHENG518 commented May 7, 2024

stack-heap-overflow commented May 7, 2024

Xu-Chen commented May 7, 2024

zwd003 commented May 7, 2024

BasicCoder commented May 7, 2024

Ricardokevins commented May 8, 2024

zwd003 commented May 9, 2024

Ricardokevins commented May 9, 2024

ccp123456789 commented May 9, 2024

ZhangYaoFu commented May 11, 2024

yukiwayx commented May 29, 2024 • edited

yukiwayx commented May 29, 2024 •

edited