Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deploy in VLLM? #7

Open
ZHENG518 opened this issue May 7, 2024 · 10 comments
Open

How to deploy in VLLM? #7

ZHENG518 opened this issue May 7, 2024 · 10 comments

Comments

@ZHENG518
Copy link

ZHENG518 commented May 7, 2024

No description provided.

@stack-heap-overflow
Copy link
Contributor

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

@Xu-Chen
Copy link

Xu-Chen commented May 7, 2024

Thank you for your interest in our work. We are aware of the challenges in implementing KV compression on current open-source code and are actively working on it. The HuggingFace's code is not as efficient as we would like, so we're developing a new open-source code using vLLM for better performance. The open-source vLLM code including KV compression will be released once it is ready.

Can it support quantitative deployment? GPTQ or AWQ?

@zwd003
Copy link

zwd003 commented May 7, 2024

hi, we have support vllm in this pr(vllm-project/vllm#4650)

@BasicCoder
Copy link

hi, we have support vllm in this pr(vllm-project/vllm#4650)

Thank you for your great work. According to your document description: the actual deployment on an 8*H800 machine has an input throughput of more than 100,000 tokens/s and an output throughput of more than 50,000 tokens/s . Can we achieve such excellent performance with this vllm?

@Ricardokevins
Copy link

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

@zwd003
Copy link

zwd003 commented May 9, 2024

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

@Ricardokevins
Copy link

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

got it, thank you~

@ccp123456789
Copy link

Hi, thank you for your great work! I would like to know how many V-RAM needed? I try with 8*40G, but failed with OOM.

8x80G,8*40G only work for 4bit model

4bit model ? we don't get it

@ZhangYaoFu
Copy link

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config
speculative_config = SpeculativeConfig.maybe_create_spec_config(
TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

@luofuli luofuli mentioned this issue May 14, 2024
@yukiwayx
Copy link

yukiwayx commented May 29, 2024

hi, we have support vllm in this pr(vllm-project/vllm#4650)

I failed to use vllm 0.4.2 for inference and reported the following error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/data0/zhenglin/src/asr-anlp-autovision-model3/src/local_inference/deepseek_infer.py", line 8, in llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True) File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in init self.llm_engine = LLMEngine.from_engine_args( File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 272, in from_engine_args engine_config = engine_args.create_engine_config() File "/data0/zhenglin/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 544, in create_engine_config speculative_config = SpeculativeConfig.maybe_create_spec_config( TypeError: SpeculativeConfig.maybe_create_spec_config() missing 1 required positional argument: 'speculative_disable_by_batch_size'

Same problem

Solved by checking the engine/arg_utils.py file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants