HFClientVLLM

HFClient vLLM

Prerequisites - Launching vLLM Server locally

Refer to the vLLM Server API for setting up the vLLM server locally.

#Example vLLM Server Launch

 python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080

This command will start the server and make it accessible at http://localhost:8080.

Setting up the vLLM Client

The constructor initializes the HFModel base class to support the handling of prompting models, configuring the client for communicating with the hosted vLLM server to generate requests. This requires the following parameters:

model (str): ID of model connected to the vLLM server.
port (int): Port for communicating to the vLLM server.
url (str): Base URL of hosted vLLM server. This will often be "http://localhost".
**kwargs: Additional keyword arguments to configure the vLLM client.

Example of the vLLM constructor:

class HFClientVLLM(HFModel):
    def __init__(self, model, port, url="http://localhost", **kwargs):

Under the Hood

`_generate(self, prompt, **kwargs) -> dict`

Parameters:

prompt (str): Prompt to send to model hosted on vLLM server.
**kwargs: Additional keyword arguments for completion request.

Returns:

dict: dictionary with prompt and list of response choices.

Internally, the method handles the specifics of preparing the request prompt and corresponding payload to obtain the response.

After generation, the method parses the JSON response received from the server and retrieves the output through json_response["choices"] and stored as the completions list.

Lastly, the method constructs the response dictionary with two keys: the original request prompt and choices, a list of dictionaries representing generated completions with the key text holding the response's generated text.

Using the vLLM Client

vllm_llama2 = dspy.HFClientVLLM(model="meta-llama/Llama-2-7b-hf", port=8080, url="http://localhost")

Sending Requests via vLLM Client

Recommended Configure default LM using dspy.configure.

This allows you to define programs in DSPy and simply call modules on your input fields, having DSPy internally call the prompt on the configured LM.

dspy.configure(lm=vllm_llama2)

#Example DSPy CoT QA program
qa = dspy.ChainOfThought('question -> answer')

response = qa(question="What is the capital of Paris?") #Prompted to vllm_llama2
print(response.answer)

Generate responses using the client directly.

response = vllm_llama2._generate(prompt='What is the capital of Paris?')
print(response)

Written By: Arnav Singhvi

HFClient vLLM​

Prerequisites - Launching vLLM Server locally​

Setting up the vLLM Client​

Under the Hood​

_generate(self, prompt, **kwargs) -> dict​

Using the vLLM Client​

Sending Requests via vLLM Client​