Skip to main content

DatabricksRM

Constructor

Initialize an instance of the DatabricksRM retriever class, which enables DSPy programs to query Databricks Mosaic AI Vector Search indexes for document retrieval.

DatabricksRM(
databricks_index_name: str,
databricks_endpoint: Optional[str] = None,
databricks_token: Optional[str] = None,
columns: Optional[List[str]] = None,
filters_json: Optional[str] = None,
k: int = 3,
docs_id_column_name: str = "id",
text_column_name: str = "text",
)

Parameters:

  • databricks_index_name (str): The name of the Databricks Vector Search Index to query.
  • databricks_endpoint (Optional[str]): The URL of the Databricks Workspace containing the Vector Search Index. Defaults to the value of the DATABRICKS_HOST environment variable. If unspecified, the Databricks SDK is used to identify the endpoint based on the current environment.
  • databricks_token (Optional[str]): The Databricks Workspace authentication token to use when querying the Vector Search Index. Defaults to the value of the DATABRICKS_TOKEN environment variable. If unspecified, the Databricks SDK is used to identify the token based on the current environment.
  • columns (Optional[List[str]]): Extra column names to include in response, in addition to the document id and text columns specified by docs_id_column_name and text_column_name.
  • filters_json (Optional[str]): A JSON string specifying additional query filters. Example filters: {"id <": 5} selects records that have an id column value less than 5, and {"id >=": 5, "id <": 10} selects records that have an id column value greater than or equal to 5 and less than 10.
  • k (int): The number of documents to retrieve.
  • docs_id_column_name (str): The name of the column in the Databricks Vector Search Index containing document IDs.
  • text_column_name (str): The name of the column in the Databricks Vector Search Index containing document text to retrieve.

Methods

def forward(self, query: Union[str, List[float]], query_type: str = "ANN", filters_json: Optional[str] = None) -> dspy.Prediction:

Retrieve documents from a Databricks Mosaic AI Vector Search Index that are relevant to the specified query.

Parameters:

  • query (Union[str, List[float]]): The query text or numeric query vector for which to retrieve relevant documents.
  • query_type (str): The type of search query to perform against the Databricks Vector Search Index. Must be either 'ANN' (approximate nearest neighbor) or 'HYBRID' (hybrid search).
  • filters_json (Optional[str]): A JSON string specifying additional query filters. Example filters: {"id <": 5} selects records that have an id column value less than 5, and {"id >=": 5, "id <": 10} selects records that have an id column value greater than or equal to 5 and less than 10. If specified, this parameter overrides the filters_json parameter passed to the constructor.

Returns:

  • dspy.Prediction: A dotdict containing retrieved documents. The schema is {'docs': List[str], 'doc_ids': List[Any], extra_columns: List[Dict[str, Any]]}. The docs entry contains the retrieved document content.

Quickstart

To retrieve documents using Databricks Mosaic AI Vector Search, you must create a Databricks Mosaic AI Vector Search Index first.

The following example code demonstrates how to set up a Databricks Mosaic AI Direct Access Vector Search Index and use the DatabricksRM DSPy retriever module to query the index. The example requires the databricks-vectorsearch Python library to be installed.

from databricks.vector_search.client import VectorSearchClient

# Create a Databricks Vector Search Endpoint
client = VectorSearchClient()
client.create_endpoint(
name="your_vector_search_endpoint_name",
endpoint_type="STANDARD"
)

# Create a Databricks Direct Access Vector Search Index
index = client.create_direct_access_index(
endpoint_name="your_vector_search_endpoint_name",
index_name="your_index_name",
primary_key="id",
embedding_dimension=1024,
embedding_vector_column="text_vector",
schema={
"id": "int",
"field2": "str",
"field3": "float",
"text_vector": "array<float>"
}
)

# Create a DatabricksRM retriever and retrieve the top-3 most relevant documents from the
# Databricks Direct Access Vector Search Index corresponding to an example query
retriever = DatabricksRM(
databricks_index_name = "your_index_name",
docs_id_column_name="id",
text_column_name="field2",
k=3
)
retrieved_results = DatabricksRM(query="Example query text", query_type="hybrid"))