forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Expanded Self-Query Retriever and Self-Query Retriever with MyScale #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
1b26021
add myscale self-query
1bba1ee
Merge branch 'hwchase17:master' into master
mpskex a79ca91
Merge pull request #1 from myscale/master
mpskex 7ac2742
revised prompt and add notebook
ec1a2ad
improve tools (#6062)
hwchase17 6ac5d80
propogate kwargs fully (#6076)
hwchase17 a9b3b2e
Enable serialization for anthropic (#6049)
nfcampos cde1e87
turn off repr (#6078)
hwchase17 7ff729e
Merge branch 'hwchase17:master' into master
mpskex edd55b4
formated
fc92205
still linting...
7759afd
fixed unittest and lint
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
370 changes: 370 additions & 0 deletions
370
docs/modules/indexes/retrievers/examples/myscale_self_query.ipynb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,370 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "13afcae7", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Self-querying with MyScale\n", | ||
| "\n", | ||
| ">[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain. MyScale can make a use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.\n", | ||
| "\n", | ||
| "In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra piece we contributed to LangChain. In short, it can be concluded into 4 points:\n", | ||
| "1. Add `contain` comparator to match list of any if there is more than one element matched\n", | ||
| "2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)\n", | ||
| "3. Add `like` comparator for string pattern search\n", | ||
| "4. Add arbitrary function capability" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "68e75fb9", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Creating a MyScale vectorstore\n", | ||
| "MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](../../vectorstores/examples/myscale.ipynb) to create your own vectorstore for a self-query retriever.\n", | ||
| "\n", | ||
| "NOTE: All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "63a8af5b", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "! pip install lark clickhouse-connect" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "83811610-7df3-4ede-b268-68a6a83ba9e2", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get a OpenAI API Key for valid accesss to LLMs." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "dd01b61b-7d32-4a55-85d6-b2d2d4f18840", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "import os\n", | ||
| "import getpass\n", | ||
| "\n", | ||
| "os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n", | ||
| "os.environ['MYSCALE_HOST'] = getpass.getpass('MyScale URL:')\n", | ||
| "os.environ['MYSCALE_PORT'] = getpass.getpass('MyScale Port:')\n", | ||
| "os.environ['MYSCALE_USERNAME'] = getpass.getpass('MyScale Username:')\n", | ||
| "os.environ['MYSCALE_PASSWORD'] = getpass.getpass('MyScale Password:')" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "cb4a5787", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from langchain.schema import Document\n", | ||
| "from langchain.embeddings.openai import OpenAIEmbeddings\n", | ||
| "from langchain.vectorstores import MyScale\n", | ||
| "\n", | ||
| "embeddings = OpenAIEmbeddings()" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "bf7f6fc4", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Create some sample data\n", | ||
| "As you can see, the data we created has some difference to other self-query retrievers. We replaced keyword `year` to `date` which gives you a finer control on timestamps. We also altered the type of keyword `gerne` to list of strings, where LLM can use a new `contain` comparator to construct filters. We also provides comparator `like` and arbitrary function support to filters, which will be introduced in next few cells.\n", | ||
| "\n", | ||
| "Now let's look at the data first." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "bcbe04d9", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "docs = [\n", | ||
| " Document(page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\", metadata={\"date\": \"1993-07-02\", \"rating\": 7.7, \"genre\": [\"science fiction\"]}),\n", | ||
| " Document(page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\", metadata={\"date\": \"2010-12-30\", \"director\": \"Christopher Nolan\", \"rating\": 8.2}),\n", | ||
| " Document(page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\", metadata={\"date\": \"2006-04-23\", \"director\": \"Satoshi Kon\", \"rating\": 8.6}),\n", | ||
| " Document(page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\", metadata={\"date\": \"2019-08-22\", \"director\": \"Greta Gerwig\", \"rating\": 8.3}),\n", | ||
| " Document(page_content=\"Toys come alive and have a blast doing so\", metadata={\"date\": \"1995-02-11\", \"genre\": [\"animated\"]}),\n", | ||
| " Document(page_content=\"Three men walk into the Zone, three men walk out of the Zone\", metadata={\"date\": \"1979-09-10\", \"rating\": 9.9, \"director\": \"Andrei Tarkovsky\", \"genre\": [\"science fiction\", \"adventure\"], \"rating\": 9.9})\n", | ||
| "]\n", | ||
| "vectorstore = MyScale.from_documents(\n", | ||
| " docs, \n", | ||
| " embeddings, \n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "5ecaab6d", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Creating our self-querying retriever\n", | ||
| "Just like other retrievers... Simple and nice." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "86e34dbf", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "from langchain.llms import OpenAI\n", | ||
| "from langchain.retrievers.self_query.base import SelfQueryRetriever\n", | ||
| "from langchain.chains.query_constructor.base import AttributeInfo\n", | ||
| "\n", | ||
| "metadata_field_info=[\n", | ||
| " AttributeInfo(\n", | ||
| " name=\"genre\",\n", | ||
| " description=\"The genres of the movie\", \n", | ||
| " type=\"list[string]\", \n", | ||
| " ),\n", | ||
| " # If you want to include length of a list, just define it as a new column\n", | ||
| " # This will teach the LLM to use it as a column when constructing filter.\n", | ||
| " AttributeInfo(\n", | ||
| " name=\"length(genre)\",\n", | ||
| " description=\"The lenth of genres of the movie\", \n", | ||
| " type=\"integer\", \n", | ||
| " ),\n", | ||
| " # Now you can define a column as timestamp. By simply set the type to timestamp.\n", | ||
| " AttributeInfo(\n", | ||
| " name=\"date\",\n", | ||
| " description=\"The date the movie was released\", \n", | ||
| " type=\"timestamp\", \n", | ||
| " ),\n", | ||
| " AttributeInfo(\n", | ||
| " name=\"director\",\n", | ||
| " description=\"The name of the movie director\", \n", | ||
| " type=\"string\", \n", | ||
| " ),\n", | ||
| " AttributeInfo(\n", | ||
| " name=\"rating\",\n", | ||
| " description=\"A 1-10 rating for the movie\",\n", | ||
| " type=\"float\"\n", | ||
| " ),\n", | ||
| "]\n", | ||
| "document_content_description = \"Brief summary of a movie\"\n", | ||
| "llm = OpenAI(temperature=0)\n", | ||
| "retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "ea9df8d4", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Testing it out with self-query retriever's existing functionalities\n", | ||
| "And now we can try actually using our retriever!" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "38a126e9", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example only specifies a relevant query\n", | ||
| "retriever.get_relevant_documents(\"What are some movies about dinosaurs\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "fc3f1e6e", | ||
| "metadata": { | ||
| "scrolled": false | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example only specifies a filter\n", | ||
| "retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "b19d4da0", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example specifies a query and a filter\n", | ||
| "retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "f900e40e", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example specifies a composite filter\n", | ||
| "retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "12a51522", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example specifies a query and composite filter\n", | ||
| "retriever.get_relevant_documents(\"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\")" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "86371ac8", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Wait a second... What else?\n", | ||
| "\n", | ||
| "Self-query retriever with MyScale can do more! Let's find out." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "1d043096", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# You can use length(genres) to do anything you want\n", | ||
| "retriever.get_relevant_documents(\"What's a movie that have more than 1 genres?\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "d570d33c", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Fine-grained datetime? You got it already.\n", | ||
| "retriever.get_relevant_documents(\"What's a movie that release after feb 1995?\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "fbe0b21b", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Don't know what your exact filter should be? Use string pattern match!\n", | ||
| "retriever.get_relevant_documents(\"What's a movie whose name is like Andrei?\")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "6a514104", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Contain works for lists: so you can match a list with contain comparator!\n", | ||
| "retriever.get_relevant_documents(\"What's a movie who has genres science fiction and adventure?\")" | ||
| ] | ||
| }, | ||
| { | ||
| "attachments": {}, | ||
| "cell_type": "markdown", | ||
| "id": "39bd1de1-b9fe-4a98-89da-58d8a7a6ae51", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Filter k\n", | ||
| "\n", | ||
| "We can also use the self query retriever to specify `k`: the number of documents to fetch.\n", | ||
| "\n", | ||
| "We can do this by passing `enable_limit=True` to the constructor." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "bff36b88-b506-4877-9c63-e5a1a8d78e64", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "retriever = SelfQueryRetriever.from_llm(\n", | ||
| " llm, \n", | ||
| " vectorstore, \n", | ||
| " document_content_description, \n", | ||
| " metadata_field_info, \n", | ||
| " enable_limit=True,\n", | ||
| " verbose=True\n", | ||
| ")" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "2758d229-4f97-499c-819f-888acaf8ee10", | ||
| "metadata": { | ||
| "tags": [] | ||
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# This example only specifies a relevant query\n", | ||
| "retriever.get_relevant_documents(\"what are two movies about dinosaurs\")" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3 (ipykernel)", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.8.8" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
θθε δΈε₯οΌvisit myscale.com and sign up for free to get username and password for your myscale pod.