The MongoDB document model along with Vector Search is a powerful combination to store data to power AI applications. Vector Search is available to you alongside MongoDB’s document database without any extra integrations.
We are going to be loading in the MongoDB Atlas Best Practices PDF and chunk the PDF for efficient lookup.
Install libraries:
pip install pymongo langchain langchain-mongodb sentence-transformers pypdf
Code Sample:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain_community.llms import HuggingFaceEndpoint
import pprint
uri = <your connection string>
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Initialize embeddings
HUGGINGFACE_TOKEN = <your access token>
embeddings = HuggingFaceEmbeddings(model_name="thenlper/gte-large")
db = client[“langchain_db”]
collection = db[“test”]
# Load the PDF
loader = PyPDFLoader("https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP")
data = loader.load()
# Split PDF into documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
docs = text_splitter.split_documents(data)
# Create the MongoDB Atlas Vector Search instance
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
connection_string = uri,
namespace = "langchain_db.test",
embedding = embeddings,
index_name="vector_index_test",
)
# Run the documents through the embeddings and add to the vectorstore
vector_search.add_documents(docs)
The following parameters are used:
docs: The documents to store in the vector database.
uri: the Atlas URI to connect to the client
namespace: langchain_db.test as the Atlas collection for which to insert the documents.
embedding: OpenAI's embedding model as the model used to convert text into vector embeddings for the embedding field.
index_name: vector_index_test as the index to use for querying the vector store.