# Supabase Setup for Optimal GAIA Agent Performance ## Required Supabase Configuration ### 1. Create the `match_documents_langchain` Function This SQL function enables efficient vector similarity search: ```sql -- Create the similarity search function for LangChain integration create or replace function match_documents_langchain ( query_embedding vector(1536), -- Adjust dimension based on embedding model match_threshold float default 0.75, match_count int default 3 ) returns table ( id uuid, page_content text, embedding vector, metadata jsonb, similarity float ) language plpgsql as $$ begin return query select documents.id, documents.page_content, documents.embedding, documents.metadata, 1 - (documents.embedding <=> query_embedding) as similarity from documents where 1 - (documents.embedding <=> query_embedding) > match_threshold order by documents.embedding <=> query_embedding limit match_count; end; $$; ``` ### 2. Alternative for HuggingFace Embeddings (384 dimensions) If using `sentence-transformers/all-mpnet-base-v2`: ```sql -- For HuggingFace embeddings (384 dimensions) create or replace function match_documents_langchain_hf ( query_embedding vector(384), match_threshold float default 0.75, match_count int default 3 ) returns table ( id uuid, page_content text, embedding vector, metadata jsonb, similarity float ) language plpgsql as $$ begin return query select documents.id, documents.page_content, documents.embedding, documents.metadata, 1 - (documents.embedding <=> query_embedding) as similarity from documents where 1 - (documents.embedding <=> query_embedding) > match_threshold order by documents.embedding <=> query_embedding limit match_count; end; $$; ``` ### 3. Update Database Table Structure Ensure the `documents` table has the right structure: ```sql -- Check/create the documents table structure CREATE TABLE IF NOT EXISTS documents ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY, page_content TEXT NOT NULL, embedding VECTOR(1536), -- Or 384 for HuggingFace metadata JSONB DEFAULT '{}', created_at TIMESTAMP WITH TIME ZONE DEFAULT TIMEZONE('utc'::text, NOW()) ); -- Create index for fast similarity search CREATE INDEX IF NOT EXISTS documents_embedding_idx ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); ``` ### 4. Environment Variables Update the `.env` file: ```env # Required for both approaches SUPABASE_URL=supabase_project_url SUPABASE_SERVICE_KEY=supabase_service_key # Alternative key name (some setups use this) SUPABASE_KEY=supabase_service_key # Optional: For OpenAI fallback OPENAI_API_KEY=openai_api_key ``` ## Performance Comparison ### HuggingFace Approach (Recommended) ✅ **Free embedding model** ✅ **Often better semantic understanding** ✅ **384-dimensional vectors (smaller storage)** ✅ **No API rate limits** ### OpenAI Approach (Fallback) ✅ **Very reliable and consistent** ✅ **1536-dimensional vectors (more detailed)** ❌ **Costs money per embedding** ❌ **API rate limits** ## Testing the Setup 1. **Test the function exists:** ```sql SELECT * FROM match_documents_langchain( '[0.1, 0.2, ...]'::vector, -- Sample embedding 0.7, -- Threshold 5 -- Count ); ``` 2. **Test with Python:** ```python from tools.database_tools import retriever # Test efficient search results = retriever.search_similar_questions_efficient( "What is the capital of France?", top_k=3 ) print(results) ``` ## Migration from Manual to Efficient Search If you're currently using manual similarity search, the new hybrid approach will: 1. **Try efficient LangChain search first** 2. **Fall back to manual search if needed** 3. **Automatically detect which approach works** This ensures compatibility while optimizing for performance!