# Supabase Setup for Optimal GAIA Agent Performance

## Required Supabase Configuration

### 1. Create the `match_documents_langchain` Function

This SQL function enables efficient vector similarity search:

```sql
-- Create the similarity search function for LangChain integration
create or replace function match_documents_langchain (
  query_embedding vector(1536),  -- Adjust dimension based on embedding model
  match_threshold float default 0.75,
  match_count int default 3
)
returns table (
  id uuid,
  page_content text,
  embedding vector,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.page_content,
    documents.embedding,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;
```

### 2. Alternative for HuggingFace Embeddings (384 dimensions)

If using `sentence-transformers/all-mpnet-base-v2`:

```sql
-- For HuggingFace embeddings (384 dimensions)
create or replace function match_documents_langchain_hf (
  query_embedding vector(384),
  match_threshold float default 0.75,
  match_count int default 3
)
returns table (
  id uuid,
  page_content text,
  embedding vector,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    documents.id,
    documents.page_content,
    documents.embedding,
    documents.metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;
```

### 3. Update Database Table Structure

Ensure the `documents` table has the right structure:

```sql
-- Check/create the documents table structure
CREATE TABLE IF NOT EXISTS documents (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  page_content TEXT NOT NULL,
  embedding VECTOR(1536), -- Or 384 for HuggingFace
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMP WITH TIME ZONE DEFAULT TIMEZONE('utc'::text, NOW())
);

-- Create index for fast similarity search
CREATE INDEX IF NOT EXISTS documents_embedding_idx 
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
```

### 4. Environment Variables

Update the `.env` file:

```env
# Required for both approaches
SUPABASE_URL=supabase_project_url
SUPABASE_SERVICE_KEY=supabase_service_key
# Alternative key name (some setups use this)
SUPABASE_KEY=supabase_service_key

# Optional: For OpenAI fallback
OPENAI_API_KEY=openai_api_key
```

## Performance Comparison

### HuggingFace Approach (Recommended)
✅ **Free embedding model**
✅ **Often better semantic understanding**  
✅ **384-dimensional vectors (smaller storage)**
✅ **No API rate limits**

### OpenAI Approach (Fallback)
✅ **Very reliable and consistent**
✅ **1536-dimensional vectors (more detailed)**
❌ **Costs money per embedding**
❌ **API rate limits**

## Testing the Setup

1. **Test the function exists:**
```sql
SELECT * FROM match_documents_langchain(
  '[0.1, 0.2, ...]'::vector,  -- Sample embedding
  0.7,  -- Threshold
  5     -- Count
);
```

2. **Test with Python:**
```python
from tools.database_tools import retriever

# Test efficient search
results = retriever.search_similar_questions_efficient(
    "What is the capital of France?", 
    top_k=3
)
print(results)
```

## Migration from Manual to Efficient Search

If you're currently using manual similarity search, the new hybrid approach will:

1. **Try efficient LangChain search first**
2. **Fall back to manual search if needed**
3. **Automatically detect which approach works**

This ensures compatibility while optimizing for performance!