How big should our embedding dimensions be?

1536 dimensions (OpenAI text-embedding-3-small default) is a strong default for most production cases. 3072 (text-embedding-3-large) gets marginally better quality at higher cost and storage. Smaller dimensions (384, 768 from open-source models) save cost and storage with modest quality loss, often acceptable. Right-size based on retrieval quality measured on your eval set.

How fast is semantic search at scale?

Production ANN systems serve sub-100ms p95 query latency on indexes of 10-100M vectors. At 1B+ vectors, latency stays under 200ms with appropriate hardware and indexing. The bottleneck is usually network round-trip, not the search itself.

Do we need a vector database or can we use Postgres?

pgvector handles up to 5-10M vectors well in production if you're already running Postgres. Above that, dedicated vector databases (Pinecone, Qdrant, Weaviate, Milvus) start to win on latency, scaling, and metadata filtering performance. For most growth-stage SaaS, pgvector is the right starting point.

Start a conversation

AI engineering glossary

What is Semantic Search?

Semantic search is a retrieval technique that finds documents based on meaning rather than exact keyword match (converting both queries and documents to high-dimensional vector embeddings and ranking documents by vector similarity to the query), enabling search to surface conceptually-related results even when no shared terms exist.

Last updated 2026-04-28BearPlex AI Engineering Team

Overview

Semantic search is the retrieval foundation under most modern AI applications: RAG systems, recommendation engines, knowledge bases, customer support deflection, and code search. It replaces (or augments) traditional keyword-based search (BM25, TF-IDF) by understanding semantic similarity: 'cancel my subscription' and 'how do I unsubscribe' are recognized as semantically similar even with no shared terms. The technology has matured rapidly since 2020: embedding models from OpenAI, Cohere, Voyage, and open-source providers like sentence-transformers and BGE deliver production-quality semantic understanding at low cost. In production, the most reliable architecture is hybrid retrieval (semantic search combined with keyword search) to capture both meaning and exact-match signals.

How semantic search works

Three components. (1) Embedding model: converts text into a fixed-dimension vector that represents its semantic meaning; common dimensions are 384, 768, 1536, or 3072. Documents and queries are embedded with the same model. (2) Vector database: stores millions or billions of document embeddings indexed for efficient nearest-neighbor search; common choices include Pinecone, Qdrant, Weaviate, pgvector, and Milvus. (3) Approximate nearest neighbor (ANN) algorithm: finds the K most-similar document vectors to a query vector in milliseconds; common algorithms are HNSW, IVF, and ScaNN. At query time: embed the query, run ANN search against the indexed documents, return the top K matches ranked by similarity score. Production systems usually add reranking and metadata filtering on top of base retrieval.

Hybrid search: semantic + keyword

Pure semantic search has known weaknesses: it misses exact-match signals (proper nouns, product codes, specific numbers, rare technical terms) that should rank highly when present. Pure keyword search misses semantic similarity. Hybrid search combines both: run semantic search and keyword search (typically BM25) in parallel, then fuse the rankings using techniques like Reciprocal Rank Fusion (RRF) or weighted combination. Production benchmarks consistently show hybrid outperforming either alone, typically 10-30% improvement in retrieval quality on benchmarks like BEIR. At BearPlex, hybrid is the default for production RAG; pure semantic is the exception, used only when the corpus has minimal proper noun content.

Reranking for the last mile

ANN retrieval returns the top K (typically 50-200) documents that are roughly relevant. Reranking is a second-stage scoring that examines each candidate more carefully and reorders them by true relevance. Cross-encoder models (Cohere Rerank, BGE-reranker, ColBERT) score query-document pairs jointly rather than computing independent embeddings: much more accurate but slower. Production pattern: ANN retrieves top 100, reranker scores them precisely, top 5-10 reranked documents go to the LLM. The combined pipeline (ANN + reranker) consistently outperforms either alone and is the de-facto standard for high-quality production retrieval.

Use cases

RAG retrieval: finding relevant documents for LLM-grounded answers
Customer support knowledge base search ('cancel my subscription' → policy doc on cancellation)
Code search across large codebases (semantic search over function descriptions)
Recommendation systems (semantically-similar products, articles, content)
Multilingual search where queries and documents are in different languages

Examples in production

OpenAI

text-embedding-3-large and text-embedding-3-small are widely-used embedding models for production semantic search; lower-cost than first-generation embeddings with better quality.

Source

Cohere

Cohere Embed and Cohere Rerank are popular production retrieval components: Rerank in particular is widely used as the second stage of hybrid retrieval pipelines.

Source

BEIR benchmark

Standard benchmark for evaluating retrieval quality across 18 diverse domains; widely cited in production retrieval system design.

Source

Semantic Search compared to alternatives

Alternative	Choose Semantic Search when	Choose alternative when
Keyword search (BM25) Statistical scoring based on term frequency and document length	Use semantic search to capture meaning beyond exact terms	Use keyword search (or hybrid) when exact terms matter (proper nouns, codes, rare words)
Lexical search with synonyms Keyword search expanded with manually-curated synonym lists	Use semantic search for scalable meaning-based retrieval: no synonym maintenance	Use synonym-expanded lexical search when you need full transparency and explainability of matches

Common pitfalls

Using pure semantic search for corpora with lots of proper nouns or technical terms: keyword signals get lost
Skipping reranking: first-stage ANN results are often wrong in the top 5
Using off-the-shelf embedding models without evaluation on your specific domain
Not normalizing embeddings: some models require it, others don't; check provider docs
Reindexing the entire corpus when changing embedding models without thinking through the migration

Related terms

Embedding Vector Database RAG Reranking

Related BearPlex services

RAG & Knowledge Systems

Full AI glossary

FAQ

Questions about Semantic Search.

For most production cases, OpenAI text-embedding-3-large or Cohere Embed v3 are reasonable defaults: strong quality, well-supported, reasonable cost. For domain-specific or multilingual needs, evaluate Voyage AI embeddings, BGE-large (open source), or domain-tuned variants. Always benchmark on your specific data before committing to a model: generic benchmarks don't predict performance in your domain.

Need help implementing Semantic Search?

BearPlex builds production AI systems that use Semantic Search for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is Semantic Search?

Overview

How semantic search works

Hybrid search: semantic + keyword

Reranking for the last mile

Use cases

Examples in production

OpenAI

Cohere

BEIR benchmark

Semantic Search compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about Semantic Search.

Related reading

Need help implementing Semantic Search?