What is Semantic Search?
Semantic search is a retrieval technique that finds documents based on meaning rather than exact keyword match (converting both queries and documents to high-dimensional vector embeddings and ranking documents by vector similarity to the query), enabling search to surface conceptually-related results even when no shared terms exist.
Overview
Semantic search is the retrieval foundation under most modern AI applications: RAG systems, recommendation engines, knowledge bases, customer support deflection, and code search. It replaces (or augments) traditional keyword-based search (BM25, TF-IDF) by understanding semantic similarity: 'cancel my subscription' and 'how do I unsubscribe' are recognized as semantically similar even with no shared terms. The technology has matured rapidly since 2020: embedding models from OpenAI, Cohere, Voyage, and open-source providers like sentence-transformers and BGE deliver production-quality semantic understanding at low cost. In production, the most reliable architecture is hybrid retrieval (semantic search combined with keyword search) to capture both meaning and exact-match signals.
How semantic search works
Three components. (1) Embedding model: converts text into a fixed-dimension vector that represents its semantic meaning; common dimensions are 384, 768, 1536, or 3072. Documents and queries are embedded with the same model. (2) Vector database: stores millions or billions of document embeddings indexed for efficient nearest-neighbor search; common choices include Pinecone, Qdrant, Weaviate, pgvector, and Milvus. (3) Approximate nearest neighbor (ANN) algorithm: finds the K most-similar document vectors to a query vector in milliseconds; common algorithms are HNSW, IVF, and ScaNN. At query time: embed the query, run ANN search against the indexed documents, return the top K matches ranked by similarity score. Production systems usually add reranking and metadata filtering on top of base retrieval.
Hybrid search: semantic + keyword
Pure semantic search has known weaknesses: it misses exact-match signals (proper nouns, product codes, specific numbers, rare technical terms) that should rank highly when present. Pure keyword search misses semantic similarity. Hybrid search combines both: run semantic search and keyword search (typically BM25) in parallel, then fuse the rankings using techniques like Reciprocal Rank Fusion (RRF) or weighted combination. Production benchmarks consistently show hybrid outperforming either alone, typically 10-30% improvement in retrieval quality on benchmarks like BEIR. At BearPlex, hybrid is the default for production RAG; pure semantic is the exception, used only when the corpus has minimal proper noun content.
Reranking for the last mile
ANN retrieval returns the top K (typically 50-200) documents that are roughly relevant. Reranking is a second-stage scoring that examines each candidate more carefully and reorders them by true relevance. Cross-encoder models (Cohere Rerank, BGE-reranker, ColBERT) score query-document pairs jointly rather than computing independent embeddings: much more accurate but slower. Production pattern: ANN retrieves top 100, reranker scores them precisely, top 5-10 reranked documents go to the LLM. The combined pipeline (ANN + reranker) consistently outperforms either alone and is the de-facto standard for high-quality production retrieval.
Use cases
- RAG retrieval: finding relevant documents for LLM-grounded answers
- Customer support knowledge base search ('cancel my subscription' → policy doc on cancellation)
- Code search across large codebases (semantic search over function descriptions)
- Recommendation systems (semantically-similar products, articles, content)
- Multilingual search where queries and documents are in different languages
Examples in production
OpenAI
text-embedding-3-large and text-embedding-3-small are widely-used embedding models for production semantic search; lower-cost than first-generation embeddings with better quality.
SourceCohere
Cohere Embed and Cohere Rerank are popular production retrieval components: Rerank in particular is widely used as the second stage of hybrid retrieval pipelines.
SourceBEIR benchmark
Standard benchmark for evaluating retrieval quality across 18 diverse domains; widely cited in production retrieval system design.
SourceSemantic Search compared to alternatives
| Alternative | Choose Semantic Search when | Choose alternative when |
|---|---|---|
Keyword search (BM25) Statistical scoring based on term frequency and document length | Use semantic search to capture meaning beyond exact terms | Use keyword search (or hybrid) when exact terms matter (proper nouns, codes, rare words) |
Lexical search with synonyms Keyword search expanded with manually-curated synonym lists | Use semantic search for scalable meaning-based retrieval: no synonym maintenance | Use synonym-expanded lexical search when you need full transparency and explainability of matches |
Common pitfalls
- Using pure semantic search for corpora with lots of proper nouns or technical terms: keyword signals get lost
- Skipping reranking: first-stage ANN results are often wrong in the top 5
- Using off-the-shelf embedding models without evaluation on your specific domain
- Not normalizing embeddings: some models require it, others don't; check provider docs
- Reindexing the entire corpus when changing embedding models without thinking through the migration
Questions about Semantic Search.
1536 dimensions (OpenAI text-embedding-3-small default) is a strong default for most production cases. 3072 (text-embedding-3-large) gets marginally better quality at higher cost and storage. Smaller dimensions (384, 768 from open-source models) save cost and storage with modest quality loss, often acceptable. Right-size based on retrieval quality measured on your eval set.
Production ANN systems serve sub-100ms p95 query latency on indexes of 10-100M vectors. At 1B+ vectors, latency stays under 200ms with appropriate hardware and indexing. The bottleneck is usually network round-trip, not the search itself.
pgvector handles up to 5-10M vectors well in production if you're already running Postgres. Above that, dedicated vector databases (Pinecone, Qdrant, Weaviate, Milvus) start to win on latency, scaling, and metadata filtering performance. For most growth-stage SaaS, pgvector is the right starting point.
Need help implementing Semantic Search?
BearPlex builds production AI systems that use Semantic Search for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.