How to Build·15 min read

How to Build AI-Powered Search for E-Commerce Product Discovery

Most e-commerce search boxes are still running keyword matching from 2010. Here is a practical guide to replacing them with AI-powered semantic search, vector embeddings, and personalized ranking that actually converts browsers into buyers.

Nate Laquis

Nate Laquis

Founder & CEO

Why Your E-Commerce Search Is Losing You Money

If your site search converts at two to three percent while your top navigation converts at six, you have a broken search engine, not a traffic problem. Studies consistently show that visitors who use site search are two to four times more likely to purchase than those who do not. That means every failed search query is a direct revenue leak.

The core problem is that most e-commerce platforms ship with BM25-based keyword search baked in. BM25 is a solid algorithm from 1994 that ranks documents by term frequency. It has no concept of meaning. A shopper who types "running shoes for wide feet" gets results ranked by how often those exact words appear in product titles and descriptions. If your catalog uses "wide fit" instead of "wide feet," those products never surface.

The good news is that building genuinely intelligent search is now an engineering problem you can solve in weeks, not years. The components you need, including vector databases, embedding APIs, and re-ranking models, are mature, well-documented, and affordable. This guide walks you through every layer of a production-grade AI search stack, from semantic retrieval to visual search to A/B testing your quality metrics.

E-commerce search analytics dashboard showing conversion metrics and query performance

Before we get into implementation, it helps to understand why AI is transforming e-commerce at every touchpoint right now. Search is the highest-leverage place to start because it directly touches purchase intent.

Semantic Search vs. Keyword Search: What Actually Changes

Keyword search asks: does this document contain the query terms? Semantic search asks: does this document mean the same thing as the query? That shift in framing changes everything about how your catalog needs to be represented and retrieved.

In a keyword system, products are bags of words. In a semantic system, products are points in a high-dimensional vector space where similar meanings cluster together. When a user searches for "cozy winter gift for mom," a semantic system can return a cashmere throw blanket even if the product description never uses the word "cozy," "winter," or "gift." The embedding model learned those associations from the enormous amount of text it was trained on.

The practical gap shows up most sharply in three scenarios. First, synonym handling: a shopper looking for "trousers" should find your "pants" and "chinos." Keyword search requires you to maintain a manual synonym dictionary forever. Semantic search handles this automatically. Second, concept search: "gifts under $50 for a ten-year-old who likes science" is a concept, not a set of matching tokens. Third, typo and slang tolerance: "Nikke shoes" and "Addidas sneeks" should still surface the right products.

The tradeoff is that semantic search alone can occasionally miss exact-match precision. If someone types a specific product SKU or a very precise phrase like "14-inch laptop bag with TSA-approved lock," keyword matching often wins. This is why production systems use hybrid search, combining both approaches and re-ranking the merged result set.

How Vector Embeddings Represent Your Product Catalog

An embedding is a numerical vector, typically 768 to 1536 dimensions, that captures the semantic meaning of a piece of text or an image. OpenAI's text-embedding-3-large model produces 3072-dimensional vectors. Cohere's embed-v3 and Voyage AI's voyage-large-2 are strong alternatives worth benchmarking for your domain.

For an e-commerce catalog, you generate one embedding per product. The input to the model is typically a concatenation of the product title, category breadcrumb, key attributes, and a cleaned version of the product description. You want to include enough context to distinguish a "leather wallet" from a "leather belt," but you do not want to embed your entire boilerplate returns policy.

Once you have embeddings for your full catalog, you store them in a vector database. At query time, you embed the user's search query using the same model, then perform approximate nearest neighbor search to find the products whose vectors are closest to the query vector. This retrieval step is fast, typically under 50 milliseconds for catalogs up to a few million products, because vector databases use indexing structures like HNSW (Hierarchical Navigable Small World graphs) to avoid brute-force comparison.

Choosing Your Search Infrastructure Stack

The infrastructure choice is the decision that is hardest to reverse, so let's be direct about the options and who they are right for.

Algolia is the fastest path to production if you value developer experience over infrastructure control. Its NeuralSearch product combines vector and keyword search behind a single API, and their index pipeline handles embedding generation for you. The downside is cost at scale and limited model flexibility. Expect to pay $500 to $2,000 per month once you exceed the starter tier on a real catalog. If you are doing less than five million monthly search operations and want to move fast, Algolia is often the right call.

Typesense is the open-source alternative to Algolia with a nearly identical API surface. You can self-host on a $50 per month VM, which makes it compelling for budget-conscious teams. Typesense added hybrid search support in v0.25. The tradeoff is that you own the infrastructure, which means you manage upgrades, backups, and scaling.

Elasticsearch with the kNN vector search capability (introduced in v8.0) is the right choice for teams already running the Elastic stack who need deep customization. The learning curve is steep and the operational overhead is real, but you get the most flexibility for complex relevance tuning. Elasticsearch shines when you have a dedicated platform engineering team.

Vespa is the least well-known but arguably the most powerful option for large catalogs. Built by Yahoo and used internally by several major retailers, Vespa natively handles hybrid retrieval, re-ranking, and real-time updates in a single engine. It supports ONNX model inference directly in the serving layer, so you can run custom re-rankers without an additional microservice. The operational complexity is high, but if you are processing tens of millions of queries per day, Vespa's performance characteristics are hard to beat.

For most teams building on an existing cloud infrastructure, our recommended starting point is a managed vector database like Pinecone or Weaviate Cloud for semantic retrieval, combined with your existing Elasticsearch or Algolia instance for keyword retrieval. Bridge the two with a lightweight re-ranking service using a cross-encoder model. This architecture gives you semantic capability without scrapping your existing search infrastructure on day one.

Developer building AI search system with code on multiple screens

Implementing Hybrid Search and Re-Ranking

Hybrid search is not a single algorithm. It is a pattern for combining results from multiple retrieval systems into a single ranked list. Getting this right is where most teams spend the most time, and where most off-the-shelf solutions cut corners.

Reciprocal Rank Fusion

The simplest and often the best-performing fusion method is Reciprocal Rank Fusion (RRF). For each candidate product, you sum a score derived from its rank in each retrieval list: score = sum(1 / (k + rank)), where k is typically 60. RRF does not require score normalization, which matters because BM25 scores and cosine similarity scores live on completely different scales. It is also surprisingly robust, performing better than weighted score combinations in most benchmark evaluations.

To implement RRF, run your keyword query and your vector query in parallel. Merge the candidate sets. Score each candidate using the RRF formula across both ranked lists. Sort descending. Return the top N. This can be done in under 100 lines of Python and adds roughly 5 to 15 milliseconds to your query latency if you parallelize the two retrieval calls.

Cross-Encoder Re-Ranking

RRF gives you a good first-pass ranking. For the final 10 to 20 results you will actually show the user, you can dramatically improve relevance with a cross-encoder re-ranker. Unlike the bi-encoder models used for embedding generation, a cross-encoder takes both the query and a product representation as a single input and outputs a relevance score. This joint encoding allows the model to reason about query-product interactions, like whether "slim fit" in the query conflicts with "relaxed cut" in the product description.

Cohere's Rerank API and Jina AI's reranker are the easiest ways to add this without training your own model. You pass the top 50 to 100 candidates from your hybrid retrieval step, and the API returns them sorted by relevance. Latency for 50 candidates is typically 80 to 150 milliseconds. For products where ranking quality is worth the cost, this additional pass consistently lifts click-through and conversion rates by 10 to 20 percent in controlled experiments.

Business Logic Layers

Pure relevance scoring is never the whole story. After your re-ranker sorts by semantic relevance, you still need to apply business rules: boost in-stock products, apply margin-based promotions, suppress out-of-region products, and surface sponsored listings. Build this as a final scoring pass that applies multiplicative or additive boosts to the re-ranked list. Keep it separate from your ML pipeline so merchandisers can adjust it without touching model code.

Query Understanding, NLP, and Autocomplete

Even the best retrieval system fails if it does not understand what the user actually meant. Query understanding is the preprocessing layer that transforms raw user input into a structured retrieval signal before it ever hits your index.

Intent Classification

Start by classifying query intent. A navigational query ("Nike Air Max 90") needs different handling than a discovery query ("comfortable office chair") or a comparison query ("waterproof jacket vs windbreaker"). A simple fine-tuned classifier on your historical query logs can categorize queries into four or five intents with 85 to 90 percent accuracy. Each intent maps to a different retrieval strategy: navigational queries favor exact-match keyword search, discovery queries favor semantic retrieval, and so on.

Entity Extraction and Query Rewriting

Named entity recognition lets you identify brands, colors, sizes, and categories mentioned in the query and route them to facet filters automatically. "Blue Nike running shoes size 10" should resolve to a brand filter for Nike, a color filter for blue, a category filter for running shoes, and a size filter for 10, rather than being treated as a six-token keyword query. SpaCy with a custom NER model trained on your product taxonomy handles this well, and the inference overhead is negligible.

Query rewriting addresses the vocabulary mismatch between how users talk and how your catalog is written. If users frequently search "couch" but your catalog says "sofa," a simple synonym mapping handles it. For more complex cases, a small seq2seq model can rewrite noisy queries into cleaner forms before retrieval. OpenAI's GPT-4o-mini is useful here: you can prompt it to expand or clarify ambiguous queries at a cost well under a tenth of a cent per query.

Autocomplete That Does Not Embarrass You

Autocomplete is often the first AI interaction a user has on your site, and most implementations are still serving alphabetically sorted prefix completions from a trie. A better approach: train a completion model on your historical search click data to surface the completions that statistically lead to purchases, not just the most common queries. Algolia's Query Suggestions feature does a version of this automatically. If you are rolling your own, weight completion candidates by their conversion rate in the last 30 days, not just their query frequency. You will surface "blue suede boots women's size 8 wide" before "blue" even if "blue" has ten times the raw frequency, because the longer, more specific query has a higher purchase probability.

Personalized Ranking and Faceted Navigation

Search relevance is not the same for every user. Someone who has spent the last six sessions browsing premium cookware expects different ranking than a first-time visitor who landed from a "cheap pots" paid search ad. Personalized ranking layers user context on top of your semantic relevance scores to produce results tuned to the individual.

Building a User Signal Pipeline

The inputs to personalized ranking are behavioral signals: clicks, add-to-carts, purchases, dwell time, and explicit ratings. You need a real-time event pipeline to collect these signals and a feature store to make them available to your ranking model at query time. Kafka or AWS Kinesis for event ingestion, Redis for the feature store, and a lightweight gradient boosted tree model (XGBoost or LightGBM) for the ranking logic is a stack that handles millions of queries per day without excessive infrastructure cost.

The ranking model takes as input a set of query-product features (BM25 score, vector similarity, product popularity, price) combined with user-product affinity features (has the user viewed this category before, has the user purchased from this brand, what is the user's typical price range). It outputs a ranking score that combines relevance with personalization. You train this model offline on logged search sessions where you know which result the user eventually clicked and purchased.

Faceted Navigation as a Search Accelerator

Faceted navigation deserves more credit than it gets. Well-designed facets let users narrow intent without typing, and they are especially valuable on mobile where retyping queries is friction. The AI opportunity in facets is in dynamic facet generation and ordering: rather than showing the same six filters to every user on every query, surface the facets that are statistically most likely to lead to a click for this specific query context.

If 80 percent of users who search "running shoes" immediately filter by size and gender, those two facets should appear first and expanded by default. If users who search "office chair" mostly filter by price and then by lumbar support rating, those facets come first. You can drive this behavior from your click logs with a simple conditional frequency analysis, no deep learning required.

Mobile e-commerce search experience on smartphone showing personalized product results

Visual Search and Multimodal Discovery

Text search has a fundamental ceiling: some shopping intent cannot be expressed in words. "I want a sofa that looks like the one in this photo" is not a query any text system can answer. Visual search closes this gap by letting users upload an image and retrieve visually similar products from your catalog.

Building Visual Search with CLIP Embeddings

OpenAI's CLIP model and its successors, including OpenCLIP and Salesforce's BLIP-2, produce image embeddings that live in the same vector space as text embeddings. This means you can embed both your product images and user-uploaded query images into the same index and retrieve by visual similarity using the same vector database infrastructure you built for semantic text search.

To set this up, run your product catalog images through CLIP's image encoder and store the resulting vectors in your vector database alongside your text embeddings. At query time, encode the user's uploaded image with the same model and retrieve the nearest neighbors. You can combine visual and text signals by averaging the query image embedding with a text embedding of any text the user provides, which handles queries like "show me something like this but in green."

The operational requirements are more demanding than text search: image embeddings are computationally expensive to generate, and your product images need to be standardized (consistent background, consistent crop) to produce reliable embeddings. Budget for a preprocessing pipeline that normalizes images before embedding. AWS Rekognition or Google Vision AI can handle background removal at scale if your product photography is inconsistent.

Multimodal Catalog Enrichment

Vision models can also enrich your product catalog automatically. If your suppliers provide products without color tags or material attributes, a vision model can infer these from product images. This closes the attribute gap that kills faceted navigation on catalog sections where metadata is sparse. A GPT-4o call per product image costs roughly one cent and can generate structured attributes that would take a human data entry team weeks to produce. For a catalog of 100,000 products, this is a $1,000 one-time investment that pays back on the first day of improved search quality.

A/B Testing Search Quality and Measuring What Matters

The biggest mistake teams make when building AI search is treating it as a one-time implementation rather than an ongoing optimization loop. Your search quality will degrade as your catalog evolves, user behavior shifts, and seasonal trends change. You need a measurement system before you can improve anything.

Offline Evaluation Metrics

Before you run any live experiments, establish offline evaluation baselines. The standard metrics for search quality are Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR). Both require a set of labeled queries where humans have rated which products are relevant to which queries. Even a small set of 200 to 500 queries with human relevance judgments gives you a reliable baseline to measure model changes against without touching production traffic.

You do not need to build this judgment set from scratch. Start with your zero-result queries (searches that returned no relevant results), your high-abandonment queries (users who searched and immediately left), and your high-conversion queries (to understand what good looks like). These three buckets give you a representative slice of your search quality landscape.

Online Experiment Design

For A/B testing, you need to randomize at the user or session level, not the query level. A user should have a consistent experience within a session to avoid confounding your metrics. The metrics you care about, in order of reliability, are: revenue per search session (most reliable, directly tied to business outcomes), add-to-cart rate from search results, click-through rate on search results, and search abandonment rate. Do not optimize for click-through rate alone; it can be gamed by ranking clickbait products over genuinely relevant ones.

Run experiments for a minimum of two weeks to capture day-of-week variation, and use Bonferroni correction if you are testing multiple variants simultaneously. A five percent lift in revenue per search session on 20 percent of your traffic, held for two weeks with p less than 0.05, is a result you can trust. Anything shorter or with weaker statistical power is noise.

Continuous Improvement Loops

The highest ROI activity in search quality improvement is mining your own query logs. Every week, look at your top 100 zero-result queries, your top 100 high-abandonment queries, and the queries where users reformulated before purchasing. These three lists tell you exactly where your system is failing and what to fix next. Feed these examples back into your synonym dictionaries, your query rewriting rules, and your fine-tuning datasets. If you do nothing else with this guide, build this weekly analysis process and you will see compounding improvement over time.

To learn how this connects to broader recommendation and personalization infrastructure, read our guide on how to build an AI recommendation engine, which covers the collaborative filtering and session-based models that power "you might also like" and "frequently bought together" features.

The full picture of how to build AI search across your entire product experience, beyond e-commerce catalog search, is worth understanding as you plan your roadmap.

If you are ready to stop losing revenue to a broken search box and want an experienced team to help you architect and implement a production-grade AI search system, Book a free strategy call and we will map out a plan for your catalog in the first session.

Need help building this?

Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.

build AI search ecommerce product discoverysemantic search ecommercevector embeddings product catalogpersonalized search rankinghybrid search implementation

Ready to build your product?

Book a free 15-minute strategy call. No pitch, just clarity on your next steps.

Get Started