
How a Global E-Commerce Platform Achieved 99%+ Search Relevance with Taskmonk
How a Global E-Commerce Platform Achieved 99%+ Search Relevance with Taskmonk
See how Taskmonk helped a leading global e-commerce platform achieve 99%+ search relevance accuracy by rebuilding their training data annotation pipeline at scale.

About the Client
A leading global e-commerce platform serves hundreds of millions of customers across multiple countries, with a catalog spanning electronics, fashion, furniture, FMCG, and more. Thousands of new products are added weekly by a seller base operating across diverse regions and languages.
Search is the primary discovery mechanism for the platform's customers. With query volumes in the hundreds of millions daily, ranging from high-frequency head queries to long-tail multilingual searches with category-specific intent. The accuracy of search ranking models directly determines conversion rates, customer satisfaction, and revenue.
The Challenge
What Is Search Relevance?
Search relevance determines whether a shopper who searches for 'running shoes for women' sees women's running shoes, not formal heels and unrelated products. Ranking models learn from labeled training data: relevance judgments applied by annotators to query-product pairs. Inconsistent judgments produce inconsistent models.
The Problem
ML algorithm accuracy had stagnated at 40%. The cause was training data quality, not model architecture. Around 750 labelers spread across multiple vendors had no unified quality standard and no feedback loop back to model performance. Key challenges:
- 750+ annotators across multiple vendors with no unified relevance standard
- No category-specific guidelines; relevance criteria varied wildly by vertical
- No inter-annotator agreement tracking or correction mechanism
- Irrelevant search results eroding conversion rates across high-traffic queries
What Didn't Work?
Multi-vendor operations: Training data looked sufficient by volume but was systematically inconsistent by quality. No mechanism existed to detect or correct annotator drift across vendors.
Generic labeling platforms: No category-specific rubrics, no IAA tracking at the vertical level, no feedback loop to the training pipeline. Turnaround times and quality floors were both insufficient.
In-house attempts: Without tooling for consensus workflows and intelligent task routing, annotation quality depended entirely on individual judgment, which varied too widely at this scale to produce reliable training data.
The Solution
Taskmonk deployed their AI-powered annotation platform, working closely with the client's search and data operations teams to rebuild the relevance judgment pipeline from scratch.
What We Delivered
Structured Relevance Judgment Workflow: Taskmonk built a graded relevance annotation system (0-4 scale) across 193K+ query-product pairs. Binary labeling collapses the signal gradients ranking models need. The graded schema gave the learning-to-rank model the precision to separate highly relevant results from near-relevant ones. Annotation guidelines were built category by category, with worked examples for each relevance grade in each major product vertical.
AI-Assisted Pre-Labeling: Custom models trained on historical query patterns generated initial relevance judgments before human annotators reviewed them. This shifted annotator effort from labeling from scratch to validating and correcting AI suggestions, concentrating human judgment on the pairs where model confidence was low or category complexity was high. Annotation throughput increased without a proportional increase in cost.
Intelligent Task Allocation: Not every query-product pair needs the same annotator. Taskmonk's routing system assigned tasks based on annotator expertise, language capability, and domain knowledge. Multilingual queries and vertical-specific relevance decisions were routed to annotators with the matching background. This is the only reliable way to maintain consistency across a catalog and query distribution this large.
Multi-Tier Quality Control: A layered QC system combined automated consistency checks, Maker-Checker peer review, and expert escalation for edge cases. Inter-annotator agreement was tracked at the category level, not just in aggregate, so disagreement in specific verticals was caught and corrected before it compounded into the training data. Human-in-the-loop verification was applied selectively to the annotation pairs most likely to introduce noise.
Active Learning for Continuous Improvement: Active Learning Models identified which unlabeled pairs carried the most information value for the ranking model, directing annotation effort toward the data most likely to improve model performance. As the labeled dataset grew, pre-labeling accuracy improved, reducing human review time per batch while maintaining accuracy. The pipeline was designed from the start to improve over time, not produce a one-time dataset.
Workflow Orchestration: A Workflow Orchestration Engine coordinated all annotation stages end-to-end: from query sampling and pre-labeling to consensus review and export to the model training pipeline. Data flow between classification stages, departments, and locations was managed centrally, eliminating the coordination overhead that had made the previous multi-vendor setup unmanageable.
The Results
Annotation Pipeline
- 193K+ products annotated with 99%+ search relevance accuracy
- Graded relevance schema across multiple query types and product verticals
- Category-specific annotation guidelines reducing inter-annotator disagreement from batch two onwards
- Active Learning pipeline built for continuous improvement as catalog and query distribution evolved
Model Performance
- ML algorithm accuracy lifted from 40% to 70%, directly improving search ranking quality and pricing strategy
- 99%+ search relevance accuracy and searchability across 193K+ products
- 99+% accuracy on catalog improvement and searchability
Business Impact
- $310,000 in first-year savings with full payback achieved in 15 days
- 50% overall efficiency gain across annotation and curation processes
- 167M+ transactions processed with 99.9% accuracy
- Seller onboarding reduced from 6 weeks to 1 week, accelerating marketplace growth
Operational Efficiency
- Single annotation platform replacing a 750-person multi-vendor operation
- Unified quality standard applied consistently across all product verticals and languages
- Scalable pipeline built to grow with catalog expansion and new query types without proportional cost increases
Search Relevance Annotation Training Data Needs
With hundreds of millions of daily queries across a global multilingual catalog, the platform required an annotation partner that could produce relevance judgments at a consistency and scale that no generic labeling operation could sustain.
To train their ranking models, the team needed large datasets of accurately labeled query-product pairs, not just by volume, but by the precision of the relevance schema. The workflows required category-specific guidelines, graded relevance scales, and QA processes capable of detecting inter-annotator drift at the vertical level before it degraded model performance.
These workflows included consensus stages where multiple annotators labeled the same query-product pair independently, automated outlier detection flagging inconsistent labels for re-review, expert review layers for ambiguous multilingual queries, and active learning prioritization directing annotation effort toward the pairs with the highest model training value.
Why Taskmonk for Search Relevance Annotation?
The platform needed an annotation partner that could:
- Build and enforce category-specific relevance judgment schemas across a large multilingual catalog
- Handle annotation at the volume and consistency needed for production-grade ranking models
- Implement AI-assisted pre-labeling that reduced human effort without sacrificing accuracy
- Route complex multilingual and vertical-specific queries to the right annotators
- Track inter-annotator agreement at the category level and close the feedback loop to guidelines
- Seamlessly integrate annotation outputs with the existing model training pipeline
Taskmonk met all these requirements, delivering a search relevance annotation solution that balanced workflow complexity with the throughput and quality consistency needed at scale. The team's ability to stand up category-specific guidelines, deploy AI pre-labeling, and maintain active QA loops across the full annotation pipeline directly enabled the jump from 40% to 70% ML algorithm accuracy.
With Taskmonk's annotation platform, the client established a foundation for continuous search quality improvement, capable of scaling with catalog growth, new product categories, and evolving query patterns. Search relevance was the starting point. It didn't stay the only one.
What's Next?
With a proven search relevance annotation pipeline in production, the client has expanded Taskmonk's role across their broader AI data infrastructure. The same platform and annotation principles that fixed search relevance now support data pipeline projects across the business, including product classification and catalog management to seller onboarding automation and beyond.
Taskmonk continues to support ongoing model retraining, edge case annotation, and new query class development as the platform's AI capabilities mature and catalog complexity grows.
To learn how Taskmonk can help your team build better search models with consistent, scalable annotation infrastructure, book a demo.
.png)
