GTIN Product Matching in Web Scraping: With and Without GTIN

Group BWT /
Blog /
Web Scraping /
GTIN Product Matching in Web Scraping: How to Match Products Across Retailers with and Without GTIN

Alex Yudin

Head of Data Engineering

product catalog matching across retailers without GTIN identifier

Read summarized version with

Updated on May 5, 2026

Reviewed by:

Oleg Boyko, COO at GroupBWT

Introduction

A global cosmetics brand we work with (€1B in revenue, 300K+ products tracked weekly across 13 European retailers and 30+ locales) runs into the same wall every other ecommerce data team hits sooner or later. The GTIN, the one identifier that should make product matching trivial, is hidden, missing, or wrong on roughly a third of the pages it scrapes. Often, retailers simply replace it with useless internal SKUs or hide it deep within third-party APIs.

This is the part of the topic no vendor blog wants to talk about. The clean version (every product has a barcode, every retailer publishes it, you join it, you go home) rarely holds up under real catalogs. So the question is not “what is GTIN?” The question is what you do the other 30% of the time. This guide covers the methods, the workflow, and the trade-offs, including what happens to ecommerce product matching when the identifier simply does not exist anywhere it can be reliably collected.

Tech Stack

Data Engineering: From Raw Web to Data Product

We develop and manage custom data solutions, powered by proven experts, to ensure the fastest delivery of structured data from sources of any size and complexity.

We offer:

Custom Web Scraping & Development
15+ Years of Engineering Expertise
AI-Driven Data Processing & Enrichment

Intro to GTIN Matching and Why It Matters in E-commerce

A Global Trade Item Number is the 8-to-14-digit barcode allocated through GS1 that uniquely identifies a single product variant worldwide. Joining product records across catalogs (your own, a retailer’s, or a marketplace’s) using that shared identifier as the primary key is the cleanest version of ecommerce catalog matching.

How It Helps Identify Products

When the identifier is present and correct, matching collapses to a database join. Two listings sharing the same GTIN are the same SKU even if their titles, photos, and descriptions disagree. That property is what makes GTIN the anchor of every serious price intelligence and digital shelf analytics system.

Why Accurate Product Matching Matters for Retailers, Brands, and Marketplaces

Bad matches poison everything downstream. Pricing reports compare a 6-pack to a single bottle. Share-of-shelf calculations double-count variants. Review aggregations attribute praise for the wrong product. The cost is not abstract. For a Fortune 500 retailer with three million SKUs, a wrong match across millions of comparisons does not produce a footnote in a report. It produces a wrong business decision: a pricing strategy set against misidentified competitor data, a restocking call based on a fictitious share-of-shelf number, a margin calculation that reflects a bundle, not a unit.

GTIN barcode linking two product listings as one record

Why Product Matching Becomes Difficult When GTIN Is Missing

GTIN absence is not the exception. It is the default condition for entire categories. Teams that design their matching systems around the assumption of clean GTIN data find this out in production.

Where GTINs Actually Hide — and When They Don’t Exist

The identifier doesn’t simply go missing. It hides. Private-label products carry retailer-specific EANs that no competitor recognizes. Marketplace sellers paste in arbitrary codes. GTIN is optional for apparel sold through certain channels, which is why it rarely surfaces on fast-fashion product pages.

Across the projects we run, GTINs surface in places the product page never advertises. For a major UK grocery retailer’s private-label catalog, we pull the identifier from a hidden internal API. They never appear in the HTML. The response payload carries the full GTIN. That approach achieves 99% coverage on 6,000+ SKUs. For a leading beauty retailer, the Bazaarvoice reviews API (a third-party service the retailer uses for user-generated content) returns product identifiers, including GTIN, in its response schema. Neither source is obvious. Both require knowing what you’re looking for and where.

Some products have no recoverable identifier. For several private-label cosmetics lines, the EAN field exists but fails checksum validation; the code is structurally invalid. In those cases, the fallback is the internal SKU combined with the variant ID parsed from the URL. Not a universal identifier, but the most stable key available. For clients who need GTINs for a catalog where they genuinely don’t exist in any feed, we investigate third-party price comparison aggregators that have independently indexed the same products. The identifier sometimes exists outside the original retailer’s ecosystem, even when the retailer has never published it.

“The single hardest lesson for teams new to retail data: GTIN is not a field you fetch. It’s a field you hunt for, sometimes across the page source, sometimes inside an embedded API, and sometimes you just don’t find it and you build the whole pipeline assuming it isn’t there.”
— Alex Yudin, Head of Scraping at GroupBWT

How Missing Identifiers Affect Matching — and When Matching Fails Entirely

Without a clean key, every join is a guess weighted by similarity. The matching doesn’t fail outright. It becomes statistical, with a confidence floor below 100 percent that no amount of engineering lifts past.

Some categories sit below any acceptable threshold regardless of method. Private-label products with no shared attributes between retailers, fashion SKUs with deliberately obscured internal style codes, and marketplace listings with fabricated identifiers. These fall outside what any automated system handles reliably. That is the honest state of the data, not a failure mode to engineer around. Acknowledging it is what separates a matching system you can trust from a dashboard that quietly corrupts over time.

Also Read: Competitor Price Scraping: Precision, Policy, and Real-Time Advantage

How Web Scraping Supports Product Matching Across Retailers

Enterprise web scraping services product matching exists because retailer feeds and partner APIs only cover the lucky cases. Everything else — competitive monitoring, private-label tracking, cross-marketplace dedup — data extraction services from source pages and embedded API responses.

What Product Attributes Can Be Collected from Retail Websites

A serious scraper pulls more than the product title and price. It captures GTIN where present, brand, MPN, model, size, color, weight, ingredients, characteristics, image hashes, JSON-LD blocks, and any structured metadata buried in the page or API calls. For the cosmetics brand we track across 13 European retailers, the full attribute stack runs GTIN → ingredients → characteristics → reviews → price → availability. The richer the attribute set, the more options the matching layer has when the primary key fails.

How Scraped Data Improves Catalog Matching and Product Comparison

Catalog matching at scale is a pipeline problem, not a query problem. For a project tracking products across 13 retailers, consolidating disparate records into a single unified schema runs through a robust six-stage pipeline. The quality baseline at project start was 83% precision and 78% recall. None of that improves without scraped attributes feeding the matcher at every stage.

Product Matching Methods: From GTIN to Rules-Based and Fuzzy Matching

There is no single matching method. There is a stack, and each layer catches what the layer above it missed.

Exact Matching by GTIN, SKU, and MPN

Exact matching is simplest when both sides expose stable identifiers such as GTIN, EAN, UPC, MPN, or retailer SKU. Match rate is high, false positives are near zero, but coverage depends entirely on what the retailer exposes.

Rules-Based Product Matching for Structured Catalog Data

Rules-based product matching combines several attributes into a deterministic key (normalized brand plus MPN plus pack size, for example). A grocery chain we worked with built equivalence tables for private-label products this way, manually maintained, and used as the fallback when EANs diverged between retailers.

Fuzzy Product Matching for Titles, Attributes, and Descriptions

Fuzzy product matching uses string similarity to score how close two free-text fields are. It’s the layer that catches “Charcoal Grey 250ml” and “Grey – 250 ml” as the same product. String similarity scoring is now supported natively by major cloud and database platforms, which reflects how standard this layer has become in production matching systems. The technique is effective on long-tail and marketplace listings where structured attributes are absent; it breaks down on bundles, variants, and listings where title quality is deliberately low.

Method	Typical accuracy	When to use	Failure mode
Exact GTIN/EAN	99%+	Always try first	Field missing or wrong
Exact SKU/MPN	95–99%	Branded electronics, parts, tools	Retailer SKU collisions
Rules-based attribute	85–95%	Grocery, beauty, packaged goods	Inconsistent attribute values
Fuzzy text	70–90%	Long-tail and marketplace listings	Bundles and variants
Image hash + AI	75–90%	Fashion, furniture, no metadata	Stock photo reuse across SKUs

Layered matching methods from exact GTIN to fuzzy text scoring

How to Match Products Without GTIN

Product matching without GTIN is a routing problem. You pick the attribute stack that suits the category and accept that the answer comes with a confidence score, not a yes or no.

Using Brand, MPN, Size, Color, and Other Attributes

Attribute-based product matching works best when the retailer exposes structured fields. For a fashion intelligence project covering 500K+ apparel items across major fast-fashion sites, we match by brand, category, size, and color; GTIN almost never appears on these listings. Brand and MPN matching pinned to size codes handles most automotive parts catalog cleanly.

Category	Attribute stack that works	Why
Beauty & cosmetics	Brand + product line + shade name + volume	Shade naming varies, volume doesn’t
Fashion & apparel	Brand + style code + size + color	GTIN seldom present
Tires & auto parts	Brand + MPN + size code	MPN is canonical in this category
Grocery private label	Retailer code + pack size + flavor	No cross-retailer GTIN equivalence
Electronics	GTIN, then MPN, then brand + model + memory + color	Layered fallback when GTIN is incomplete

Combining Rules and Fuzzy Logic for Better Match Quality

Layering is what makes the whole stack work. Exact identifiers run first, structured attribute rules second, fuzzy text comparison third. Human review sits on top. For one telecom client tracking 1,000 SKUs every 30 minutes against six competitors, our fallback pipeline enriches each card with brand, model, memory, and color, generates candidates through strict attribute filters, then scores them with token-based similarity. Anything below the confidence threshold goes to human review. Nothing publishes without a confirmed match.

Why 100 Percent Match Accuracy Is Rare Without GTIN

Two products can share every attribute and still be different SKUs (different country versions, different production runs, repackaged variants). We typically aim for 92–96 percent automated coverage on top-brand, well-structured catalogs and route the rest through human QA. Anyone promising 100 percent accuracy without a universal identifier is either misrepresenting the workflow or doing the manual review themselves without saying so.

A Practical Workflow for Web Scraping Product Matching

A workable pipeline has six stages. None of them are optional.

Attribute discovery. Before a scraper runs, map which fields each retailer actually exposes, and where identifiers like GTIN live if they exist. For complex or unfamiliar catalogs, this is a standalone phase: collect a sample, validate attribute coverage against the client’s own catalog, and confirm the schema before committing collection rules. Skipping this step means building a pipeline against assumptions that don’t hold.
Data collection. Scrape the full attribute set, not just title and price. Capture GTIN, JSON-LD blocks, image URLs, and any metadata the page fires through background API calls.
Attribute normalization. Lowercase, strip units, standardize sizing, parse weights, resolve regional formats. Proper normalization alone removes most of the noise before any scoring runs, ensuring your fuzzy matching layer isn’t just fighting formatting inconsistencies.
Candidate generation. For each source record, narrow the search space using strict filters (same brand, same category, similar price band) before any fuzzy comparison runs.
Similarity scoring. Run fuzzy text comparison and attribute matchers across the candidate set. Combine scores into a single confidence number.
Human review. Anything below your threshold goes to a reviewer. This is not a failure mode. It’s the design.

Re-validate matches as catalogs drift. New variants appear weekly. Old matches break silently.

“A new client always asks how long it takes to build a match. The honest answer is that the pipeline itself is four to eight hours of work for a basic rules-plus-fuzzy setup. The maintenance is forever. Clients who treat matching as a one-shot project end up with dashboards full of nonsense within a quarter.”
— Dmytro Naumenko, CTO at GroupBWT

six-stage product matching pipeline from attribute discovery to review

How to Improve Product Matching Accuracy at Scale

Quality control is what separates a matching pipeline that works on day one from one that still works on day 300.

Standardizing Titles, Units, and Product Attributes

Normalization is the step most teams underinvest in. While unit conversion, tokenization, and size standardization are essential, normalization also directly applies to the GTIN itself. In practice, simple formatting differences will break an exact match:

Format variations: One retailer stores 5901234123457, another uses 590-1234-123457, and a third drops the leading zero (590123412345). Without stripping characters and padding zeros, the exact match fails on the identical product.
UPC vs. EAN conversion: A 12-digit UPC-A is a subset of a 13-digit EAN-13 (requiring an added leading zero). Without this conversion rule at ingest, US and European listings for the same item will never align.
Checksum validation: Retailers occasionally publish GTINs with a typo in the final digit. Checksum validation is a mandatory part of normalization. Retailers occasionally publish GTINs with an invalid final digit, and those errors need to be caught before matching runs.

Done correctly, normalization removes most of the noise before any scoring runs, which determines whether the fuzzy layer is doing useful work or fighting a data quality problem the entire way down.

Filtering Incorrect Matches with Validation Rules

Apply sanity checks after matching. Products in the same match group should fall within a price band, share a category, and have consistent brand strings. Anything that fails these post-hoc rules goes back to the review queue rather than being published as a confirmed match.

Building Quality Control Into Ongoing Maintenance

Catalogs change daily. The best validation discipline is automated drift detection: flag matches whose attribute distance grows over time, surface them for review, and retire stale equivalences before they corrupt reports.

GTIN Matching for Merchant Center and Product Reviews

Why GTIN Improves Product Review Matching

A clean GTIN lets Google associate every review with one product entity in its catalog. Star ratings consolidate. Review counts compound across sources. Listings become eligible for review snippets in paid and organic search results, the kind of surface real buyers read before clicking.

What Happens When GTIN Is Missing in Merchant Center Workflows

When GTIN is absent, Google falls back on brand plus MPN. If either is inconsistent across feeds, reviews route to the wrong product or disappear entirely. For listings that depend on review snippets to drive click-through rates in Shopping, the matching gap has a direct business cost. Not in internal dashboards, but in star ratings that never accumulate where it matters.

GTIN consolidating reviews across Merchant Center product listings

Common Challenges in Ecommerce Product Matching

Failure mode	Real example	Mitigation
Variant collapse	“Nude 05” lipstick on one site, “Bare Nude” on another	Maintain a shade equivalence table per brand
Bundles vs. singles	A “4-pack” listed as one item skews price-per-unit	Three-level filter: API quantity field, title parsing, description check
Region-specific listings	Same product, different URL per country, different price	Geo-tag listings; never compare across regions
Duplicate marketplace listings	Same property across multiple platforms with slightly different names	Match by location coordinates plus property attributes

When Businesses Need a Custom Product Matching Solution

When Off-the-Shelf Matching Tools Are Not Enough

Generic matching tools work for clean catalogs in obvious categories. They fall apart on private-label grocery, fashion, and mobile carrier offers, and any catalog where GTIN coverage is below 70 percent. The assumption built into every off-the-shelf tool is that your data arrives pre-cleaned and pre-structured. That assumption fails precisely in the situations where matching is hardest.

Clients who come to us after a generic solution failed typically describe the same pattern: the tool handled their top-brand, structured catalog reasonably well, then fell apart on the private-label segment, the marketplace tail, and categories where GTIN was absent or fabricated. Those are not edge cases in most real catalogs. They are the majority of the hard matching work — the part the tool was never trained to handle.

How Custom Matching Logic Supports Better Data Quality

Custom logic encodes what only your category understands. For a beauty client, that means shade equivalence tables across dozens of names per brand that no generic tool ships. For grocery private-label, bundle detection that distinguishes a 4-pack from a single unit before any pricing comparison runs. For a catalog with no usable GTIN, an acquisition path through internal APIs or third-party aggregators. None of this is exotic engineering — it is specific to your data, and that specificity is what off-the-shelf tools cannot provide.

“When clients ask whether they should buy a matching tool or build one, my answer depends on a single question: how much of your catalog is private-label or marketplace-listed? If it makes up a significant portion of your data, you need custom logic, because the off-the-shelf tools were never trained on your edge cases.”
— Eugene Yushenko, CEO at GroupBWT

How GroupBWT Builds Product Matching Workflows with Web Scraping

Every matching engagement we run follows the same three phases. The category changes. The scale changes. The phases don’t.

Phase 1: Attribute Discovery and Source Mapping

Before a single scraper runs, we map which fields each retailer actually exposes, and where identifiers like GTIN live if they exist at all. For a digital shelf platform covering 70+ retailers, collection rules differ for almost every source: one retailer exposes GTIN in JSON-LD; another fires it through a background API call, the page never advertises; a third has no usable identifier and requires a full attribute stack as the matching key. This phase often runs as a paid PoC, validating an initial product sample against the client’s own catalog before full collection begins. It is the step that determines whether the rest of the pipeline is built on real data or assumptions.

Phase 2: Layered Matching Stack

Our default stack is rules first, fuzzy second, and automated data validation on top. We design delivery workflows so only validated matches reach the client dataset. Low-confidence cases are handled upstream through review and validation rather than passed downstream as uncertain data. For the cosmetics brand we track across 13 European retailers and 30+ locales, the matching layer handles GTIN joins where available, and falls back to brand plus product line plus shade and volume for the long tail. Quality is continuously controlled through strict validation rules, ensuring that uncertain matches are filtered out before they ever reach the client’s dashboard.

Phase 3: Ongoing Maintenance and Drift Detection

Catalogs change. Anti-bot systems evolve. Retailers restructure pages. A matching pipeline that isn’t actively maintained degrades silently — and the first sign is usually a business stakeholder noticing that numbers don’t add up. Every deployment we build includes drift detection: we monitor attribute distances continuously, flag matches when their similarity score crosses a threshold, and schedule re-validation runs timed to each catalog’s update cycle. One client came to us after their previous data supplier became unreliable over time. Inconsistent coverage, missed updates, and growing gaps between what the feed reported and what was actually on the page. Replacing that supplier with a pipeline we own and monitor is now in its fourth year of operation.

Three-phase matching engagement from catalog discovery to drift detection

Conclusion

Reliable product matching is not a tooling problem, and it is not an algorithm problem. It is a pipeline-and-discipline problem. GTIN gives you a clean key when it exists. Rules and fuzzy logic close the gap when it doesn’t. And in the categories where no automated approach reaches acceptable accuracy (private-label without shared attributes, fashion with obscured codes, marketplaces with fabricated identifiers), the answer is human QA and custom matching logic built to your specific data, not a more powerful algorithm.

Teams that treat matching as a one-time setup end up with dashboards that quietly degrade until no one trusts the numbers. Teams that build maintenance into the workflow keep the lights on for years. If your matching pipeline is hitting edge cases at scale (missing GTINs, private-label catalogs, marketplace noise), we have built this for clients tracking 300K+ products weekly across dozens of retailers. Contact us to walk through what your catalog actually requires.

GTIN matching is one specific method: joining product records by their Global Trade Item Number, the GS1-allocated barcode that uniquely identifies a product variant worldwide. Product matching is the broader practice of identifying when two listings refer to the same SKU, using whatever attributes are available. Every GTIN match is a product match, but most product matching in production catalogs today happens without a usable GTIN; the identifier is missing, hidden, or invalid on a significant share of retailer pages.

Yes, and most production catalogs require it. The standard approach combines deterministic rules on structured attributes (brand, MPN, size) with fuzzy text similarity on titles and descriptions. Confidence scores route uncertain matches to human review. Expect 85 to 95 percent automated coverage in well-structured categories, less in fashion or marketplace listings where attribute quality is low.

For top-brand products with consistent attribute data, 92 to 96 percent automated accuracy is realistic. For long-tail or marketplace listings, that number drops. The remaining gap is closed by ongoing human review, not by a smarter algorithm. Anyone promising 100 percent accuracy without a universal identifier is either misrepresenting the workflow or doing the manual review themselves without disclosing it.

Several reasons. Private-label products carry retailer-specific codes that other retailers do not recognize. Marketplace sellers enter arbitrary numbers. Some categories, like fashion, never adopted GTIN broadly. Many retailers strip the field from their public product pages while keeping it inside internal API calls or third-party services like review platforms. Collecting it requires knowing where to look. Sometimes the page source, sometimes an embedded API response, sometimes a third-party aggregator that independently indexed the product.

Merchant Center uses GTIN as the canonical key for joining reviews to products. With a clean identifier, reviews from your own site, distributors, and aggregator feeds consolidate against a single product card; star ratings and review counts compound across all sources. Without it, Google falls back on brand and MPN, and reviews fragment or attach to the wrong listing entirely. For listings that depend on review snippets in paid or organic search, the difference is measurable in click-through rates.