Tender Data Aggregation:
Overcoming Multilingual
and Classification
Challenges

single blog background
 author`s image

Oleg Boyko

Introduction: Why Tender Aggregation Matters

In our previous post, Challenges in Tender Data Unification, we looked at the complexities of gathering and unifying tender information from various sources across the EU. Many of you reached out with questions about how we tackle key issues like working with multiple languages, handling classification challenges, and ensuring data quality. This time, we will focus on practical tools for monitoring tender data, overcoming multilingual and classification obstacles, and maintaining high data accuracy. Let’s dive into these solutions and how they help create reliable tender aggregation systems.

Approaches to Multilingual Tenders

tenders

Due to the EU’s multilingual nature—with 24 official languages—tender information is often only available in the contracting country’s language. This creates barriers for companies from other member states that may not understand the relevant language, complicating access to information and participation in tenders.

Additionally, various EU countries use their national platforms for publishing tenders, resulting in fragmented information that makes centralized collection and analysis challenging. While TED aims to consolidate this data, not all tenders are published on the portal, particularly those below certain financial thresholds.

Effective Strategies for Multilingual Tender Classification

For multilingual classification, we consider the following approaches:

  1. Expanding Our Existing Model: We built our model based on English data. The training dataset consists of approximately 500,000 to 800,000 records provided by Contracts Finder and Find Tender, including archived and current data dating back to 2015. Given the UK’s regulatory requirements and the characteristics of CF/FT sites, their data was adopted as a standard and sufficiently comprehensive for training purposes.
  2. Using Machine Translation/External APIs: Translate all texts into English and use an English-trained prediction model. This approach avoids additional training since translation services and models are already available. However, translation can reduce prediction accuracy due to the potential loss of context and grammatical or semantic nuances.
  3. Training Separate Models for Each Target Language: This involves identifying the language of the input release before predicting with the respective model. While this approach ensures high accuracy and minimizes distortions from uneven training datasets, it requires substantial training data for each language. Although this method scales less efficiently than a single-model approach, it provides a significant advantage in terms of accuracy by tailoring each model to the linguistic and cultural specifics of its respective language.

The choice of approach depends on task context, situation, and constraints. We favor the third option, as our services prioritize highly accurate, contextually precise predictions over rough approximations.

Next, we will explore CPV codes and our categorization approach, which allows for identifying procurement subjects regardless of language. CPV codes also simplify automated data aggregation from various sources, reducing data processing complexity and improving classification accuracy.

How CPV Coding Works in Tender Data

tenders CPV Coding

CPV (Common Procurement Vocabulary) is a standardized classification system designed to describe procurement subjects in public tenders, particularly within European Union countries. Its primary purpose is to unify terminology and simplify searching and analyzing tender information.

CPV enhances automation in tender classification, reducing errors and improving data processing speed.

Key Features of CPV:

CPV Codes:

The CPV system consists of numerical codes representing a specific product, service, or work.

  • A code contains nine digits plus a check digit, for example: 03221200-5, where:
    • 03221200 is the main code indicating the specific product or service.
    • 5 is the check digit used to ensure data entry accuracy.

 

Hierarchical Structure:

CPV is organized like a tree, where each level provides more detail about the product or service. For instance:

  • 03000000-1 — Agricultural, farming, fishing, and forestry products.
  • 03200000-3 — Cereals, potatoes, vegetables, fruits.
  • 03221200-5 — Onions.

Additional Vocabulary:

In addition to the principal codes, CPV includes an auxiliary vocabulary (Additional Vocabulary) that specifies additional characteristics of procurements, such as sizes, materials, or intended purposes.

Our Own CPV System Prediction Model

We used 500,000–800,000 records from Contracts Finder and Find Tender to build our prediction model. These records include both archived and current data dating back to 2015. Due to the regulatory requirements in the UK and the structured nature of data on CF/FT sites, they were selected as a standard and sufficiently comprehensive source for training.

The model predicts the exact CPV code without building a hierarchy and provides multiple options, each with an associated probability. We achieved reasonably high prediction accuracy thanks to the reliability of CPV categories on government sites and the large volume of data. CPV codes are integrated into our models for data unification in two ways:

  1. The predicted CPV code is inserted into the appropriate field where it was missing previously, ensuring proper classification.
  2. We extend the OCDS schema with a custom extension that adds metadata about the CPV category, including the prediction method and the confidence level (a value between 0 and 1) associated with each predicted category.

We employed a standard eXtreme Multi-Label Classification method for training and fine-tuning parameters such as the number of epochs. While no groundbreaking solutions were implemented, the model works effectively for our specific use case. However, we do not claim that it outperforms or underperforms other models in general.

How We Ensure the Tender Data Aggregation System Works Properly

Tender Data Aggregation System

We treat each site (or source) as an independent resource, with all statistics, management, and fault tolerance explicitly organized for that site. Sometimes, sites may be grouped or shared on the same platform to publish their tenders. Even in such situations, we maintain a high level of granularity for each site. This approach ensures that if one organization stops using a platform and a site becomes inactive, it does not disrupt other organizations or their sites on the same platform. However, platform-wide changes do impact all connected sites.

We use Grafana to monitor the progress and functionality of our data aggregation systems. It’s not used as an analytics tool for the data itself but as a system monitoring tool that provides real-time insights into the health of our collection processes. Grafana fulfills this role exceptionally well.

Ensuring System Reliability in Tender Data Aggregation

While clients can access Grafana dashboards, they typically don’t need to use them directly. Instead, our support team relies heavily on these dashboards to monitor data quality, resolve issues promptly, and even anticipate potential problems before they occur.

Here are some of the key metrics we track:

  • Daily collection runtime per source: Monitoring how long it takes to process each site.
  • Number of HTTP errors per source: Tracking issues related to connectivity or site accessibility.
  • Collection progress and retry attempts: Ensuring all data is collected, with automatic retries where necessary.
  • Expected daily release count per source: If a site usually publishes 100–150 tenders per day, we check whether the numbers fall within this range.
  • Pipeline accuracy in processing releases: Ensuring all collected data is correctly saved in the database.
  • Missing values in specific fields: Monitoring fields like title, description, CPV code, release date, buyer, and tag to ensure the completeness of critical information.

Our system includes an object model that supports the standard OCDS 1.2 specification and various custom extensions. During data collection, validation ensures that fields conform to the required structure. Any fields that fail validation are omitted from the final OCDS JSON. Additionally, the system tracks missing values for critical fields, which is significant for clients who depend on accurate and complete data.

To monitor such missing fields, we use Grafana and Sentry, which enable real-time tracking and visualization of data quality issues. Moreover, certain fields are marked as essential for client-specific business cases. Additional statistics are maintained to identify records where these significant fields are missing, providing deeper insights and helping ensure compliance with client requirements.

How We Ensure Data Quality

One of our system’s key strengths is its unwavering focus on data completeness and quality. Our experienced scrapers and QA specialists bring deep domain expertise, ensuring accurate mapping of data from free text or custom website layouts into OCDS fields and extensions.

By leveraging an object-oriented approach in our code, we effectively manage and monitor a release’s complex structure. This approach minimizes errors and incorrect mappings, ensuring reliable, high-quality outputs.

For projects of this nature, we assign a dedicated support team that works full-time to oversee and optimize the data collection process. If this level of service is something you value, we can tailor our support to match your specific needs. Whether it’s about the speed at which issues are resolved or the urgency of problem fixes, the level of service depends on your required SLA. How fast do you need us to act? Let’s ensure our service aligns perfectly with your expectations.

Adding New Tender Sources Easily (Additional scrapers)

Tender Source

The system’s architecture allows for rapid addition of new scrapers, provided a ready deployment with configured microservices is in place. The entire process is standardized:

  • Source analysis
  • Implementation of the scraper and data mapping to OCDS
  • Documentation
  • Delivery to a testing environment and verification
  • Delivery to the production environment requires adding a scraper for a known and previously analyzed plat, and deployment takes less than a day. Timing for entirely new sounding depends on their specifics but generally requires no more than a few days.

With 15 years of experience in scraper development, we know how to do this efficiently and effectively. Each scraper has its documentation, which simplifies maintenance. Therefore, if you want to expand the number of sources, we can assist and integrate them into your current system.

Tender Aggregation System Architecture

It’s important to highlight that we don’t offer a ready-made, one-size-fits-all system that can be copied and deployed from a previous project. Our applications are built using a microservice architecture consisting of multiple components organized in layers. Some layers handle essential functionalities like fault tolerance and monitoring, while others are dedicated to business-specific logic.

For example, we use several proprietary extensions when working with tenders and creating OCDS outputs. These extensions are integrated at different levels within the application and are exclusively tailored for each client. Because of this, our applications are not generic, off-the-shelf products; they are custom-built solutions designed to address the specific needs of individual clients. This also means that we cannot reuse scrapers, data, or configurations from one client for another. Each project involves unique client-owned elements, such as extensions, OCID encoding, source segmentation rules, etc.

Our approach is backed by 15 years of experience developing similar systems and is guided by a “framework” that allows us to implement new data collection systems efficiently. While this framework accelerates development, each system is configured with its own set of components and customized business logic to meet the client’s unique requirements, including their specific SLA expectations.

Conclusion: Improving Tender Aggregation Processes

Our approach to tender data aggregation combines flexibility, speed, and client-focused service to deliver superior results. With a custom-built system architecture, we ensure every client receives a solution tailored to their unique needs, avoiding the pitfalls of generic, one-size-fits-all platforms.

Our 15 years of expertise support the rapid integration of new data sources, allowing us to expand and adapt systems efficiently with minimal downtime. Our dedicated support team ensures high data quality, prompt issue resolution, and proactive system monitoring, meeting the most demanding SLA requirements.

Looking to enhance your tender aggregation process? Partner with us for a system as dynamic as your business needs. Contact us today!

Looking for a data-driven solution for your retail business?

Embrace digital opportunities for retail and e-commerce.

Contact Us