Introduction: Why Tender Aggregation Matters
In our previous post, Challenges in Tender Data Unification, we looked at the complexities of gathering and unifying tender information from various sources across the EU. Many of you reached out with questions about how we tackle key issues like working with multiple languages, handling classification challenges, and ensuring data quality. This time we will focus on practical tools for monitoring tender data, overcoming multilingual and classification obstacles, and maintaining high data accuracy. Let’s dive into these solutions and how they help create reliable tender aggregation systems.
Approaches to Multilingual Tenders
Due to the multilingual nature of the EU, with 24 official languages, tender information is often only available in the contracting country’s language. This creates barriers for companies from other member states that may not understand the relevant language, complicating access to information and participation in tenders.
Additionally, various EU countries use their own national platforms for publishing tenders, resulting in fragmented information that makes centralized collection and analysis challenging. While TED aims to consolidate this data, not all tenders are published on the portal, particularly those below certain financial thresholds.
For multilingual classification, we consider the following approaches:
- Expanding Our Existing Model:We built our own model based on English data. The training dataset consists of approximately 500,000 to 800,000 records provided by Contracts Finder and Find Tender, including archived and current data dating back to 2015. Given the UK’s regulatory requirements and the characteristics of CF/FT sites, their data was adopted as a standard and sufficiently comprehensive for training purposes.
- Using Machine Translation/External APIs: Translate all texts into English and use an English-trained model for predictions. This approach avoids additional training since translation services and models are already available. However, translation can lead to reduced prediction accuracy due to potential loss of context, grammatical, or semantic nuances.
- Training Separate Models for Each Target Language: This involves identifying the language of the input release before predicting with the respective model. While this approach ensures high accuracy and minimizes distortions from uneven training datasets, it requires substantial training data for each language. Although this method scales less efficiently than a single-model approach, it provides a significant advantage in terms of accuracy by tailoring each model to the linguistic and cultural specifics of its respective language.
The choice of approach depends on task context, situation, and constraints. We favor the third option, as our services prioritize highly accurate, contextually precise predictions over rough approximations.
Next, we will explore CPV codes and our categorization approach, which allows for identifying procurement subjects regardless of language. CPV codes also simplify automated data aggregation from various sources, reducing data processing complexity and improving classification accuracy.
How CPV Coding Works in Tender Data
CPV (Common Procurement Vocabulary) is a standardized classification system designed to describe procurement subjects in public tenders, particularly within European Union countries. Its main purpose is to unify terminology and simplify the process of searching and analyzing tender information.
CPV enhances automation in tender classification, reducing errors and improving the speed of data processing.
Key Features of CPV:
CPV Codes:
The CPV system consists of numerical codes, each representing a specific product, service, or work.
- A code contains 9 digits plus a check digit, for example: 03221200-5, where:
- 03221200 is the main code indicating the specific product or service.
- 5 is the check digit used to ensure data entry accuracy.
Hierarchical Structure:
CPV is organized like a tree, where each level provides more detail about the product or service. For instance:
- 03000000-1 — Agricultural, farming, fishing, and forestry products.
- 03200000-3 — Cereals, potatoes, vegetables, fruits.
- 03221200-5 — Onions.
Additional Vocabulary:
In addition to the main codes, CPV includes an auxiliary vocabulary (Additional Vocabulary) that allows for specifying additional characteristics of procurements, such as sizes, materials, or intended purposes.
Our Own CPV System Prediction Model
To build our own prediction model, we used 500,000–800,000 records sourced from Contracts Finder and Find Tender. These records include both archived and current data dating back to 2015. Due to the regulatory requirements in the UK and the structured nature of data on CF/FT sites, they were selected as a standard and sufficiently comprehensive source for training.
The model predicts the exact CPV code without building a hierarchy and provides multiple options, each with an associated probability. Thanks to the reliability of CPV categories on government sites and the large volume of data, we achieved reasonably high prediction accuracy. CPV codes are integrated into our models for data unification in two ways:
- The predicted CPV code is inserted into the appropriate field where it was missing previously, ensuring proper classification.
- We extend the OCDS schema with a custom extension that adds metadata about the CPV category, including the prediction method and the confidence level (a value between 0 and 1) associated with each predicted category.
For training, we employed a standard eXtreme Multi-Label Classification method, fine-tuning parameters such as the number of epochs. While no groundbreaking solutions were implemented, the model works effectively for our specific use case. However, we do not claim that it outperforms or underperforms other models in general.
How We Ensure the Tender Data Aggregation System Works Properly
We treat each site (or source) as an independent resource, with all statistics, management, and fault tolerance organized specifically for that site. In some cases, sites may be grouped together or share the same platform to publish their tenders. Even in such situations, we maintain a high level of granularity for each site. This approach ensures that if one organization stops using a platform and a site becomes inactive, it does not disrupt other organizations or their sites on the same platform. However, platform-wide changes do impact all connected sites.
To monitor the progress and functionality of our data aggregation systems, we use Grafana. It’s not used as an analytics tool for the data itself but as a system monitoring tool that provides real-time insights into the health of our collection processes. Grafana fulfills this role exceptionally well.
While clients have access to Grafana dashboards, they typically don’t need to use them directly. Instead, our support team relies heavily on these dashboards to monitor data quality, resolve issues promptly, and even anticipate potential problems before they occur.
Here are some of the key metrics we track:
- Daily collection runtime per source: Monitoring how long it takes to process each site.
- Number of HTTP errors per source: Tracking issues related to connectivity or site accessibility.
- Collection progress and retry attempts: Ensuring all data is collected, with automatic retries where necessary.
- Expected daily release count per source: For example, if a site usually publishes 100–150 tenders per day, we check whether the numbers fall within this range.
- Pipeline accuracy in processing releases: Ensuring all collected data is correctly saved in the database.
- Missing values in specific fields: Monitoring fields like title, description, CPV code, release date, buyer, and tag to ensure the completeness of critical information.
Our system includes an object model that supports both the standard OCDS 1.2 specification and various custom extensions. During the data collection process, validation ensures that fields conform to the required structure. Any fields that fail validation are omitted from the final OCDS JSON. Additionally, the system tracks missing values for critical fields, which is particularly important for clients who depend on accurate and complete data.
To monitor such missing fields, we use Grafana and Sentry, enabling real-time tracking and visualization of data quality issues. Moreover, for client-specific business cases, certain fields are marked as essential. Additional statistics are maintained to identify records where these significant fields are missing, providing deeper insights and helping ensure compliance with client requirements.
How We Ensure Data Quality
One of the key strengths of our system is its unwavering focus on data completeness and quality. Our team of experienced scrapers and QA specialists brings deep domain expertise, ensuring accurate mapping of data from free text or custom website layouts into OCDS fields and extensions.
By leveraging an object-oriented approach in our code, we effectively manage and monitor the complex structure of a release. This approach minimizes errors and incorrect mappings, ensuring reliable, high-quality outputs.
For projects of this nature, we assign a dedicated support team that works full-time to oversee and optimize the data collection process. If this level of service is something you value, we can tailor our support to match your specific needs. Whether it’s about the speed at which issues are resolved or the urgency of problem fixes, the level of service depends on your required SLA. How fast do you need us to act? Let’s ensure our service aligns perfectly with your expectations.
Adding New Tender Sources Easily (Additional scrapers)
The system’s architecture allows for rapid addition of new scrapers, provided a ready deployment with configured microservices is in place. The entire process is standardized:
- Source analysis
- Implementation of the scraper and data mapping to OCDS
- Documentation
- Delivery to a testing environment and verification
- Delivery to the production environment
When adding a scraper for a known and previously analyzed platform, deployment takes less than a day. For entirely new sources, timing depends on their specifics but generally requires no more than a few days.
With 15 years of experience in scraper development, we know how to do this efficiently and effectively. Each scraper has its own documentation, which simplifies maintenance. Therefore, if you want to expand the number of sources, we can assist you in doing so and integrate them into your current system.
Tender Aggregation System Architecture
It’s important to highlight that we don’t offer a ready-made, one-size-fits-all system that can simply be copied and deployed from a previous project. Our applications are built using a microservice architecture, consisting of multiple components organized in layers. Some layers handle essential functionalities like fault tolerance and monitoring, while others are dedicated to business-specific logic.
For example, when working with tenders and creating OCDS outputs, we use several proprietary extensions. These extensions are integrated at different levels within the application and are exclusively tailored for each client. Because of this, our applications are not generic, off-the-shelf products; they are custom-built solutions designed to address the specific needs of individual clients. This also means that we cannot reuse scrapers, data, or configurations from one client for another. Each project involves unique client-owned elements, such as extensions, OCID encoding, source segmentation rules, and more.
Our approach is backed by 15 years of experience in developing similar systems and is guided by a “framework” that allows us to implement new data collection systems efficiently. While this framework accelerates development, each system is configured with its own set of components and customized business logic to meet the client’s unique requirements, including their specific SLA expectations.
Conclusion: Improving Tender Aggregation Processes
Our approach to tender data aggregation combines flexibility, speed, and client-focused service to deliver superior results. With a custom-built system architecture, we ensure every client receives a solution tailored to their unique needs, avoiding the pitfalls of generic, one-size-fits-all platforms.
The rapid integration of new data sources, supported by our 15 years of expertise, allows us to expand and adapt systems efficiently, with minimal downtime. Additionally, our dedicated support team ensures high data quality, prompt issue resolution, and proactive system monitoring, meeting even the most demanding SLA requirements.
Looking to enhance your tender aggregation process? Partner with us for a system that’s as dynamic as your business needs. Contact us today!