Transforming unstructured construction data into joined up knowledge

Construction technology platforms accumulate vast volumes of unstructured data. Free-text entries, abbreviations, and inconsistent terminology prevent this data from being connected to structured datasets - leaving valuable insights locked away even when technical integrations exist.

A construction software provider faced exactly this challenge. Their platform held millions of free-text invoice line items that couldn't be reconciled with structured pricing and compliance datasets without prohibitive manual effort.

The challenge

The platform contained 4 million rows of unstructured data, with 600,000 new rows added monthly. Each entry was a free-text description of products and materials - inconsistent, abbreviated, and unstandardised.

This created interconnected problems. Without a way to map historic entries to standardised product codes, years of accumulated pricing data could not be transformed into a usable analytics database - the commercial opportunity was locked away in plain sight. As new entries were created, there was no mechanism to classify them on the fly and feed pricing or compliance insights back to users within the workflow itself.

Solving both required the same underlying capability: a classification engine that could handle millions of existing rows at scale, and operate fast enough to enrich new entries in real time. Three barriers needed to be overcome:

Inconsistent inputs. Free-text entry meant the same product could be described dozens of different ways, making automated classification impossible through conventional means.
Incompatible datasets. Without a common taxonomy, operational data could not be joined to structured pricing or compliance datasets, leaving significant commercial opportunity unrealised.
Overwhelming scale. With millions of rows of existing data and hundreds of thousands added each month, manual classification was never a viable option.

Hoppa's solution

Following an evaluation process that compared performance of several providers, the organisation commissioned Hoppa to solve this challenge using our proprietary HARDR algorithm (Hierarchically Aware Row-To-Description Retriever). HARDR specialises in taking unstructured inputs, such as invoice line items, and classifying them against a hierarchical taxonomy of the customer's choosing.

For this engagement, invoice line items were mapped to product and material categories defined by industry-standard classification systems: Uniclass and ETIM. By normalising the plain language descriptions to a structured taxonomy, the operational data could then be connected to the organisation's pricing and compliance datasets. This enabled entirely new value propositions, such as serving real-time information to frontline teams directly within their workflow.

In addition to mapping descriptions to structured categories, Hoppa's solution also extracted other key information, such as manufacturer names, enhancing dataset richness and utility.

Deployment and impact

Recognising the strategic importance of HARDR's capabilities, the organisation opted to license the algorithm and deploy Hoppa's technology directly onto their own infrastructure. This wasn't simply a technology transfer - Hoppa embedded into the very fabric of how the customer needed to work, stitching our capability directly into their backend systems and operational processes.

The partnership went beyond pure technology deployment. Working with a product-centric organisation, Hoppa shared our deep data expertise and know-how, helping the client understand not just how to implement the solution, but how to maximise its value across their business. This knowledge transfer ensured HARDR could operate seamlessly within their digital ecosystem, processing data at scale and delivering structured, accurate outputs in real time.

Hoppa by numbers

Customer subject matter experts assessed classification accuracy across a representative sample of HARDR's output. In parallel, a blind control study was run with two domain experts independently classifying the same sample.

91% of classifications usable without further intervention. 80% were assessed as perfect matches and 11% as good matches. The remaining 9% of low-confidence matches were predominantly attributable to underlying data quality issues and the inherent limitations of taxonomies like Uniclass, which do not always capture the full breadth of construction terminology.
2× greater coverage than domain experts working independently. The two human experts reached consensus on only 39% of entries, due to knowledge gaps and differences in interpretation. Hoppa not only processed data at a scale no human team could match, but delivered greater consistency and accuracy than domain experts working manually.
250,000 hours of manual effort eliminated. Scaled across the full 4 million row data backlog, Hoppa's classification engine eliminated the equivalent of 250,000 hours of human effort.

From fragmented records to actionable intelligence

Hoppa's classification engine connected datasets that were never designed to work together, transforming the platform from a data entry tool into an intelligence layer. With entries now mapped to standardised product codes, the organisation could serve real-time pricing and compliance information to frontline teams directly within their workflow - unlocking entirely new commercial propositions.

Tom Goldsmith

Co-Founder & CTO at Hoppa

“This engagement demonstrated Hoppa's ability to process complex, high-volume construction data at scale. The solution embedded directly into existing technology infrastructure, delivering structured intelligence without disrupting established workflows.”

Looking ahead

For organisations with large volumes of unstructured operational data, Hoppa bridges the gap between fragmented inputs and connected intelligence. Whether the challenge is invoice classification, asset record enrichment, or materials cataloguing, our adaptable algorithms can be licensed and embedded directly into your existing technology stack - turning data that was previously unusable into a source of competitive advantage.