sources delivered without a match key
An alternative data provider, packaging web-sourced datasets for investment and research customers, was receiving monthly deliveries across 30 sources. The entity identifier needed to match each record to other datasets was absent from every record. Their vendor said it couldn't be pulled. It was there the whole time, hidden in the page behind the visible view.
The entity identifier (the permanent ID, such as an LEI or ticker, that lets a record be joined to other datasets) does two jobs. It completes the record, and it is what makes every downstream match accurate. A record without its identifier can only be joined on approximate fields like name, which produces false matches and silent errors. So the missing field was two problems at once: the records were incomplete, and any matching done on them was unreliable. The vendor's deliveries were counting these records toward volume regardless.
The vendor had told the client the identifier couldn't be pulled because it wasn't visible anywhere on the page. The client took that at face value. But "not shown in the interface" is not the same as "not in the page." The identifier was sitting in the data layer behind the rendered view, never displayed to a human, fully accessible to a scraper that knows where to look.
We extracted it on the same sources the vendor was already scraping. In some cases it took one additional request. In others, reading the structured response behind the page instead of the visible table. The data was there the whole time. It simply wasn't being captured.
Capturing the identifier fixed both problems at once. The record was complete, and because matching now ran on a stable, authoritative ID instead of approximate name matching, it was accurate. Records that had been silently mis-joined or duplicated resolved to the right entity. The same fix that filled the gap also removed a class of matching errors the client could not see.
This is a category of gap that spot checks cannot surface. A record delivered without its match key looks complete: all other fields present, no obvious error. The missing field only becomes visible when the data reaches a downstream system that requires it, or when someone checks whether the source actually holds more than the vendor returned.
Across 30 sources, records the provider could only partly use became fully usable: matchable, verifiable, and sellable. Both the accuracy and the completeness of the provider's downstream data improved, with no change to the sources or scrape scope.
How much of your delivered data can you actually use downstream?
Start a Gap Audit →