Unstructured data

Unstructured data is information without a fixed schema of rows and fields: quote PDFs, contracts, engineering drawings, emails, specifications, meeting notes. Most of what a procurement team actually knows lives in this form, while its systems historically computed only on the structured minority such as PO lines and invoice fields. Modern language models make unstructured content extractable and searchable, which is quietly redefining what counts as procurement data.

Examples

Quote archaeology: A senior buyer retires, leaving 600 quote PDFs and four years of pricing rationale in a mailbox. Extraction turns the archive into searchable line-item history and surfaces that one connector's price climbed 22% across three re-quotes while its volume doubled. The renegotiation pays for the project.

Spec mismatch: The drawing note reads 6061-T6; the quote line reads 6063-T5. Automated cross-reading flags the alloy mismatch before the PO is cut, instead of at incoming inspection three months later. The fix costs one email; caught downstream, it would have scrapped a 2,000-piece first run.

Definition

Consider what the structured record of a purchase captures: part number, quantity, price, dates. Why the price is what it is (the volume break in the quote's notes, the tooling amortization in an email thread, the tolerance change in drawing rev C) lives in documents. For years that knowledge was retrievable only through the person who remembered it, which is why a buyer's departure routinely erases a category's pricing history.

What changed is that large language models and NLP now read these documents well enough to extract terms, normalize line items, and answer questions with citations. The work shifts from retyping to reviewing. Extracted output still needs validation and deduplication (data cleansing does not disappear, it moves earlier in the pipeline) before it can feed spend analysis at line-item depth.

Quality tracks the source material: digital PDFs extract well; scanned faxes, photographed drawings, and handwritten notes are harder and need confidence-based review. The practical sequencing is to start with the document type that decides money, which is usually quotes. Turning messy quote PDFs into structured, comparable line items is the specific problem LightSource was built around.

Previous

*GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and COOL VENDORS is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.