back
Back to Glossary

What is Unstructured data

Unstructured data is information that does not naturally fit into a fixed table of rows and columns. It is still valuable, but it does not come with a consistent schema like “name, price, category” in a spreadsheet.

Examples include:

  1. Text: emails, chat logs, support tickets, documents
  2. Images and video: product photos, CCTV footage, medical scans
  3. Audio: calls, voice notes, podcasts
  4. PDFs and scanned forms: invoices, contracts, KYC documents

Unstructured data is important because a large share of enterprise knowledge lives in it. However, models cannot reliably learn from it unless it is converted into ML-ready formats through steps like preprocessing, labeling, and validation.

How unstructured data becomes usable for ML

  1. Extract signals (OCR for text in images, ASR for speech, metadata parsing)
  2. Annotate targets (entities, categories, bounding boxes, timestamps, segments)
  3. Normalize formats (consistent schemas, taxonomies, label definitions)
  4. Quality-check with audits, sampling, and reviewer workflows