Data Lake — definition

A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its raw format at any scale. Unlike a data warehouse, a data lake imposes no schema at write time — structure is applied when data is read. Data lakes are the storage foundation for AI training pipelines and large-scale analytics.

Modern data lakes are built on object storage (AWS S3, Azure Data Lake Storage, GCS) with open table formats (Delta Lake, Apache Iceberg, Apache Hudi) that add ACID transactions and schema evolution. AI teams use data lakes as the source for feature engineering, fine-tuning datasets, and RAG knowledge base construction. Without governance (data catalog, lineage, access controls), data lakes become 'data swamps' — storage full of data nobody can find or trust.

Related terms

Feature Engineering — Feature engineering is the process of transforming raw data into the input representations that machine learning models use to make predictions.

Data Mesh — Data mesh is an organizational and architectural approach to data management that distributes data ownership to the business domains that produce it, rather than centralizing all data in a single platform team.

Fine-Tuning — Fine-tuning is the process of training an existing AI model on additional task-specific data so its weights adapt to a narrower domain.

Synthetic Data — Synthetic data is AI-generated data that statistically mimics a real dataset without containing actual personal records.