Data
Feature Engineering
Feature engineering is the process of transforming raw data into the input representations that machine learning models use to make predictions. It includes selecting relevant variables, creating derived features, handling missing values, encoding categoricals, and scaling numerical inputs.
Feature engineering was historically the most time-intensive part of ML development — responsible for 80% of a data scientist's time in the pre-LLM era. Deep learning and large language models have reduced but not eliminated it: structured-data ML (fraud detection, demand forecasting, risk scoring) still depends heavily on hand-crafted features. Feature stores (Feast, Tecton, Databricks Feature Store) industrialize feature reuse across models and teams.
Related terms
- MLOps — MLOps (Machine Learning Operations) is the set of practices that operationalize ML and AI models in production — covering CI/CD pipelines for model updates, automated testing, performance monitoring, data versioning, and rollback procedures.
- Fine-Tuning — Fine-tuning is the process of training an existing AI model on additional task-specific data so its weights adapt to a narrower domain.
- Data Lake — A data lake is a centralized repository that stores structured, semi-structured, and unstructured data in its raw format at any scale.
- Synthetic Data — Synthetic data is AI-generated data that statistically mimics a real dataset without containing actual personal records.