Change is allowed. Silent change is not. Your first principle is: Schema version is part of the data identifier. events_v2.parquet is a different entity than events_v1.parquet . Never mutate; deprecate.
You enforce quality at the point of creation or ingestion. If a record doesn’t meet the first principles of your domain (e.g., timestamp cannot be in the future; customer_id must match a regex), it is rejected immediately. The rule: Do not allow a known violation to enter your persistent storage. Ever. 2. The "Nullable Integer" Paradox Let’s look at a classic first-principles failure: Nulls in numeric fields. ab initio data quality
Ab initio (Latin for "from the beginning") means starting from first principles. In a quantum simulation, you don't patch errors later—you define the laws of physics upfront. If your initial conditions are wrong, the simulation is worthless. Change is allowed
Stop polishing bad data. Start building it right from the first principle. events_v2
Replace NULL with explicit semantics. Use -999 for "offline," -9999 for "out of range," or better—split the column into value and value_metadata_flag . 3. The Referential Integrity Illusion Modern data lakes love "schema on read." This is the enemy of ab initio . You are essentially saying, “Let’s store the garbage, and we’ll figure out what kind of garbage it is later.”