Flaky Test Misdiagnosis

- A junior engineer misdiagnosed CI failures as flaky when shared database state caused intermittent test breakage. - A mentor introduced test isolation and a flakiness detector that captured DB snapshots before and after tests. - Capturing environment state around test runs can separate true flakiness from shared-state bugs that break parallel pipelines (x.com).

A flaky test is supposed to fail at random with no code or environment change; in this case, the failures were intermittent because tests were sharing database state. (martinfowler.com) Martin Fowler’s definition of a non-deterministic test is one that passes and fails “without any noticeable change” to code, tests, or environment. Google’s testing team lists invalid assumptions about test data, cleanup, and test order among the common causes of that behavior. (martinfowler.com) (testing.googleblog.com) Shared state is one of the most common ways teams mistake a broken test for a flaky one. GitLab’s testing guide calls this a “state leak,” where data created or modified by one test leaves the next test running against a dirty database instead of a clean one. (gitlab-docs-d6a9bb.gitlab.io) That distinction changes how engineers respond in continuous integration, the system that runs tests on every change. If a team labels a failure “flaky” and just reruns it, the underlying bug can stay in the pipeline and keep breaking parallel jobs that touch the same database. (learn.microsoft.com) (gitlab-docs-d6a9bb.gitlab.io) Modern tooling often reinforces that shortcut. Azure DevOps, for example, can mark a test as flaky when it fails once and passes on rerun in the same pipeline, which helps with triage but does not prove the root cause was randomness rather than leaked state. (learn.microsoft.com) The fix for shared-state failures is isolation: each test starts from a known baseline and cleans up after itself. Fowler argues that lack of isolation is a primary source of non-determinism, and GitLab’s guidance says the remedy is to reset data and environment to a pristine state after each test. (martinfowler.com) (gitlab-docs-d6a9bb.gitlab.io) Capturing the environment around each run makes that diagnosis faster. Google’s testing blog frames flakiness as a property of the full test-running stack, from the test code to the system under test and the hardware and services underneath it, so snapshots of database state before and after a test can show whether the environment changed under the test’s feet. (testing.googleblog.com) Teams still quarantine truly flaky tests to keep pipelines moving, but quarantine is supposed to be temporary. Fowler’s advice is to contain the damage first and then fix the cause before unreliable tests erode trust in the entire suite. (martinfowler.com) The practical lesson is narrower than “rerun failed tests” and broader than one database bug: when a test fails intermittently, record what the environment looked like before and after the run. That turns “flake” from a guess into something an engineer can prove or rule out. (testing.googleblog.com) (martinfowler.com)

Flaky Test Misdiagnosis

Get your own daily briefing