You Should Test Less

February 25, 2025 - Matthew

Irrelevant tests denied entry to the CI nightclub

Stop running irrelevant tests

To be precise, when you make a change to code, you should only run tests that can be proven to be relevant. Running unrelated tests tells you nothing useful.

Why do we run irrelevant tests to begin with? Our testing practices derive from unit tests, where the ideal is to make tests so fast, you can rerun them all frequently. But this broke down as test suites got longer and slower. I have personally seen CI for PRs take up to 120 minutes to complete, and full end-to-end tests so slow they could only be run overnight.

How to fix

There's a solution that deserves to be better-known: test-impact analysis (TIA). With TIA, you use code and file dependencies to run only tests that are affected by changes in a PR.

TIA is not new. Variants of it are used in Google's TAP. Jest and Vitest support it. Microsoft coined the term itself, and offers it in Azure DevOps. Thought leaders have talked about it.

This can be done in several ways:

Automatic analysis

Automatic TIA methods use static analysis or run-time information to figure out dependencies. They're less work to use, but don't handle non-code changes well; that requires specialized custom analysis and/or running all tests to be cautious. Luckily, changing things like database schemas usually means changes in associated code too. (Optionally, you can add known safe files to an ignore list.)

File dependencies

File dependencies are fast and simple, but can "overtest". Not every test in a file may be affected by a change in a depended-on code file. It's best for codebases that don't import/export more than they need to (e.g., Python __init__.py files that import everything under the sun.)

Program dependency graphs

Program dependency graphs are slower than file deps, but more precise. By analyzing data dependencies and control flow like a compiler, they can match up tests to individual code statements. They won’t overtest based on code, but have the same limitation on static analysis as file deps do.

Coverage

Coverage can map code dependencies that aren't amenable to static analysis. This comes with a downside, though; to collect coverage info requires a full test run to start. If you want to use TIA in a CI environment, you need to run all tests beforehand, store coverage data, and share it with CI environments. (If you don't share coverage data, CI has to run the full test suite, defeating the whole purpose.) Coverage methods work well, at the cost of more complexity, harder CI integration, and sharing state.

Manual specification

If you write out all file dependencies for builds and tests (like Google's TAP), you can use that for TIA. The major advantage is you can use TIA for non-code changes if they're declared as dependencies. The disadvantage is that it's lots of error-prone manual work.

Complementary test speed-up methods

TIA is safe (when conservative), cheap, requires minimal changes, and combines well with other acceleration methods.

Test Suites

The traditional approach divides tests into suites. When updating Foo-related code, run only the Foo suite. While simple, this method is coarse and needs manual configuration. TIA can replace suites, though using both remains an option.

Parallelization

Parallelization is very effective, but has several preconditions. Tests must run safely in parallel, without inter-test dependencies. Shared resources (networks, databases) must handle concurrent access or be replicated. Manual parallelization is labor-intensive, while automated approaches requires careful auditing to ensure no subtle heisenbugs. TIA can help because you simply parallelize fewer tests.

Predictive test selection

Selecting tests based on their failure history is useful, but mostly when you're FAANG-scale and can't run all relevant tests for each PR. It doesn't eliminate the need to run all affected tests eventually.

This can be as simple as selecting the top-failing tests, or as sophisticated as building a machine-learning model that predicts relevant tests from code changes. ML models require expertise, historical test data, and flaky test identification (to ensure they’re not confused with genuine failures).

LLMs

Since it’s 2025, you can ask ChatGPT to select relevant tests, but this is fuzzy. It probably misses relevant tests sometimes and runs irrelevant ones. That being said, I suspect LLM/ML techniques might do a decent job of mapping tests to changes in non-code files, but I'm not sure I'd trust it with the keys to the car just yet.

Try it out on your code!

If you’re tired of waiting on CI tests to finish, give test-impact analysis a try.

If you use Python (other languages coming soon!), you can download our beta tool spdr on getspdr.dev. If you just want to hear more about this topic, you can sign up for our list there, too.