As the Operational Design Domain expands, traditional tests fail: SOTIF in automotive practice

Imagine the following scenario: An AI-based perception system passes all gate reviews with flying colors. Simulations run stably, track testing shows no anomalies. Then, after a few weeks in the field, three recurring incidents occur: low sun, wet road surface, a specific reflection pattern on the road surface. A scenario that was simply not anticipated in the validation suite.

This is not a bug in the traditional sense. The code is correct; the model behaves according to specifications. It is precisely the phenomenon for which ISO 21448, the standard for “Safety of the Intended Functionality (SOTIF),” was written: functional inadequacies without classic errors, triggered by operating conditions that no one explicitly ruled out.

The ODD is growing. The depth of validation does not automatically grow with it.

AI-based vehicle functions are moving targets. Every new sensor generation, every new market, every new vehicle variant expands the Operational Design Domain (ODD): the totality of all operating conditions under which the system is supposed to function safely. This expansion is not a one-time project, but an ongoing process that runs on an 18- to 24-month cycle of new sensor generations.

The fundamental problem here is that classic code coverage metrics such as branch coverage, MC/DC, and statement coverage are designed for algorithmic code. A neural network has no branches in the conventional sense, but rather weights. Even complete coverage of the wrapper code does not verify the actual model behavior.

What actually needs to be tested are the so-called triggering conditions: those input classes that can trigger safety-critical malfunctions in the machine learning model. Triggering conditions combine in a combinatorial manner. Four weather conditions, five lighting situations, three reflection patterns: that already yields 60 basic variations, and that is a highly simplified model. Without a reusable suite of scenarios, every new validation round starts from scratch. Not because the team is doing a poor job, but because there is no structured tool available that translates ODD changes into evaluable scenarios.

What Late Findings Really Cost Here

The “Rule of Ten in Quality Assurance” is well-known in the embedded environment: A defect discovered only in the field costs many times more than one detected during the design phase. In projects with AI-based vehicle functions, this factor is particularly noticeable because scenario gaps are rarely isolated individual problems.

A late finding shortly before the gate review forces retests. If the scenario suite is not maintained in a modular fashion, these retests start from scratch, with the corresponding effort and time pressure. Delayed production start dates have a measurable impact on program profitability and OEM confidence.

Added to this is the pressure from type approval: the SOTIF justification must seamlessly support the safety case. If this link is missing, audit findings arise in the gaps between Safety-FMEA, TARA, and SOTIF analysis. And after series production begins, another risk looms: those who fail to establish drift monitoring are operating their safety case with an unspoken expiration date. UN R155 and UN R156 have been relevant for type approval since July 2024. Starting August 2, 2027, the high-risk requirements of the EU AI Act for AI as a safety component will take effect, with requirements for post-market monitoring and penalties of up to 35 million euros or 7 percent of global annual turnover.

It’s Not About More Tests, but About the Right Pyramid

The answer to a growing ODD is not complete coverage of all conceivable scenarios; that would be neither economically nor technically feasible. The answer is a structured validation pyramid in which each level addresses what the others cannot.

Simulation enables massive variation, edge-case generation, and reproducibility.
Software-in-the-Loop (SIL) demonstrates model behavior under realistic software integration.
Hardware-in-the-Loop (HIL) tests under real-time and hardware conditions.
Track testing provides ground truth for field release.

The ODD model itself should not be a static document, but a living engineering artifact that reflects sensor generation changes, market expansions, and release cycles. Reusable scenario suites based on standards such as ASAM OpenSCENARIO provide the foundation so that new variants do not have to start from scratch every time.

A perfect validation pyramid does not protect against every edge case. But it makes it assessable, and that is exactly what UN R155, UN R156, and the EU AI Act require.

Learn more about automotive software validation.

Sources

ISO 21448:2022. Safety of the Intended Functionality (SOTIF). International Organization for Standardization.
National Highway Traffic Safety Administration (NHTSA). (2024). Recalls by System – Calendar Year 2024. U.S. Department of Transportation.
United Nations Economic Commission for Europe. (2021). UN Regulation No. 155: Cyber Security and Cyber Security Management System.
European Parliament and Council of the EU. (2024). Regulation (EU) 2024/1689 (AI Act).
ASAM e.V. ASAM OpenSCENARIO.
Pfeifer, T. & Schmitt, R. (2021). Masing Handbook of Quality Management (7th ed.). Hanser. (Rule of Ten for Error Costs)

As the Operational Design Domain expands, traditional tests fail

The ODD is growing. The depth of validation does not automatically grow with it.

The ODD is growing. The depth of validation does not automatically grow with it.

What Late Findings Really Cost Here

It’s Not About More Tests, but About the Right Pyramid

FAQ

Sources

Get in touch with us!