Which Tools for Which Problem?
The tools and methods needed for the validation of machine-learned systems vary as much as the tools and methods needed to create various machine-learned systems.
The Size of the Training Data
Depending on the size of the training data, the requirements on hardware, energy supply, and software tools can be broadly grouped into three categories:
- models which can be trained on a single node with commodity hardware,
- models which require a “rack” or a specialized system with fast interconnects among mainboards,
- models which require hardware needing more than 22kW electric power.
In terms of software, category 1 essentially only requires a Python and/or R software stack, while category 2 often requires (a) NoSQL databases (like Redis, MongoDB or neo4j) or a distributed file system (like Ceph) and (b) ML-specific orchestration software like Apache Spark. Category 3 requires lots of non-IT support systems, e.g. fall-back power generators and separate cooling solutions, such that most users tend to use cloud-solutions for category 3.
The Type of the Training Data
The training of large pre-trained models requires considerable resources, especially energy. According to NVidia, the base BERT language model with 110 million parameters can be trained in 3.3 days using 4 NVidia DGX-2H servers (containing 64 Tesla V100 GPUs in total), which corresponds to roughly 2 MWh. Therefore, it is reasonable to consider pre-trained models as a starting point for
- natural language processing (NLP),
- image processing with convolutional neural networks, and
- processing of language and images in context, using multi-modal models.
The Model Architecture
Many practical use cases require both
- stability in inference: If the input data does not change by more than a certain amount, then it can be guaranteed that the output does not change by more than a certain amount either, and
- stability in training: If the training data does not change by more than a certain amount, then it can be guaranteed that the quality criteria for the model do not change by more than a certain amount either.
Both stability questions require access to the model parameters and the architecture. The question of stability in training can only be answered if the training strategy is known as well.
Type and Amount of Randomness (Signal-to-Noise-Ratio)
Validation questions and approaches are fundamentally different depending on the “type” of randomness in the problem domain. Five main categories can be distinguished:
- The problem is completely deterministic, like in playing Chess, playing Go or proving theorems in pure mathematics.
- There is randomness, but the probabilities are known exactly, like in Texas Hold’em or other card games.
- The probabilities are not known, but are defined by “nature” and are thus in principle accessible through repeated experiments, like in physics (nuclear fusion) or chemistry and medicine (protein folding).
- The probabilities change over time due to changes in the ecosystem, like in ecological models or risk models for financial institutions.
- The probability distributions embedded within a model are not “objective” at all, but the system’s responses are actively exploited by adversaries, like in pricing models for financial assets and systems defending against cybercrime.
Frequency of Re-Calibrations and Model Revisions
The frequency of model updates determines the relationship between (i) independent validation and (ii) quality assurance performed as part of the training and re-training of the model. The more frequent the model updates are, the more independent validation needs to also automate the computation and control of quality criteria. Broad categories are:
- A specific system (or drug) is approved once. Updates of those systems are less frequent than once a year. Comparatively large validation and testing efforts go into the approval. Typical example is the approval process for drugs and medicines or the homologation of new types of cars.
- The system is re-tested and re-validated roughly every 1-2 years. Typical examples are the re-testing of individual cars (“KfZ-Zulassung”) and the validation of risk models in financial institutions.
- Many models are updated roughly every 2-3 weeks, corresponding to the typical length of a sprint in scrum mode.
- Some prediction models employ “online learning” and continuous integration, such that the model may be updated several times per day.
Severity of the Consequences of the Model’s Output
The consequences of a model’s output range from “life-and-death” decisions to “minor inconveniences in the context of entertainment”:
- direct consequences for health and well-being: medicine, criminal justice, nuclear power plants;
- potentially large consequences for health and well-being: autonomous vehicles, cyber security, privacy of personal data;
- potentially large consequences for companies: actuarial reserving in insurance companies, risk models in financial institutions;
- economic consequences: recommender systems, advertisements, customer relationship software;
- minor inconveniences in the context of entertainment: electronic games.
The frequency of re-calibration and model revision is often correlated with the severity of the consequences of a model’s output, such that independent validation ranges from (i) infrequent, but in-depth validation of life-and-death decision models to (ii) frequent, but more standardized validation of models with less severe consequences. The definition of quality criteria and the controlling of these criteria need to be adequate for the severity of the consequences of the model’s output.