This publication is licensed under the terms of the Creative Commons Attribution License 4.0 which permits unrestricted use, provided the original authors and source are credited.

Introduction

One positive aspect of UK Defence procurement, despite many criticisms, is the consideration of a capability throughout its lifecycle, rather than just delivery of a product. 

This is often articulated through Defence Lines of Development (DLoDs), nine aspects that need to be considered throughout the development and sustainment of any military capability. 

This article considers how DLoDs apply to machine learning (ML) capabilities that may be deployed for defence and security. We evaluate each of the DLoDs for a notional ML capability, finding DLoDs are equally applicable outside their traditional domain. We also find corresponding support in the relatively new field of Machine Learning Operations (MLOps).

The DLoDs cover training, equipment, personnel, infrastructure, concepts and doctrine, organisation, information and logistics, with interoperability as an overarching theme. For example, delivering a new tank is pointless without training the tank crew and maintainers; supplying the fuel, ammunition and long-distance transport to sustain its global deployment; and achieving the interoperability to work alongside mission partners in often hastily convened coalitions.

Machine Learning Operations (MLOps), taking inspiration from DevOps, refers to the culture, best practices and processes to bring together ML development, integration and deployment in a sustainable and scalable manner. The hidden technical debt of machine learning systems was highlighted by Google researchers back in 2015, illustrated by the following frequently reproduced diagram:

“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small box in the middle. The required surrounding infrastructure is vast and complex.”

MLOps helps increase the pace of experimentation, model development and model deployment, as well as enabling end-to-end quality assurance. MLOps components often include:

  • Data labelling – tools for humans to efficiently label large volumes of data.
  • Data version control – versioning the specific data used to train a model, promoting reproducibility and tracking model provenance.
  • Experiment tracking – methodically recording the specific model architecture, training data, hyper-parameters and performance metrics, again to promote reproducibility and track model provenance.
  • Model registry – for model lifecycle and governance.

Once deployed, the performance of ML-containing capabilities will degrade over time, dependent on aspects such as data drift (how well the properties of operational data reflect training data) and concept drift (how well the properties of operational data reflect the underlying concept being modelled). 

Consider an ML model to detect tanks from images – data drift will likely occur if the model is trained on images from spring and summer but run on images from winter and autumn, or imagery collected from a sensor with a different spatial resolution or spectral response. Similarly, concept drift will likely occur should the tank operator introduce a new variant or the tank crew add extra camouflage. This latter adversarial aspect makes defence applications of ML different from most other sectors, and provides an additional motivation for MLOps.

We now consider each of the DLoDs for a notional ML capability – tank detection from satellite imagery. The definitions quoted are taken from Knowledge in Defence.

DLoDs for ML Capabilities

Concepts and doctrine

A Concept is an expression of the capabilities that are likely to be used to accomplish an activity in the future. Doctrine is an expression of the principles by which military forces guide their actions and is a codification of how activity is conducted today. It is authoritative, but requires judgement in application.

The alignment of an ML capability with defence principles, ethics, values and objectives needs to be carefully considered before it is adopted. Is ML even an appropriate solution? Is ML any better than an existing or more basic approaches (e.g. using heuristics)? Can we sustain an ML solution (e.g. obtain adequate labelled data)? Will the ML solution be trusted by users (and their customers), especially if we can’t fully explain its operation?

For tank detection, what are the advantages and disadvantages that ML has over other algorithmic or human analysis of data – is it purely scale? Why do we want to integrate ML and what added value do we hope it will bring? What are the consequent first and further order changes we anticipate should the capability be successful? What are the benefits of getting it right and the risks of getting it wrong? What countermeasures might we expect our adversary to employ?

Further, an ML-based tank detection capability is only of net benefit both if it is both accurate enough to meet its use case, and if the additional effort of maintaining that accuracy is less than alternate approaches (e.g. using heuristics or human analysts).

We also need to consider the risk appetite and any associated policy – for example, the risks raised by a tank detector used to prioritise satellite images that will always be reviewed by a human analyst is very different from a tank detector used in the seeker of a missile without human confirmation. Who owns these risks, and what mitigations are available? What are the error risks of specific decisions and actions that may be taken by ML? Are false positives or false negatives the graver error in the specific context? What is the threshold where deployment is worth the risks, even if human analysts will not be able to oversee every decision made by the ML model?

The Defence AI Strategy is a helpful start, and should be accompanied by specific user-centric and context-specific AI policies for defence and security applications. MLOps aspects include model lifecycle management to ensure ML models are documented, developed and maintained in accordance with applicable policies. This often includes the use of model cards which describe the capabilities and limitations of a particular model.

Organisation and personnel 

Organisation: relates to the operational and non-operational organisational relationships of people. It typically includes military force structures, MOD civilian organisational structures and Defence contractors providing support.

Personnel: the timely provision of sufficient, capable and motivated personnel to deliver Defence outputs, now and in the future.

Developing, deploying and sustaining ML capabilities requires a broad range of roles, including domain (military experts), business leaders, legal experts, data scientists, data engineers, ML engineers and testing and evaluation professionals.

For our tank detection example, where should these people come from – civil service, military personnel, industry? Assuming it will be some combination of all three, how can we set the right incentives to promote the culture we would like to see? For those we wish to appoint in government, how can we recruit and retain them in the face of a skills shortage and generally uncompetitive salaries? With much ML progress being made outside of government, how can we take advantage of ML models trained by an industry supplier (perhaps supplied as a service) or the open-source community?

We also need to consider how user roles may change as a result of this new capability – someone who may previously have pored over satellite imagery to find tanks is now teamed with the output of an ML capability. Are users relegated to manually labelling vast amounts of training data for ML models? Is their previous role no longer required? What is the effect on morale and resilience should the ML capability fail? 

In some cases, new structures, policies and processes may be needed to ensure the model’s responsible deployment and maintenance. When new data is available, will organisations have a standard operating procedure and task owners whose job is to integrate data whilst safeguarding the compliance and safety case for the model’s deployment?

In terms of MLOps, these DLoDs are less about specific tools and more about developing a collaborative culture to rapidly develop, deploy and maintain ML capabilities.

Equipment

The provision of military platforms, systems and weapons, (expendable and non-expendable, including updates to legacy systems) needed to outfit / equip an individual, group or organisation.

For tank detection we will need to integrate with existing systems that handle the collection, processing and exploitation of data, and the dissemination of the resulting intelligence. Military data formats sometimes differ from those used by the broader ML community, and even a seemingly simple task such as geolocating an image can prove surprisingly difficult. Legacy systems, which are not intended for extensibility or automation can be particularly challenging.

The contribution of MLOps here is the provision of standardised patterns and application programming interfaces (APIs) for tasks such as pipeline orchestration, model serving and model monitoring, promoting reuse across multiple ML projects and improving efficiency. There are an increasing number of end-to-end MLOps platforms, however, their adoption in some defence applications can be challenging due to their dependencies on public cloud resources (not always accessible depending on classification and connectivity constraints). There are also concerns around vendor lock-in, where the cost of changing vendors is so high that the government customer is stuck with the original vendor. 

Returning to our tank detection example, MLOps tools will be especially beneficial when we want to apply the same architecture/patterns/approach to other use cases – for example, aircraft or ship detection.

Infrastructure

The acquisition, development, management and disposal of all fixed, permanent buildings and structures, land, utilities and facility management services (both hard and soft facility management (FM)) in support of Defence capabilities. It includes estate development and structures that support military and civilian personnel.

ML capabilities often have substantial hardware resource requirements, in particular expensive Graphics Processor Units (GPUs) which are required to run the latest ML models. Depending on the sensitivity of the data and task, we may be able to deploy to public cloud environments or we may be constrained to run a large GPU cluster on premises. ML workloads come in bursts (e.g. no images need processing when the satellite is not over the target area), so sizing and operating such a cluster cost-effectively can be difficult.

To enable use of the most advanced models, our tank detector almost certainly requires substantial numbers of costly GPUs, particularly if we want to leverage the scale advantage of ML over human image analysts. Again, MLOps can help – in this case through the use of model serving platforms, so models can be loaded on demand, scaled up as workload increases, then scaled back down again to decrease costs (public cloud) or to permit other ML models to run (on-premises).

Training 

The provision of the means to practise, develop and validate, within constraints, the practical application of a common military doctrine to deliver a military capability.

Training is applicable to both ML models, as well as the human users of ML capabilities.

For our tank detector, we will likely require thousands of labelled examples of the specific tanks we wish to reliably detect, across different variants, configurations, environmental conditions (e.g. location around the world and season) and collection geometries (e.g. spatial resolution). In many cases, labelling can be outsourced to specialist providers, but for defence and security applications, data or use case sensitivity can make this challenging. We also need a principled approach to managing training data, enabling robust exploration of ML model hyper-parameters. MLOps provides data labelling, data version control and experiment tracking capabilities to manage this training process.

Those using our tank detector also need to be trained, to understand the benefits and shortcomings of the ML capability. What different skills may personnel need and what is the right career path to develop them? What  reversionary mode skills do personnel need to ensure the resilience of operations when models are compromised and a manual mode of operation is needed? When can they expect it to operate well and (perhaps more importantly), when should they be more sceptical? How should they interpret any confidence or uncertainty metrics or visualisations? Such information should be recorded on a model card, managed via a model registry, and suitably exposed to trained users.

Information 

The provision of a coherent development of data, information and knowledge requirements for capabilities and all processes designed to gather and handle data, information and knowledge. Data is defined as raw facts, without inherent meaning, used by humans and systems. Information is defined as data placed in context. Knowledge is Information applied to a particular situation.

Once we have deployed an ML capability, we need to ensure the data it runs on is sufficiently similar to that on which it was trained on, in order to be confident of adequate performance. 

For our tank detector, we can use MLOps tools to record and compare distributions of metadata and image statistics to assess data drift. This may include tracking changes in spatial resolution, time of day of collection (e.g. day vs night) or general brightness and darkness. Comparing these distributions to those from training time will allow us to alert both end users (to be potentially wary of outputs) and ML model developers (to prompt model retraining).

Beyond data drift, we should also monitor target drift (the distribution of detections) – for example, in terms of both numbers and geographic locations. Should far fewer (or more) tanks be detected than previously, or should their positions change substantially over a short period of time, this is likely to warrant further investigation. 

The dependence on both code and data makes ML capabilities fundamentally different from traditional software (which is only dependent on code) and provides a key differentiator and motivator for MLOps.

Logistics 

Logistics is the science of planning and carrying out the operational movement and maintenance of forces. In its most comprehensive sense, it relates to the aspects of military operations which deal with; the design and development, acquisition, storage, transport, distribution, maintenance, evacuation and disposition of materiel; the transport of personnel; the acquisition, construction, maintenance, operation, and disposition of facilities; the acquisition or furnishing of services, medical and health service support.

Sustaining ML capabilities requires monitoring ML models throughout their deployed use, collecting:

  • Semi-structured or unstructured feedback from end-users (e.g. using a five star scale with clear definitions).
  • Ground truth from end-users. For some ML applications (e.g. binary predictors) this is relatively simple, for others it is much more complex (e.g. machine translation or document summarisation).
  • Automated performance metrics –  including time spent on inference, number of predictions made and traditional ML performance metrics (which require ground truth).

We consider these under this DLoD since they are effectively a consumable resource that needs replenishing if the capability is to remain effective.

For our tank detector, we’ll want to collect confirmation of detected tanks. This will allow us to calculate metrics such as precision – the fraction of positive detections that are correct. We also need to consider how we might seek out false negatives (tanks that should have been detected but weren’t), a more difficult proposition. This will allow us to calculate metrics such as recall – the fraction of existing tanks that were successfully detected. We can also measure and assess other important performance metrics, including latency (how much time passes from the image being collected to automated detections being available) and resource utilisation (e.g. what fraction of our expensive GPUs we are actually using?). 

In MLOps terms, we refer to much of the above as observability, analogous to observability in DevOps, providing critical insight into ML capability operation and health. Within an overarching AI policy, we can inform (and, if necessary, warn) end-users, data scientists and ML oversight staff about near real-time model performance. Using ground truth, we can even implement automated model retraining pipelines, either on a periodic basis or once performance degrades. Automated retraining is a critical factor quoted by both Google and Microsoft as a sign of higher levels of MLOps maturity. 

Interoperability 

The ability of UK Forces and, when appropriate, forces of partner and other nations to train, exercise and operate effectively together in the execution of assigned missions and tasks. In the context of DLoD, Interoperability also covers interaction between Services, UK Defence capabilities, Other Government Departments and the civil aspects of interoperability, including compatibility with Civil Regulations. Interoperability is used in the literal sense and is not a compromise lying somewhere between integration and de-confliction.

Traditional example of interoperability include compatibility of fuel between different land vehicles, physical pallet sizes between different transport aircraft or of computer systems to share intelligence. For ML capabilities, how can we use ML models from across industry and academia, as well as from other branches of the UK military and partner nations?

For our tank detector, we may want to import an ML model from industry or academia in a standardised format or consume it as a service with a standardised API. We may further want to fine-tune that model, combining the advantages of a foundation model trained on vast amounts of openly available data with necessarily limited amounts of labelled classified data.

MLOps can contribute definitions of standards for ML models or APIs, as well as patterns for fine-tuning foundation models in conjunction with experiment tracking and other capabilities already mentioned for model lifecycle management.

Summary 

Having reviewed each of the DLoDs, we’ve found them to be relevant to delivering our notional tank detection ML capability. We’ve also seen how MLOps, both as a culture as well as a family of tools, can be used to address these considerations. However, while using an MLOps approach can contribute to addressing the DLoDs, it will never fully solve them all. We also chose a somewhat idealised use case – those at the operational or tactical level will encounter degraded networks (making model deployment and updates challenging), constrained infrastructure (making running complex ML models challenging) and distributed sensors (making automated retraining challenging).

As with a traditional defence products, it is all too easy to focus on the shiny ship, tank, plane or indeed ML model. We urge those deploying ML into defence and security applications to treat them as capabilities rather than products, to consider the full range of DLoDs, and the potential organisational benefit that could be offered by the adoption of MLOps.

The views expressed in this article are those of the authors, and do not necessarily represent the views of The Alan Turing Institute or any other organisation.

Citation information

Tom S, James S and Anna Knack, "One Does Not Simply Buy AI: Defence lines of development and machine learning – a case for MLOps," CETaS Expert Analysis (April 2024).