The Alan Turing Institute

The Future of Online Safety​: A data-centric approach

Expert Analysis

Bertie Vidgen


Online threats are increasingly varied, challenging, and widespread – ranging from hate speech to terrorism, from disinformation to child abuse. ​

These disruptive, unwanted and often illegal activities present a clear risk to the safety and wellbeing of individuals, platforms and societies. Despite incoming regulation in many territories, increasing public support for action, and a growing economy of vendors who provide products, these problems remain fundamentally difficult to solve. All of the solutions currently available raise issues of performance, free speech, proportionality, privacy and technological capability. ​

There are no silver bullets when dealing with a complex problem like online safety, but Artificial Intelligence (AI) has real potential to drive a step change in detecting and responding to online threats. It can make law and policy enforcement more efficient and effective by supporting, replacing and advancing on human-led interventions.

In recent years, AI has drastically improved and workflows for its use and deployment have been overhauled with the widespread use of large pre-trained models and transfer learning. However, these changes have not been fully leveraged in how AI is used for online safety. In particular, the critical role played by data, and as such the unique position and importance of data owners, has not been fully recognised. This article discusses these changes and their implications, and explores what this means for the future online safety sector.

How AI is used to tackle online threats

In the past ten years there has been a surge in both commercial and academic research in Artificial Intelligence (AI), leading to increasingly sophisticated models, entirely new techniques for training systems, and more robust evaluation processes. This has resulted in the release of jaw-dropping image generation models like Dall-E and Stable Diffusion, powerful speech transcription models such as Whisper, and Natural Language Understanding models such as GPT, BERT and Megatron. Most of these models can be used to both classify and generate content, achieving human and super-human levels of performance. 

AI can also augment, support and advance human responses to online threats. It can be used to automatically find threatening content and malign accounts, tackle and mitigate the harmful effects of their behaviour, and monitor them to better understand patterns, dynamics and motivations. There are four features that make AI particularly suitable for tackling online threats:

  1. Speed. AI can process content in milliseconds, which is near real-time. This means that users do not have to wait for the result, allowing the AI to be embedded into a range of automated monitoring, moderation and evaluation processes.
  2. Scale. AI can handle a huge volume of content. Even a small server can process millions of items every day, with no upper limit. This is particularly important when there are unexpected events which lead to sharp spikes in content volume. AI can handle this easily, whereas human-only approaches struggle to scale. 
  3. Consistency. Production-ready AI models behave deterministically. Given the exact same inputs, they will return the same output. Although some AI is brittle, meaning it is very sensitive to minor changes in the input data, in principle this means that a model can be trusted to behave in the same way each time, and – crucially – the AI’s behaviour can be investigated post-hoc.
  4. Performance. Properly trained AI can be better than humans, particularly untrained humans, at making difficult decisions about content. This is not always the case and depends on the complexity and difficulty of the task.

These features mean that AI offers two main advantages over human-only approaches. First, it can increase the efficiency of online safety measures, saving money and time. For instance, a social media platform might identify extremist activity by having human moderators review reports from users. It could make this process more efficient by using AI to triage content and give an initial assessment of whether it is likely to be extremist. Second, AI can increase effectiveness by creating new ways of keeping people safe and secure, such as being embedded into the design of their products. For instance, many platforms use AI to shape users’ experiences by populating timelines, and downranking potentially harmful content. Other novel applications include using AI to power bots that automatically generate counter speech, and to provide real-time warnings and nudges to stop people from engaging in unsafe behaviour. 

Despite these recent improvements, AI is far from perfect, and any system will have weaknesses and flaws. AI struggles to handle context, nuance and intention, which are key limitations when dealing with online threats. Other concerns include the ethical and societal implications of using AI for automated decision-making, and the need for effective human governance at all levels. A related challenge is the environmental impact of training and deploying AI models, which can use huge amounts of energy. The suitability of using AI should be assessed in the context of the application, and it may ultimately be considered inappropriate or unsafe – meaning human-only approaches will still be preferred in some contexts.

AI workflows for tackling online threats

AI workflows have been transformed by the widespread use of transfer learning. Today, most practitioners do not train their own models from scratch but optimise and then deploy extremely large models (i.e. models with billions of parameters) which have been trained by big teams at well-funded organisations like OpenAI and Deepmind, and then open sourced. These models are often called 'foundation models', and are distinct to specific applied models, which we call 'classifiers'. Foundation models are trained through self-supervised learning tasks, such as masked language modelling, to develop an understanding of content. Large language models, for example, are trained over thousands of hours on huge datasets which comprise billions of entries, such as the Common Crawl Corpus. Originally developed for English, variants have now been trained for most major languages, and fully multilingual models, such as XLM-R, have also been introduced. Efforts are underway to improve coverage of languages from the Global South, which have historically been under-represented in the AI community and are “low resource” - meaning there is a limited volume of training data available when compared with more dominant languages such as English.

Through transfer learning, practitioners can optimise these foundation models to create a classifier for a specific use case, such as finding hate speech, identifying extremist groups, or mapping the activity of bot networks. This article focuses primarily on models for analysing text – but similar arguments apply to models for other applications for tackling online threats, such as image models.

Practitioners have a range of ways to improve model performance when training classifiers, of which three are particularly important. 

  1. Continued pre-training. Off-the-shelf models can be optimised by continuing their training. The original task (e.g. masked language modelling) is restarted, using either randomly sampled in-domain data or unlabelled task data. For instance, a hate speech classifier could be created by taking a pre-trained BERT model and showing it several million posts from social media communities that host large volumes of hateful material. This would give it a far better understanding of both social media data and toxic content.
  2. Fine tuning. Large models can be adapted to a specific task by retraining them on a small dataset of labelled examples, which adjusts the weights and parameters of the upper layers of the network. Fine tuning is the most common way in which large models are used, and has been shown to achieve state-of-the-art performance on a wide range of tasks, even with relatively few examples.
  3. Model prompting. In-context learning is where models learn a task by conditioning on either no examples (“zero shot”), or just a small number (“few shot”), and without optimising any parameters. This is an incredibly quick and easy way of training classifiers, although it is often not suitable for complex tasks involved in tackling online threats.

Practitioners can use all of these techniques (and others), combining them as needed. For instance, a team could take an off-the-shelf model and continue pre-training on a large corpus of messy and toxic social media data to create a new foundation model. They can then fine tune it on task-specific datasets to create new classifiers, such as tools for detecting extremist, hateful or illegal content.

The widespread use of transfer learning has created a split between AI researchers who are creating foundation models and practitioners who are applying these models to create classifiers. Nearly all work in online safety and security is conducted by practitioners creating classifiers who are benefiting from the rising tide of increasingly powerful – and increasingly adaptable – foundation models. Apart from the largest and most innovative teams, it does not make sense to swim against this tide. Three factors contribute to making this the most viable approach for tackling online threats.

  1. Cost. The cost of training new foundation models is huge. GPT-3 is reported to have cost $12 million to train, and the open science model Bloom cost $7 million. These figures do not include the costs of the research teams, which typically have very high salaries. For a team to start from scratch and not use these models is to effectively throw away the millions of dollars already spent on their development. 
  2. Risk of obsolescence. Model architectures are constantly improving because of fierce competition amongst the big players, motivated as much by research glory as financial benefit. A small team could spend a small fortune to create their own AI model, only to find that within a year or two it is rendered obsolete by an open sourced model from big tech.
  3. Low switching costs. Off-the-shelf foundation models are now very easy to access through services such as Hugging Face and no-code solutions such as Data Robot. Generally, it is as easy to set up the code and workflow to evaluate many models as just one. In practice, therefore, teams can easily switch between foundation models and assess multiple models at once. They can also consider much smaller distilled models which are typically much faster to run.

Data Data Data

With transfer learning and widespread access to large models, the biggest challenge now facing practitioners is how to acquire, label and use the right data. In light of this, there has been resurgent interest in the role and curation of datasets as a crucial part of the AI development process, with leading experts calling for a shift from a ‘model-centric’ approach to a ‘data-centric approach’ to AI development. In many ways, this reflects a longstanding mantra in computer science: Garbage In will lead to Garbage Out. Numerous studies show that a shockingly small amount of good data can create a high performing AI classifier, whereas large quantities of low quality data only result in a very weak model. 

In nearly all applications of AI for tackling online threats, there is no single way of determining “good” data. What counts as “good” data depends on the context and the task at hand: good data for you might not be good data for anyone else. And, often your data is the best data – it is the most relevant, in-domain and will best reflect the task that motivated you to create an AI classifier. For instance, if you run a platform and want to apply AI to detect users’ expressions of intent to self-harm, there is no better data to train it on than your data – i.e. data taken from the platform and then labelled in line with this task. Or, if you are a security agency that wants to identify expressions of support for a specific terrorist group, based on your analysts’ qualitative analyses, then your qualitative data is your best starting point as it reflects the type of content that you actually care about. There are two reasons for this.

In-domain data

Your data will be in-domain, which means that it is selected from the same pool of content that you will apply the classifier to, and therefore has similar features. In-domain data is important because even fairly small differences between settings can radically alter the performance of classifiers. Consider the differences between a gaming chat, replies to a tweet, and a Facebook post – or even just the differences between a Twitch livestream for a gaming influencer compared with the livestream for a fashion influencer. All of them could contain text, but the topics of the content will be very different, as well as the style of expression, use of unusual symbols such as emoji, demographics of the content creators, and the norms in using external links and shorthands. This is also why off-the-shelf classifiers that cannot be customised, such as static models provided by a third party vendor, may give reasonable performance but are very unlikely to be optimal.

Similarity of task

You can label the data in line with the exact task that you want the classifier to deliver. This is true whether the data is in-domain or not, and whether or not you have collected it new. Indeed, we have often found value in reannotating datasets provided by other people to reflect the categories that we care about. Stating that the task is important may sound obvious, or even a truism – but practitioners routinely use imperfect data that does not quite meet all of their specifications as in online safety there are very few well-established taxonomies and categories, and the field is constantly changing. The advantage of labelling data from scratch is that you can specify the exact task you want the AI classifier to deliver. This means encoding the classifier’s expected decision boundary in the data through consistent labelling. There is also no single best way of labelling data, but it is crucial to be aware of the limitations of different approaches. We have found that some “market leading” data labelling providers create datasets with serious errors and inconsistencies, and at Rewire we have a team of trained data labellers. They typically perform far better and more consistently than crowdsourced workers, but are also more expensive and require more coordination. 

In many cases, well-labelled in-domain data is not available, available only in very small quantities, or needs a huge amount of processing to be usable for machine learning. To address the problem of not having enough good data, a range of data-centric techniques have been adapted specifically for online safety. In our work, we have used adversarial data generation through the Dynabench platform, and active learning to find the most relevant and informative cases when you have millions of unlabelled entries to select from but a limited annotation budget. Other techniques include using data augmentation and synthetic generation techniques, leveraging AI to actually create new data to train future AI classifiers, and handcrafting challenging perturbations. All of these approaches can be used to gain more useful data points or maximise the utility of the data that practitioners already have.

Finally, the shift towards data-centric AI has clear implications for the online safety and security economy. Increasingly, real value resides with the organisations that own the best data assets, whether they are platforms, vendors, civil society organisations or security agencies. For most tasks, high quality data is hard to acquire – requiring special relationships with platforms and a team of data analysts. Of course, emphasising the importance of data is not to trivialise the other significant challenges in building trustworthy AI classifiers, but simply to argue that your best chance of building such a classifier is to start with good data. Anyone without data assets, and a resilient data pipeline to ensure new data keeps coming in, will struggle to deliver best-in-class results.

The implications are clear: organisations should take a deep look at what data assets they have and their capacity to build AI expertise. Only then should they assess (1) what their expectations for their AI actually are; and (2) whether they want to bring in third party vendors – and, if so, in what ways. If good data is the real “secret sauce” behind powerful AI – and what counts as good depends on your application and context – you should think carefully about who can add real value.


Online threats present a fundamentally difficult problem and will not be solved by any single technological innovation.

But, given the huge scale and complexity of activity online, some form of automated technology is now essential. In 2022, the real question is not should AI be used to tackle online threats? but, instead, how should it be used? and, increasingly, who should implement it? The answers to these questions will determine whether we see high-performing AI being widely developed and used across the sector, keeping people safe by increasing efficiency and effectiveness of online safety, or an under-utilisation of overpriced and under-performing AI in the future fight against online threats. 

The views expressed in this article are those of the authors, and do not necessarily represent the views of The Alan Turing Institute or any other organisation.

Citation information

Bertie Vidgen, "The Future of Online Safety​: A data-centric approach," CETaS Expert Analysis (November 2022).