Refining input guardrails for safer LLM applications
How chain-of-thought prompting and fine-tuned alignment help improve moderation accuracy and LLM safety.
The advent of large language models (LLMs) has opened new doors in natural language processing and understanding use cases. Their powerful capabilities of instruction-following, reasoning and agentic AI behavior, makes them prime candidates to rapidly integrate into user-facing applications. However, the invention of LLMs has also introduced novel vulnerabilities, raising questions of reliability that, if not addressed adequately, can potentially risk trust in LLM-powered applications. These include concerns about ethical use, propagating misinformation, biases and hallucination. One major shortcoming of LLMs is its vulnerability to adversarial attacks, which can lead to malicious or undesirable outputs.
At Capital One, the Enterprise AI team is committed to advancing cutting-edge AI research across a wide range of generative AI applications, with a strong emphasis on safe and responsible AI. As we integrate state-of-the-art AI/ML solutions into our products and services, we also focus on developing robust LLM guardrails to ensure these technologies meet—and exceed—Capital One’s safety standards. This blog post highlights one such initiative, namely an overview of a recent paper, Refining input guardrails: enhancing LLM-as-a-judge efficiency through chain-of-thought fine-tuning and alignment.
It was published in January 2025 to the workshop of Preventing and Detecting LLM Misinformation (PDLM) at the Association for the Advancement of Artificial Intelligence (AAAI) 2025 conference, and won the Outstanding Paper Award. This work introduces methods for safeguarding LLM-powered conversational assistants against adversarial attacks under constraints faced in industry applications. It is the result of the research conducted by Capital One’s AI Foundations team, in collaboration with Huy Nghiem who was a data science intern at Capital One in summer 2024.
What are LLM guardrails in conversational agents?
LLMs customarily undergo different stages of training to first infuse them with knowledge about language and various subjects, followed by aligning them for specific tasks. The post-training stage is mainly conducted to improve the quality of their outputs to comply with certain safety and usefulness guidelines. LLMs are exposed to a variety of data, some of which may contain harmful language or information. Post-training is designed to reduce the risk of generating biased, unsafe or misleading outputs. For the majority of user-facing applications, the adoption of a general-purpose LLM without adequate post-training is practically infeasible, since outputs violating safety measures may incur irreparable damage to the product and the reputation of its owners.
This issue is further complicated by malicious users actively attempting to manipulate LLMs and LLM-powered applications to generate outputs that would benefit their purposes. There is a wide range of adversarial attacks on LLMs that motivate the need for strong input moderation guardrails in these applications. The primary job of these guardrail components is to prevent such harmful interactions from reaching the main conversation-driving LLM (or the “brain” of the assistant), thereby drastically lowering the chances of unwanted responses by ensuring the AI assistant ultimately interacts with safer inputs.
In this work, we focused on two main types of malicious inputs. The first is queries directly asking for fraudulent information or assistance with illegal activities. The second consists of “jailbreak” and “prompt injection” attacks. Jailbreak prompts target the LLM to retrieve responses that violate safety and security policies by instructing it to ignore these measures. Prompt injection, on the other hand, attacks disguise malicious instructions as benign inputs through overriding the original instructions on the prompt using special inputs.
Input moderation guardrails: proxy defense approach
As pointed out in the previous section, ideally, the main conversation-driving LLM used in AI assistant applications are post-trained such that they would detect any adversarial attacks and refuse to supply a response in violation of the safety protocols they were aligned on. These alignment techniques usually leverage reinforcement learning with human feedback (RLHF), in which a reward model helps align the output from the LLM with human-preferred directions.
However, in reality, these LLMs are still susceptible to sophisticated attacks, the landscape of which is constantly evolving. Therefore, we must also focus on another approach to mitigating risk, referred to as “proxy defense,” which implements an additional component acting as a firewall to filter out unsafe user utterances. Figure 1 demonstrates an example interface of an input moderation guardrail used as proxy defense, where the input guardrail’s safety verdict accompanied by an explanation is passed to the conversational agent in charge of deciding how to best synthesize a response to the query.
Figure 1: An example interface of the Input Moderation Guardrail as proxy defense.
LLM-as-a-Judge input moderation guardrails
The input proxy defense can theoretically be any classification model that would assign a safety probability score to input user interactions. Current implementations either leverage light-weight language models, such as bidirectional encoder representations from transformers (BERT)-based models as the backbone for the classifier, or take advantage of the reasoning capabilities of LLMs. The latter approach is referred to as LLM-as-a-Judge, where an LLM is instructed to identify incidents of guardrail policy violations. Due to its versatility compared to the BERT-based classifiers, this study focused on the LLM-as-a-Judge approach.
The judge LLM is prompted with instructions to determine whether inputs are safe or unsafe, based on the provided policy violation definitions. An LLM input guardrail can target multiple rail violations in addition to the malicious/jailbreak category, thereby functioning as a multiclass classifier, for example those outlined by OpenAI usage policies.
Any LLM-as-a-Judge classifier relies on the LLM’s pre-trained knowledge, malicious policy definitions as well as prompt engineering techniques. However, given the dynamic nature of malicious attacks against LLMs resulting in a vast array of jailbreak and prompt injection vulnerabilities, open-source LLMs need to undergo additional post-training steps to better detect these types of attacks. Our prior experiments with different publicly available LLMs have confirmed the gap in their performance when it comes to jailbreaks and prompt injections, even in many-shot settings where many examples of such vulnerabilities are presented to the LLM in the input prompt.
Our two primary research questions are:
-
Can we improve the reasoning capabilities of LLMs to correctly classify whether user inputs contain unsafe/adversarial content?
-
Can we improve the ability of LLMs to frame their moderation responses in the requested, parsable format?
More importantly, we set out to explore these improvements in small training sets/few training epochs regimes, compared to other studies. While the training and evaluation data sets do not provide comprehensive coverage of all attack types, our motivation is to primarily assess potential accuracy lifts. We focused on four medium-sized open-source LLMs from Mistral and Llama families, namely:
-
Mistral 7B Instruct v2
-
Mixtral 8x7B Instruct v1
-
Llama2 13B Chat
-
Llama3 8B Instruct
Chain-of-thought prompting
In this study, we leveraged chain-of-thought (CoT) prompting to facilitate the reasoning capabilities of the judge LLM, thereby enhancing its classification performance. CoT prompting involves instructing the LLM to first generate a logical and concise explanation of its thought process and leverage this information to arrive at a final verdict. Moreover, CoT explanations provide additional guidance in scenarios where the downstream agent’s performance depends on the quality of the explanation for a more nuanced response.
Figure 2 displays our initial evaluations using CoT prompting. The F1 score, recall, and false positive rate (FPR) values are used to gauge the classification power of each LLM (first research question), while the invalid response ratio, which is the percentage of non-parsable LLM responses, measures the ability of LLM’s to follow the output format (second research question).
For three of the four LLMs (with the exception of Mixtral 8x7B), CoT prompting resulted in an increase or at least a match in F1 score and recall, compared to prompts without the CoT instruction. Similarly, we observed a reduction in FPR values for all LLMs except for Mistral 7B with the CoT instruction. CoT prompting also seems to dramatically reduce Invalid Response Ratio for all LLMs with the exception of Llama2 13B. These results provided additional motivation for us to leverage CoT prompting. However, given that the highest F1 scores achieved on the base LLMs were still well below 80%, the need for additional fine-tuning of the judge LLMs was further validated.
Figure 2. F1, Recall, false positive rate (FPR), and invalid prediction ratio based on evaluating each of the four base LLMs using the prompt including and excluding the CoT instruction, respectively.
Fine-tuning and alignment of LLM-as-a-Judge input guardrails
To improve performance, we explored three fine-tuning LLM techniques in conjunction with chain-of-thought (CoT) prompting: supervised fine-tuning (SFT), direct preference optimization (DPO) and Kahneman-Tversky optimization (KTO).
-
Supervised fine-tuning (SFT) which encodes knowledge on various types of malicious and adversarial queries by updating the model’s weights using high-quality and desired LLM outputs.
-
Direct preference optimization (DPO) is an alignment technique, but different from RLHF in that it does not require fitting a reward model first to encode preferences. It is also more computationally stable than RLHF, requiring fewer data points compared to common reinforcement learning algorithms like proximal policy optimization (PPO).
-
Kahneman-Tversky optimization (KTO) which leverages Kahneman & Tversky’s prospect theory, founded on the observation in behavioral economics that humans are more averse to losses than to gains. KTO introduces a new loss function that maximizes the utility of generations instead of DPO’s log-likelihood of preferences. KTO only requires binary signals that are more prevalent and easier to curate compared to pairs of accepted and rejected responses for DPO.
We used parameter-efficient low-rank adaptation (LoRA) for all these experiments, which offers reduced computational needs and faster training.
We utilized public benchmarks containing malicious content and synthetically generated safe data. For each input query, we first synthetically generated a set of ideal responses, which is the only data requirement for SFT. However, to accommodate the two alignment techniques (DPO and KTO), we also synthetically generated various types of rejected or undesirable responses for each accepted answer. For all three experiments, we narrowed down the training data to a total of 400 accepted responses (equally distributed between malicious and safe categories) and three rejected responses for each query. The number of rejected responses per input is chosen empirically, but it is more important to ensure each rejected response captures a unique type of misalignment relative to the other for each input.
As shown in Figure 3, all LLMs achieved strong performance gains across all evaluation metrics after fine-tuning and alignment, with SFT offering the most significant lifts in F1 score and attack detection ratio (ADR, equivalent to Recall), and DPO and KTO providing marginal improvements on top of SFT. We do observe an overall increase in FPR across almost all LLMs and tuning strategies. However, these are relatively marginal compared to the improvements achieved in F1 and ADR (increase of 50+% for F1 and ADR, compared to a maximum hike of 1.5% in FPR). For the majority of user-facing applications, failure in correctly flagging malicious inputs results in more serious repercussions, as opposed to refusing to provide a response to valid queries. Nevertheless, a concerningly high FPR can adversely affect customer satisfaction, which is not the case here. These results fulfill our first research question.
Additionally, the Invalid Response Ratio values in all base and fine-tuned LLMs categorically dropped with all tuning strategies. We also noticed improvements in the quality of the CoT explanations and their effect on the final prediction by empirically analyzing a subset of responses. These results address our second research question.
Figure 3. F1, ADR (Recall), FPR, and Invalid Response Ratio based on evaluating four base LLMs, and those tuned by SFT, SFT + DPO and SFT + KTO.
Separately, we evaluated ADR across the different types of malicious data. As shown in Figure 4, baseline LLMs demonstrate poorer results against standalone jailbreak prompts as opposed to other malicious queries with and without jailbreak components. This could be anticipated due to the unfamiliarity of open-source LLMs with these harmful queries and the non-trivial techniques used to attack LLMs. We achieved significant improvements across all LLMs through fine-tuning and further alignment across the various types of adversarial prompts, with the largest gains on standalone jailbreak prompts.
ADR for jailbreak prompts, malicious queries with jailbreak prompts, and stand-alone malicious queries, across the different LLMs and tuning techniques.
Finally, we compared our best performing LLM, DPO-aligned Llama3 8B, against existing public input guardrail models. The results are shown in Table 1. We particularly focused on LlamaGuard-2 for malicious queries. We can see that Llama3-DPO achieves significant improvements against LlamaGuard-2 across all metrics. For jailbreak queries, we compared Llama3-DPO against ProtectAI’s DeBERTaV3 and Meta Llama’s PromptGuard models, and similarly observed improvements by wide margins across all metrics.
Table 1. Top) Comparison between LlamaGuard-2 and Llama3-DPO on the entire test set. Bottom) Comparison across Llama3-DPO, PromptGuard, and DeBERTaV3 on jailbreak and safe examples.
How to implement guardrails in LLM
In this guardrail-focused research, we investigated both research questions mentioned earlier and arrived at the following valuable insights and important takeaways:
-
SFT can significantly improve performance of LLM-as-a-Judge input guardrails, by offering enhanced CoT reasoning and instruction-following capabilities, thereby resulting in more successful attack detectors.
-
Alignment techniques including DPO and KTO can introduce additional improvements on top of SFT, with minimal efforts on curating the tuning dataset.
-
These improvements can be obtained with small training datasets and minimal hyperparameter tuning efforts.
-
In general, larger models need more data/training time to achieve similar performance gains as smaller models, while smaller models are especially efficient for adoption.
-
Given the marginal DPO/KTO gains vs. SFT, it is speculated that these alignment techniques require a larger and/or more diverse set of rejected responses for each input query to further improve upon SFT.
-
Overall, this is a major step towards creating and deploying enterprise grade input moderation guardrails.
You can read the research paper to learn more.
Explore Capital One's AI efforts and career opportunities
New to tech at Capital One?
-
Learn how we’re delivering value to millions of customers with proprietary AI solutions.
-
See how we're advancing the state-of-the-art research in AI for financial services.
-
Explore AI jobs and join our world-class team in accelerating AI research to change banking for good.