Backdoor Learning of Language Models in Natural Language Processing

Li, Jiazhao

Backdoor Learning of Language Models in Natural Language Processing

Li, Jiazhao

2024

View/Open

jiazhaol_1.pdf

(29.3MB

PDF)

Abstract

The concept of cybersecurity has evolved since its inception in 1982, paralleling advancements in computer science. In the past decade, the field of natural language processing (NLP) has witnessed remarkable progress with the development of language models (LMs). These models leverage extensive datasets and computational resources on application development, utilizing methods like supervised instruction fine-tuning, human alignment, and demonstrations. Despite their advancements, LMs are susceptible to both backdoor and adversarial attacks, which poses significant concerns for the security and reliability of applications reliant on these models. Additionally, the rise of LLM-powered chat assistants has brought to light ethical issues related to the adherence of generated responses to social norms. My dissertation examines three studies on training-time backdoor attacks and defense methods at various stages of language model application development. The first two studies focus on backdoor attacks and defenses in language model classifiers, while the third study investigates backdoor jailbreaking attacks on the human safety alignment of autoregressive models. Chapter I reviews the evolution of cybersecurity, emphasizing its development in traditional systems, the new challenges posed by deep learning, and the specific concerns surrounding backdoor learning in large language models. It then defines and compares key attack vectors, particularly backdoor and adversarial attacks. The chapter also discusses the metrics used to evaluate these attacks. Finally, it concludes by presenting the core motivations and research questions that form the foundation of this study, setting the stage for the detailed exploration that follows. In Chapter II, we delve into existing research on both attack and defense methods in backdoor learning. The chapter begins by examining various attack methods, including real-world scenarios and their corresponding capacities, data-poisoning-based backdoor attacks, and model-modification-based backdoor attacks. Following this, the chapter provides an overview of defense strategies, categorizing them into three main approaches: training corpus sanitization, input certification, and defenses implemented during the inference phase. Chapter III introduces an attribution-based defense method, AttDef, designed to counter insertion-based backdoor attacks in training-time scenarios. By interpreting autoencoder models, this method identifies suspicious correlations between backdoor triggers and target predictions, allowing for the masking of these triggers and thereby mitigating the attack. In Chapter IV, the focus shifts from defense to attack, where we propose a novel and stealthy backdoor attack, BGMAttack, using a language model as an implicit trigger. This attack leverages paraphrasing to embed triggers subtly within the text, making detection by existing defenses more challenging. Chapter V extends the exploration of backdoor attacks from classification tasks to the realm of text generation, specifically in the context of human alignment with autoregressive models. Here, we introduce a backdoor jailbreaking attack that manipulates the model’s preference for harmful responses during content generation. In Chapter VI, we summarize the dissertation, illustrating how the three studies are interconnected in exploring different facets of backdoor learning in large language models. This chapter shows how analyzing both attack mechanisms and defensive strategies enhances our understanding of model security. Through careful evaluation and innovative approaches, this dissertation contributes to advancing the field while setting the stage for further progress in developing safer, more reliable models. Furthermore, an ethical consideration has been discussed from the perspective of red-teaming and compliance practical implementation to prevent backdoor attacks.

Deep Blue DOI

https://dx.doi.org/10.7302/25055

Subjects

Backdoor Learning

Natural Language Processing

Large Language Model

Trustworthy AI

Types

Thesis

Handle

https://hdl.handle.net/2027.42/196119

Metadata

Show full item record

Collections

Dissertations and Theses (Ph.D. and Master's)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.