Hugging Face 的 Transformers 库快速入门(四):微调预训练模型

在上一篇《Hugging Face 的 Transformers 库快速入门(三):必要的 Pytorch 知识》中,我们介绍了训练模型必要的 Pytorch 知识, 本文我们将正式上手训练自己的模型。

本文以基础的句子对分类任务为例,展示如何微调一个预训练模型,并且保存验证集上最好的模型权重。

加载数据集

我们选择蚂蚁金融语义相似度数据集 AFQMC 作为语料,在其之上训练一个同义句判断模型:每次输入两个句子,判断它们是否为同义句。AFQMC 提供了官方的数据划分,训练集/验证集/测试集分别包含 34334 / 4316 / 3861 个句子对,对应的标签 0 表示是非同义句,1 表示是同义句:

{"sentence1": "双十一花呗提额在哪", "sentence2": "里可以提花呗额度", "label": "0"}

Dataset

就像我们在上一篇中介绍的那样,Pytorch 通过 Dataset 类和 DataLoader 类处理数据集和加载数据构建 batch。因此我们首先需要编写继承自 Dataset 类的自定义数据集用于组织样本和标签:

from torch.utils.data import Dataset
import json

class AFQMC(Dataset):
    def __init__(self, data_file):
        self.data = self.load_data(data_file)
    
    def load_data(self, data_file):
        Data = {}
        with open(data_file, 'rt') as f:
            for idx, line in enumerate(f):
                sample = json.loads(line.strip())
                Data[idx] = sample
        return Data
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

train_data = AFQMC('afqmc_public/train.json')
valid_data = AFQMC('afqmc_public/dev.json')

print(train_data[0])
{'sentence1': '蚂蚁借呗等额还款可以换成先息后本吗', 'sentence2': '借呗有先息到期还本吗', 'label': '0'}

由于 AFQMC 数据库中的样本是以 json 格式存储的,因此我们在 __init__() 中使用 json 库按行读取样本,并且以行号作为索引构建数据集。每一个样本都以字典形式保存,分别以 sentence1sentence2label 为键存储句子对和对应的标签。

当然了,如果数据集非常巨大,难以一次全部加载到内存中,我们也可以继承 IterableDataset 类构建 Iterable-style 数据集:

from torch.utils.data import IterableDataset
import json

class IterableAFQMC(IterableDataset):
    def __init__(self, data_file):
        self.data_file = data_file

    def __iter__(self):
        with open(self.data_file, 'rt') as f:
            for line in f:
                sample = json.loads(line.strip())
                yield sample

train_data = IterableAFQMC('afqmc_public/train.json')
print(next(iter(train_data)))
{'sentence1': '蚂蚁借呗等额还款可以换成先息后本吗', 'sentence2': '借呗有先息到期还本吗', 'label': '0'}

DataLoader

创建好数据集之后,就需要通过 DataLoader 库来按批 (batch) 加载数据,将样本组织成模型可以接受的输入格式。对于 NLP 任务来说,这个环节就是对一个 batch 中的句子(这里是“句子对”)按照预训练模型的要求进行编码(包括 Padding、截断等操作),正如我们在上一篇文章中说的那样,通过在 DataLoader 中设置批处理函数 collate_fn 来实现。

这种不是在整个数据集上,而是只在一个 batch 内进行的 Padding 操作被称为动态 Padding (Dynamic padding),Hugging Face 也提供了 DataCollatorWithPadding 类来进行,如果感兴趣可以自行了解

这里我们以中文 BERT 模型为例,通过加载对应的分词器,对样本中的句子对进行编码:

import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

checkpoint = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def collote_fn(batch_samples):
    batch_sentence_1, batch_sentence_2 = [], []
    batch_label = []
    for sample in batch_samples:
        batch_sentence_1.append(sample['sentence1'])
        batch_sentence_2.append(sample['sentence2'])
        batch_label.append(int(sample['label']))
    X = tokenizer(
        batch_sentence_1, 
        batch_sentence_2, 
        padding=True, 
        truncation=True, 
        return_tensors="pt"
    )
    y = torch.tensor(batch_label)
    return X, y

train_dataloader = DataLoader(train_data, batch_size=4, shuffle=True, collate_fn=collote_fn)

batch_X, batch_y = next(iter(train_dataloader))
print('batch_X shape:', {k: v.shape for k, v in batch_X.items()})
print('batch_y shape:', batch_y.shape)
print(batch_X)
print(batch_y)
batch_X shape: {
    'input_ids': torch.Size([4, 39]), 
    'token_type_ids': torch.Size([4, 39]), 
    'attention_mask': torch.Size([4, 39])
}
batch_y shape: torch.Size([4])

{'input_ids': tensor([
        [ 101, 5709, 1446, 5543, 3118,  802,  736, 3952, 3952, 2767, 1408,  102,
         3952, 2767, 1041,  966, 5543, 4500, 5709, 1446, 1408,  102,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [ 101,  872, 8024, 2769, 6821, 5709, 1446, 4638, 7178, 6820, 2130,  749,
         8024, 6929, 2582,  720, 1357, 3867,  749,  102, 1963, 3362, 1357, 3867,
          749, 5709, 1446,  722, 1400, 8024, 1355, 4495, 4638, 6842, 3621, 2582,
          720, 1215,  102],
        [ 101, 1963,  862, 2990, 7770,  955, 1446,  677, 7361,  102, 6010, 6009,
          955, 1446, 1963,  862,  712, 1220, 2990, 7583, 2428,  102,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [ 101, 2582, 3416, 2990, 7770,  955, 1446, 7583, 2428,  102,  955, 1446,
         2990, 4157, 7583, 2428, 1416,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0]]), 
 'token_type_ids': tensor([
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
tensor([1, 0, 1, 1])

可以看到,DataLoader 按照我们设置的 batch size,每次对 4 个样本进行编码,将句子对处理成 [CLS] sentence1 [SEP] sentence2 [SEP] 的格式,并且将 token 序列 pad 到了相同的长度。同时,将标签也处理成了 Pytorch 的张量格式。

训练模型

构建模型

对于本文的句子对分类任务,可以使用我们前面介绍过的 AutoModelForSequenceClassification 类来完成,不过考虑到在大部分情况下,预训练模型仅仅被用作编码器,模型中还会包含很多自定义的模块,因此本文采用自己编写 Pytorch 模型的方式来完成:首先利用 Transformers 库加载 BERT 模型,然后接一个全连接层完成分类:

from torch import nn
from transformers import AutoModel

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.bert_encoder = AutoModel.from_pretrained(checkpoint)
        self.classifier = nn.Linear(768, 2)

    def forward(self, x):
        bert_output = self.bert_encoder(**x)
        cls_vectors = bert_output.last_hidden_state[:, 0, :]
        logits = self.classifier(cls_vectors)
        return logits

model = NeuralNetwork().to(device)
print(model)
Using cpu device
NeuralNetwork(
  (bert_encoder): BertModel(
    (embeddings): BertEmbeddings(...)
    (encoder): BertEncoder(...)
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

可以看到,我们构建的模型首先将输入送入到一个 BERT 模型中进行编码,将每一个 token 都编码为维度为 768 的张量,然后我们从输出序列中取出第一个 [CLS] token 的编码结果作为句子对的语义表示,送入到一个包含两个神经元的线性全连接层中完成分类。

为了确保模型的输出符合我们的预期,我们尝试将一个 Batch 的数据送入模型:

outputs = model(batch_X)
print(outputs.shape)
torch.Size([4, 2])

可以看到模型输出了一个 $4 \times 2$ 的张量,符合我们的预期(每个样本输出 2 维的 logits 值,batch 内共 4 个样本)。

优化模型参数

正如我们在上一篇文章中介绍的那样,我们将每一轮 Epoch 分为训练循环和验证/测试循环。在训练循环中计算损失、优化模型的参数,在验证/测试循环中评估模型的性能:

from tqdm.auto import tqdm

def train_loop(dataloader, model, loss_fn, optimizer, lr_scheduler, epoch, total_loss):
    progress_bar = tqdm(range(len(dataloader)))
    progress_bar.set_description(f'loss: {0:>7f}')
    finish_batch_num = (epoch-1)*len(dataloader)
    
    model.train()
    for batch, (X, y) in enumerate(dataloader, start=1):
        X, y = X.to(device), y.to(device)
        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        total_loss += loss.item()
        progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
        progress_bar.update(1)
    return total_loss

def test_loop(dataloader, model, mode='Test'):
    assert mode in ['Valid', 'Test']
    size = len(dataloader.dataset)
    correct = 0

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    correct /= size
    print(f"{mode} Accuracy: {(100*correct):>0.1f}%\n")

最后,将”训练循环”和”验证/测试循环”组合成 Epoch,就可以进行模型的训练和验证了。其实 Transformers 库同样实现了很多的优化器,相比 Pytorch 固定学习率的优化器,Transformers 库实现的优化器会随着训练过程逐步减小学习率(这通常会产生更好的效果)。例如我们前面使用过的 AdamW 优化器:

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

默认情况下,优化器会线性衰减学习率,例如对于上面的例子,学习率会线性地从 $\text{5e-5}$ 降到 $0$。为了正确地定义学习率调度器,我们需要知道总的训练步数 (step),它等于训练轮数 (Epoch number) 乘以 batch 的数量(也就是训练 dataloader 的大小):

from transformers import get_scheduler

epochs = 3
num_training_steps = epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)
25752

完整的训练过程如下:

from transformers import AdamW, get_scheduler

learning_rate = 1e-5
epoch_num = 3

loss_fn = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=epoch_num*len(train_dataloader),
)

total_loss = 0.
for t in range(epoch_num):
    print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
    total_loss = train_loop(train_dataloader, model, loss_fn, optimizer, lr_scheduler, t+1, total_loss)
    test_loop(valid_dataloader, model, mode='Valid')
print("Done!")
Using cuda device

Epoch 1/3
-------------------------------
loss: 0.553325: 100%|███████| 8584/8584 [09:27<00:00, 15.13it/s]
Valid Accuracy: 73.0%

Epoch 2/3
-------------------------------
loss: 0.504794: 100%|███████| 8584/8584 [09:17<00:00, 15.39it/s]
Valid Accuracy: 73.8%

Epoch 3/3
-------------------------------
loss: 0.454273: 100%|███████| 8584/8584 [09:27<00:00, 15.12it/s]
Valid Accuracy: 74.1%

保存和加载模型

在大多数情况下,我们还需要根据验证集上的表现来调整超参数以及选出最好的模型,最后再将选出的模型应用于测试集以评估性能。例如我们在测试循环时返回计算出的准确率,然后对上面的 Epoch 训练代码进行小幅的调整,以保存验证集上准确率最高的模型:

def test_loop(dataloader, model, mode='Test'):
    assert mode in ['Valid', 'Test']
    size = len(dataloader.dataset)
    correct = 0

    model.eval()
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    correct /= size
    print(f"{mode} Accuracy: {(100*correct):>0.1f}%\n")
    return correct

total_loss = 0.
best_acc = 0.
for t in range(epoch_num):
    print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
    total_loss = train_loop(train_dataloader, model, loss_fn, optimizer, lr_scheduler, t+1, total_loss)
    valid_acc = test_loop(valid_dataloader, model, mode='Valid')
    if valid_acc > best_acc:
        best_acc = valid_acc
        print('saving new weights...\n')
        torch.save(model.state_dict(), f'epoch_{t+1}_valid_acc_{(100*valid_acc):0.1f}_model_weights.bin')
print("Done!")
Using cuda device

Epoch 1/3
-------------------------------
loss: 0.564382: 100%|███████| 8584/8584 [09:16<00:00, 15.42it/s]
Valid Accuracy: 70.8%

saving new weights...

Epoch 2/3
-------------------------------
loss: 0.515803: 100%|███████| 8584/8584 [09:18<00:00, 15.36it/s]
Valid Accuracy: 73.7%

saving new weights...

Epoch 3/3
-------------------------------
loss: 0.466284: 100%|███████| 8584/8584 [09:17<00:00, 15.39it/s]
Valid Accuracy: 73.6%

Done!

可以看到,随着训练的进行,在验证集上的准确率先升后降(70.8% -> 73.7% -> 73.6%)。因此,3 轮 Epoch 训练结束后,会在目录下保存前两轮的模型权重:

epoch_1_valid_acc_70.8_model_weights.bin
epoch_2_valid_acc_73.7_model_weights.bin

至此,我们手工构建的文本分类模型的训练过程就完成了,完整的训练代码如下:

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
from transformers import AdamW, get_scheduler
from tqdm.auto import tqdm
import json

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

learning_rate = 1e-5
batch_size = 4
epoch_num = 3

checkpoint = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

class AFQMC(Dataset):
    def __init__(self, data_file):
        self.data = self.load_data(data_file)
    
    def load_data(self, data_file):
        Data = {}
        with open(data_file, 'rt') as f:
            for idx, line in enumerate(f):
                sample = json.loads(line.strip())
                Data[idx] = sample
        return Data
    
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

train_data = AFQMC('afqmc_public/train.json')
valid_data = AFQMC('afqmc_public/dev.json')

def collote_fn(batch_samples):
    batch_sentence_1, batch_sentence_2 = [], []
    batch_label = []
    for sample in batch_samples:
        batch_sentence_1.append(sample['sentence1'])
        batch_sentence_2.append(sample['sentence2'])
        batch_label.append(int(sample['label']))
    X = tokenizer(
        batch_sentence_1, 
        batch_sentence_2, 
        padding=True, 
        truncation=True, 
        return_tensors="pt"
    )
    y = torch.tensor(batch_label)
    return X, y

train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True, collate_fn=collote_fn)
valid_dataloader = DataLoader(valid_data, batch_size=batch_size, shuffle=False, collate_fn=collote_fn)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.bert_encoder = AutoModel.from_pretrained(checkpoint)
        self.classifier = nn.Linear(768, 2)

    def forward(self, x):
        bert_output = self.bert_encoder(**x)
        cls_vectors = bert_output.last_hidden_state[:, 0, :]
        logits = self.classifier(cls_vectors)
        return logits

model = NeuralNetwork().to(device)

def train_loop(dataloader, model, loss_fn, optimizer, lr_scheduler, epoch, total_loss):
    progress_bar = tqdm(range(len(dataloader)))
    progress_bar.set_description(f'loss: {0:>7f}')
    finish_batch_num = (epoch-1)*len(dataloader)
    
    model.train()
    for batch, (X, y) in enumerate(dataloader, start=1):
        X, y = X.to(device), y.to(device)
        pred = model(X)
        loss = loss_fn(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        total_loss += loss.item()
        progress_bar.set_description(f'loss: {total_loss/(finish_batch_num + batch):>7f}')
        progress_bar.update(1)
    return total_loss

def test_loop(dataloader, model, mode='Test'):
    assert mode in ['Valid', 'Test']
    size = len(dataloader.dataset)
    correct = 0

    model.eval()
    with torch.no_grad():
        for X, y in tqdm(dataloader):
            X, y = X.to(device), y.to(device)
            pred = model(X)
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    correct /= size
    print(f"{mode} Accuracy: {(100*correct):>0.1f}%\n")
    return correct

loss_fn = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=epoch_num*len(train_dataloader),
)

total_loss = 0.
best_acc = 0.
for t in range(epoch_num):
    print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")
    total_loss = train_loop(train_dataloader, model, loss_fn, optimizer, lr_scheduler, t+1, total_loss)
    valid_acc = test_loop(valid_dataloader, model, mode='Valid')
    if valid_acc > best_acc:
        best_acc = valid_acc
        print('saving new weights...\n')
        torch.save(model.state_dict(), f'epoch_{t+1}_valid_acc_{(100*valid_acc):0.1f}_model_weights.bin')
print("Done!")

最后,我们加载验证集上最优的模型权重,汇报其在测试集上的性能。由于 AFQMC 公布的测试集上并没有标签,无法评估性能,这里我们暂且用验证集代替进行演示:

model.load_state_dict(torch.load('epoch_2_valid_acc_73.7_model_weights.bin'))
test_loop(valid_dataloader, model, mode='Test')
Test Accuracy: 73.7%

注意:前面我们只保存了模型的权重(并没有同时保存模型结构),因此如果要单独调用上面的代码,需要首先实例化一个结构完全一样的模型,再通过 model.load_state_dict() 函数加载权重。

最终在测试集(这里用了验证集)上的准确率为 73.7%,与前面汇报的一致,也验证了加载过程是正确的。

参考

[1] HuggingFace 在线教程
[2] Pytorch 官方文档
[3] Pytorch 在线教程