[NLP/자연어처리/Python] text classification 실습

🤖 ai logbook /

[NLP/자연어처리/Python] text classification 실습

2023. 7. 8. 18:43

728x90

(PyTorch) Text Classification

참고 : https://tutorials.pytorch.kr/beginner/text_sentiment_ngrams_tutorial.html

torchtext 라이브러리로 텍스트 분류하기

번역: 김강민, 김진현 이 튜토리얼에서는 torchtext 라이브러리를 사용하여 어떻게 텍스트 분류 분석을 위한 데이터셋을 만드는지를 살펴보겠습니다. 다음과 같은 내용들을 알게 됩니다: 반복자(it

tutorials.pytorch.kr

해당 tutorial을 진행하면서 공부한 내용을 정리하고,

실제로 model이 잘 동작하는지 임의의 예시(CNN 뉴스기사)를 가지고 테스트를 진행해 보았다.

1필요한 library를 설치한다

!pip install torchdata
!pip install -U portalocker>=2.0.0`

torchdata : TorchData는 PyTorch 프로젝트의 일부로,
데이터 파이프라인을 쉽게 구축할 수 있도록 공통적인 모듈식 데이터 로딩 기본 요소들을 제공하는 베타 라이브러리.

portalocker : Portalocker는 파일 잠금에 대한 쉬운 API를 제공하는 라이브러리.

Dataset을 가져온다

torchdata에서 제공하는 dataset 중에서 AG_NEWS를 사용해 train data를 생성한다.

AG_NEWS dataset : 뉴스 기사와 해당 기사의 카테고리로 구성된 dataset

torchtext.datasets.AG_NEWS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))

현재 제공하는 Datasets

Text Classification
AG_NEWS
AmazonReviewFull
AmazonReviewPolarity
CoLA
DBpedia
IMDb
MNLI
MRPC
QNLI
QQP
RTE
SogouNews
SST2
STSB
WNLI
YahooAnswers
YelpReviewFull
YelpReviewPolarity

  import torch
  from torchtext.datasets import AG_NEWS
  train_iter = iter(AG_NEWS(split='train'))

Dataset의 Tokenizer를 진행한 내용으로 Vocabulary를 생성한다.

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english') # 기본 영어 토큰화기 생성
train_iter = AG_NEWS(split='train')

def yield_tokens(data_iter):
    for _, text in data_iter:
      yield tokenizer(text) # 토큰화한 내용 반환

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
# specials 의 인자에 "<unk>"를 전달하면서, 어휘집에 없는 단어를 나타나는데 사용한다.

vocab.set_default_index(vocab["<unk>"])
# <unk> 기호의 index를 기본 index로 설정. 어휘집에 없는 단어가 나타날 경우 <unk>의 인덱스가 반환된다.

"Oil Near \$49 on Mounting Iraq Violence SINGAPORE (Reuters) - Global oil prices rallied to fresh highs on Friday with U.S. crude approaching \$49, driven by escalating violence in Iraq and unabated demand growth from China and India."

tokenizer ->

['oil', 'near', '\\$49', 'on', 'mounting', 'iraq', 'violence', 'singapore', '(', 'reuters', ')', '-', 'global', 'oil', 'prices', 'rallied', 'to', 'fresh', 'highs', 'on', 'friday', 'with', 'u', '.', 's', '.', 'crude', 'approaching', '\\$49', ',', 'driven', 'by', 'escalating', 'violence', 'in', 'iraq', 'and', 'unabated', 'demand', 'growth', 'from', 'china', 'and', 'india', '.']

vocab(['here', 'is', 'an', 'example'])
# vocab 객체의 __getitem__ 메소드를 호출하여 단어 리스트의 각 단어에 대한 인덱스를 반환
# 단어들은 빈도수에 따라 정렬되며, 
# 1) 가장 빈번한 단어가 가장 낮은 인덱스를 갖는다.
# 2) 빈도수가 같은 단어들은 사전 순으로 정렬된다.

text_pipeline = lambda x: vocab(tokenizer(x)) 
# vocab(['here', 'is', 'an', 'example']) 이렇게 단어 단위가 아닌,
# text_pipeline('here is the an example') 문자열 단위로 토큰 index를 가져올 수 있게 함
label_pipeline = lambda x: int(x) - 1 # label 0부터 시작하는 정수로 변환

Batch 처리 함수 및 DataLoader 생성하기

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # gpu setting

def collate_batch(batch):
  # batch -> train_iter
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0)) # processed_text의 길이 반환
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) 
    # cumsum : 텐서의 누적합 계산
    # 즉, offsets list의 마지막 원소를 제외한 나머지 원소들(processed_text의 길이)의 합
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
# batch_size = 8 : 배치 크기 설정
# shuffle : dataset 순서 유지
# collate_fn : batch 단위로 data 처리하는 방법

모델 정의하기

from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False) # 단어 임베딩 layer
        #torch.nn.EmbeddingBag(num_embeddings, embedding_dim, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, mode='mean', sparse=False, _weight=None, include_last_offset=False, padding_idx=None, device=None, dtype=None)
        self.fc = nn.Linear(embed_dim, num_class) # 선형 layer -> classification 목적
        #torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)
        self.init_weights()

    def init_weights(self):
        # 가중치를 uniform distribution로 초기화
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        # 편향을 0으로 초기화
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

모델 객체 생성하기

train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter])) # label = 1 : World (세계) 2 : Sports (스포츠) 3 : Business (경제) 4 : Sci/Tec (과학/기술)
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

모델 학습 함수 및 평가 함수 정의하기

import time

def train(dataloader):
    model.train()

    for idx, (label, text, offsets) in enumerate(dataloader):
        # Gradient 0으로 초기화
        optimizer.zero_grad()
        # Prediction
        predicted_label = model(text, offsets)
        # Loss 계산
        loss = criterion(predicted_label, label)
        # Backward pass (gradient 계산)
        loss.backward()
        # Parameter update
        optimizer.step()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    # 기울기 계산을 비활성화, 평가 과정에서는 기울기가 필요하지 않기 때문
    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            # Prediction
            predicted_label = model(text, offsets)
            # Loss 계산
            loss = criterion(predicted_label, label)
            # Accuracy 계산
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Dataset 분할 후 모델 훈련하기

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) 
# 일정한 간격마다 학습률을 감소시키는 데 사용.
# 1.0은 학습률을 감소시키는 주기를 나타내고, gamma=0.1은 학습률 감소율을 나타냄

total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter) # map style dataset : 인덱스를 사용하여 데이터에 접근할 수 있는 데이터셋
test_dataset = to_map_style_dataset(test_iter)

num_train = int(len(train_dataset) * 0.95) # 95% 를 훈련 데이터로 사용
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])
    # torch.utils.data.random_split(dataset, lengths, generator=<torch._C.Generator object>)
    # random_split(분리할 dataset, train set의 크기, valid set의 크기)

# data loader 생성
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    train(train_dataloader) # 모델 훈련
    accu_val = evaluate(valid_dataloader) # 모델 평가
    if total_accu is not None and total_accu > accu_val: # 정확도가 이전보다 낮다면 학습률 조정
      scheduler.step()
    else:
       total_accu = accu_val
    print('| end of epoch {:3d} | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           accu_val))

print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

# Checking the results of test dataset.
# test accuracy    0.908

임의의 뉴스로 평가하기

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0])) # 예측값 계산
        return output.argmax(1).item() + 1 
        # argmax : 텐서에서 가장 큰 값 가진 원소의 index 반환
        # 즉, 예측 label 반환

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu") 

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

결과 : This is a Sports news

주어진 튜토리얼 예시로는 와닿지 않아, CNN의 Tech에 있는 뉴스를 긁어와 테스트해봤다.

ex_test_str 출처 : https://edition.cnn.com/2023/06/30/tech/pokemon-go-niantic-layoffs/index.html

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1 

ex_text_str = "Niantic, the creator of hit mobile game Pokemon Go, announced it is laying off 230 employees and reorganizing its business as it grapples with new macroeconomic uncertainty.\
In a letter to staff announcing the job cuts, Niantic CEO John Hanke said the company is taking other significant actions as well: shuttering its Los Angeles studio, sunsetting its NBA All-World game and halting production on Marvel: World of Heroes.\
The privately held company’s breakout hit, Pokemon Go, was among the first mobile games to embrace augmented reality when it launched in 2016.\
Hanke went on to state what has become a familiar refrain among tech CEOs announcing layoffs: The company grew too fast during the boom in demand for digital services seen in the early days of the pandemic, and now must adjust to a new environment.\
\"In the wake of the revenue surge we saw during Covid, we grew our headcount and related expenses in order to pursue growth more aggressively, expanding existing game teams, our AR platform work, new game projects and roles that support our products and our employees,\" Hanke wrote.\
Eventually, however, \"our revenue returned to pre-Covid levels and new projects in games and platform have not delivered revenues commensurate with those investments,\" he added."

model = model.to("cpu") 

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

결과 : This is a Sci/Tec news

텍스트 분류가 아주 잘 되는 것을 확인할 수 있다. 👍

728x90

'🤖 ai logbook' 카테고리의 다른 글

베이즈 정리(Bayes’ theorem) & 마르코프 모델(Markov Models) (0)	2023.07.14
[NLP/자연어처리/Python] koGPT2 ChatBot 실습 (0)	2023.07.09
[cs231n/Spring 2023] Lecture 5: Image Classification with CNNs (0)	2023.07.09
[NLP/자연어처리/Python] text generation 실습 (transformer 언어 번역) (0)	2023.07.09
[NLP/자연어처리] BERT & GPT & ChatGPT (0)	2023.07.05
[NLP/자연어처리] 트랜스포머(Transformer) (0)	2023.07.04
[NLP/자연어처리] seq2seq 인코더-디코더 및 어텐션 모델 (Seq2Seq Encoder-Decoder & Attention Model) (0)	2023.07.04
[NLP/자연어처리] 자연어 처리에서의 순환 신경망 (RNN in Natural Language Processing) (0)	2023.07.01

I study SO

Menu

Category

Tags

[NLP/자연어처리/Python] text classification 실습

(PyTorch) Text Classification

1필요한 library를 설치한다

Dataset을 가져온다

Dataset의 Tokenizer를 진행한 내용으로 Vocabulary를 생성한다.

Batch 처리 함수 및 DataLoader 생성하기

모델 정의하기

모델 객체 생성하기

모델 학습 함수 및 평가 함수 정의하기

Dataset 분할 후 모델 훈련하기

임의의 뉴스로 평가하기

'🤖 ai logbook' 카테고리의 다른 글

티스토리툴바