NLP에서 사용하는 벡터화 기법

자연어 처리는 컴퓨터가 텍스트를 이해할 수 있도록 수치화하는 과정이 필수적이다. 이를 위해 벡터화(Vectorization) 기법이 사용된다. 벡터화는 텍스트 데이터를 숫자로 변환하는 과정이며, 다양한 기법이 존재한다. 이 글에서는 **One-hot Encoding, TF-IDF, Word Embedding(Word2Vec, FastText, Glove 등)**을 다룬다.

1. 벡터화가 필요한 이유

컴퓨터는 텍스트 데이터를 직접 처리할 수 없다. 따라서 우리가 사용하는 언어를 컴퓨터가 이해할 수 있도록 **수치형 데이터(벡터)**로 변환해야 한다.

예시: "나는 NLP를 공부한다."

사람이 이해하는 형태: "나는 NLP를 공부한다."
컴퓨터가 이해하는 형태: [0.21, 0.35, -0.11, 0.98, ...] (벡터로 변환된 데이터)

벡터화 기법을 적절히 활용하면 문맥을 고려한 의미 분석, 감성 분석, 기계 번역, 문서 분류 등 다양한 NLP 작업을 수행할 수 있다.

2. One-hot Encoding

One-hot Encoding은 가장 단순한 벡터화 기법이다. 각 단어를 고유한 인덱스로 변환하고, 해당 인덱스에만 1을 부여하는 방식이다.

예제 단어 목록:

"나는", "자연어", "처리를", "공부한다"

단어 One-hot Vector

나는	[1, 0, 0, 0]
자연어	[0, 1, 0, 0]
처리를	[0, 0, 1, 0]
공부한다	[0, 0, 0, 1]

단점:

단어 간의 의미 관계를 표현하지 못함 (나는 ≠ 자연어)
단어 개수가 많아질수록 고차원 희소 벡터가 됨

Python 코드 (One-hot Encoding 직접 구현 vs. 라이브러리 사용)

from collections import defaultdict
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# 직접 구현

def one_hot_encoding(corpus):
    unique_words = list(set(corpus))  # 고유한 단어 목록 생성
    word_to_index = {word: i for i, word in enumerate(unique_words)}  # 단어별 인덱스 매핑
    one_hot_vectors = []
    
    for word in corpus:
        vector = [0] * len(unique_words)  # 모든 요소를 0으로 초기화
        vector[word_to_index[word]] = 1   # 해당 단어 위치에 1 할당
        one_hot_vectors.append(vector)
    
    return one_hot_vectors

sentence = ["나는", "자연어", "처리를", "공부한다"]
print("직접 구현:", one_hot_encoding(sentence))

# 라이브러리 사용
encoder = OneHotEncoder(sparse=False)
sentence_array = np.array(sentence).reshape(-1, 1)  # sklearn의 OneHotEncoder는 2D 배열을 요구함
one_hot_encoded = encoder.fit_transform(sentence_array)
print("라이브러리 사용:", one_hot_encoded)

3. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF는 단어의 중요도를 고려한 벡터화 방법이다. 특정 단어가 문서 내에서 자주 등장하지만 전체 문서에서는 적게 등장할수록 더 중요한 단어로 간주한다.

Python 코드 (TF-IDF 직접 구현 vs. 라이브러리 사용)

import math
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

# 직접 구현

def compute_tf(document):
    word_counts = Counter(document)
    total_words = len(document)
    return {word: count / total_words for word, count in word_counts.items()}

def compute_idf(corpus):
    num_docs = len(corpus)
    word_doc_counts = Counter(word for document in corpus for word in set(document))
    return {word: math.log(num_docs / (count + 1)) for word, count in word_doc_counts.items()}

def compute_tfidf(document, corpus):
    tf = compute_tf(document)
    idf = compute_idf(corpus)
    return {word: tf[word] * idf[word] for word in document}

corpus = [["나는", "자연어", "처리를", "공부한다"], ["자연어", "처리는", "흥미롭다"]]
document = corpus[0]
print("직접 구현:", compute_tfidf(document, corpus))

# 라이브러리 사용
vectorizer = TfidfVectorizer()
corpus_text = ["나는 자연어 처리를 공부한다", "자연어 처리는 흥미롭다"]
tfidf_matrix = vectorizer.fit_transform(corpus_text)
print("라이브러리 사용:", tfidf_matrix.toarray())

4. 단어 임베딩 (Word Embedding)

Word Embedding 기법은 단어 간 의미적 유사도를 벡터 공간에서 학습하는 방식이다. 대표적인 기법으로 Word2Vec, FastText, GloVe가 있다.

Python 코드 (Word2Vec, FastText, GloVe 비교)

from gensim.models import Word2Vec, FastText

# 샘플 문장 데이터
sentences = [["나는", "자연어", "처리를", "공부한다"], ["자연어", "처리는", "흥미롭다"]]

# Word2Vec 모델 학습
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
print("Word2Vec:", w2v_model.wv["자연어"])  # "자연어"의 벡터 출력

# FastText 모델 학습
ft_model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
print("FastText:", ft_model.wv["자연어"])  # "자연어"의 벡터 출력

5. 결론

NLP에서 벡터화는 텍스트를 수치화하여 모델이 학습할 수 있도록 만드는 필수 과정이다.

One-hot Encoding: 단순하지만 의미 관계를 반영하지 못함
TF-IDF: 단어의 중요도를 반영하지만 문맥을 고려하지 못함
Word Embedding: 문맥을 반영하여 단어 의미를 표현할 수 있음 (Word2Vec, FastText, GloVe)

직접 구현과 라이브러리를 비교하여, 각각의 장단점을 이해하고 필요에 따라 적절한 방법을 선택하는 것이 중요하다. 다음 글에서는 NLP 모델 학습과 활용에 대해 다룰 예정이다! 😊

5.NLP에서 사용하는 벡터화 기법