문서 유사도 및 언어 모델

문서 유사도 측정

문서는 다양한 요소와 이들의 상호작용으로 구성

가장 기본 단위인 단어 조차 문서와 관련된 다양한 정보(형태소, 키워드,개체명(Named entity), 중의적 단어)를 포함

상위 개념인 문장 또한 추가적인 정보(목적어, 주어, 문장 간 관계, 상호참조해결)를 제공

* 문서 벡터 간 유사도 측정 위해 코사인 유사도 자주 사용

Bag of Words

:문서 내 단어의 빈도수를 기준으로 문서 벡터를 생성

존재하지 않은 단어에 대해서도 column 으로 넣고 존재하지 않으니 0으로 채우기

--> 자주 발생하는 단어가 문서의 특징을 나타낸다는 것을 가정

- Bag of words 문서 벡터의 차원은 데이터 내 발생하는 모든 단어의 개수와 동일

- 합성어를 독립적인 단어로 개별 처리

N - gram 은 연속된 N개의 단어를 기준으로 텍스트 분석을 수행

띄어쓰기를 기준으로 단어 하나하나를 column을 설정한건 unigram

N = 2(bi - gram)라고 하면 두 단어를 하나의 문서 벡터로 사용 ex) 포근한 봄

N = 3(tri - gram) ex) 포근한 봄 날씨가

* 자주 발생하는 단어가 문서의 주요 내용 및 특징을 항상 효과적으로 표현하지는 않음

ex) 그리고, 그러나, 오늘, 만약

TF - IDF(term frequency - inverse document frequency)는 문서 내 상대적으로 자주 발생하는 단어가 더 중요하다는 점을 반영 : 단어의 상대적 중요성을 반영!!

문서1 내 "봄"의 빈도수 / 문서1 내 모든 단어의 빈도수 X log(데이터 내 총 문서의 개수 / 데이터 내 "봄"이 들어간 문서의 개수)

생성한 CountVectorizer() 객체에 get_feature_name() 메소드를 사용하면, 문서 벡터의 칼럼이 각각 어떤 단어를 나타내는지 확인할 수 있는 리스트가 반환

import re
from sklearn.feature_extraction.text import CountVectorizer

regex = re.compile('[^a-z ]')

with open("text.txt", 'r') as f:
    documents = []
    for line in f:
        # doucments 리스트에 리뷰 데이터를 저장하세요.
        filtered_doc = regex.sub("", line.rstrip())
        documents.append(filtered_doc)
        
        
# CountVectorizer() 객체를 이용해 Bag of words 문서 벡터를 생성하여 변수 X에 저장하세요.  
cv = CountVectorizer()
X = cv.fit_transform(documents)

# 변수 X의 차원을 변수 dim에 저장하세요.
dim = X.shape
# X 변수의 차원을 확인해봅니다.
print(dim)

# 위에서 생성한 CountVectorizer() 객체에서 첫 10개의 칼럼이 의미하는 단어를 words_feature 변수에 저장하세요.
words_feature = cv.get_feature_names()[:10]
# CountVectorizer() 객체의 첫 10개 칼럼이 의미하는 단어를 확인해봅니다.
print(words_feature)

# 단어 "comedy"를 의미하는 칼럼의 인덱스 값을 idx 변수에 저장하세요.
idx = cv.vocabulary_['comedy']
# 단어 "comedy"의 인덱스를 확인합니다.
print(idx)

# 첫 번째 문서의 Bag of words 벡터를 vec1 변수에 저장하세요.
vec1 = X[0]
# 첫 번째 문서의 Bag of words 벡터를 확인합니다.
print(vec1)

<TF-IDF Bag of words 기반 문서 벡터 생성>

import re
from sklearn.feature_extraction.text import TfidfVectorizer

regex = re.compile('[^a-z ]')

# 리뷰 데이터를 가져옵니다. 이전 실습과 동일하게 리스트 `documents`에는 전처리되어 있는 리뷰 데이터가 들어있습니다.
with open("text.txt", 'r') as f:
    documents = []
    for line in f:
        lowered_sent = line.rstrip().lower()
        filtered_sent = regex.sub('', lowered_sent)
        documents.append(filtered_sent)

# TfidfVectorizer() 객체를 이용해 TF-IDF Bag of words 문서 벡터를 생성하여 변수 X에 저장하세요.
tv = TfidfVectorizer()
X = tv.fit_transform(documents)

# 변수 X의 차원을 변수 dim1에 저장하세요.
dim1 = X.shape
# X 변수의 차원을 확인해봅니다.
print(dim1)

# 첫 번째 문서의 TF-IDF Bag of words를 vec1 변수에 저장하세요.
vec1 = X[0]
# 첫 번째 문서의 TF-IDF Bag of words를 확인합니다.
print(vec1)

# 위에서 생성한 TfidfVectorizer() 객체를 이용해 TF-IDF 기반 Bag of N-grams 문서 벡터를 생성하세요.
unibi_v = TfidfVectorizer(ngram_range = (1, 2)) 
unibigram_X = unibi_v.fit_transform(documents)


# 생성한 TF-IDF 기반 Bag of N-grams 문서 벡터의 차원을 변수 dim2에 저장하세요.
dim2 = unibigram_X.shape
# 문서 벡터의 차원을 확인합니다.
print(dim2)

# 경고문을 제거합니다.
import warnings
warnings.filterwarnings(action='ignore')

import pickle
from sklearn.metrics.pairwise import cosine_similarity

sent1 = ["I first saw this movie when I was a little kid and fell in love with it at once."]
sent2 = ["Despite having 6 different directors, this fantasy hangs together remarkably well."]

with open('bow_models.pkl', 'rb') as f:
    # 저장된 모델을 불러와 객체와 벡터를 각각vectorizer와 X에 저장하세요.
    vectorizer, X = pickle.load(f) 

# sent1, sent2 문장을 vectorizer 객체의 transform() 함수를 이용해 변수 vec1, vec2에 저장합니다.
vec1 = vectorizer.transform(sent1)
vec2 = vectorizer.transform(sent2)

#  vec1과 vec2의 코사인 유사도를 변수 sim1에 저장합니다.
sim1 = cosine_similarity(vec1, vec2)
# 두 벡터의 코사인 유사도를 확인해봅니다.
print(sim1)

# vec1과 행렬 X의 첫 번째 문서 벡터 간 코사인 유사도를 변수 sim2에 저장합니다.
sim2 = cosine_similarity(vec1, X[0])
# X의 첫 번째 문서와 vec1의 코사인 유사도를 확인해봅니다.
print(sim2)

doc2vec

bag of words는 벡터의 구성요소가 직관적인 것이 큰 장점

--> 텍스트 데이터의 양이 증가하면 문서 벡터의 차원이 증가 --> 대부분 단어의 빈도수가 0인 희소(sparse)벡터가 생성

-->문서 벡터의 차원 증가에 따른 메모리 제약 및 비효율성 발생

-->차원의 저주 발생( = 유사도의 거리가 줄어든다는 것)

doc2vec은 문서 내 단어 간 문맥적 유사도를 기반으로 문서 벡터를 임베딩, 지속적으로 학습

유사한 문맥의 문서 임베딩 벡터는 인접한 공간에 위치

doc2vec은 상대적으로 저차원의 공간에서 문서 벡터를 생성(hidden node의 수가 단어가 늘어나는 대로 늘어나는 것이 아니라 조절 가능)

<임베딩을 통한 문장 유사도 측정 서비스>

# -*- coding: utf-8 -*-
import random
import re
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from numpy import sqrt, dot

random.seed(10)

doc1 = ["homelessness has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter"]

doc2 = ["it may have ends that do not tie together particularly well but it is still a compelling enough story to stick with"]

# 데이터를 불러오는 함수입니다.
def load_data(filepath):
    regex = re.compile('[^a-z ]')

    gensim_input = []
    with open(filepath, 'r') as f:
        for idx, line in enumerate(f):
            lowered_sent = line.rstrip().lower()
            filtered_sent = regex.sub('', lowered_sent)
            tagged_doc = TaggedDocument(filtered_sent, [idx])
            gensim_input.append(tagged_doc)
            
    return gensim_input
    
def cal_cosine_sim(sent1, sent2):
    # 벡터 간 코사인 유사도를 계산해 주는 함수를 완성합니다.
    top = dot(sent1, sent2)
    size1 = sqrt(dot(sent1, sent1))
    size2 = sqrt(dot(sent2, sent2))
    return top / (size1 * size2)
    
# doc2vec 모델을 documents 리스트를 이용해 학습하세요.
documents = load_data("text.txt")
d2v_model = Doc2Vec(window = 2, vector_size = 50)
d2v_model.build_vocab(documents)
d2v_model.train(documents, total_examples = d2v_model.corpus_count, epochs = 5)
# 학습된 모델을 이용해 doc1과 doc2에 들어있는 문서의 임베딩 벡터를 생성하여 각각 변수 vector1과 vector2에 저장하세요.
vector1 = d2v_model.infer_vector(doc1)
vector2 = d2v_model.infer_vector(doc2)

# vector1과 vector2의 코사인 유사도를 변수 sim에 저장하세요.
sim = cal_cosine_sim(vector1, vector2)
# 계산한 코사인 유사도를 확인합니다.
print(sim)

N- gram 기반 언어 모델

언어 모델이란 주어진 문장이 텍스트 데이터에서 발생활 확률을 계산하는 모델

언어 모델을 통해 자동 문장 생성이 가능!

문장의 발생 확률은 단어가 발생할 조건부 확률의 곱으로 계산

N - gram을 사용하여 단어의 조건부 확률을 근사

각 N-gram 기반 조건부 확률은 데이터 내 각 n-gram의 빈도수로 계산

= 전체 데이터 내 "포근한 봄 날씨가" 의 빈도수 / 전체 데이터에서 "포근한 봄"의 빈도수

문장 생성 시, 주어진 단어 기준 최대 조건부 확률의 단어를 다음 단어로 생성

<N-gram 언어 모델>

data = ['this is a dog', 'this is a cat', 'this is my horse','my name is elice', 'my name is hank']

def count_unigram(docs):
    unigram_counter = dict()
    # docs에서 발생하는 모든 unigram의 빈도수를 딕셔너리 unigram_counter에 저장하여 반환하세요.
    for doc in docs:
        for word in doc.split():
            if word not in unigram_counter:
                unigram_counter[word] = 1
            else:
                unigram_counter[word] += 1
    return unigram_counter

def count_bigram(docs):
    bigram_counter = dict()
  # docs에서 발생하는 모든 bigram의 빈도수를 딕셔너리 bigram_counter에 저장하여 반환하세요.
    for doc in docs:
        words = doc.split()
        for word1, word2 in zip(words, words[1:]):
            if (word1, word2) not in bigram_counter:
                bigram_counter[(word1, word2)] = 1
            else:
                bigram_counter[(word1, word2)] += 1
    return bigram_counter

def cal_prob(sent, unigram_counter, bigram_counter):
    words = sent.split()
    result = 1.0
    # sent의 발생 확률을 계산하여 변수 result에 저장 후 반환하세요.
    for word1, word2 in zip(words, words[1:]):
        top = bigram_counter[(word1, word2)]
        bottom = unigram_counter[word1]
        result  *= float(top/bottom)
    return result

# 주어진data를 이용해 unigram 빈도수, bigram 빈도수를 구하고 "this is elice" 문장의 발생 확률을 계산해봅니다.
unigram_counter = count_unigram(data)
bigram_counter = count_bigram(data)
print(cal_prob("this is elice", unigram_counter, bigram_counter))

RNN 기반 언어 모델

:RNN으로 문장의 각 단어가 주어졌을 때 다음 단어를 예측하는 문제로 언어 모델 학습

:문자 단위 언어 모델로 학습 데이터 내 존재하지 않았던 단어 처리 및 생성 가능

: 모델 학습 시, 문장의 시작과 종료를 의미하는 태그 추가, 문자 생성시, 주어진 입력값부터 순차적으로 예측 단어 및 문자 생성

고성능 언어 모델 학습을 위해서는 RNN계열의 알고리즘 뿐만 아니라 BERT 와 같은 transformer 계열의 알고리즘 등을 사용 가능

data = ['this is a dog', 'this is a cat', 'this is my horse','my name is elice', 'my name is hank']

def count_unigram(docs):
    unigram_counter = dict()
    # docs에서 발생하는 모든 unigram의 빈도수를 딕셔너리 unigram_counter에 저장하여 반환하세요.
    for doc in docs:
        for word in doc.split():
            if word not in unigram_counter:
                unigram_counter[word] = 1
            else:
                unigram_counter[word] += 1
    return unigram_counter

def count_bigram(docs):
    bigram_counter = dict()
  # docs에서 발생하는 모든 bigram의 빈도수를 딕셔너리 bigram_counter에 저장하여 반환하세요.
    for doc in docs:
        words = doc.split()
        for word1, word2 in zip(words, words[1:]):
            if (word1, word2) not in bigram_counter:
                bigram_counter[(word1, word2)] = 1
            else:
                bigram_counter[(word1, word2)] += 1
    return bigram_counter

def cal_prob(sent, unigram_counter, bigram_counter):
    words = sent.split()
    result = 1.0
    # sent의 발생 확률을 계산하여 변수 result에 저장 후 반환하세요.
    for word1, word2 in zip(words, words[1:]):
        top = bigram_counter[(word1, word2)]
        bottom = unigram_counter[word1]
        result  *= float(top/bottom)
    return result

# 주어진data를 이용해 unigram 빈도수, bigram 빈도수를 구하고 "this is elice" 문장의 발생 확률을 계산해봅니다.
unigram_counter = count_unigram(data)
bigram_counter = count_bigram(data)
print(cal_prob("this is elice", unigram_counter, bigram_counter))

'Data Analysis' 카테고리의 다른 글

한국어 자연어 처리 및 문장 유사도 (0)	2022.07.29
감정 분석 서비스 (0)	2022.07.29
텍스트 전처리 및 단어 임베딩 (0)	2022.07.26

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

문서 유사도 및 언어 모델

문서 유사도 측정

Bag of Words

doc2vec

N- gram 기반 언어 모델

RNN 기반 언어 모델

'Data Analysis' 카테고리의 다른 글

문서 유사도 측정

Bag of Words

doc2vec

N- gram 기반 언어 모델

RNN 기반 언어 모델

'Data Analysis' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역