๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
TIL๐Ÿ”ฅ/๋ฉ‹์Ÿ์ด์‚ฌ์ž์ฒ˜๋Ÿผ_AI School 5๊ธฐ

[๋ฉ‹์‚ฌ] AI SCHOOL 5๊ธฐ_ Day 12

by hk713 2022. 3. 25.

The process of data analysis for text data

Tokenize → POS Tagging → Stopwords ์ œ๊ฑฐ →๋‹จ์–ด์‚ฌ์ „ ์ƒ์„ฑ ์‚ฌ์ „ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”  → ๋จธ์‹ ๋Ÿฌ๋‹/๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ ์šฉ

 

NLTK

NLTK๋Š” Natural Language Toolkit์˜ ์•ฝ์ž๋กœ, ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฐ ๋ฌธ์„œ ๋ถ„์„์šฉ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€๋‹ค.

(์ž์—ฐ์–ด๋Š” ์ผ์ƒ์ ์ธ ์ƒํ™œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์–ธ์–ด๋ฅผ ๋งํ•œ๋‹ค)

๋ถ„์„์„ ์œ„ํ•ด์„œ๋Š” ๊ธด ๋ฌธ์ž์—ด์„ ์ž‘์€ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ ์•ผ ํ•˜๋Š”๋ฐ, ์ด ๋‹จ์œ„๋ฅผ token(ํ† ํฐ)์ด๋ผ ํ•˜๊ณ 

๊ทธ ์ž‘์—…์„ tokenizing(ํ† ํฐ ์ƒ์„ฑ)์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

word_tokenize() ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

word_tokenize() ์˜ˆ์‹œ

 

ํ’ˆ์‚ฌ(POS, part-of-speech)

pos_tag() ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ† ํฐํ™”ํ•œ ๋ฌธ์žฅ์„ ๋Œ€์ƒ์œผ๋กœ ๊ฐ๊ฐ์˜ ํ’ˆ์‚ฌ๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  • N : Noun (๋ช…์‚ฌ)
  • V : Verb (๋™์‚ฌ)
  • J/A : Adjective (ํ˜•์šฉ์‚ฌ)

pos_tag() ์˜ˆ์‹œ

 

Stopwords(๋ถˆ์šฉ์–ด) ์ œ๊ฑฐ

nltk ๋ชจ๋“ˆ์—์„œ stopwords ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

from nltk.corpus import stopwords

stopWords = stopwords.words('english')

์ง€์› ์–ธ์–ด๋Š” stopwords.fileids() ๋ฅผ ํ†ตํ•ด ์•Œ ์ˆ˜ ์žˆ๋‹ค. (ํ•œ๊ธ€์€ ์—†๋‹ค)

๋ถ„์„์„ ํ•˜๋ฉฐ ์ œ๊ฑฐ๋˜์–ด์•ผ ํ•˜๋Š” ๋ถˆ์šฉ์–ด๋“ค์„ ๊ฐ€์ ธ์˜จ ๋ถˆ์šฉ์–ด ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค.

 

Lemmatization(ํ‘œ์ œ์–ด ์ถ”์ถœ)

Lemma๋Š” ํ•œ๊ธ€๋กœ 'ํ‘œ์ œ์–ด' ๋˜๋Š” '๊ธฐ๋ณธ ์‚ฌ์ „ํ˜• ๋‹จ์–ด' ๋ผ๊ณ  ํ•œ๋‹ค.  

ํ‘œ์ œ์–ด ์ถ”์ถœ์€ ์ฝ”ํผ์Šค์— ์žˆ๋Š” ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์œผ๋กœ์จ ์ค‘์š”ํ•˜๋‹ค.

WordNetLemmatizer ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ณ , pos ์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ์ž…๋ ฅํ•˜๋ฉด ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.


์ •๊ทœํ‘œํ˜„์‹

์ •๊ทœํ‘œํ˜„์‹(Regular expression)์€ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํŠน์ • ํŒจํ„ด์„ ๋”ฐ๋ฅด๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ์šฉ๋„๋กœ ์“ฐ์ธ๋‹ค.

http://pythonstudy.xyz/python/article/401-%EC%A0%95%EA%B7%9C-%ED%91%9C%ED%98%84%EC%8B%9D-Regex

 

์˜ˆ์ œ๋กœ ๋ฐฐ์šฐ๋Š” ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ - ์ •๊ทœ ํ‘œํ˜„์‹ Regex

1. ์ •๊ทœ ํ‘œํ˜„์‹ (Regular Expression) ์ •๊ทœ ํ‘œํ˜„์‹์€ ํŠน์ •ํ•œ ๊ทœ์น™์„ ๊ฐ€์ง„ ๋ฌธ์ž์—ด์˜ ํŒจํ„ด์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•˜๋Š” ํ‘œํ˜„์‹(Expression)์œผ๋กœ ํ…์ŠคํŠธ์—์„œ ํŠน์ • ๋ฌธ์ž์—ด์„ ๊ฒ€์ƒ‰ํ•˜๊ฑฐ๋‚˜ ์น˜ํ™˜ํ•  ๋•Œ ํ”ํžˆ ์‚ฌ์šฉ๋œ๋‹ค.

pythonstudy.xyz

ํ•„์š”ํ•  ๋•Œ ํŒจํ„ด๋“ค์„ ์ฐพ์•„๊ฐ€๋ฉฐ ์‚ฌ์šฉํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค!


Text Similarity Analysis

TF-IDF

TF-IDF(Term Frequency - Inverse Document Frequency)๋Š” 

ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ์—์„œ ์–ด๋–ค ์ค‘์š”๋„๋ฅผ ๊ฐ€์ง€๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ๋‹ค.

๋งŽ์€ ๋ฌธ์„œ์— ๊ณตํ†ต์ ์œผ๋กœ ๋“ค์–ด์žˆ๋Š” ๋‹จ์–ด์˜ ๊ฒฝ์šฐ

๋ฌธ์„œ ๊ตฌ๋ณ„ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ๋ณด์•„ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ•์†Œํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

TF-IDF ๊ฐ’์€ ๋‹จ์–ด๋งˆ๋‹ค ๊ณ„์‚ฐ๋œ๋‹ค.

์ถœ์ฒ˜_ https://blog.naver.com/PostView.nhn?blogId=hobby-explorer&logNo=222213301125&categoryNo=13&parentCategoryNo=0

scikit-learn์—์„œ TFIDF Vectorizer ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

from sklearn.feature_extraction.text import TfidfVectorizer

 

Cosine similarity

์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋‹ค. 

๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด ๊ฐ๋„์˜ ์ฝ”์‚ฌ์ธ ๊ฐ’์„ ์ด์šฉํ•ด ๋‘ ๋ฒกํ„ฐ์˜ ์œ ์‚ฌํ•œ ์ •๋„๋ฅผ ํŒŒ์•…ํ•œ๋‹ค.

https://wikidocs.net/24603

 

1) ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity)

BoW์— ๊ธฐ๋ฐ˜ํ•œ ๋‹จ์–ด ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ธ DTM, TF-IDF, ๋˜๋Š” ๋’ค์—์„œ ๋ฐฐ์šฐ๊ฒŒ ๋  Word2Vec ๋“ฑ๊ณผ ๊ฐ™์ด ๋‹จ์–ด๋ฅผ ์ˆ˜์น˜ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ–ˆ๋‹ค๋ฉด ์ด๋Ÿฌํ•œ ํ‘œํ˜„ ๋ฐฉ๋ฒ•์— ๋Œ€ ...

wikidocs.net

๊ณ„์‚ฐ๋œ ๊ฐ’์€ 0์—์„œ 1์‚ฌ์ด์ด๊ณ , 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์œ ์‚ฌํ•œ ๊ฒƒ์ด๋‹ค.

from sklearn.metrics.pairwise import cosine_similarity

์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ cosine_similarity๋ฅผ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

๋Œ“๊ธ€