๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
IT๐Ÿ’ก/Python

[Python] ํŒŒ์ด์ฌ ํฌ๋กค๋ง, ์›น ์Šคํฌ๋ž˜ํ•‘ ์˜ˆ์ œ - ๋„ค์ด๋ฒ„ ์˜ํ™” (์ฝ”์ฝ”) ๋ฆฌ๋ทฐ(BeautifulSoup, Pandas)

by hk713 2022. 3. 24.

๋„ค์ด๋ฒ„ ์˜ํ™” ์‚ฌ์ดํŠธ์—์„œ ์ฝ”์ฝ”(Coco)์˜ ๊ด€๋žŒ๊ฐ ํ‰์ ๊ณผ ํ•œ์ค„ํ‰์„

์ด 100๊ฐœ๋ฅผ ๊ธ์–ด์™€ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋Š” ์ฝ”๋“œ๋‹ค.

https://movie.naver.com/movie/bi/mi/basic.naver?code=151728

 

์ฝ”์ฝ”

๋ฎค์ง€์…˜์„ ๊ฟˆ๊พธ๋Š” ์†Œ๋…„ ๋ฏธ๊ตฌ์—˜์€ ์ „์„ค์ ์ธ ๊ฐ€์ˆ˜ ์—๋ฅด๋„ค์Šคํ† ์˜ ๊ธฐํƒ€์— ์†์„ ๋Œ”๋‹ค ‘์ฃฝ์€ ์ž๋“ค์˜ ์„ธ์ƒ’์— ...

movie.naver.com

 

ํ•ด๋‹น ์‚ฌ์ดํŠธ์— ์ ‘์†ํ•œ ๋’ค, "ํ‰์ "์„ ํด๋ฆญํ•ด ๊ธ์–ด์™€์•ผํ•˜๋Š” ๋ฐ์ดํ„ฐ์ธ, ํ‰์ ๊ณผ ํ•œ์ค„ํ‰์„ ์‚ดํŽด๋ณด์•˜๋‹ค.

 

ํ•œ ํŽ˜์ด์ง€๋‹น 10๊ฐœ์”ฉ ๋ฆฌ๋ทฐ๊ฐ€ ๋‹ฌ๋ ค์žˆ์—ˆ๋Š”๋ฐ,

์ค‘์š”ํ•œ๊ฑด ํŽ˜์ด์ง€ ํ•˜๋‹จ์˜ ๋ฒ„ํŠผ์„ ๋ˆŒ๋ €์„๋•Œ ์ „์ฒด url ์ฃผ์†Œ๊ฐ€ ๋‹ฌ๋ผ์ง€์ง€ ์•Š์•˜๋‹ค.

๊ทธ๋ž˜์„œ ์ธ์ŠคํŽ™ํ„ฐ(ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ)๋ฅผ ๊ฐ€์ง€๊ณ  ์‚ดํŽด๋ดค๋Š”๋ฐ ํŽ˜์ด์ง€ ๋ฒ„ํŠผ์˜ href๊ฐ€ ๋ˆˆ์— ๋„์—ˆ๋‹ค!

ํ•˜์ดํผ๋งํฌ์˜ ๋งˆ์ง€๋ง‰ ์˜ต์…˜์ธ page ๊ฐ’์œผ๋กœ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•˜๋‹ค.

 

ํ•˜์ดํผ๋งํฌ๋ฅผ ํด๋ฆญํ•ด ์ ‘์†ํ•˜๋‹ˆ ๊ด€๋žŒ๊ฐ ๋ฆฌ๋ทฐ๊ฐ€ 10๊ฐœ์”ฉ ๋ณด์—ฌ์ง€๋Š” ํŽ˜์ด์ง€์˜€๋‹ค. 

๊ทธ๋ฆฌ๊ณ  ์˜ˆ์ƒ๋Œ€๋กœ page ์˜ต์…˜ ๊ฐ’์œผ๋กœ ํŽ˜์ด์ง€๊ฐ€ ๋‹ฌ๋ผ์กŒ๋‹ค.

 

ํ‰์ ์€ div class = "star_score" ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์•„ ctrl+F ๋กœ ์ฐพ์•„๋ณด๋‹ˆ ๋”ฑ 10๊ฐœ๊ฐ€ ๋‚˜์™”๋‹ค. (๋‹น์ฒจ!!)

ํ•œ์ค„ํ‰์€ id๋กœ ๊ตฌ๋ถ„ํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์•˜๋‹ค. _filtered_ment_0~9 ๊นŒ์ง€๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


์™„์„ฑ ์ฝ”๋“œ

from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd

# ๊ด€๋žŒ๊ฐ ํ‰์ ์„ ๋ณด์—ฌ์ฃผ๋Š” url
url_pre = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.naver?code=151728&type=after&isActualPointWriteExecute=false&isMileageSubscriptionAlready=false&isMileageSubscriptionReject=false&page='

rate = []
review = []

for page in range(10):
    url = url_pre + str(page+1)  # page๋ณ„ url ์ƒ์„ฑ
    web = urlopen(url)
    web_page = BeautifulSoup(web, 'html.parser')
    
    scores = web_page.select('div.star_score em') # ํ‰์ ์„ ๋ชจ๋‘ ์ฐพ์•„ ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“ฆ
    
    for num in range(10):
        score = scores[num].get_text()
        review_id = '_filtered_ment_'+str(num) # ํ•œ์ค„ํ‰ id ์ƒ์„ฑ
        contents = web_page.find('span',{'id':f'{review_id}'}) # ๋งŒ๋“  id๋ฅผ ๊ฐ€์ง€๊ณ  ํ•œ์ค„ํ‰ ์ฐพ์•„๋ƒ„
        content = contents.get_text().strip() # ํƒœ๊ทธ ๋–ผ๊ณ  ํ…์ŠคํŠธ์˜ ์–‘์ชฝ ๊ณต๋ฐฑ ์ œ๊ฑฐ
        rate.append(score)
        review.append(content)

# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋งŒ๋“ค๊ธฐ
result = pd.DataFrame({'ํ‰์ ':rate,'ํ•œ์ค„ํ‰':review})

# ์—‘์…€ ์ €์žฅ
result.to_excel('์˜ํ™” ์ฝ”์ฝ” ๋ฆฌ๋ทฐ.xlsx', index=False) # ์ธ๋ฑ์Šค ๋ฏธํฌํ•จ

# ์™„๋ฃŒ ์‹œ๊ทธ๋„
print("์Šคํฌ๋ž˜ํ•‘์ด ์„ฑ๊ณต์ ์œผ๋กœ ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.")

์‹คํ–‰ ๊ฒฐ๊ณผ

์ฃผํ”ผํ„ฐ ๋…ธํŠธ๋ถ์—์„œ ๋Œ๋ ค๋ณธ ๊ฒฐ๊ณผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ ์‹œ๊ทธ๋„์ด ๋–ด๊ณ ,

ํ•ด๋‹น ํŒŒ์ผ์ด ๋“ค์–ด์žˆ๋Š” ํด๋”๋ฅผ ์‚ดํŽด๋ณด๋‹ˆ "์˜ํ™” ์ฝ”์ฝ” ๋ฆฌ๋ทฐ.xlsx" ํŒŒ์ผ์ด ์ƒ์„ฑ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

์—‘์…€ ํŒŒ์ผ์„ ์—ด์–ด ํ‰์ ๊ณผ ํ•œ์ค„ํ‰์ด ์ด 100๊ฐœ ์ €์žฅ๋œ ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค๐Ÿ˜

(์ค‘๊ฐ„๋ถ€๋ถ„์€ ์ˆจ๊น€ ์ฒ˜๋ฆฌํ•˜๊ณ  ์บก์ณํ–ˆ๋‹ค.)

๋Œ“๊ธ€