selenium 과 BeautifulSoup 이용해 특정상품 네이버쇼핑에서 데이터를 가져와 데이터 시각화 해보기

728x90

Python을 사용하여 네이버 쇼핑에서 원하는 제품의 정보를 수집하고, 이를 통해 인기 있는 제품을 시각화해보려 합니다.

먼저, 필요한 라이브러리를 가져와줍니다. selenium은 웹 페이지를 제어하고 정보를 가져오는 데 사용되며, BeautifulSoup은 HTML을 파싱 하는 데 유용합니다. 또한 데이터 분석을 위해 pandas, 시각화를 위해 matplotlib과 seaborn을 사용합니다.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import chromedriver_autoinstaller
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

그리고 데이터수집을 하기전에 가져올 수 있는 정보를 먼저 확인해 줍니다.

광고상품을 제외하고 가져올 수 있는 정보는 제목, 별점, 리뷰 개수, 구매건수입니다.

하지만 상품목록마다 별점이 있는 상품과 없는 상품이 있으므로 제목, 리뷰, 구매건수 만 가져오도록 합시다.

더 밑으로 내려보면 아래와 같이 리뷰, 구매건수 가 없는 상품목록이 있습니다.

아래 같은 상품목록은 크롤링에서 제외하겠습니다.

다음으로는 웹 드라이버를 설정하고 크롬을 제어하기 위해 필요한 옵션을 추가합니다. 그리고 검색할 제품의 이름과, 제품 정보를 저장할 빈 리스트들을 만들어줍니다.

# 크롬 드라이버 자동 설치
chromedriver_autoinstaller.install()

# 크롬 웹 드라이버 옵션 설정
options = Options()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36")

# 크롬 웹 드라이버 인스턴스 생성
driver = webdriver.Chrome(options=options)

search_query = "배즙"

titles = []
prices = []
reviews = []
purchases = []

이제 페이지를 순회하면서 제품 정보를 수집합니다. 검색 페이지에 있는 모든 제품이 로드될 때까지 스크롤하고, 각 제품의 제목, 가격, 리뷰 수, 구매 건수를 추출합니다.

추출하기 전에 HTML태그를 먼저 확인합니다.

# BeautifulSoup을 사용하여 HTML 파싱
    soup = BeautifulSoup(html_content, "html.parser")
    
    # 제품 정보 추출
    product_info_list = soup.find_all("div", class_="product_info_area__xxCTi")

    for product_info in product_info_list:
        # 제품 제목 가져오기
        product_title_elem = product_info.find("div", class_="product_title__Mmw2K").a
        product_title = product_title_elem["title"] if product_title_elem else "제목 없음"
        
        # 가격 가져오기
        price_elem = product_info.find("span", class_="price_num__S2p_v")
        price = price_elem.get_text() if price_elem else "가격 없음"

        # 리뷰 수 가져오기
        review_elem = product_info.find("em", class_="product_num__fafe5")
        if review_elem:
            parent_tag = review_elem.find_parent("a")
            if parent_tag and parent_tag.get("role") == "button":
                review_count = "리뷰 없음"
            else:
                review_count = review_elem.get_text()
        else:
            review_count = "리뷰 없음"


        # 첫 번째 글자가 괄호인 경우 괄호 제거
        if review_count.startswith("(") and review_count.endswith(")"):
            review_count = review_count[1:-1]

        # 구매건수 가져오기
        purchase_tag = product_info.find_all("em", class_="product_num__fafe5")
        if len(purchase_tag) > 1:
            parent_tag = purchase_tag[1].find_parent("span")
            if parent_tag and "product_text__cTjus" not in parent_tag.get("class", []):
                purchase_count = purchase_tag[1].get_text()
            else:
                purchase_count = "구매건수 없음"
        else:
            purchase_count = "구매건수 없음"

        # 리스트에 데이터 추가
        titles.append(product_title)
        prices.append(price)
        reviews.append(review_count)
        purchases.append(purchase_count)

이 코드 블록은 BeautifulSoup을 사용하여 HTML을 파싱하고 제품 정보를 추출하는 부분입니다. 각각의 단계를 설명해 드리겠습니다.

1. 먼저, soup = BeautifulSoup(html_content, "html.parser") 코드는 BeautifulSoup을 사용하여 HTML을 파싱 합니다. 이렇게 하면 파싱 된 HTML 문서가 soup 변수에 저장됩니다.

2. 다음으로,product_info_list = soup.find_all("div", class_="product_info_area__xxCTi") 코드는 HTML에서 클래스가 "product_info_area__xxCTi"인 모든 <div> 요소를 찾아서 리스트로 반환합니다. 이 요소들은 각 제품에 대한 정보를 포함하고 있습니다.

3. 그 후, for product_info in product_info_list: 루프는 각 제품 정보를 처리합니다. 이 루프는 product_info_list에서 각각의 제품 정보를 하나씩 가져와서 처리합니다.

4. 제품의 제목을 가져오는 부분은 다음과 같습니다.

product_title_elem = product_info.find("div", class_="product_title__Mmw2K").a
product_title = product_title_elem["title"] if product_title_elem else "제목 없음"

이 코드는 각 제품 정보에서 클래스가 "product_title__Mmw2 K"인<div> 요소를 찾고, 그 안에서 <a> 태그를 찾아 제품 제목을 가져옵니다. 만약 제목이 없으면 "제목 없음"을 사용합니다.

5. 제품의 가격을 가져오는 부분은 다음과 같습니다.

price_elem = product_info.find("span", class_="price_num__S2p_v")
price = price_elem.get_text() if price_elem else "가격 없음"

이 코드는 각 제품 정보에서 클래스가 "price_num__S2p_v"인 요소를 찾아서 가격을 가져옵니다. 가격이 없으면 "가격 없음"을 사용합니다.

6. 제품의 리뷰 수를 가져오는 부분은 다음과 같습니다.

review_elem = product_info.find("em", class_="product_num__fafe5")
if review_elem:
    parent_tag = review_elem.find_parent("a")
    if parent_tag and parent_tag.get("role") == "button":
        review_count = "리뷰 없음"
    else:
        review_count = review_elem.get_text()
else:
    review_count = "리뷰 없음"

이 코드 블록은 제품의 리뷰 수를 가져오는 과정을 처리합니다. 다음은 코드의 각 부분에 대한 상세 설명입니다.

1.review_elem = product_info.find("em", class_="product_num__fafe5"): 이 코드는 현재 제품 정보에서 클래스가 "product_num__fafe5"인 요소를 찾아 review_elem 변수에 할당합니다. 이 요소는 제품의 리뷰 수를 포함하고 있습니다. 만약 리뷰가 없는 제품이라면 이 요소는 찾을 수 없을 것입니다.

2. if review_elem:: 이 조건문은 review_elem이 None이 아닌지 확인합니다. 즉, 해당 제품에 리뷰 수가 있는지를 검사합니다.

3. parent_tag = review_elem.find_parent("a"): 이 코드는 review_elem의 부모 요소 중에서 <a> 태그를 찾아 parent_tag 변수에 할당합니다. 왜냐하면 네이버 쇼핑에서 제품에 대한 리뷰 수가 태그 안에 있지만, 실제 리뷰 페이지로 이어지는 링크가 <a> 태그 안에 있기 때문입니다.

4. if parent_tag and parent_tag.get("role") == "button":: 이 조건문은 부모 요소인 <a> 태그가 있고, 그 태그의 'role' 속성 값이 "button"인지를 확인합니다. 만약 해당 조건이 참이라면, 이는 제품에 대한 리뷰 페이지가 존재하지 않는다는 것을 의미합니다.

5. review_count = "리뷰 없음": 위의 두 조건이 모두 거짓일 경우, 즉 리뷰 수가 있고 해당 제품의 리뷰 페이지로 이동할 수 있다면, 리뷰 수를 가져와서 review_count 변수에 할당합니다.

6.else: review_count = review_elem.get_text(): 위의 조건들이 거짓일 경우, 즉 리뷰 수가 없거나 리뷰 페이지로 이동할 수 없다면, "리뷰 없음"을 review_count 변수에 할당합니다.

이런 식으로 코드가 진행됩니다. 각 단계에서는 BeautifulSoup을 사용하여 HTML에서 원하는 요소를 찾고, 그에 해당하는 정보를 추출하여 리스트에 추가하는 과정을 반복합니다.

# 첫 번째 글자가 괄호인 경우 괄호 제거
        if review_count.startswith("(") and review_count.endswith(")"):
            review_count = review_count[1:-1]

위 코드 블록은 다음과 같은 작업을 수행합니다:

review_elem = product_info.find("em", class_="product_num__fafe5"): BeautifulSoup을 사용하여 현재 제품 정보에서 리뷰 수를 나타내는 태그를 찾습니다. 이 태그의 클래스가 "product_num__fafe5"인 경우를 찾습니다.
if review_elem:: 만약 리뷰 수를 나타내는 태그를 찾았다면, 다음 단계로 넘어갑니다. 그렇지 않으면 "리뷰 없음"으로 설정합니다.
parent_tag = review_elem.find_parent("a"): 리뷰 수를 나타내는 태그의 부모 요소 중에서 <a> 태그를 찾습니다. 이것은 주로 리뷰 수가 링크로 연결되어 있는 경우입니다.
if parent_tag and parent_tag.get("role") == "button":: 만약 <a> 태그가 존재하고, 그 태그의 role 속성이 "button"인 경우를 확인합니다. 이는 일부 사이트에서 리뷰 수가 버튼으로 표시되어 있는 경우입니다.
review_count = "리뷰 없음": 위 조건들을 모두 만족하지 않으면 "리뷰 없음"으로 설정합니다. 즉, 리뷰 수가 없거나 표시 방식이 다르거나 등의 이유로 리뷰 수를 가져올 수 없는 경우입니다.
else:: 위 조건들을 모두 만족하지 않는 경우, 즉 리뷰 수를 텍스트로 가져올 수 있는 경우입니다.
review_count = review_elem.get_text(): 태그 안에 있는 텍스트를 가져와서 review_count 변수에 저장합니다. 이것이 실제 리뷰 수입니다.
마지막으로, if review_count.startswith("(") and review_count.endswith(")"):: 가져온 리뷰 수가 괄호로 둘러싸여 있는지 확인합니다.
review_count = review_count [1:-1]: 만약 리뷰 수가 괄호로 둘러싸여 있다면, 첫 번째 글자와 마지막 글자를 제외한 나머지를 선택하여 괄호를 제거합니다.

이 과정을 통해 제품의 리뷰 수를 추출하고, 특정 상황에 따라 "리뷰 없음"으로 처리하거나 괄호를 제거하는 등의 처리를 수행합니다.

purchase_tag = product_info.find_all("em", class_="product_num__fafe5")
if len(purchase_tag) > 1:
    parent_tag = purchase_tag[1].find_parent("span")
    if parent_tag and "product_text__cTjus" not in parent_tag.get("class", []):
        purchase_count = purchase_tag[1].get_text()
    else:
        purchase_count = "구매건수 없음"
else:
    purchase_count = "구매건수 없음"

위 코드는 해당 제품 정보에서 구매건수를 가져오는 과정입니다. 먼저, 클래스 이름이 "product_num__fafe5"인 태그를 모두 찾아서 purchase_tag에 저장합니다. 그런 다음, purchase_tag의 길이가 1보다 큰지 확인합니다. 이는 제품 정보에서 태그가 2개 이상일 때를 의미합니다. 잘 보시면 리뷰와 구매건수의 em 태그가 동일하게 "product_num__fafe5" 입니다.

구매건수 의 값을 가져오기 위해 purchase_tag의 두 번째 요소(인덱스 1)에서 부모 요소인 태그를 찾습니다.

이때, 부모 요소의 클래스가 "product_text__cTjus"를 포함하지 않는 경우에만 해당 요소의 텍스트를 가져와서 구매건수로 사용합니다. 만약 이 조건을 만족하지 않는다면 "구매건수 없음"으로 처리합니다. 만약 purchase_tag의 길이가 1이라면, 즉 구매건수를 나타내는 요소가 없는 경우에는 "구매건수 없음"으로 처리합니다.

그런 다음, 수집한 데이터를 데이터프레임으로 변환하고 필요한 전처리를 수행합니다.

# 데이터프레임 생성
data = pd.DataFrame({
    '제목': titles,
    '가격': prices,
    '리뷰 수': reviews,
    '구매건수': purchases
})

# '구매건수' 열을 숫자형으로 변환 (숫자가 아닌 문자 제거)
data['구매건수'] = data['구매건수'].str.replace('구매건수 없음', '0').str.replace(',', '').astype(int)

# '구매건수 없음'인 행 제거
data = data[data['구매건수'] != 0]

마지막으로, 제품별 구매건수를 그래프로 시각화하여 상위 10개 제품을 확인합니다.

# 제품명으로 그룹화하여 구매건수 합산
purchase_counts = data.groupby('제목')['구매건수'].sum().sort_values(ascending=False)

# 데이터 시각화
rc('font', family='AppleGothic')

plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(10, 6))
purchase_counts.head(10).plot(kind='bar')

# 그래프 제목 및 축 레이블 설정
plt.title('상위 10개 제품의 구매건수', fontsize=16)
plt.xlabel('제품명', fontsize=14)
plt.ylabel('구매건수', fontsize=14)

# x 축 눈금 레이블 회전 및 정렬 설정
plt.xticks(rotation=45, ha='right', fontsize=12)

# 그래프 표시
plt.tight_layout()
plt.show()

# 웹드라이버 종료
driver.quit()

여기까지가 코드의 전체 내용입니다! 이 코드를 실행하면 원하는 제품의 인기도를 알 수 있는 그래프가 생성됩니다. 이제 쇼핑 사이트에서 제품을 검색하고 비교하는 데 유용하게 활용할 수 있습니다. 계속해서 익숙해지면 다양한 데이터 분석과 시각화를 시도해 보세요!

전체코드

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
import chromedriver_autoinstaller
import random
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns

# 크롬 드라이버 자동 설치
chromedriver_autoinstaller.install()

# 크롬 웹 드라이버 옵션 설정
options = Options()
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36")

# 크롬 웹 드라이버 인스턴스 생성
driver = webdriver.Chrome(options=options)

search_query = "배즙"

titles = []
prices = []
reviews = []
purchases = []

for page_index in range(1, 10):
    # 네이버 쇼핑 검색 링크 설정
    search_link = f"https://search.shopping.naver.com/search/all?adQuery={search_query}&origQuery={search_query}&pagingIndex={page_index}&pagingSize=40&productSet=total&query={search_query}&sort=rel&timestamp=&viewType=list"

    # 검색 링크로 이동
    driver.get(search_link)
    
    # 이전에 로드된 아이템 수
    prev_items_count = 0

    # 무한 스크롤로 추가 아이템 로드
    while True:
        sec = random.randint(3, 7)
        # 스크롤하여 아이템 추가 로드
        driver.execute_script("window.scrollBy(0, 10000);")
        
        time.sleep(sec)
        # 현재 로드된 아이템 수 확인
        current_items_count = len(driver.find_elements(By.CSS_SELECTOR, "a[data-i]"))
        # 추가로 아이템이 로드되지 않으면 반복 종료
        if current_items_count == prev_items_count:
            break
        # 이전 아이템 수 업데이트
        prev_items_count = current_items_count

    # 현재 페이지의 HTML 가져오기
    html_content = driver.page_source
    
    # BeautifulSoup을 사용하여 HTML 파싱
    soup = BeautifulSoup(html_content, "html.parser")
    
    # 제품 정보 추출
    product_info_list = soup.find_all("div", class_="product_info_area__xxCTi")

    for product_info in product_info_list:
        # 제품 제목 가져오기
        product_title_elem = product_info.find("div", class_="product_title__Mmw2K").a
        product_title = product_title_elem["title"] if product_title_elem else "제목 없음"
        
        # 가격 가져오기
        price_elem = product_info.find("span", class_="price_num__S2p_v")
        price = price_elem.get_text() if price_elem else "가격 없음"

        # 리뷰 수 가져오기
        review_elem = product_info.find("em", class_="product_num__fafe5")
        if review_elem:
            parent_tag = review_elem.find_parent("a")
            if parent_tag and parent_tag.get("role") == "button":
                review_count = "리뷰 없음"
            else:
                review_count = review_elem.get_text()
        else:
            review_count = "리뷰 없음"


        # 첫 번째 글자가 괄호인 경우 괄호 제거
        if review_count.startswith("(") and review_count.endswith(")"):
            review_count = review_count[1:-1]

        # 구매건수 가져오기
        purchase_tag = product_info.find_all("em", class_="product_num__fafe5")
        if len(purchase_tag) > 1:
            parent_tag = purchase_tag[1].find_parent("span")
            if parent_tag and "product_text__cTjus" not in parent_tag.get("class", []):
                purchase_count = purchase_tag[1].get_text()
            else:
                purchase_count = "구매건수 없음"
        else:
            purchase_count = "구매건수 없음"

        # 리스트에 데이터 추가
        titles.append(product_title)
        prices.append(price)
        reviews.append(review_count)
        purchases.append(purchase_count)

        # 정보 출력
        print(page_index, "번 페이지 출력")
        print("제품 제목:", product_title)
        print("가격:", price)
        print("리뷰 수:", review_count)
        print("구매건수:", purchase_count)
        print()

# 데이터프레임 생성
data = pd.DataFrame({
    '제목': titles,
    '가격': prices,
    '리뷰 수': reviews,
    '구매건수': purchases
})

# '구매건수' 열을 숫자형으로 변환 (숫자가 아닌 문자 제거)
data['구매건수'] = data['구매건수'].str.replace('구매건수 없음', '0').str.replace(',', '').astype(int)

# '구매건수 없음'인 행 제거
data = data[data['구매건수'] != 0]

# 제품명으로 그룹화하여 구매건수 합산
purchase_counts = data.groupby('제목')['구매건수'].sum().sort_values(ascending=False)

# 데이터 시각화
rc('font', family='AppleGothic')

plt.rcParams['axes.unicode_minus'] = False
plt.figure(figsize=(10, 6))
purchase_counts.head(10).plot(kind='bar')

# 그래프 제목 및 축 레이블 설정
plt.title('상위 10개 제품의 구매건수', fontsize=16)
plt.xlabel('제품명', fontsize=14)
plt.ylabel('구매건수', fontsize=14)

# x 축 눈금 레이블 회전 및 정렬 설정
plt.xticks(rotation=45, ha='right', fontsize=12)

# 그래프 표시
plt.tight_layout()
plt.show()

# 웹드라이버 종료
driver.quit()

실행결과

728x90

'Python > 웹 스크래핑 및 웹 자동화' 카테고리의 다른 글

BeautifulSoup 사용법 (0)	2024.03.01
selenium 을 이용해 특정상품 네이버쇼핑에서 순위 확인하기 (1)	2024.02.27
selenium 을 이용해 네이버 블로그 검색 결과에서 특정 블로그 게시물 랭킹 찾기 (0)	2024.02.26
Selenium을 활용한 HTML 태그 선택하기 (0)	2024.02.25
파이썬으로 웹 크롤링 하기 : Selenium (0)	2024.02.25

P_eli 개발 블로그

selenium 과 BeautifulSoup 이용해 특정상품 네이버쇼핑에서 데이터를 가져와 데이터 시각화 해보기

전체코드

실행결과

'Python > 웹 스크래핑 및 웹 자동화' 카테고리의 다른 글

티스토리툴바

selenium 과 BeautifulSoup 이용해 특정상품 네이버쇼핑에서 데이터를 가져와 데이터 시각화 해보기

전체코드

실행결과

'Python > 웹 스크래핑 및 웹 자동화' 카테고리의 다른 글

관련글

티스토리툴바

'Python > 웹 스크래핑 및 웹 자동화' 카테고리의 다른 글