본문 바로가기
E | ngineering

WordCloud 활용

by 덞웖이 2024. 10. 5.

환경

- Ubuntu Server 22.04 
- VSCode Insider 
- Jupyter Extension 
- Remote SSH Extension 
- Python 3.10 venv 

사전 세팅

### Standard for headless chrome ### 
# 크롬 설치 
sudo apt update 
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb 
sudo dpkg -i google-chrome-stable_current_amd64.deb 
rm google-chrome-stable_current_amd64.deb 

# 드라이버 설치 
google-chrome --version # 버전 확인 후 아래에 적용 
wget https://storage.googleapis.com/chrome-for-testing-public/<버전>/linux64/chromedriver-linux64.zip 
unzip chromedriver_linux64.zip 
sudo mv chromedriver_linux64/chromedriver /usr/local/bin/ 
sudo chmod +x /usr/local/bin/chromedriver 
rm -r chromedriver_linux64 && rm chromedriver_linux64.zip 
 
### Other dependencies ### 
pip install wordcloud konlpy numpy -q

# for english NLP
pip install nltk -q

패키지

# for scrapping
from bs4 import BeautifulSoup
import requests as rq
import time

# 자연어 처리
from konlpy.tag import Hannanum

# wordcloud 그리기
from matplotlib import pyplot as plt
from wordcloud import WordCloud
from collections import Counter

# 마스크용
import numpy as np
from PIL import Image

실행

수집

# init
user_agent = {"User-Agent": "your agent here"}
tokenizer = Hannanum()
bag = []
idx = 1

# 49페이지 긁어오기
while idx < 50:
    try:
        rsp = rq.get(f'https://qna.programmers.co.kr/?page={idx}', user_agent)
        time.sleep(1.5)
        parsed = BeautifulSoup(rsp.text, 'html.parser')
        titles = parsed.find_all('div', 'top')
        for t in titles:
            bag += tokenizer.nouns(t.text.strip())
        idx += 1
    except:
        print(f'page {idx}: Out of pages📖')
        break
cnt = Counter(bag)
len(cnt) # 2820개 명사 확인

시각화

# 모양 넣으려고 하는 중...
mask_img = Image.open('./starshape.jpg')
mask_img = mask_img.convert('L') # to grayscale
THR = 128
mask_img = mask_img.point(lambda pix: 255 if pix > THR else 0) # eliminating noise
mask = np.array(mask_img) # to a binary list

# wordcloud setup
cloud = WordCloud(
    mask=mask,
    font_path='./NanumGothicCoding.ttf',
    background_color='white',
    contour_color='white',
    contour_width=1
).generate_from_frequencies(cnt)

# visualize
plt.figure(figsize=(7, 7))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

별모양 시각화 확인

 

'E | ngineering' 카테고리의 다른 글

파이똔 asyncio  (0) 2025.02.24
Threading & Multiprocessing  (0) 2025.02.20
Seaborn 활용 2  (0) 2024.10.04
Seaborn 활용 1  (0) 2024.10.04
Selenium으로 Crawl  (0) 2024.10.04