환경
- Ubuntu Server 22.04
- VSCode Insider
- Jupyter Extension
- Remote SSH Extension
- Python 3.10 venv
사전 세팅
### Standard for headless chrome ###
# 크롬 설치
sudo apt update
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
rm google-chrome-stable_current_amd64.deb
# 드라이버 설치
google-chrome --version # 버전 확인 후 아래에 적용
wget https://storage.googleapis.com/chrome-for-testing-public/<버전>/linux64/chromedriver-linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver_linux64/chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver
rm -r chromedriver_linux64 && rm chromedriver_linux64.zip
### Other dependencies ###
pip install wordcloud konlpy numpy -q
# for english NLP
pip install nltk -q
패키지
# for scrapping
from bs4 import BeautifulSoup
import requests as rq
import time
# 자연어 처리
from konlpy.tag import Hannanum
# wordcloud 그리기
from matplotlib import pyplot as plt
from wordcloud import WordCloud
from collections import Counter
# 마스크용
import numpy as np
from PIL import Image
실행
수집
# init
user_agent = {"User-Agent": "your agent here"}
tokenizer = Hannanum()
bag = []
idx = 1
# 49페이지 긁어오기
while idx < 50:
try:
rsp = rq.get(f'https://qna.programmers.co.kr/?page={idx}', user_agent)
time.sleep(1.5)
parsed = BeautifulSoup(rsp.text, 'html.parser')
titles = parsed.find_all('div', 'top')
for t in titles:
bag += tokenizer.nouns(t.text.strip())
idx += 1
except:
print(f'page {idx}: Out of pages📖')
break
cnt = Counter(bag)
len(cnt) # 2820개 명사 확인
시각화
# 모양 넣으려고 하는 중...
mask_img = Image.open('./starshape.jpg')
mask_img = mask_img.convert('L') # to grayscale
THR = 128
mask_img = mask_img.point(lambda pix: 255 if pix > THR else 0) # eliminating noise
mask = np.array(mask_img) # to a binary list
# wordcloud setup
cloud = WordCloud(
mask=mask,
font_path='./NanumGothicCoding.ttf',
background_color='white',
contour_color='white',
contour_width=1
).generate_from_frequencies(cnt)
# visualize
plt.figure(figsize=(7, 7))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
'E | ngineering' 카테고리의 다른 글
파이똔 asyncio (0) | 2025.02.24 |
---|---|
Threading & Multiprocessing (0) | 2025.02.20 |
Seaborn 활용 2 (0) | 2024.10.04 |
Seaborn 활용 1 (0) | 2024.10.04 |
Selenium으로 Crawl (0) | 2024.10.04 |