[TIL] Python, MongoDB & Web Crawling

Today, I learned how to crawl websites using the Beautiful Soup Python library. In addition, I created a MongoDB account and utilized PyMongo and dnspython to add, find, modify, and delete data in the database. Web crawling is a powerful tool that enables anyone to extract data from any website and transform it into a secondary product.

Python Virtual Environment

To keep dependencies required by different projects or locals separate by creating isolated python virtual environments for them
1. Install (on Git Bash): python -m venv venvName
2. Activate (on Git Bash): source venvName/Scripts/activate

Using pip - package installer in Python

By using pip install packageName, we can use a bunch of Python libraries that others already created for the useful features
Error I encountered
- After importing requests library, it shows the following error: ModuleNotFoundError
- How to resolve
  1. I tried to uninstall Python 3.10 version that I used and install Python 3.8 since Python 3.10 is newly updated, which is likely to be less compatible with some libraries
  2. By [Windows+R], on the command line, I pip install requests under the Python Scripts path.

Python Web Crawling

Install beautifulSoup4: pip install bs4

# Base format
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.naver?sel=pnt&date=20210829',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

Select the HTML element from the browser
- Chrome Extension > Inspect Element > Right-click the data I want to crawl > Copy > Copy Selector (looks like this: #old_content > table > tbody > tr:nth-child(3) > td.title > div > a)
- ```
  trs = soup.select('#old_content > table > tbody > tr')

  for tr in trs:
      movie = tr.select_one('td.title > div > a')
      if movie is not None:
          print(movie.text)
```

Exercise: music chart top 50 crawling

  import requests
  from bs4 import BeautifulSoup

  headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
  data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20221201',headers=headers)

  soup = BeautifulSoup(data.text, 'html.parser')

  trs = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

  for tr in trs:
      title = tr.select_one('td.info > a.title.ellipsis').text.strip()
      rank = tr.select_one('td.number').text[0:2].strip()
      singer = tr.select_one('td.info > a.artist.ellipsis').text.strip()
      print(rank, title, singer)

Database: MongoDB - NoSQL

Install packages to import MongoDB in Python: pip install pymongo dnspython

pymongo base format

  from pymongo import MongoClient
  client = MongoClient('mongodb+srv://siwon:<password>@cluster0.icysouv.mongodb.net/?retryWrites=true&w=majority')
  db = client.dbsiwon

How to add data

  doc = {
      'name': 'Tom',
      'age': 24
  }

  db.users.insert_one(doc)

How to bring data from DB

  # every data
  all_users = list(db.users.find({},{'_id':False}))

  print(all_users[0])         
  print(all_users[0]['name']) 

  for a in all_users:      
      print(a)

  # specific data
  user = db.users.find_one({})
  print(user)

How to modify data

  db.users.update_one({'name':'Tom'},{'$set':{'age':19}})

How to delete data
```
  db.users.delete_one({'name':'Tom'})
```

[TIL] Python, MongoDB & Web Crawling

03/01/23

Table of contents

Python Virtual Environment

Using pip - package installer in Python

Python Web Crawling

Database: MongoDB - NoSQL