Categories
Development

Web Scraping

Web Scraping with Python on my WSL environment.

Python virtual Environment

Open WSL Terminal, switch into the folder for my Web Scraping project and create a virtual environment first.

# Python virtual Environment
## Install
### On Debian/Ubuntu systems, you need to install the python3-venv package
sudo apt install python3.10-venv -y
python3 -m venv ai
## Activate
source ai/bin/activate

Visual Code

Open Visual Code IDE with code .

Change Python Interpreter to the one of virtual Environment

When working in a Terminal window in VS then the virtual environment has also to be activated in this terminal window: source ai/bin/activate

Requirements

I put all required external libraries and their specific versions my project relies on in a seperate file: requirements.txt.
In Python projects this is considered best practice.

selenium

Installation:

pip install -r requirements.txt

Selenium

Selenium is a powerful automation tool for web browsers. It allows you to control web browsers programmatically, simulating user interactions like clicking buttons, filling out forms, and navigating between pages. This makes it ideal for tasks such as web testing, web scraping, and browser automation.

As of Selenium 4.6, Selenium downloads the correct driver for you. You shouldn’t need to do anything. If you are using the latest version of Selenium and you are getting an error, please turn on logging and file a bug report with that information. Quelle

So installation of Google Chrome and Google Chrome Webdriver is not required anymore.
But I had to install some additional libraries on WSL:

sudo apt install libnss3 libgbm1 libasound2

Exkurs: Google Chrome

To find missing libraries I downloaded Google Chrome and tried to start it until all missing libraries were installed.

Page to find download link:

https://googlechromelabs.github.io/chrome-for-testing/#stable

## Google Chrome
wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chrome-linux64.zip
unzip chrome-linux64.zip
mv chrome-linux64 chrome

## Google Chrome Webdriver
wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chromedriver-linux64.zip
unzip chromedriver-linux64.zip
mv chromedriver-linux64 chromedriver
cp chromedriver/chromedriver chrome/chromedriver

cd chrome
./chromedriver

Scrape a single page

import selenium.webdriver as webdriver
from selenium.webdriver.chrome.service import Service

def scrape_website(website):
    print("Launching chrome browser...")
    driver= webdriver.Chrome()

    try:
        driver.get(website)
        print("Page loaded...")
        html = driver.page_source
        return html
    finally:
        driver.quit()

print(scrape_website("https://www.selenium.dev/"))
python scrape.py

Scape a Heise News Article

Extract Article Header, Article Lead and Article itself:

import selenium.webdriver as webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

def scrape_website(website):
    print("Launching chrome browser...")
    service = Service()
    options = Options()
    options.headless = True  # Headless-Modus aktivieren, um den Browser unsichtbar zu machen
    driver = webdriver.Chrome(service=service, options=options)

    try:
        driver.get(website)
        print("Page loaded...")
        html = driver.page_source
        return html
    finally:
        driver.quit()


def split_dom_content(dom_content, max_length=6000):
    return [
        dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length)
    ]


def scrape_heise_website(website):
    html  = scrape_website(website)

    # BeautifulSoup zum Parsen des HTML-Codes verwenden
    soup = BeautifulSoup(html, 'html.parser')

    # Artikel-Header und -Inhalt extrahieren
    # Der Header ist oft in einem <h1>-Tag zu finden
    header_title = soup.find('h1', {'class': 'a-article-header__title'}).get_text().strip()
    header_lead  = soup.find('p',  {'class': 'a-article-header__lead'}).get_text().strip()

    # Der eigentliche Artikelinhalt befindet sich oft in einem <div>-Tag mit der Klasse 'article-content'
    article_div = soup.find('div', {'class': 'article-content'})
    paragraphs = article_div.find_all('p') if article_div else []
    # 'redakteurskuerzel' entfernen
    for para in paragraphs:
        spans_to_remove = para.find_all('span', {'class': 'redakteurskuerzel'})
        for span in spans_to_remove:
            span.decompose()  # Entfernt den Tag vollständig aus dem Baum

    article_content = "\n".join([para.get_text().strip() for para in paragraphs])
    
    # Header und Artikelinhalt ausgeben
    result = "Header Title:" + header_title + "\nHeader Lead:" + header_lead + "\nContent:" + article_content
    return result

result = scrape_heise_website("https://www.heise.de/news/Streit-ueber-Kosten-Meta-kappt-Leitungen-zur-Telekom-9953162.html")
print(result)