Web Scraping with Python on my WSL environment.
Python virtual Environment
Open WSL Terminal, switch into the folder for my Web Scraping project and create a virtual environment first.
# Python virtual Environment ## Install ### On Debian/Ubuntu systems, you need to install the python3-venv package sudo apt install python3.10-venv -y python3 -m venv ai ## Activate source ai/bin/activate
Visual Code
Open Visual Code IDE with code .
Change Python Interpreter to the one of virtual Environment
When working in a Terminal window in VS then the virtual environment has also to be activated in this terminal window: source ai/bin/activate
Requirements
I put all required external libraries and their specific versions my project relies on in a seperate file: requirements.txt
.
In Python projects this is considered best practice.
selenium
Installation:
pip install -r requirements.txt
Selenium
Selenium is a powerful automation tool for web browsers. It allows you to control web browsers programmatically, simulating user interactions like clicking buttons, filling out forms, and navigating between pages. This makes it ideal for tasks such as web testing, web scraping, and browser automation.
As of Selenium 4.6, Selenium downloads the correct driver for you. You shouldn’t need to do anything. If you are using the latest version of Selenium and you are getting an error, please turn on logging and file a bug report with that information. Quelle
So installation of Google Chrome and Google Chrome Webdriver is not required anymore.
But I had to install some additional libraries on WSL:
sudo apt install libnss3 libgbm1 libasound2
Exkurs: Google Chrome
To find missing libraries I downloaded Google Chrome and tried to start it until all missing libraries were installed.
Page to find download link:
https://googlechromelabs.github.io/chrome-for-testing/#stable
## Google Chrome wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chrome-linux64.zip unzip chrome-linux64.zip mv chrome-linux64 chrome ## Google Chrome Webdriver wget https://storage.googleapis.com/chrome-for-testing-public/129.0.6668.70/linux64/chromedriver-linux64.zip unzip chromedriver-linux64.zip mv chromedriver-linux64 chromedriver cp chromedriver/chromedriver chrome/chromedriver cd chrome ./chromedriver
Scrape a single page
import selenium.webdriver as webdriver from selenium.webdriver.chrome.service import Service def scrape_website(website): print("Launching chrome browser...") driver= webdriver.Chrome() try: driver.get(website) print("Page loaded...") html = driver.page_source return html finally: driver.quit() print(scrape_website("https://www.selenium.dev/"))
python scrape.py
Scape a Heise News Article
Extract Article Header, Article Lead and Article itself:
import selenium.webdriver as webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup def scrape_website(website): print("Launching chrome browser...") service = Service() options = Options() options.headless = True # Headless-Modus aktivieren, um den Browser unsichtbar zu machen driver = webdriver.Chrome(service=service, options=options) try: driver.get(website) print("Page loaded...") html = driver.page_source return html finally: driver.quit() def split_dom_content(dom_content, max_length=6000): return [ dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) ] def scrape_heise_website(website): html = scrape_website(website) # BeautifulSoup zum Parsen des HTML-Codes verwenden soup = BeautifulSoup(html, 'html.parser') # Artikel-Header und -Inhalt extrahieren # Der Header ist oft in einem-Tag zu finden header_title = soup.find('h1', {'class': 'a-article-header__title'}).get_text().strip() header_lead = soup.find('p', {'class': 'a-article-header__lead'}).get_text().strip() # Der eigentliche Artikelinhalt befindet sich oft in einem
-Tag mit der Klasse 'article-content' article_div = soup.find('div', {'class': 'article-content'}) paragraphs = article_div.find_all('p') if article_div else [] # 'redakteurskuerzel' entfernen for para in paragraphs: spans_to_remove = para.find_all('span', {'class': 'redakteurskuerzel'}) for span in spans_to_remove: span.decompose() # Entfernt den Tag vollständig aus dem Baum article_content = "\n".join([para.get_text().strip() for para in paragraphs]) # Header und Artikelinhalt ausgeben result = "Header Title:" + header_title + "\nHeader Lead:" + header_lead + "\nContent:" + article_content return result result = scrape_heise_website("https://www.heise.de/news/Streit-ueber-Kosten-Meta-kappt-Leitungen-zur-Telekom-9953162.html") print(result)