When I was trying to scrape a Javascript heavy website with my Raspberry using Python, I ran into some interesting issues that needed to be solved.
I found that modules like request,request_html, urlllib did not deliver the complete content with Javascripts websites containing shadow-dom (#shadowroot). When searching for solution i found some, like the use of PhantomJS or other discontinued modules.
The solution I found was using Chromedriver in headless mode. But the version I got my hands on kept throwing errors on the version of the browser.
After extensive searches I found the solution in:
1. Download the latest chromedriver from:
https://github.com/electron/electron/releases
(get the arvmv7 version)
2. Install this using the instructions i found on:
https://www.raspberrypi.org/forums/viewtopic.php?t=194176
- mkdir /tmp
- wget <url latest version arm7>
- unzip <zip file>
- mv chromedriver /usr/local/bin
- sudo chmod +x /usr/local/bin/chromedriver
sudo apt-get install libminizip1
sudo apt-get install libwebpmux2
- sudo apt-get install libgtk-3-0
In your code add these two arguments, when you start the driver:
-headless
-disable-gpu
3 Update the Chromium bowser
When trying to execute the script I still got the error on Chromium version.I was able to solve that using:
- sudo apt-get install -y chromium-browser
IT WORKS
now the script finally worked
The Python Script to get the page content
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
# Define the site to be opened
site = “http://….”
# Set Chrome Options
chrome_options = Options()
chrome_options.add_argument(“–headless”)
# Open Chrome Headless
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.set_page_load_timeout(20)
driver.get(site)
4. Analyze the content of the page
With the content of the page in driver it is possible to further decompose the page.
content1= driver.find_element_by_tag_name(‘…..’)
shadow_content1 = expand_shadow_element(content1)
To get access to the shadow element the function below needs to be used:
# function to expand a shadow element to useable content
def expand_shadow_element(element):
shadow_root = driver.execute_script(‘return arguments[0].shadowRoot’, element)
return shadow_root