{"id":1276,"date":"2019-02-02T08:30:57","date_gmt":"2019-02-02T08:30:57","guid":{"rendered":"http:\/\/www.edwinmichielsen.nl\/?p=1276"},"modified":"2019-02-05T18:32:50","modified_gmt":"2019-02-05T18:32:50","slug":"scrape-javascript-heavy-website-on-raspberrypi3b-using-pythonpi-with-selenium","status":"publish","type":"post","link":"https:\/\/edwinmichielsen.nl\/?p=1276","title":{"rendered":"Scrape Javascript heavy website on RaspberryPi3B+ using Python with Selenium"},"content":{"rendered":"<p>When I was trying to scrape a Javascript heavy website with my Raspberry using Python, I ran into some interesting issues that needed to be solved.<\/p>\n<p>I found that modules like request,request_html, urlllib did not deliver the complete content with Javascripts websites containing shadow-dom (#shadowroot). When searching for solution i found some, like the use of PhantomJS or other discontinued modules.<\/p>\n<p>The solution I found was using Chromedriver in headless mode. But the version I got my hands on kept throwing errors on the version of the browser.<\/p>\n<p>After extensive searches I found the solution in:<\/p>\n<h3>1. Download the latest chromedriver from:<\/h3>\n<p><a href=\"https:\/\/github.com\/electron\/electron\/releases\">https:\/\/github.com\/electron\/electron\/releases<\/a><\/p>\n<p>(get the arvmv7 version)<\/p>\n<h3>2. Install this using the instructions i found on:<\/h3>\n<p><a href=\"https:\/\/www.raspberrypi.org\/forums\/viewtopic.php?t=194176\">https:\/\/www.raspberrypi.org\/forums\/viewtopic.php?t=194176<\/a><\/p>\n<ul>\n<li>mkdir \/tmp<\/li>\n<li>wget &lt;url latest version arm7&gt;<\/li>\n<li>unzip &lt;zip file&gt;<\/li>\n<li>mv chromedriver \/usr\/local\/bin<\/li>\n<li>sudo chmod +x \/usr\/local\/bin\/chromedriver<br \/>\nsudo apt-get install libminizip1<br \/>\nsudo apt-get install libwebpmux2<\/li>\n<li>sudo apt-get install libgtk-3-0<\/li>\n<\/ul>\n<p>In your code add these two arguments, when you start the driver:<br \/>\n-headless<br \/>\n-disable-gpu<\/p>\n<h3>3 Update the Chromium bowser<\/h3>\n<p>When trying to execute the script I still got the error on Chromium version.I was able to solve that using:<\/p>\n<ul>\n<li>sudo apt-get install -y chromium-browser<\/li>\n<\/ul>\n<p>IT WORKS<\/p>\n<p>now the script finally worked<\/p>\n<h3>The Python Script to get the page content<\/h3>\n<p>from selenium import webdriver<br \/>\nimport time<br \/>\nfrom selenium.webdriver.common.by import By<br \/>\nfrom selenium.webdriver.support.ui import WebDriverWait<br \/>\nfrom selenium.webdriver.support import expected_conditions as EC<br \/>\nfrom selenium.webdriver.common.keys import Keys<br \/>\nfrom selenium.webdriver.chrome.options import Options<\/p>\n<p># Define the site to be opened<br \/>\nsite = &#8220;http:\/\/&#8230;.&#8221;<\/p>\n<p># Set Chrome Options<br \/>\nchrome_options = Options()<br \/>\nchrome_options.add_argument(&#8220;&#8211;headless&#8221;)<br \/>\n# Open Chrome Headless<br \/>\ndriver = webdriver.Chrome(chrome_options=chrome_options)<br \/>\ndriver.set_page_load_timeout(20)<br \/>\ndriver.get(site)<\/p>\n<h3>4. Analyze the content of the page<\/h3>\n<p>With the content of the page in driver it is possible to further decompose the page.<\/p>\n<p>content1= driver.find_element_by_tag_name(&#8216;&#8230;..&#8217;)<br \/>\nshadow_content1 = expand_shadow_element(content1)<\/p>\n<p>To get access to the shadow element the function below needs to be used:<\/p>\n<p># function to expand a shadow element to useable content<br \/>\ndef expand_shadow_element(element):<br \/>\nshadow_root = driver.execute_script(&#8216;return arguments[0].shadowRoot&#8217;, element)<br \/>\nreturn shadow_root<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I was trying to scrape a Javascript heavy website with my Raspberry using Python, I ran into some interesting issues that needed to be solved. I found that modules like request,request_html, urlllib did not deliver the complete content with Javascripts websites containing shadow-dom (#shadowroot). When searching for solution i found some, like the use &hellip; <a href=\"https:\/\/edwinmichielsen.nl\/?p=1276\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Scrape Javascript heavy website on RaspberryPi3B+ using Python with Selenium&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[71,72,13],"tags":[73,53],"class_list":["post-1276","post","type-post","status-publish","format-standard","hentry","category-programming","category-python","category-raspberry","tag-python","tag-raspberry"],"_links":{"self":[{"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/posts\/1276","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1276"}],"version-history":[{"count":8,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/posts\/1276\/revisions"}],"predecessor-version":[{"id":1287,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=\/wp\/v2\/posts\/1276\/revisions\/1287"}],"wp:attachment":[{"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/edwinmichielsen.nl\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}