you have to deal with a large project, involving scraping 1000's of webpages
the pages are in the vector all_webpages
you've written a scraper function, scrape_format
, that scrapes a single web page and formats it appropriately
you have to deal with a large project, involving scraping 1000's of webpages
the pages are in the vector all_webpages
you've written a scraper function, scrape_format
, that scrapes a single web page and formats it appropriately
If you wrote the following code or similar:
n = all_webpagesres = vector("list", n)for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i])}
If you wrote the following code or similar:
n = all_webpagesres = vector("list", n)for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i])}
Then congratulations! 🎉
If you wrote the following code or similar:
n = all_webpagesres = vector("list", n)for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i])}
Then congratulations! 🎉 You didn't miss any of the rookie mistakes!
n = all_webpagesres = vector("list", n)for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i])}
n = all_webpagesres = vector("list", n)for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i])}
This code breaks the three laws of webscraping.
Thou shalt not be in a hurry
Thou shalt separate the scraping from the formatting
Thou shalt not be in a hurry
Thou shalt separate the scraping from the formatting
Thou shalt anticipate failure
you would have soon figured out this law by yourself
requests to the server are costly for the server costs run time + bandwidth
very easy to ply a server with thousands of requests per second, just write a loop
you would have soon figured out this law by yourself
requests to the server are costly for the server costs run time + bandwidth
very easy to ply a server with thousands of requests per second, just write a loop
you would have soon figured out this law by yourself
requests to the server are costly for the server costs run time + bandwidth
very easy to ply a server with thousands of requests per second, just write a loop
always wait a few seconds between two pages requests
it's a bit of a gentleman agreement to be slow in scraping★
anyway the incentives are aligned: trespass this agreement and you will quickly feel the consequences!
Remember that you have 1000's of pages to scrape.
Remember that you have 1000's of pages to scrape.
Remember that you have 1000's of pages to scrape.
In a large project you cannot anticipate the format of all the pages from a mere sub-selection. So:
always save the HTML code on disk after scraping★
apply, and then debug, the formatting script only after all, or a large chunk of, the data is scraped
In a large project you cannot anticipate the format of all the pages from a mere sub-selection. So:
always save the HTML code on disk after scraping★
apply, and then debug, the formatting script only after all, or a large chunk of, the data is scraped
Be prepared, even if it's coded properly:
Be prepared, even if it's coded properly:
Your code will fail!
Be prepared, even if it's coded properly:
Your code will fail!
If you don't acknowledge that... well.. I'm afraid to say you gonna suffer.
There are numerous reasons for scraping-functions to fail:
a CAPTCHA catches you red handed and diverts the page requests
your IP gets blocked
the page you get is different from the one you anticipate
the url doesn't exists
timeout
other random reason
A: You have to:
anticipate failure by:
checking the pages you obtain: are they what you expect? Save the problematic pages. Stop as you go.
catching errors such as connection problems. Parse the error, and either continue or stop depending on the severity of the error.
catch higher level errors
anticipate failure by:
checking the pages you obtain: are they what you expect? Save the problematic pages. Stop as you go.
catching errors such as connection problems. Parse the error, and either continue or stop depending on the severity of the error.
catch higher level errors
rewrite the loop to make it start at the last problem
save_path = "path/to/saved/webpages/"# SCRAPINGall_files = list.files(save_path, full.names = TRUE)i_start = length(all_files) + 1n = length(all_webpages)for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } Sys.sleep(1.2)}# FORMATTINGall_pages = list.files(save_path, full.names = TRUE)n = length(all_pages)res = vector("list", n)for(i in 1:n){ res[[i]] = formatter(all_pages[i])}
save_path = "path/to/saved/webpages/"# SCRAPINGall_files = list.files(save_path, full.names = TRUE)i_start = length(all_files) + 1n = length(all_webpages)for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } # wait Sys.sleep(1.2)}# FORMATTINGall_pages = list.files(save_path, full.names = TRUE)n = length(all_pages)res = vector("list", n)for(i in 1:n){ res[[i]] = formatter(all_pages[i])}
save_path = "path/to/saved/webpages/"# SCRAPINGall_files = list.files(save_path, full.names = TRUE)i_start = length(all_files) + 1n = length(all_webpages)for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } Sys.sleep(1.2)}# FORMATTINGall_pages = list.files(save_path, full.names = TRUE)n = length(all_pages)res = vector("list", n)for(i in 1:n){ res[[i]] = formatter(all_pages[i])}
save_path = "path/to/saved/webpages/"# SCRAPING# of course, I assume URLs are UNIQUE and ONLY web pages end up in the folderall_files = list.files(save_path, full.names = TRUE)i_start = length(all_files) + 1n = length(all_webpages)for(i in i_start:n){ # files are saved on disk with the scraper function # we give an appropriate prefix to the files to make it tidy status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) # (optional) the status variable returns 1 if everything is OK # otherwise it contains information helping to debug if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } # wait Sys.sleep(1.2)}
# FORMATTINGall_pages = list.files(save_path, full.names = TRUE)n = length(all_pages)res = vector("list", n)for(i in 1:n){ # error handling can be looser here since the data formatting # is typically very fast. We can correct errors as we go. # If the formatting is slow, we can use the same procedure as for # the scraper, by saving the results on the hard drive as we advance. res[[i]] = formatter(all_pages[i])}
scraper = function(url, save_path, prefix){ # simplified version page = try(read_html(url)) if(inherits(page, "try-error")){ return(page) } writeLines(page, paste0(save_path, prefix, url, ".html")) if(!is_page_content_ok(page)){ return(page) } return(1)}formatter = function(path){ # simplified version page = readLines(path) if(!is_page_format_ok(page)){ stop("Wrong formatting of the page. Revise code.") } extract_data(page)}
save_path = "path/to/saved/webpages/"# SCRAPING# of course, I assume URLs are UNIQUE and ONLY web pages end up in the folderall_files = list.files(save_path, full.names = TRUE)i_start = length(all_files) + 1n = length(all_webpages)for(i in i_start:n){ # files are saved on disk with the scraper function # we give an appropriate prefix to the files to make it tidy status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) # (optional) the status variable returns 1 if everything is OK # otherwise it contains information helping to debug if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } # wait Sys.sleep(1.2)}# FORMATTINGall_pages = list.files(save_path, full.names = TRUE)n = length(all_pages)res = vector("list", n)for(i in 1:n){ # error handling can be looser here since the data formatting # is typically very fast. We can correct errors as we go. # If the formatting is slow, we can use the same procedure as for # the scraper, by saving the results on the hard drive as we advance. res[[i]] = formatter(all_pages[i])}# Done. You may need some extra formatting still, but then it really depends# on the problem.
]
Making the code fool-proof and easy to adapt require some planning. But it's worth it!
Making the code fool-proof and easy to adapt require some planning. But it's worth it!
If you follow the three laws of webscraping, you're ready to handle large projects in peace!
dynmstatic: HTML in source code = HTML in browser
sttdynamic: HTML in source code ≠ HTML in browser
dynmstatic: HTML in source code = HTML in browser
sttdynamic: HTML in source code ≠ HTML in browser
dynmstatic: HTML in source code = HTML in browser
sttdynamic: HTML in source code ≠ HTML in browser
javascript
HTML for content
HTML for content
CSS for style
HTML for content
CSS for style
javascript for manipulation
imagine you're the webmaster of an e-commerce website
if you had no javascript and a client searched "shirt" on your website, you'd have to manually create the results page in HTML.
imagine you're the webmaster of an e-commerce website
if you had no javascript and a client searched "shirt" on your website, you'd have to manually create the results page in HTML.
with javascript, you fetch the results from the query in a data base, and the HTML content is updated to fit the results of the query, with real time information
javascript is simply indispensable
some webpages may decide to display some information only after some event has occurred
the event can be:
some webpages may decide to display some information only after some event has occurred
the event can be:
some webpages may decide to display some information only after some event has occurred
the event can be:
<script>
tag<script>
tag<script> let all_p = document.querySelectorAll("p"); for(p of all_p) p.style.display = "none";</script>
Use a CSS selector to select all paragraphs in the document.
<script> let all_p = document.querySelectorAll("p"); for(p of all_p) p.style.display = "none";</script>
Remove all paragraphs from view.
<script> let all_p = document.querySelectorAll("p"); for(p of all_p) p.style.display = "none";</script>
Let's go back to the webpage we have created in the previous course.
Add the following code:
<button type="button" id="btn"> What is my favourite author?</button><script> let btn = document.querySelector("#btn"); showAuthor = function(){ let p = document.createElement("p"); p.innerHTML = "My favourite author is Shakespeare"; this.replaceWith(p); } btn.addEventListener("click", showAuthor);</script>
you can access a critical information only after an event was triggered here the click
that's how dynamic web pages work!
We need a web browser.
to scrape dynamic webpages, we'll use python in combination with selenium
you need:
pip install selenium
on the terminalIf the installation is all right, the following code should open a browser:
from selenium import webdriverdriver = webdriver.Chrome()
selenium controls a browser: typically anything that you can do, it can do
most common actions include:
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
Importing only the classes we'll use.
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
Launching the browser empty at the moment.
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
Accessing the stackoverflow☆ URL.
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
find_element
with find_elements
(the s!). The former returns an HTML element while the latter returns an array.It's our first visit on the page, so cookies need to be agreed upon. After selecting★ the button to click with a CSS selector, we click on it with the click()
method.
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
Finally we search the SO posts containing the term webscraping. We first select the input element containing the search text. Then we type webscraping with the send_keys()
method and end with pressing enter (Keys.RETURN
).
from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysdriver = webdriver.Chrome()driver.get("https://stackoverflow.com/")btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")btn_cookies.click()search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")search_input.send_keys("webscraping")search_input.send_keys(Keys.RETURN)
driver
is only a generic name which was taken from the previous example. It could be anything else.To obtain the HTML of an element:
body = driver.find_element(By.CSS_SELECTOR, "body")body.get_attribute("innerHTML")
The variable driver.find_element(By.CSS_SELECTOR, "body").get_attribute("innerHTML")
☆ contains the HTML code as it is currently displayed in the browser. It has nothing to do with the source code!
To write the HTML in a file, you can still do it Python way:
outFile = open(path_to_save, "w")outFile.write(html_to_save)outFile.close()
Scrolling (executes javascript)
# Scrolls down by 1000 pixelsdriver.execute_script("window.scrollBy(0,1000)")# Goes at the top of the pagedriver.execute_script("window.scrollTo(0, 0)")
Well, that's it folks!
You just have to automate the browser and save the results.
Then you can do the data processing in your favorite language.
Go on the website of Peugeot.
you have to deal with a large project, involving scraping 1000's of webpages
the pages are in the vector all_webpages
you've written a scraper function, scrape_format
, that scrapes a single web page and formats it appropriately
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |