+ - 0:00:00
Notes for current slide
Notes for next slide

Data analysis II

Webscraping, advanced

1 / 50

Laurent Bergé

University of Bordeaux, BxSE

Fall 2022

A large project

  • you have to deal with a large project, involving scraping 1000's of webpages

  • the pages are in the vector all_webpages

  • you've written a scraper function, scrape_format, that scrapes a single web page and formats it appropriately

2 / 50

A large project

  • you have to deal with a large project, involving scraping 1000's of webpages

  • the pages are in the vector all_webpages

  • you've written a scraper function, scrape_format, that scrapes a single web page and formats it appropriately

Q: How would you write the code for this project?

2 / 50

A large project: Code

If you wrote the following code or similar:

n = all_webpages
res = vector("list", n)
for(i in 1:n){
res[[i]] = scrape_format(all_webpages[i])
}
3 / 50

A large project: Code

If you wrote the following code or similar:

n = all_webpages
res = vector("list", n)
for(i in 1:n){
res[[i]] = scrape_format(all_webpages[i])
}

Then congratulations! 🎉  

3 / 50

A large project: Code

If you wrote the following code or similar:

n = all_webpages
res = vector("list", n)
for(i in 1:n){
res[[i]] = scrape_format(all_webpages[i])
}

Then congratulations! 🎉  You didn't miss any of the rookie mistakes!

3 / 50

What are the problems with my code?

Q: Look closely at the code and find out three major problems.

n = all_webpages
res = vector("list", n)
for(i in 1:n){
res[[i]] = scrape_format(all_webpages[i])
}
4 / 50

What are the problems with my code?

Q: Look closely at the code and find out three major problems.

n = all_webpages
res = vector("list", n)
for(i in 1:n){
res[[i]] = scrape_format(all_webpages[i])
}

This code breaks the three laws of webscraping.

4 / 50

The three laws of webscraping

5 / 50

The three laws of webscraping

  1. Thou shalt not be in a hurry
6 / 50

The three laws of webscraping

  1. Thou shalt not be in a hurry

  2. Thou shalt separate the scraping from the formatting

6 / 50

The three laws of webscraping

  1. Thou shalt not be in a hurry

  2. Thou shalt separate the scraping from the formatting

  3. Thou shalt anticipate failure

6 / 50

Thou shalt not be in a hurry

  • you would have soon figured out this law by yourself

  • requests to the server are costly for the server costs run time + bandwidth

  • very easy to ply a server with thousands of requests per second, just write a loop

7 / 50

Thou shalt not be in a hurry

  • you would have soon figured out this law by yourself

  • requests to the server are costly for the server costs run time + bandwidth

  • very easy to ply a server with thousands of requests per second, just write a loop

Q: What happens when you flood the server with requests?

7 / 50

Thou shalt not be in a hurry

  • you would have soon figured out this law by yourself

  • requests to the server are costly for the server costs run time + bandwidth

  • very easy to ply a server with thousands of requests per second, just write a loop

Q: What happens when you flood the server with requests?

A: You get your IP blocked and rightly so. End of the game.

7 / 50

Law 1: What to do

  • always wait a few seconds between two pages requests

  • it's a bit of a gentleman agreement to be slow in scraping

  • anyway the incentives are aligned: trespass this agreement and you will quickly feel the consequences!

8 / 50

Thou shalt separate the scraping from the formatting

Remember that you have 1000's of pages to scrape.

9 / 50

Thou shalt separate the scraping from the formatting

Remember that you have 1000's of pages to scrape.

Q: How do you create your data formatting code?

9 / 50

Thou shalt separate the scraping from the formatting

Remember that you have 1000's of pages to scrape.

Q: How do you create your data formatting code?

Typically:

  1. scrape a few pages
  2. write code to extract the relevant data from the HTML
  3. create a function out of it to systematize to each page
9 / 50

Law 2: Consequences of infringement

Q: What happens if you tied the formatting with the scraping?

10 / 50

Law 2: Consequences of infringement

Q: What happens if you tied the formatting with the scraping?

A: If the format of the pages changes, you're roasted! You have to painfully rerun the scraping!

10 / 50

Law 2: What to do?

In a large project you cannot anticipate the format of all the pages from a mere sub-selection. So:

  1. always save the HTML code on disk after scraping

  2. apply, and then debug, the formatting script only after all, or a large chunk of, the data is scraped

11 / 50

Law 2: What to do?

In a large project you cannot anticipate the format of all the pages from a mere sub-selection. So:

  1. always save the HTML code on disk after scraping

  2. apply, and then debug, the formatting script only after all, or a large chunk of, the data is scraped

To remember

  1. the data always change
  2. you always want more stuff than what you anticipated and be sure this epiphany only happens ex post!
11 / 50

Thou shalt anticipate failure

Be prepared, even if it's coded properly:

12 / 50

Thou shalt anticipate failure

Be prepared, even if it's coded properly:

Your code will fail!

12 / 50

Thou shalt anticipate failure

Be prepared, even if it's coded properly:

Your code will fail!

If you don't acknowledge that... well.. I'm afraid to say you gonna suffer.

12 / 50

Law 3: Why can my scraping function fail?

There are numerous reasons for scraping-functions to fail:

  • a CAPTCHA catches you red handed and diverts the page requests

  • your IP gets blocked

  • the page you get is different from the one you anticipate

  • the url doesn't exists

  • timeout

  • other random reason

13 / 50

Law 3: Consequences

Q: What happens if you didn't anticipate failure?

14 / 50

Law 3: Consequences

Q: What happens if you didn't anticipate failure?

A: You have to:

  1. find out where your scraping stopped
  2. try to debug ex post
  3. manually restart the loop where it last stopped, etc...
  4. in sum: pain which could be avoided
14 / 50

Law 3: What to do?

  1. anticipate failure by:

    • checking the pages you obtain: are they what you expect? Save the problematic pages. Stop as you go.

    • catching errors such as connection problems. Parse the error, and either continue or stop depending on the severity of the error.

    • catch higher level errors

15 / 50

Law 3: What to do?

  1. anticipate failure by:

    • checking the pages you obtain: are they what you expect? Save the problematic pages. Stop as you go.

    • catching errors such as connection problems. Parse the error, and either continue or stop depending on the severity of the error.

    • catch higher level errors

  2. rewrite the loop to make it start at the last problem

15 / 50

Application of the laws

save_path = "path/to/saved/webpages/"
# SCRAPING
all_files = list.files(save_path, full.names = TRUE)
i_start = length(all_files) + 1
n = length(all_webpages)
for(i in i_start:n){
status = scraper(all_webpages[i], save_path,
prefix = paste0("id_", i, "_"))
if(!identical(status, 1)){
stop("Error during the scraping, see 'status'.")
}
Sys.sleep(1.2)
}
# FORMATTING
all_pages = list.files(save_path, full.names = TRUE)
n = length(all_pages)
res = vector("list", n)
for(i in 1:n){
res[[i]] = formatter(all_pages[i])
}
16 / 50

Law 1: Thou shalt not be in a hurry

save_path = "path/to/saved/webpages/"
# SCRAPING
all_files = list.files(save_path, full.names = TRUE)
i_start = length(all_files) + 1
n = length(all_webpages)
for(i in i_start:n){
status = scraper(all_webpages[i], save_path,
prefix = paste0("id_", i, "_"))
if(!identical(status, 1)){
stop("Error during the scraping, see 'status'.")
}
# wait
Sys.sleep(1.2)
}
# FORMATTING
all_pages = list.files(save_path, full.names = TRUE)
n = length(all_pages)
res = vector("list", n)
for(i in 1:n){
res[[i]] = formatter(all_pages[i])
}
17 / 50

Law 2: Thou shalt separate the scraping from the formatting

save_path = "path/to/saved/webpages/"
# SCRAPING
all_files = list.files(save_path, full.names = TRUE)
i_start = length(all_files) + 1
n = length(all_webpages)
for(i in i_start:n){
status = scraper(all_webpages[i], save_path,
prefix = paste0("id_", i, "_"))
if(!identical(status, 1)){
stop("Error during the scraping, see 'status'.")
}
Sys.sleep(1.2)
}
# FORMATTING
all_pages = list.files(save_path, full.names = TRUE)
n = length(all_pages)
res = vector("list", n)
for(i in 1:n){
res[[i]] = formatter(all_pages[i])
}
18 / 50

Law 3: Thou shalt anticipate failure

save_path = "path/to/saved/webpages/"
# SCRAPING
# of course, I assume URLs are UNIQUE and ONLY web pages end up in the folder
all_files = list.files(save_path, full.names = TRUE)
i_start = length(all_files) + 1
n = length(all_webpages)
for(i in i_start:n){
# files are saved on disk with the scraper function
# we give an appropriate prefix to the files to make it tidy
status = scraper(all_webpages[i], save_path,
prefix = paste0("id_", i, "_"))
# (optional) the status variable returns 1 if everything is OK
# otherwise it contains information helping to debug
if(!identical(status, 1)){
stop("Error during the scraping, see 'status'.")
}
# wait
Sys.sleep(1.2)
}
19 / 50

Law 3: Cont'd

# FORMATTING
all_pages = list.files(save_path, full.names = TRUE)
n = length(all_pages)
res = vector("list", n)
for(i in 1:n){
# error handling can be looser here since the data formatting
# is typically very fast. We can correct errors as we go.
# If the formatting is slow, we can use the same procedure as for
# the scraper, by saving the results on the hard drive as we advance.
res[[i]] = formatter(all_pages[i])
}
20 / 50

Law 3: Still cont'd

scraper = function(url, save_path, prefix){
# simplified version
page = try(read_html(url))
if(inherits(page, "try-error")){
return(page)
}
writeLines(page, paste0(save_path, prefix, url, ".html"))
if(!is_page_content_ok(page)){
return(page)
}
return(1)
}
formatter = function(path){
# simplified version
page = readLines(path)
if(!is_page_format_ok(page)){
stop("Wrong formatting of the page. Revise code.")
}
extract_data(page)
}
21 / 50
save_path = "path/to/saved/webpages/"
# SCRAPING
# of course, I assume URLs are UNIQUE and ONLY web pages end up in the folder
all_files = list.files(save_path, full.names = TRUE)
i_start = length(all_files) + 1
n = length(all_webpages)
for(i in i_start:n){
# files are saved on disk with the scraper function
# we give an appropriate prefix to the files to make it tidy
status = scraper(all_webpages[i], save_path,
prefix = paste0("id_", i, "_"))
# (optional) the status variable returns 1 if everything is OK
# otherwise it contains information helping to debug
if(!identical(status, 1)){
stop("Error during the scraping, see 'status'.")
}
# wait
Sys.sleep(1.2)
}
# FORMATTING
all_pages = list.files(save_path, full.names = TRUE)
n = length(all_pages)
res = vector("list", n)
for(i in 1:n){
# error handling can be looser here since the data formatting
# is typically very fast. We can correct errors as we go.
# If the formatting is slow, we can use the same procedure as for
# the scraper, by saving the results on the hard drive as we advance.
res[[i]] = formatter(all_pages[i])
}
# Done. You may need some extra formatting still, but then it really depends
# on the problem.

]

Laws make your life easy

Making the code fool-proof and easy to adapt require some planning. But it's worth it!

22 / 50

Laws make your life easy

Making the code fool-proof and easy to adapt require some planning. But it's worth it!

If you follow the three laws of webscraping, you're ready to handle large projects in peace!

22 / 50

Dynamic web pages

23 / 50

Static vs Dynamic

  • static: HTML in source code = HTML in browser
24 / 50

Static vs Dynamic

  • static: HTML in source code = HTML in browser

  • dynamic: HTML in source code HTML in browser

24 / 50

Static vs Dynamic

  • static: HTML in source code = HTML in browser

  • dynamic: HTML in source code HTML in browser

Q: What makes the HTML in your browser change?

24 / 50

Static vs Dynamic

  • static: HTML in source code = HTML in browser

  • dynamic: HTML in source code HTML in browser

Q: What makes the HTML in your browser change?

javascript

24 / 50

The language of the web

Web's trinity:

25 / 50

The language of the web

Web's trinity:

HTML for content

25 / 50

The language of the web

Web's trinity:

HTML for content

CSS for style

25 / 50

The language of the web

Web's trinity:

HTML for content

CSS for style

javascript for manipulation

25 / 50

What's JS?

A programming language

  • javascript is a regular programming language with the typical package: conditions, loops, functions, classes
26 / 50

What's JS?

A programming language

  • javascript is a regular programming language with the typical package: conditions, loops, functions, classes

Which...

  • specializes in modifying HTML content
26 / 50

Why JS?

  • imagine you're the webmaster of an e-commerce website

  • if you had no javascript and a client searched "shirt" on your website, you'd have to manually create the results page in HTML.

27 / 50

Why JS?

  • imagine you're the webmaster of an e-commerce website

  • if you had no javascript and a client searched "shirt" on your website, you'd have to manually create the results page in HTML.

  • with javascript, you fetch the results from the query in a data base, and the HTML content is updated to fit the results of the query, with real time information

  • javascript is simply indispensable

27 / 50

JS: what's the connection to webscraping?

  • some webpages may decide to display some information only after some event has occurred

  • the event can be:

    • the main HTML has loaded
    • an HTML box becomes, or is close to become, on-screen e.g. think to facebook
    • something is clicked
    • etc!
28 / 50

JS: what's the connection to webscraping?

  • some webpages may decide to display some information only after some event has occurred

  • the event can be:

    • the main HTML has loaded
    • an HTML box becomes, or is close to become, on-screen e.g. think to facebook
    • something is clicked
    • etc!

Q: So far, we only queried the server to have the source code of the webpage. What's the problem with that?

28 / 50

JS: what's the connection to webscraping?

  • some webpages may decide to display some information only after some event has occurred

  • the event can be:

    • the main HTML has loaded
    • an HTML box becomes, or is close to become, on-screen e.g. think to facebook
    • something is clicked
    • etc!

Q: So far, we only queried the server to have the source code of the webpage. What's the problem with that?

A: If you wanted to have access to some information that only appears after these events... well... you can't.

28 / 50

JS: How does it work?

  • you can add javascript in an HTML page with the <script> tag
29 / 50

JS: How does it work?

  • you can add javascript in an HTML page with the <script> tag

Example:

<script>
let all_p = document.querySelectorAll("p");
for(p of all_p) p.style.display = "none";
</script>
29 / 50

Use a CSS selector to select all paragraphs in the document.

<script>
let all_p = document.querySelectorAll("p");
for(p of all_p) p.style.display = "none";
</script>
30 / 50

Remove all paragraphs from view.

<script>
let all_p = document.querySelectorAll("p");
for(p of all_p) p.style.display = "none";
</script>
31 / 50

Back to our webpage

Let's go back to the webpage we have created in the previous course.

Add the following code:

<button type="button" id="btn"> What is my favourite author?</button>
<script>
let btn = document.querySelector("#btn");
showAuthor = function(){
let p = document.createElement("p");
p.innerHTML = "My favourite author is Shakespeare";
this.replaceWith(p);
}
btn.addEventListener("click", showAuthor);
</script>
32 / 50

What happened?

  • you can access a critical information only after an event was triggered here the click
33 / 50

What happened?

  • you can access a critical information only after an event was triggered here the click

  • that's how dynamic web pages work!

33 / 50

Dynamic webpages: Can we scrape them?

  • yes, but...
34 / 50

Dynamic webpages: Can we scrape them?

  • yes, but...

Q: What do we need? and I hope the answer will be natural after this long introduction!

34 / 50

Dynamic webpages: Can we scrape them?

  • yes, but...

Q: What do we need? and I hope the answer will be natural after this long introduction!

A: Indeed! We need to run javascript on the source code, and keep it running as the page updates.

34 / 50

Dynamic webpages: Can we scrape them?

  • yes, but...

Q: What do we need? and I hope the answer will be natural after this long introduction!

A: Indeed! We need to run javascript on the source code, and keep it running as the page updates.

In other words...

We need a web browser.

34 / 50

Python + Selenium

35 / 50

Requirements

  • to scrape dynamic webpages, we'll use python in combination with selenium

  • you need:

36 / 50

Checking the install

If the installation is all right, the following code should open a browser:

from selenium import webdriver
driver = webdriver.Chrome()
37 / 50

How does selenium works?

  • selenium controls a browser: typically anything that you can do, it can do

  • most common actions include:

    • access to URLs
    • clicking on buttons
    • typing/filling forms
    • scrolling
    • do you really do more than that?
38 / 50

Selenium 101: An example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
39 / 50

Importing only the classes we'll use.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
40 / 50

Launching the browser empty at the moment.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
41 / 50

Accessing the stackoverflow URL.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
42 / 50

It's our first visit on the page, so cookies need to be agreed upon. After selecting the button to click with a CSS selector, we click on it with the click() method.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
43 / 50

Finally we search the SO posts containing the term webscraping. We first select the input element containing the search text. Then we type webscraping with the send_keys() method and end with pressing enter (Keys.RETURN).

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/")
btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies")
btn_cookies.click()
search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search")
search_input.send_keys("webscraping")
search_input.send_keys(Keys.RETURN)
44 / 50

Saving the results

To obtain the HTML of an element:

body = driver.find_element(By.CSS_SELECTOR, "body")
body.get_attribute("innerHTML")

The variable driver.find_element(By.CSS_SELECTOR, "body").get_attribute("innerHTML") contains the HTML code as it is currently displayed in the browser. It has nothing to do with the source code!

45 / 50

Saving the results II

To write the HTML in a file, you can still do it Python way:

outFile = open(path_to_save, "w")
outFile.write(html_to_save)
outFile.close()
46 / 50

Good to know

Scrolling (executes javascript)

# Scrolls down by 1000 pixels
driver.execute_script("window.scrollBy(0,1000)")
# Goes at the top of the page
driver.execute_script("window.scrollTo(0, 0)")
47 / 50

Dynamic webpages: is that it?

Well, that's it folks!

You just have to automate the browser and save the results.

Then you can do the data processing in your favorite language.

48 / 50

Practice

49 / 50

Go on the website of Peugeot.

  1. Scrape information on the price of the motors for a few models.
  2. Create a class extracting the information for one model.
  3. Run on the first 3 models.
50 / 50

A large project

  • you have to deal with a large project, involving scraping 1000's of webpages

  • the pages are in the vector all_webpages

  • you've written a scraper function, scrape_format, that scrapes a single web page and formats it appropriately

2 / 50
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow