+ - 0:00:00
Notes for current slide
Notes for next slide

Data analysis II

Webscraping, introduction

1 / 57

Laurent Bergé

University of Bordeaux, BxSE

Fall 2022

What is webscraping?

  • webscraping is the art of collecting data from web pages

  • anything you see when browsing the internet is data

  • any data in a web page can be collected

2 / 57

Why doing that?

Sometimes that's the only way to get the information you want!

3 / 57

Why doing that?

Sometimes that's the only way to get the information you want!

To consider

Web scraping is time consuming and is also costly in terms of resources (both for you and the server you're scraping). You should think hard to alternative solutions first!

3 / 57

Three types of task in webscraping

  1. organizing the web scraping (only for large tasks)
4 / 57

Three types of task in webscraping

  1. organizing the web scraping (only for large tasks)

  2. getting the data from the web (actual web scraping)

4 / 57

Three types of task in webscraping

  1. organizing the web scraping (only for large tasks)

  2. getting the data from the web (actual web scraping)

  3. formatting the data

4 / 57

Steps 1 and 3 are overlooked but are extremely important

Typologies of web scraping tasks

5 / 57

Objective

  • get you going for ambitious web scraping projects (the fun ones)
6 / 57

Objective

  • get you going for ambitious web scraping projects (the fun ones)

So you'll need to have a correct understanding of how the web works!

6 / 57

Outline

  • how static web pages work

  • practice with R

  • handling large projects

  • how dynamic web pages work

  • practice with Python and Selenium

7 / 57

How does the web work?

8 / 57

The HTTP protocol

9 / 57

The HTTP protocol

9 / 57

examples: you query wikipedia => messi

client -> first gets the IP of the server via DNS (cached or via DNS servers) -> sends a GET HTTP (hypertext transfert protocol) -> ISP -> Routers -> server -> response -> all the way back

if in HTTPS -> first shake hands (server sends certificate) -> then encryptions + decryptions, same route encryption = run time

IP adresses! uniquely identify you in the network

Example of HTTP GET request

GET HTTP/1.1
Host: developer.mozilla.org
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
10 / 57

talk about the user agent file => you can write whatever you want here, but you can also look like one regular browser

Server sends back a response

HTTP/1.1 200 OK
Date: Sat, 09 Oct 2010 14:28:02 GMT
Server: Apache
Last-Modified: Tue, 01 Dec 2009 20:18:22 GMT
ETag: "51142bc1-7449-479b075b2891b"
Accept-Ranges: bytes
Content-Length: 29769
Content-Type: text/html
<!DOCTYPE html... (here come the 29769 bytes of the requested web page)
11 / 57

famous codes: 200 / 404 / 403 (forbidden)

What's a webpage?

12 / 57

What's a webpage?

  • a web page is just code that is interpreted by your browser

  • the language in which the content is written is HTML

12 / 57

What's a webpage?

  • a web page is just code that is interpreted by your browser

  • the language in which the content is written is HTML

  • HTML is just about content!

  • let's inspect the webscraping's wikipedia page

12 / 57

we look at the wikipedia page then inspect the html and find out the content

How does HTML work?

  • this is a markup language

  • the content of each HTML element is enclosed in tags

  • tags can have attributes

  • content only: multiple spaces are ignored

13 / 57

How does HTML work?

  • this is a markup language

  • the content of each HTML element is enclosed in tags

  • tags can have attributes

  • content only: multiple spaces are ignored

... that's basically it!

13 / 57

How does HTML work? Tags

14 / 57

How does HTML work? Attributes

15 / 57

Empty tags

Some tags don't need closing tags:

  • like <img> or <br>
16 / 57

Most common tags

  • h1-h4: headers
  • p: paragraph
  • a: link
  • img: image
  • strong: to emphasize text
  • div: generic box (this may be the most popular)
17 / 57

HTML is just about boxes

18 / 57

HTML: Example

  • let's write our first web page!
19 / 57

HTML: Example

  • let's write our first web page!

  • Let's create a personal web page containing:

    • a brief description of who you are
    • stuff that you like, with at least one link
    • stuff that you don't like
    • a quote
    • an image
19 / 57

Do it in VScode, it's easier

Why is our web page boring?

  • HTML is only about content, not about style
20 / 57

Why is our web page boring?

  • HTML is only about content, not about style

  • HTML is nothing without its best friend CSS

20 / 57

Why is our web page boring?

  • HTML is only about content, not about style

  • HTML is nothing without its best friend CSS

  • CSS is only about style

20 / 57

How does CSS work?

  • CSS is a language indicating (to the browser) how to style your HTML elements
21 / 57

How does CSS work?

  • CSS is a language indicating (to the browser) how to style your HTML elements
21 / 57

How does CSS work?

  • CSS is a language indicating (to the browser) how to style your HTML elements

You can do a lot with CSS!

21 / 57

22 / 57

CSS: Example

23 / 57

CSS: Can we do more?

I would like to have:

  • the first paragraph in italic
  • the stuff that I like in green (ForestGreen)
  • the stuff that I dislike in red (Crimson)
  • the things that I really, really, like or dislike in bold
24 / 57

CSS: Can we do more?

I would like to have:

  • the first paragraph in italic
  • the stuff that I like in green (ForestGreen)
  • the stuff that I dislike in red (Crimson)
  • the things that I really, really, like or dislike in bold

Q: At the moment, can I do that?

24 / 57

CSS: Can we do more?

I would like to have:

  • the first paragraph in italic
  • the stuff that I like in green (ForestGreen)
  • the stuff that I dislike in red (Crimson)
  • the things that I really, really, like or dislike in bold

Q: At the moment, can I do that?

A: Not yet, because we need to select precisely some elements! In sum, we need selectors.

24 / 57

CSS selectors

25 / 57

CSS selectors

  • CSS selectors indicate precisely which HTML element you want to style
26 / 57

CSS selectors

  • CSS selectors indicate precisely which HTML element you want to style

  • typically, HTML tags will contain attributes in order to be found via CSS selectors

26 / 57

CSS selectors

  • CSS selectors indicate precisely which HTML element you want to style

  • typically, HTML tags will contain attributes in order to be found via CSS selectors

  • the main attribute used in HTML is the class1

26 / 57

CSS selectors: Most common ways to select HTML elements

p : all "p" tags
p span : all "span" contained in "p" tags
p, a : all "p" and "a" tags
#id1 : all elements with id equal to id1
.class1 : all elements of class "class1"
p.class1 : all "p" elements of class "class1"
p.class1 span : all "span" in "p" tags of class "class1"
p > span : all "span" that are direct children of p
h1 + p : all "p" that follow *directly* an "h1" (direct sibling)
h1 ~ p : all "p" that follow an "h1" (siblings placed after)
[id] : all elements with an existing "id" attribute
[class^=my] : all elements whose class starts with "my"
p[class*=low] : all "p" elements whose class contains the string low
etc!
27 / 57

LATER: two columns. Left the HTML, right the CSS selector. As I go along the different selectors, I highlight the HTML element that is selected

Exercize 1

Q: Select the following paragraph.

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
28 / 57

Exercize 1

Q: Select the following paragraph.

A: p.extra

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
29 / 57

Exercize 2

Q: Select the two highlighted divs.

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
30 / 57

Exercize 2

Q: Select the two highlighted divs.

A: div.info > div

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
31 / 57

Exercize 3

Q: Select the two highlighted lis.

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
32 / 57

Exercize 3

Q: Select the two highlighted lis.

A: #like ~ ul > li

<h2>Who am I?</h2>
<div class="intro">
<p>I'm <span class="age">34</span> and
measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
<div>
<p id="like">What I like:</p>
<ul>
<li>Barcelona</li>
<li>winning the Ballon d'Or every odd year</li>
</ul>
<p class="extra">I forgot to say that I like scoring over 100
goals per season.</p>
</div>
<div>
<p class="dislike">What I don't like:</p>
<ul>
<li><span class="deadly-foe">Real Madrid</span></li>
<li>leaving the club in which I've played since
<span class="age">13</span></li>
</ul>
</div>
</div>
33 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>
34 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>
  • when using .class, the class is not the full string "first main low-key"
  • there are three separate classes: first, main and low-key, which can be selected with the p.class syntax
34 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>
  • when using .class, the class is not the full string "first main low-key"
  • there are three separate classes: first, main and low-key, which can be selected with the p.class syntax

This means that the paragraph can be selected with either:

p.first
p.main
p.low-key
p.first.main
34 / 57

A note on classes: [attr] selection

<p class="first main low-key"> That's only an example! </p>
  • even though you can select with p.main, you cannot with p[class^=main]
  • in p[class^=text] only the full string of the class is considered, not the three classes separately
  • you would have to use p[class*=main]
35 / 57

A note on classes: [attr] selection

<p class="first main low-key"> That's only an example! </p>
  • even though you can select with p.main, you cannot with p[class^=main]
  • in p[class^=text] only the full string of the class is considered, not the three classes separately
  • you would have to use p[class*=main]

Caveat...

  • you would also en up selecting the following element:
<p class="mainiac"> On the floor. </p>
35 / 57

Selectors: Limitation

Imagine a large HTML file.

36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
This is an important text.
<p id = "p1"> lorem ipsum </p>
</div>
36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
This is an important text.
<p id = "p1"> lorem ipsum </p>
</div>

A: No. There is no selector equal to: "select the div such that it contains a paragraph with id equal to p1".

36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
This is an important text.
<p id = "p1"> lorem ipsum </p>
</div>

A: No. There is no selector equal to: "select the div such that it contains a paragraph with id equal to p1".

  • you cannot select parent elements from children
  • in other words: you can only go down the tree sometimes it's an issue when the child node is easier to select
36 / 57

Selectors: XPath

  • XPath is a language to make selections in XML documents it is not linked to CSS

  • it's like... a path to a document: /path/to/object but instead of having folders, you have tags

37 / 57

Selectors: XPath

  • XPath is a language to make selections in XML documents it is not linked to CSS

  • it's like... a path to a document: /path/to/object but instead of having folders, you have tags

MAIN SYNTAX
/ : selects the direct descendant only
// : selects any descendant
.. : selects the parent
@ : selects attributes (used in conditions)
37 / 57

Selectors: XPath

  • XPath is a language to make selections in XML documents it is not linked to CSS

  • it's like... a path to a document: /path/to/object but instead of having folders, you have tags

MAIN SYNTAX
/ : selects the direct descendant only
// : selects any descendant
.. : selects the parent
@ : selects attributes (used in conditions)
  • Tip: always start with a double slash. Ex: //div/etc will first select all div in the document before applying the rest of the commands.
37 / 57

XPath: predicates

  • you can add predicates in brackets
//div[@class='extra'] : selects all 'div's whose class is equal to 'extra'
//div/p[2] : selects the second 'p's that are direct children of 'div's
//div/p[last()] : selects the last 'p's that are direct children of 'div's
//div[contains(@class, 'extra')] : selects all 'div's whose class contains 'extra'
38 / 57

cond1 and cond2 not(cond)

XPath: Axes

  • you can add, before the tag, an axis in the form axis::tag
//span/parent::li : selects all 'li' that are parents of 'span'
//p/following-sibling::* : selects any element following a 'p' that is a sibling of it
39 / 57

//span[@class='deadly-foe']/parent::li

Why are selectors important?

  • when you scrape a web page, you don't want all the content from the web page: you focus only on specific elements

  • you select elements using CSS selectors or XPath

40 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

41 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

Resources

41 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

Resources

Now let's go back to CSS!

41 / 57

More CSS to our webpage

  • let's add more CSS to our webpage

  • let's have:

    • the first paragraph in italic
    • the stuff that I like in green (ForestGreen)
    • the stuff that I dislike in red (Crimson)
    • the things that I really, really, like or dislike in bold
42 / 57

Idealement, il faut leur demander de creer une page web précise : par exemple un blog

What I see and what I scrape

To remember

You don't scrape what you see in the browser, you scrape the HTML code which, after application of CSS styles, renders on your browser.

It's important to set the CSS aside (that's why we need to understand what it does)!

43 / 57

You can scrape much more than visible elements!

Scraping in R

44 / 57

Tools

Good news

You have readily available tools to webscrape in R and Python.

In R, you'll need:

45 / 57

Tools

Good news

You have readily available tools to webscrape in R and Python.

In R, you'll need:

  • rvest

  • and that's it! (for the easy stuff!)

45 / 57

Exercise 1

  1. go into Messi's wikipedia page

  2. scrape the statistics table reporting the goals scored in club

  3. plot the evolution of the number of goals in the national championship and in the Champion's League

46 / 57

Exercise 1

  1. go into Messi's wikipedia page

  2. scrape the statistics table reporting the goals scored in club

  3. plot the evolution of the number of goals in the national championship and in the Champion's League

The only thing you need is...

  1. rvest

    • read_html to get the webpage content

    • html_elements to get the HTML elements with CSS selectors

    • html_table to extract the table

  2. basic data management

46 / 57

Exercise 2

  1. select all paragraphs containing the name Messi twice

  2. extract these paragraphs and highlight the name of Messi with <strong>

  3. write an HTML document containing only these paragraphs

  4. add the following style in the headers:

<style>
p {
border: 1pt solid;
border-radius: 5px;
box-shadow: 12px 12px 5px 2px rgba(0, 0, 255, 0.2);
margin: 2em;
padding: 1em;
}
</style>
47 / 57

Exercise 3

  1. gather information on your next neighbor only using Google queries

Ex: https://www.google.com/search?q=laurent+berge

48 / 57

Google query: Result

You should be facing with the following page (or similar):

49 / 57

What happened?

  1. the results page was substituted with a cookies consent page
50 / 57

What happened?

  1. the results page was substituted with a cookies consent page

  2. we didn't get our results: DAMN!

50 / 57

What happened?

  1. the results page was substituted with a cookies consent page

  2. we didn't get our results: DAMN!

Question

How do we get the results from our query?

50 / 57

Solution?

Question

How do we get the results from our query?

Answer

We should click on accept all/reject all and move on.

51 / 57

Solution?

Question

How do we get the results from our query?

Answer

We should click on accept all/reject all and move on.

let's see how we do that in R!

51 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

52 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

Q: How does it find out that you have validated the policy?

52 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

Q: How does it find out that you have validated the policy?

A: By setting cookies on your machine!

52 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

53 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

Contrary to common belief, cookies aren't bad: You need cookies! Without cookies, the web would be pretty useless.

53 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

Contrary to common belief, cookies aren't bad: You need cookies! Without cookies, the web would be pretty useless.

Cookies are just a piece of text data: name-value pairs.

Example of a cookie from Google:
name: NID
value: 511=oPzh2ocmoF48NQ_WC6j7WdML_8fDK[...]

53 / 57

Cookies: How does it work?

When making a HTTP request, some of your cookies are sent along to the website in the request header.

The website's server uses that information to identify you and offer you personalized content.

54 / 57

Cookies: How does it work?

When making a HTTP request, some of your cookies are sent along to the website in the request header.

The website's server uses that information to identify you and offer you personalized content.

Example of header:

54 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

  • use session instead of html_read
55 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

  • use session instead of html_read

  • rvest can send POST requests, and... good news: the Google accept/reject are in fact forms

55 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

  • use session instead of html_read

  • rvest can send POST requests, and... good news: the Google accept/reject are in fact forms

# use 'session' and not read_html
gg_page = session("https://www.google.com/search?q=laurent+berge")
# get all the forms
form = html_form(gg_page)
# submit and proceed (we select the second form)
gg_page = session_submit(gg_page, form[[2]])
55 / 57

Result

56 / 57

Exercise 3

  1. gather information on yourself using Google queries

  2. keep only relevant information

  3. compile that information in an HTML document

  4. write a function automatizing the process and apply it to other persons of the class

57 / 57

What is webscraping?

  • webscraping is the art of collecting data from web pages

  • anything you see when browsing the internet is data

  • any data in a web page can be collected

2 / 57
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow