Data analysis II
Webscraping, introduction
1 / 57
Laurent Bergé
University of Bordeaux, BxSE
Fall 2022

What is webscraping?

webscraping is the art of collecting data from web pages
anything you see when browsing the internet is data
any data in a web page can be collected

2 / 57

Why doing that?

Sometimes that's the only way to get the information you want!

3 / 57

Why doing that?

Sometimes that's the only way to get the information you want!

To consider

Web scraping is time consuming and is also costly in terms of resources (both for you and the server you're scraping). You should think hard to alternative solutions first!^★

3 / 57

Three types of task in webscrapingorganizing the web scraping (only for large tasks)
4 / 57

Three types of task in webscraping

organizing the web scraping (only for large tasks)
getting the data from the web (actual web scraping)

4 / 57

Three types of task in webscraping

organizing the web scraping (only for large tasks)
getting the data from the web (actual web scraping)
formatting the data

4 / 57

Steps 1 and 3 are overlooked but are extremely important

Typologies of web scraping tasks

5 / 57

Objectiveget you going for ambitious web scraping projects (the fun ones)
6 / 57

Objective

get you going for ambitious web scraping projects (the fun ones)

So you'll need to have a correct understanding of how the web works!

6 / 57

Outline

how static web pages work
practice with R
handling large projects
how dynamic web pages work
practice with Python and Selenium

7 / 57

How does the web work?8 / 57

The HTTP protocol9 / 57

The HTTP protocol

9 / 57

examples: you query wikipedia => messi

client -> first gets the IP of the server via DNS (cached or via DNS servers) -> sends a GET HTTP (hypertext transfert protocol) -> ISP -> Routers -> server -> response -> all the way back

if in HTTPS -> first shake hands (server sends certificate) -> then encryptions + decryptions, same route encryption = run time

IP adresses! uniquely identify you in the network

Example of HTTP GET request

GET HTTP/1.1
Host: developer.mozilla.org
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive

10 / 57

talk about the user agent file => you can write whatever you want here, but you can also look like one regular browser

Server sends back a response

HTTP/1.1 200 OK
Date: Sat, 09 Oct 2010 14:28:02 GMT
Server: Apache
Last-Modified: Tue, 01 Dec 2009 20:18:22 GMT
ETag: "51142bc1-7449-479b075b2891b"
Accept-Ranges: bytes
Content-Length: 29769
Content-Type: text/html
<!DOCTYPE html... (here come the 29769 bytes of the requested web page)

11 / 57

famous codes: 200 / 404 / 403 (forbidden)

What's a webpage?12 / 57

What's a webpage?

a web page is just code that is interpreted by your browser
the language in which the content is written is HTML

12 / 57

What's a webpage?

a web page is just code that is interpreted by your browser
the language in which the content is written is HTML
HTML is just about content!
let's inspect the webscraping's wikipedia page

12 / 57

we look at the wikipedia page then inspect the html and find out the content

How does HTML work?

this is a markup language
the content of each HTML element is enclosed in tags
tags can have attributes
content only: multiple spaces are ignored

13 / 57

How does HTML work?

this is a markup language
the content of each HTML element is enclosed in tags
tags can have attributes
content only: multiple spaces are ignored

... that's basically it!

13 / 57

How does HTML work? Tags

14 / 57

How does HTML work? Attributes

15 / 57

Empty tags

Some tags don't need closing tags:

like <img> or <br>

16 / 57

Most common tagsh1-h4: headers
p: paragraph
a: link
img: image
strong: to emphasize text
div: generic box (this may be the most popular)
17 / 57

HTML is just about boxes

18 / 57

HTML: Example let's write our first web page!
19 / 57

HTML: Example

let's write our first web page!
Let's create a personal web page containing:
- a brief description of who you are
- stuff that you like, with at least one link
- stuff that you don't like
- a quote
- an image

19 / 57

Do it in VScode, it's easier

Why is our web page boring?HTML is only about content, not about style
20 / 57

Why is our web page boring?

HTML is only about content, not about style
HTML is nothing without its best friend CSS

20 / 57

Why is our web page boring?

HTML is only about content, not about style
HTML is nothing without its best friend CSS
CSS is only about style

20 / 57

How does CSS work?CSS is a language indicating (to the browser) how to style your HTML elements
21 / 57

How does CSS work?

CSS is a language indicating (to the browser) how to style your HTML elements

21 / 57

How does CSS work?

CSS is a language indicating (to the browser) how to style your HTML elements

You can do a lot with CSS!

21 / 57

22 / 57

CSS: Example

let's add CSS to our web page!
use CSS to:
- increase the font-size of the paragraphs and set the font-family to sans-serif
- change the background-color to Linen^★
- add a border-radius and a box-shadow to the image

23 / 57

CSS: Can we do more?

I would like to have:

the first paragraph in italic
the stuff that I like in green (ForestGreen)
the stuff that I dislike in red (Crimson)
the things that I really, really, like or dislike in bold

24 / 57

CSS: Can we do more?

I would like to have:

the first paragraph in italic
the stuff that I like in green (ForestGreen)
the stuff that I dislike in red (Crimson)
the things that I really, really, like or dislike in bold

Q: At the moment, can I do that?

24 / 57

CSS: Can we do more?

I would like to have:

the first paragraph in italic
the stuff that I like in green (ForestGreen)
the stuff that I dislike in red (Crimson)
the things that I really, really, like or dislike in bold

Q: At the moment, can I do that?

A: Not yet, because we need to select precisely some elements! In sum, we need selectors.

24 / 57

CSS selectors25 / 57

CSS selectorsCSS selectors indicate precisely which HTML element you want to style
26 / 57

CSS selectors

CSS selectors indicate precisely which HTML element you want to style
typically, HTML tags will contain attributes in order to be found via CSS selectors

26 / 57

CSS selectors

CSS selectors indicate precisely which HTML element you want to style
typically, HTML tags will contain attributes in order to be found via CSS selectors
the main attribute used in HTML is the class¹

26 / 57

CSS selectors: Most common ways to select HTML elements

p             : all "p" tags
p span        : all "span" contained in "p" tags
p, a          : all "p" and "a" tags
#id1          : all elements with id equal to id1
.class1       : all elements of class "class1"
p.class1      : all "p" elements of class "class1"
p.class1 span : all "span" in "p" tags of class "class1"
p > span      : all "span" that are direct children of p
h1 + p        : all "p" that follow *directly* an "h1" (direct sibling)
h1 ~ p        : all "p" that follow an "h1" (siblings placed after)
[id]          : all elements with an existing "id" attribute
[class^=my]   : all elements whose class starts with "my"
p[class*=low] : all "p" elements whose class contains the string low
etc!

27 / 57

LATER: two columns. Left the HTML, right the CSS selector. As I go along the different selectors, I highlight the HTML element that is selected

Exercize 1

Q: Select the following paragraph.

stuff

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

28 / 57

Exercize 1

Q: Select the following paragraph.

A: p.extra

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

29 / 57

Exercize 2

Q: Select the two highlighted divs.

stuff

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

30 / 57

Exercize 2

Q: Select the two highlighted divs.

A: div.info > div

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

31 / 57

Exercize 3

Q: Select the two highlighted lis.

stuff

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

32 / 57

Exercize 3

Q: Select the two highlighted lis.

A: #like ~ ul > li

<h2>Who am I?</h2>
<div class="intro">
  <p>I'm <span class="age">34</span> and 
    measure <span class="unit">1.70m</span>.</p>
</div>
<div class="info">
  <div>
    <p id="like">What I like:</p>
    <ul>
      <li>Barcelona</li>
      <li>winning the Ballon d'Or every odd year</li>
    </ul>
    <p class="extra">I forgot to say that I like scoring over 100 
       goals per season.</p>
  </div>
  <div>
    <p class="dislike">What I don't like:</p>
    <ul>
      <li><span class="deadly-foe">Real Madrid</span></li>
      <li>leaving the club in which I've played since 
       <span class="age">13</span></li>
    </ul>
  </div>
</div>

33 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>

34 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>

when using .class, the class is not the full string "first main low-key"
there are three separate classes: first, main and low-key, which can be selected with the p.class syntax

34 / 57

A note on classes

An HTML element can have several classes separated with spaces:

<p class="first main low-key"> That's only an example! </p>

when using .class, the class is not the full string "first main low-key"
there are three separate classes: first, main and low-key, which can be selected with the p.class syntax

This means that the paragraph can be selected with either:

p.first
p.main
p.low-key
p.first.main

34 / 57

A note on classes: `[attr]` selection

<p class="first main low-key"> That's only an example! </p>

even though you can select with p.main, you cannot with p[class^=main]
in p[class^=text] only the full string of the class is considered, not the three classes separately
you would have to use p[class*=main]

35 / 57

A note on classes: `[attr]` selection

<p class="first main low-key"> That's only an example! </p>

even though you can select with p.main, you cannot with p[class^=main]
in p[class^=text] only the full string of the class is considered, not the three classes separately
you would have to use p[class*=main]

Caveat...

you would also en up selecting the following element:

<p class="mainiac"> On the floor. </p>

35 / 57

Selectors: Limitation

Imagine a large HTML file.

36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
  This is an important text.
  <p id = "p1"> lorem ipsum </p>
</div>

36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
  This is an important text.
  <p id = "p1"> lorem ipsum </p>
</div>

A: No. There is no selector equal to: "select the div such that it contains a paragraph with id equal to p1".

36 / 57

Selectors: Limitation

Imagine a large HTML file.

Q: Can you select the following div? Tip: there's a hint in the title.

<div>
  This is an important text.
  <p id = "p1"> lorem ipsum </p>
</div>

A: No. There is no selector equal to: "select the div such that it contains a paragraph with id equal to p1".

you cannot select parent elements from children^☆
in other words: you can only go down the tree sometimes it's an issue when the child node is easier to select

36 / 57

Selectors: XPath

XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags^☆

37 / 57

Selectors: XPath

XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags^☆

MAIN SYNTAX
/   : selects the direct descendant only
//  : selects any descendant
..  : selects the parent
@   : selects attributes (used in conditions)

37 / 57

Selectors: XPath

XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags^☆

MAIN SYNTAX
/   : selects the direct descendant only
//  : selects any descendant
..  : selects the parent
@   : selects attributes (used in conditions)

Tip: always start with a double slash. Ex: //div/etc will first select all div in the document before applying the rest of the commands.^★

37 / 57

XPath: predicatesyou can add predicates in brackets
//div[@class='extra'] : selects all 'div's whose class is equal to 'extra'
//div/p[2]            : selects the second 'p's that are direct children of 'div's
//div/p[last()]       : selects the last 'p's that are direct children of 'div's
//div[contains(@class, 'extra')] : selects all 'div's whose class contains 'extra'

38 / 57

cond1 and cond2 not(cond)

★:  You can find the list of axes in W3schools.
XPath: Axesyou can add, before the tag, an axis in the form axis::tag★
//span/parent::li         : selects all 'li' that are parents of 'span'
//p/following-sibling::*  : selects any element following a 'p' that is a sibling of it

39 / 57

//span[@class='deadly-foe']/parent::li

Why are selectors important?

when you scrape a web page, you don't want all the content from the web page: you focus only on specific elements
you select elements using CSS selectors or XPath

40 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

41 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

Resources

CSS
XPath
a website I created to learn and test selectors

41 / 57

Wrapping up

Selectors are powerful tools to select HTML elements

Resources

CSS
XPath
a website I created to learn and test selectors

Now let's go back to CSS!

41 / 57

More CSS to our webpage

let's add more CSS to our webpage
let's have:
- the first paragraph in italic
- the stuff that I like in green (ForestGreen)
- the stuff that I dislike in red (Crimson)
- the things that I really, really, like or dislike in bold

42 / 57

Idealement, il faut leur demander de creer une page web précise : par exemple un blog

What I see and what I scrape

To remember

You don't scrape what you see in the browser, you scrape the HTML code which, after application of CSS styles, renders on your browser.

It's important to set the CSS aside (that's why we need to understand what it does)!

43 / 57

You can scrape much more than visible elements!

Scraping in R44 / 57

Tools

Good news

You have readily available tools to webscrape in R and Python.

In R, you'll need:

rvest

45 / 57

Tools

Good news

You have readily available tools to webscrape in R and Python.

In R, you'll need:

rvest
and that's it! (for the easy stuff!)

45 / 57

Exercise 1

go into Messi's wikipedia page
scrape the statistics table reporting the goals scored in club
plot the evolution of the number of goals in the national championship and in the Champion's League

46 / 57

Exercise 1

go into Messi's wikipedia page
scrape the statistics table reporting the goals scored in club
plot the evolution of the number of goals in the national championship and in the Champion's League

The only thing you need is...

rvest
- read_html to get the webpage content
- html_elements to get the HTML elements with CSS selectors
- html_table to extract the table
basic data management

46 / 57

Exercise 2

select all paragraphs containing the name Messi twice
extract these paragraphs and highlight the name of Messi with <strong>
write an HTML document containing only these paragraphs
add the following style in the headers:

<style>
p {
    border: 1pt solid;
    border-radius: 5px;
    box-shadow: 12px 12px 5px 2px rgba(0, 0, 255, 0.2);
    margin: 2em;
    padding: 1em;
}
</style>

47 / 57

Exercise 3

gather information on your next neighbor only using Google queries

Ex: https://www.google.com/search?q=laurent+berge

48 / 57

Google query: Result

You should be facing with the following page (or similar):

49 / 57

★: Note that it may be because we're in the EU and the EU requires explicit consent on cookies: that requirement may vary in other countries.
What happened?the results page was substituted with a cookies consent page★
50 / 57

What happened?

the results page was substituted with a cookies consent page^★
we didn't get our results: DAMN!

50 / 57

What happened?

the results page was substituted with a cookies consent page^★
we didn't get our results: DAMN!

Question

How do we get the results from our query?

50 / 57

Solution?

Question

How do we get the results from our query?

Answer

We should click on accept all/reject all and move on.

51 / 57

Solution?

Question

How do we get the results from our query?

Answer

We should click on accept all/reject all and move on.

let's see how we do that in R!

51 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

52 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

Q: How does it find out that you have validated the policy?

52 / 57

Browsing with cookies

Google will give you the results only if you validate the cookies policy.

Q: How does it find out that you have validated the policy?

A: By setting cookies on your machine!

52 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

53 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

Contrary to common belief, cookies aren't bad:^★ You need cookies! Without cookies, the web would be pretty useless.

53 / 57

Cookies

The main purpose of cookies is to personalize your web experience.

Contrary to common belief, cookies aren't bad:^★ You need cookies! Without cookies, the web would be pretty useless.

Cookies are just a piece of text data: name-value pairs.

Example of a cookie from Google:
name: NID
value: 511=oPzh2ocmoF48NQ_WC6j7WdML_8fDK[...]

53 / 57

Cookies: How does it work?

When making a HTTP request, some of your cookies are sent along to the website in the request header.

The website's server uses that information to identify you and offer you personalized content.

54 / 57

Cookies: How does it work?

When making a HTTP request, some of your cookies are sent along to the website in the request header.

The website's server uses that information to identify you and offer you personalized content.

Example of header:

54 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

use session instead of html_read

55 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

use session instead of html_read
rvest can send POST requests, and... good news: the Google accept/reject are in fact forms

55 / 57

Cookies: How to in R?

You can deal with cookies with rvest.

use session instead of html_read
rvest can send POST requests, and... good news: the Google accept/reject are in fact forms

# use 'session' and not read_html
gg_page = session("https://www.google.com/search?q=laurent+berge")
# get all the forms
form = html_form(gg_page)
# submit and proceed (we select the second form)
gg_page = session_submit(gg_page, form[[2]])

55 / 57

Result

56 / 57

Exercise 3

gather information on yourself using Google queries
keep only relevant information
compile that information in an HTML document
write a function automatizing the process and apply it to other persons of the class

57 / 57

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Data analysis II

Webscraping, introduction

Laurent Bergé

University of Bordeaux, BxSE

Fall 2022

What is webscraping?

Why doing that?

Why doing that?

Three types of task in webscraping

Three types of task in webscraping

Three types of task in webscraping

Typologies of web scraping tasks

Objective

Objective

Outline

How does the web work?

The HTTP protocol

The HTTP protocol

Example of HTTP GET request

Server sends back a response

What's a webpage?

What's a webpage?

What's a webpage?

How does HTML work?

How does HTML work?

How does HTML work? Tags

How does HTML work? Attributes

Empty tags

Most common tags

HTML is just about boxes

HTML: Example

HTML: Example

Why is our web page boring?

Why is our web page boring?

Why is our web page boring?

How does CSS work?

How does CSS work?

How does CSS work?

CSS: Example

CSS: Can we do more?

CSS: Can we do more?

CSS: Can we do more?

CSS selectors

CSS selectors

CSS selectors

CSS selectors

CSS selectors: Most common ways to select HTML elements

Exercize 1

Exercize 1

Exercize 2

Exercize 2

Exercize 3

Exercize 3

A note on classes

A note on classes

A note on classes

A note on classes: [attr] selection

A note on classes: [attr] selection

Caveat...

Selectors: Limitation

Selectors: Limitation

Selectors: Limitation

Selectors: Limitation

Selectors: XPath

Selectors: XPath

Selectors: XPath

XPath: predicates

XPath: Axes

Why are selectors important?

Wrapping up

Wrapping up

Wrapping up

More CSS to our webpage

What I see and what I scrape

Scraping in R

Tools

Good news

Tools

Good news

Exercise 1

A note on classes: `[attr]` selection

A note on classes: `[attr]` selection