webscraping is the art of collecting data from web pages
anything you see when browsing the internet is data
any data in a web page can be collected
Sometimes that's the only way to get the information you want!
Sometimes that's the only way to get the information you want!
Web scraping is time consuming and is also costly in terms of resources (both for you and the server you're scraping). You should think hard to alternative solutions first!★
organizing the web scraping (only for large tasks)
getting the data from the web (actual web scraping)
organizing the web scraping (only for large tasks)
getting the data from the web (actual web scraping)
formatting the data
Steps 1 and 3 are overlooked but are extremely important
So you'll need to have a correct understanding of how the web works!
how static web pages work
practice with R
handling large projects
how dynamic web pages work
practice with Python and Selenium
examples: you query wikipedia => messi
client -> first gets the IP of the server via DNS (cached or via DNS servers) -> sends a GET HTTP (hypertext transfert protocol) -> ISP -> Routers -> server -> response -> all the way back
if in HTTPS -> first shake hands (server sends certificate) -> then encryptions + decryptions, same route encryption = run time
IP adresses! uniquely identify you in the network
GET HTTP/1.1Host: developer.mozilla.orgUser-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)Accept-Language: en-usAccept-Encoding: gzip, deflateConnection: Keep-Alivetalk about the user agent file => you can write whatever you want here, but you can also look like one regular browser
HTTP/1.1 200 OKDate: Sat, 09 Oct 2010 14:28:02 GMTServer: ApacheLast-Modified: Tue, 01 Dec 2009 20:18:22 GMTETag: "51142bc1-7449-479b075b2891b"Accept-Ranges: bytesContent-Length: 29769Content-Type: text/html<!DOCTYPE html... (here come the 29769 bytes of the requested web page)famous codes: 200 / 404 / 403 (forbidden)
a web page is just code that is interpreted by your browser
the language in which the content is written is HTML
a web page is just code that is interpreted by your browser
the language in which the content is written is HTML
HTML is just about content!
let's inspect the webscraping's wikipedia page
we look at the wikipedia page then inspect the html and find out the content
this is a markup language
the content of each HTML element is enclosed in tags
tags can have attributes
content only: multiple spaces are ignored
this is a markup language
the content of each HTML element is enclosed in tags
tags can have attributes
content only: multiple spaces are ignored
... that's basically it!
Some tags don't need closing tags:
<img> or <br>h1-h4: headersp: paragrapha: linkimg: imagestrong: to emphasize textdiv: generic box (this may be the most popular)let's write our first web page!
Let's create a personal web page containing:
Do it in VScode, it's easier
HTML is only about content, not about style
HTML is nothing without its best friend CSS
HTML is only about content, not about style
HTML is nothing without its best friend CSS
CSS is only about style
You can do a lot with CSS!
let's add CSS to our web page!
use CSS to:
I would like to have:
I would like to have:
I would like to have:
CSS selectors indicate precisely which HTML element you want to style
typically, HTML tags will contain attributes in order to be found via CSS selectors
id attribute is usually less useful in webscraping.CSS selectors indicate precisely which HTML element you want to style
typically, HTML tags will contain attributes in order to be found via CSS selectors
the main attribute used in HTML is the class1
p : all "p" tagsp span : all "span" contained in "p" tagsp, a : all "p" and "a" tags#id1 : all elements with id equal to id1.class1 : all elements of class "class1"p.class1 : all "p" elements of class "class1"p.class1 span : all "span" in "p" tags of class "class1"p > span : all "span" that are direct children of ph1 + p : all "p" that follow *directly* an "h1" (direct sibling)h1 ~ p : all "p" that follow an "h1" (siblings placed after)[id] : all elements with an existing "id" attribute[class^=my] : all elements whose class starts with "my"p[class*=low] : all "p" elements whose class contains the string lowetc!LATER: two columns. Left the HTML, right the CSS selector. As I go along the different selectors, I highlight the HTML element that is selected
stuff
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
p.extra
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
divs.
stuff
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
divs.
div.info > div
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
lis.
stuff
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
lis.
#like ~ ul > li
<h2>Who am I?</h2><div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p></div><div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div></div>
An HTML element can have several classes separated with spaces:
<p class="first main low-key"> That's only an example! </p>An HTML element can have several classes separated with spaces:
<p class="first main low-key"> That's only an example! </p>
.class, the class is not the full string "first main low-key"first, main and low-key, which can be selected with the p.class syntaxAn HTML element can have several classes separated with spaces:
<p class="first main low-key"> That's only an example! </p>
.class, the class is not the full string "first main low-key"first, main and low-key, which can be selected with the p.class syntaxThis means that the paragraph can be selected with either:
p.firstp.mainp.low-keyp.first.main[attr] selection<p class="first main low-key"> That's only an example! </p>
p.main, you cannot with p[class^=main]p[class^=text] only the full string of the class is considered, not the three classes separatelyp[class*=main][attr] selection<p class="first main low-key"> That's only an example! </p>
p.main, you cannot with p[class^=main]p[class^=text] only the full string of the class is considered, not the three classes separatelyp[class*=main]<p class="mainiac"> On the floor. </p>Imagine a large HTML file.
Imagine a large HTML file.
<div> This is an important text. <p id = "p1"> lorem ipsum </p></div>Imagine a large HTML file.
<div> This is an important text. <p id = "p1"> lorem ipsum </p></div>
:has() is new though but mostly unsupported.Imagine a large HTML file.
<div> This is an important text. <p id = "p1"> lorem ipsum </p></div>
XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags☆
XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags☆
MAIN SYNTAX/ : selects the direct descendant only// : selects any descendant.. : selects the parent@ : selects attributes (used in conditions)XPath is a language to make selections in XML documents it is not linked to CSS
it's like... a path to a document: /path/to/object but instead of having folders, you have tags☆
MAIN SYNTAX/ : selects the direct descendant only// : selects any descendant.. : selects the parent@ : selects attributes (used in conditions)
//div/etc will first select all div in the document before applying the rest of the commands.★//div[@class='extra'] : selects all 'div's whose class is equal to 'extra'//div/p[2] : selects the second 'p's that are direct children of 'div's//div/p[last()] : selects the last 'p's that are direct children of 'div's//div[contains(@class, 'extra')] : selects all 'div's whose class contains 'extra'
cond1 and cond2 not(cond)
axis::tag★//span/parent::li : selects all 'li' that are parents of 'span'//p/following-sibling::* : selects any element following a 'p' that is a sibling of it
//span[@class='deadly-foe']/parent::li
when you scrape a web page, you don't want all the content from the web page: you focus only on specific elements
you select elements using CSS selectors or XPath
Selectors are powerful tools to select HTML elements
let's add more CSS to our webpage
let's have:
Idealement, il faut leur demander de creer une page web précise : par exemple un blog
You don't scrape what you see in the browser, you scrape the HTML code which, after application of CSS styles, renders on your browser.
It's important to set the CSS aside (that's why we need to understand what it does)!
You can scrape much more than visible elements!
You have readily available tools to webscrape in R and Python.
In R, you'll need:
You have readily available tools to webscrape in R and Python.
In R, you'll need:
and that's it! (for the easy stuff!)
go into Messi's wikipedia page
scrape the statistics table reporting the goals scored in club
plot the evolution of the number of goals in the national championship and in the Champion's League
go into Messi's wikipedia page
scrape the statistics table reporting the goals scored in club
plot the evolution of the number of goals in the national championship and in the Champion's League
rvest
read_html to get the webpage content
html_elements to get the HTML elements with CSS selectors
html_table to extract the table
basic data management
select all paragraphs containing the name Messi twice
extract these paragraphs and highlight the name of Messi with <strong>
write an HTML document containing only these paragraphs
add the following style in the headers:
<style>p { border: 1pt solid; border-radius: 5px; box-shadow: 12px 12px 5px 2px rgba(0, 0, 255, 0.2); margin: 2em; padding: 1em;}</style>Ex:
https://www.google.com/search?q=laurent+berge
You should be facing with the following page (or similar):
the results page was substituted with a cookies consent page★
we didn't get our results: DAMN!
the results page was substituted with a cookies consent page★
we didn't get our results: DAMN!
How do we get the results from our query?
How do we get the results from our query?
We should click on accept all/reject all and move on.
How do we get the results from our query?
We should click on accept all/reject all and move on.
let's see how we do that in R!
Google will give you the results only if you validate the cookies policy.
Google will give you the results only if you validate the cookies policy.
Google will give you the results only if you validate the cookies policy.
The main purpose of cookies is to personalize your web experience.
The main purpose of cookies is to personalize your web experience.
Contrary to common belief, cookies aren't bad:★ You need cookies! Without cookies, the web would be pretty useless.
The main purpose of cookies is to personalize your web experience.
Contrary to common belief, cookies aren't bad:★ You need cookies! Without cookies, the web would be pretty useless.
Cookies are just a piece of text data: name-value pairs.
Example of a cookie from Google:
name:NID
value:511=oPzh2ocmoF48NQ_WC6j7WdML_8fDK[...]
When making a HTTP request, some of your cookies are sent along to the website in the request header.
The website's server uses that information to identify you and offer you personalized content.
When making a HTTP request, some of your cookies are sent along to the website in the request header.
The website's server uses that information to identify you and offer you personalized content.
Example of header:
You can deal with cookies with rvest.
session instead of html_readYou can deal with cookies with rvest.
use session instead of html_read
rvest can send POST requests, and... good news: the Google accept/reject are in fact forms
You can deal with cookies with rvest.
use session instead of html_read
rvest can send POST requests, and... good news: the Google accept/reject are in fact forms
# use 'session' and not read_htmlgg_page = session("https://www.google.com/search?q=laurent+berge")# get all the formsform = html_form(gg_page)# submit and proceed (we select the second form)gg_page = session_submit(gg_page, form[[2]])gather information on yourself using Google queries
keep only relevant information
compile that information in an HTML document
write a function automatizing the process and apply it to other persons of the class
webscraping is the art of collecting data from web pages
anything you see when browsing the internet is data
any data in a web page can be collected
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |