Skip to main content

Gathering data from the web using Python

Level:
intermediate
Room:
club a
Start:
Duration:
180 minutes

Abstract

Information is abundant and readily available on the internet. However, the sheer amount of data can be overwhelming and time-consuming to navigate through. That's where web scraping comes in - a powerful tool used to extract data from websites and turn it into a usable format.

In this tutorial, we will explore the basics of web scraping and how to implement it using Scrapy (a Python framework). Whether you are a data analyst, programmer, or researcher, this tutorial will equip you with the fundamental skills needed to create your own web scraper and extract valuable information from websites.

TutorialPython Libraries

Description

During the tutorial, we will learn concepts of web scraping doing some exercises against a fake website with challenges that are very similar to the ones we will find in real websites using Scrapy, a Python framework designed for web scraping tasks.

To better follow the tutorial, it would be good to install Scrapy in your machine. More information at https://github.com/rennerocha/europython-2023-gathering-data-tutorial#before-the-tutorial

  1. (presentation) Web scraping fundamentals

    • What is web scraping, why we want to do web scraping and real use cases for web scraping
  2. (presentation) Scrapy basic concepts

    • What is Scrapy, what its advantages over other tools, its basic classes (Spider, Request, Response, parses) and how to create a very simple web crawler
  3. (presentation + exercise) Scraping a basic HTML page

    • Gathering data from http://quotes.toscrape.com/, a plain well-structure HTML page
  4. (presentation exercise) Scraping Javascript generated content (external API)

    • Gathering data from http://quotes.toscrape.com/scroll, where the data is gathered from an API call
  5. (presentation + exercise) Scraping Javascript generated content (data into HTML)

    • Gathering data from http://quotes.toscrape.com/js/, where the data to be gathered is located inside the HTML code, but processed by Javascript
  6. (presentation + exercise) Scraping page with forms and ViewState

  • Gathering data from http://quotes.toscrape.com/search.aspx, where we need to handle forms submission and ViewStates
  1. (presentation) Proxies and headless browsers

    • When simple requests are not enough, what other tools we have to proceed
  2. (presentation) Being polite and not gathering data you shouldn't gather

    • How to avoid interfering in your target website and some restrictions about what data you can and you can't gather

The speaker

Renne Rocha

Renne Rocha

Renne Rocha is a Senior Python Developer at Six Feet Up. Graduated in Electrical Engineering, he has worked with development for over 10 years (most of his career using Python). He has already presented several talks at Python conferences throughout Brazil. He is co-founder of Laboratório Hacker de Campinas, one of the first hackerspaces in Brazil. He is a woodworker (in training) and a homebrewer in his spare time.


← Back to schedule