This lesson teaches people with basic Python knowledge the tools and libraries to do web scraping, which means extracting data from websites. It has three episodes.
Episode 1 begins with an introduction to how websites are structured using HTML. You’ll learn how to explore this structure using your browser and how to extract information from it using the BeautifulSoup package.
In Episode 2, you’ll learn how to retrieve the HTML of a webpage using the requests package and continue practicing how to parse and extract specific content with BeautifulSoup.
Toward the end of the workshop, in Episode 3, we’ll explore the difference between static and dynamic webpages, and how to scrape dynamic content using Selenium.
This workshop is intended for learners who already have a basic understanding of Python. In particular, you should be comfortable with:
- Install and import packages and modules
- Use lists and dictionaries
- Use conditional statements (if, else, elif)
- Use for loops
- Calling functions, understanding parameters/arguments and return values
The rendered version of the lesson is available at: https://ucsbcarpentry.github.io/web-scraping-python/
We'd love to know if you are teaching this lesson and the suggestions you have for improving it!
You can do this by submitting an issue in this repo, or sending an email to dreamlab@library.ucsb.edu or jose_nino@ucsb.edu.
If you want to know more about contributing to this lesson and other Carpentries efforts, please read the CONTRIBUTING guide.
Current maintainer of this lesson: - Jose Niño Muriel
Thanks to Noah Spahn, Ronald Lencevičius, and Seth Erickson for their feedback the first time this workshop was taught at UCSB.
Please cite this lesson using the information in the CITATION.CFF file when you refer to it in publications, and/or if you re-use, adapt, or expand on the content in your own training material.
The use and adaptation of this instructional content is made available under the Creative Commons Attribution license - CC BY 4.0. Review the LICENSE.md file for additional information.