{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Why soupsavvy? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might wonder, *If I can achieve everything I need with `BeautifulSoup`, why should I bother with `soupsavvy` on top of it?* \n", "Here are some reasons to consider giving it a try!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encapsulated logic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of selecting from a variety of search methods in `BeautifulSoup`, `soupsavvy` offers a streamlined, consistent interface. \n", "The logic is encapsulated in declared selectors, so there's no need to write nested loops or complex conditionals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " Event\n", " party\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "for div in soup.find_all(\"div\"):\n", " for event in div.find_all(class_=\"event\", recursive=False):\n", " party = event.find_next_sibling(\"span\", string=\"party\")\n", " if party is not None:\n", " break\n", "party" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " Event\n", " party\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = TypeSelector(\"div\") > ClassSelector(\"event\") + (\n", " TypeSelector(\"span\") & PatternSelector(\"party\")\n", ")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing elements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `BeautifulSoup`, you often have to deal with missing elements before interacting with them, which clutters your code. `soupsavvy` selectors handle this for you automatically. If you need stricter control, the `strict` mode raises an exception if the required element isn't found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " No event here\n", " No party\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "event = soup.find(class_=\"event\")\n", "\n", "if event is not None:\n", " party = event.find_next_sibling(string=\"party\")\n", "else:\n", " print(\"This needs to be handled explicitly every time.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, to_soupsavvy\n", "from soupsavvy.exceptions import TagNotFoundException\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " No event here\n", " No party\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"event\") + PatternSelector(\"party\")\n", "assert selector.find(element) is None\n", "\n", "try:\n", " selector.find(element, strict=True)\n", "except TagNotFoundException as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining selectors in `BeautifulSoup` can be cumbersome, especially when trying to use different methods or perform set operations like unions or intersections. With `soupsavvy`, logical operators allow you to easily combine selectors without worrying about hash collisions or element order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Festival

\n", "
\n", " Event\n", " Menu\n", "
\n", "
\n", " Menu\n", "
\n", "
\n", " Event\n", "
\n", " Event\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "result1 = soup.find_all(\"span\", string=\"Event\")\n", "result2 = soup.select(\":last-child\")\n", "result3 = soup.find_all(class_=\"special\")\n", "\n", "# elements with the same text representation have the same hash\n", "# Event is included only once!\n", "# There is no guarantee that the order of the elements will be preserved\n", "\n", "(set(result1) & set(result2)) | set(result3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.selectors.css import LastChild\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Festival

\n", "
\n", " Event\n", " Menu\n", "
\n", "
\n", " Menu\n", "
\n", "
\n", " Event\n", "
\n", " Event\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = (\n", " PatternSelector(\"Event\") & TypeSelector(\"span\") & LastChild()\n", ") | ClassSelector(\"special\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, selecting an element is just the first step, you need to extract and transform the data afterward. `soupsavvy` lets you pipe operations directly into selectors, enabling you to transform and extract data seamlessly without additional code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Event

\n", " 2023-10-30\n", " 2023-08-31\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "date_elements = soup.find_all(class_=\"date\")\n", "dates = [\n", " datetime.strptime(element.get_text(strip=True), \"%Y-%m-%d\")\n", " for element in date_elements\n", "]\n", "dates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "from soupsavvy.operations import Operation, Text\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Event

\n", " 2023-10-30\n", " 2023-08-31\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"date\") | Text() | Operation(datetime.strptime, \"%Y-%m-%d\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structured information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting structured information in `BeautifulSoup` often requires repetitive boilerplate code. \n", "With `soupsavvy`, you can define flexible, reusable data extraction schemas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "\n", "@dataclass\n", "class Book:\n", " title: str\n", " price: float\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "
\n", "
\n", "

Brave New World

\n", "

80$

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "\n", "books = []\n", "book_elements = soup.find_all(\"div\", class_=\"book\")\n", "\n", "for book_element in book_elements:\n", " title = book_element.find(class_=\"title\")\n", "\n", " if title is None:\n", " raise ValueError(\"Title not found\")\n", "\n", " title = title.get_text(strip=True)\n", "\n", " price = book_element.find(class_=\"price\")\n", "\n", " if price is None:\n", " raise ValueError(\"Price not found\")\n", "\n", " price = int(price.get_text(strip=True).replace(\"$\", \"\"))\n", " book = Book(title, price)\n", " books.append(book)\n", "\n", "books" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = (\n", " ClassSelector(\"price\")\n", " | Text()\n", " | Operation(lambda x: x.strip(\"$\"))\n", " | Operation(int)\n", " )\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "
\n", "
\n", "

Brave New World

\n", "

80$

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "By using `soupsavvy`, you not only simplify your code but also gain powerful tools to handle complex selection and extraction tasks with ease. \n", "It's a great way to keep your web scraping modules clean, concise and less error-prone.\n", "\n", "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }