{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Why soupsavvy? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might wonder, *If I can achieve everything I need with `BeautifulSoup`, why should I bother with `soupsavvy` on top of it?* \n", "Here are some reasons to consider giving it a try!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encapsulated logic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of selecting from a variety of search methods in `BeautifulSoup`, `soupsavvy` offers a streamlined, consistent interface. \n", "The logic is encapsulated in declared selectors, so there's no need to write nested loops or complex conditionals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
Festival
\n", "Festival
\n", "Event
\n", " 2023-10-30\n", " 2023-08-31\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "date_elements = soup.find_all(class_=\"date\")\n", "dates = [\n", " datetime.strptime(element.get_text(strip=True), \"%Y-%m-%d\")\n", " for element in date_elements\n", "]\n", "dates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### soupsavvy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "from soupsavvy.operations import Operation, Text\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Event
\n", " 2023-10-30\n", " 2023-08-31\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"date\") | Text() | Operation(datetime.strptime, \"%Y-%m-%d\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structured information" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting structured information in `BeautifulSoup` often requires repetitive boilerplate code. \n", "With `soupsavvy`, you can define flexible, reusable data extraction schemas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dataclasses import dataclass\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "\n", "@dataclass\n", "class Book:\n", " title: str\n", " price: float\n", "\n", "\n", "text = \"\"\"\n", "Animal Farm
\n", "100$
\n", "Brave New World
\n", "80$
\n", "Animal Farm
\n", "100$
\n", "Brave New World
\n", "80$
\n", "