{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Why soupsavvy? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might wonder, *If I can achieve everything I need with `BeautifulSoup`, why should I bother with `soupsavvy` on top of it?*  \n",
    "Here are some reasons to consider giving it a try!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encapsulated logic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of selecting from a variety of search methods in `BeautifulSoup`, `soupsavvy` offers a streamlined, consistent interface.  \n",
    "The logic is encapsulated in declared selectors, so there's no need to write nested loops or complex conditionals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <div>\n",
    "            <span class=\"event\">Event</span>\n",
    "            <span>party</span>\n",
    "        </div>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "\n",
    "for div in soup.find_all(\"div\"):\n",
    "    for event in div.find_all(class_=\"event\", recursive=False):\n",
    "        party = event.find_next_sibling(\"span\", string=\"party\")\n",
    "        if party is not None:\n",
    "            break\n",
    "party"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### soupsavvy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <div>\n",
    "            <span class=\"event\">Event</span>\n",
    "            <span>party</span>\n",
    "        </div>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = TypeSelector(\"div\") > ClassSelector(\"event\") + (\n",
    "    TypeSelector(\"span\") & PatternSelector(\"party\")\n",
    ")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing elements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In `BeautifulSoup`, you often have to deal with missing elements before interacting with them, which clutters your code. `soupsavvy` selectors handle this for you automatically. If you need stricter control, the `strict` mode raises an exception if the required element isn't found."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <div>\n",
    "            <span>No event here</span>\n",
    "            <span>No party</span>\n",
    "        </div>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "\n",
    "event = soup.find(class_=\"event\")\n",
    "\n",
    "if event is not None:\n",
    "    party = event.find_next_sibling(string=\"party\")\n",
    "else:\n",
    "    print(\"This needs to be handled explicitly every time.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### soupsavvy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, PatternSelector, to_soupsavvy\n",
    "from soupsavvy.exceptions import TagNotFoundException\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <div>\n",
    "            <span>No event here</span>\n",
    "            <span>No party</span>\n",
    "        </div>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "selector = ClassSelector(\"event\") + PatternSelector(\"party\")\n",
    "assert selector.find(element) is None\n",
    "\n",
    "try:\n",
    "    selector.find(element, strict=True)\n",
    "except TagNotFoundException as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining selectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining selectors in `BeautifulSoup` can be cumbersome, especially when trying to use different methods or perform set operations like unions or intersections. With `soupsavvy`, logical operators allow you to easily combine selectors without worrying about hash collisions or element order."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p class=\"special\">Festival</p>\n",
    "        <div>\n",
    "            <span>Event</span>\n",
    "            <span>Menu</span>\n",
    "        </div>\n",
    "        <div>\n",
    "            <span>Menu</span>\n",
    "        </div>\n",
    "        <div>\n",
    "            <span>Event</span>\n",
    "        </div>\n",
    "        <span>Event</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "\n",
    "result1 = soup.find_all(\"span\", string=\"Event\")\n",
    "result2 = soup.select(\":last-child\")\n",
    "result3 = soup.find_all(class_=\"special\")\n",
    "\n",
    "# elements with the same text representation have the same hash\n",
    "# <span>Event</span> is included only once!\n",
    "# There is no guarantee that the order of the elements will be preserved\n",
    "\n",
    "(set(result1) & set(result2)) | set(result3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### soupsavvy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.selectors.css import LastChild\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p class=\"special\">Festival</p>\n",
    "        <div>\n",
    "            <span>Event</span>\n",
    "            <span>Menu</span>\n",
    "        </div>\n",
    "        <div>\n",
    "            <span>Menu</span>\n",
    "        </div>\n",
    "        <div>\n",
    "            <span>Event</span>\n",
    "        </div>\n",
    "        <span>Event</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "selector = (\n",
    "    PatternSelector(\"Event\") & TypeSelector(\"span\") & LastChild()\n",
    ") | ClassSelector(\"special\")\n",
    "selector.find_all(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, selecting an element is just the first step, you need to extract and transform the data afterward. `soupsavvy` lets you pipe operations directly into selectors, enabling you to transform and extract data seamlessly without additional code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p>Event</p>\n",
    "        <span class=\"date\">2023-10-30</span>\n",
    "        <span class=\"date\">2023-08-31</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "\n",
    "date_elements = soup.find_all(class_=\"date\")\n",
    "dates = [\n",
    "    datetime.strptime(element.get_text(strip=True), \"%Y-%m-%d\")\n",
    "    for element in date_elements\n",
    "]\n",
    "dates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### soupsavvy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p>Event</p>\n",
    "        <span class=\"date\">2023-10-30</span>\n",
    "        <span class=\"date\">2023-08-31</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "selector = ClassSelector(\"date\") | Text() | Operation(datetime.strptime, \"%Y-%m-%d\")\n",
    "selector.find_all(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Structured information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Extracting structured information in `BeautifulSoup` often requires repetitive boilerplate code.  \n",
    "With `soupsavvy`, you can define flexible, reusable data extraction schemas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import dataclass\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "\n",
    "@dataclass\n",
    "class Book:\n",
    "    title: str\n",
    "    price: float\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Brave New World  </p>\n",
    "        <p class=\"price\">80$</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "\n",
    "books = []\n",
    "book_elements = soup.find_all(\"div\", class_=\"book\")\n",
    "\n",
    "for book_element in book_elements:\n",
    "    title = book_element.find(class_=\"title\")\n",
    "\n",
    "    if title is None:\n",
    "        raise ValueError(\"Title not found\")\n",
    "\n",
    "    title = title.get_text(strip=True)\n",
    "\n",
    "    price = book_element.find(class_=\"price\")\n",
    "\n",
    "    if price is None:\n",
    "        raise ValueError(\"Price not found\")\n",
    "\n",
    "    price = int(price.get_text(strip=True).replace(\"$\", \"\"))\n",
    "    book = Book(title, price)\n",
    "    books.append(book)\n",
    "\n",
    "books"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### soupsavvy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = (\n",
    "        ClassSelector(\"price\")\n",
    "        | Text()\n",
    "        | Operation(lambda x: x.strip(\"$\"))\n",
    "        | Operation(int)\n",
    "    )\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Brave New World  </p>\n",
    "        <p class=\"price\">80$</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find_all(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "By using `soupsavvy`, you not only simplify your code but also gain powerful tools to handle complex selection and extraction tasks with ease.  \n",
    "It's a great way to keep your web scraping modules clean, concise and less error-prone.\n",
    "\n",
    "**Enjoy `soupsavvy` and leave us feedback!**  \n",
    "**Happy scraping!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "soupsavvy",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}