{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`soupsavvy` **Model** is user-defined **scraping schema**, that uses selectors and operations to extract structured information from the webpage. Model represents an entity of interest in scraping like product, article, job, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Operations encapsulate transformation logic, such as extracting text, converting data types, or applying custom transformations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Applying operation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy.operations import Operation\n",
    "\n",
    "operation = Operation(lambda x: x.strip(\"$\"))\n",
    "operation.execute(\"100$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Operation` accepts positional and keyword arguments, which are passed to execution function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "\n",
    "from soupsavvy.operations import Operation\n",
    "\n",
    "operation = Operation(datetime.strptime, \"%d-%m-%Y\")\n",
    "operation.execute(\"01-02-2020\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chaining operations\n",
    "\n",
    "Operations can be chained with `|` operator to apply multiple in the sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy.operations import Operation\n",
    "\n",
    "operation = (\n",
    "    Operation(lambda x: x.strip(\"$\")) | Operation(int) | Operation(lambda x: x * 2)\n",
    ")\n",
    "operation.execute(\"100$\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Text` is a built-in operation that extracts the text content of an element. It's very common and useful operation in web scraping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy.operations import Text\n",
    "from soupsavvy import to_soupsavvy\n",
    "\n",
    "text = \"\"\"\n",
    "    <p class=\"title\">Animal Farm</p>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup.p)\n",
    "operation = Text()\n",
    "operation.execute(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Href"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Href` is a built-in operation that extracts value of `href` attribute from an element. If `href` attribute is not found in the element, it returns `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy.operations import Href\n",
    "from soupsavvy import to_soupsavvy\n",
    "\n",
    "text = \"\"\"\n",
    "    <a href=\"www.book.com\">Animal Farm</a>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup.a)\n",
    "operation = Href()\n",
    "operation.execute(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Parent` is an operation that extracts the parent element of the current element. It can be sued as selector as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy.operations import Parent\n",
    "from soupsavvy import to_soupsavvy\n",
    "\n",
    "text = \"\"\"\n",
    "    <div><a href=\"www.book.com\">Animal Farm</a></div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup.a)\n",
    "operation = Parent()\n",
    "operation.execute(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining with selector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selectors in `soupsavvy` can be combined with operations by using `|` operator. Created pipeline first locates the element and then applies the operation on it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "selector = ClassSelector(\"book\") > ClassSelector(\"price\")\n",
    "operation = Text() | Operation(lambda x: x.strip(\"$\")) | Operation(int)\n",
    "\n",
    "pipeline = selector | operation\n",
    "pipeline.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Definition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To create a user-defined model in `soupsavvy`, it must:\n",
    "\n",
    "- Inherit from `soupsavvy.models.BaseModel`.\n",
    "- Define a `__scope__` class attribute specifying the HTML element containing the model's fields.\n",
    "- Include at least one field as a class attribute.\n",
    "\n",
    "**Scope:** This selector defines the HTML element that encapsulates all fields of the model.\n",
    "\n",
    "**Fields:** Class attributes that extract data from within the scope element. These can be:\n",
    "- Selectors, e.g., `ClassSelector(\"book\")`\n",
    "- Selector-operation pipelines, e.g., `ClassSelector(\"price\") | Text() | Operation(int)`\n",
    "- Another model class inheriting from `BaseModel`\n",
    "- Mixins like `Text()`, `Href()`, or custom `Operation()`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`find` method of model class can be used to extract model from `bs4` object. It returns model instance within the first found scope element.\n",
    "\n",
    "`Book` class can define a model expected to be contained within a `div.book` element and includes two fields:\n",
    "\n",
    "- **`title`**: Extracts text from `.title`.\n",
    "- **`price`**: Extracts text from `.price` and converts it to an integer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the `find` method doesn't locate the scope in the provided tag, it returns `None`, and the model is not extracted by default. \n",
    "\n",
    "However, in `strict` mode, when the `find` method fails to find the specified scope, it raises a `ModelNotFoundException` exception instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.exceptions import ModelNotFoundException\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"ebook\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "result = Book.find(element)\n",
    "assert result is None\n",
    "\n",
    "try:\n",
    "    Book.find(element, strict=True)\n",
    "except ModelNotFoundException as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, errors during data extraction are propagated, stopping the model from being built. For instance, if the `price` element isn't found, the `Text` operation fails since it can't extract text from `None`.\n",
    "\n",
    "The `strict` parameter only affects scope searches, not individual field selectors. Field selectors are forgiving, meaning they continue even if a previous step returns `None`. Edge cases must be handled explicitly within the model definition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.exceptions import FieldExtractionException\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "try:\n",
    "    Book.find(element)\n",
    "except FieldExtractionException as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Operations as fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As noted earlier, operations can be used as fields in the model to extract and transform data from the scope element. For example, `Operation` can extract the `id` attribute, and `Href` can be used to retrieve the `href` attribute from an element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Href\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\")\n",
    "\n",
    "    id = Operation(lambda x: x.get()[\"id\"])\n",
    "    link = Href()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div id=\"book1\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrappers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wrappers are composite components that modify the behavior of operations or selectors.  \n",
    "They handle edge cases, like missing data, more gracefully in the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SkipNone"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `SkipNone` wrapper prevents operations like text extraction or type conversion from running if the input is `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, SkipNone, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | SkipNone(Text() | Operation(int))\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Suppress"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Suppress` operation wrapper catches and suppresses exceptions during execution, returning `None` if an exception occurs. This is useful for handling potential incompatibilities, such as converting an empty string to an integer.\n",
    "It allows to specify category of exceptions to suppress by passing `category` parameter as exception or tuple of exceptions, in such case only exceptions of specified category will be suppressed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Suppress, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = (\n",
    "        ClassSelector(\"price\") | Text() | Suppress(Operation(int), category=ValueError)\n",
    "    )\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\"></p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Default` wrapper provides a default value when a field selector returns `None`, allowing for specific interpretations, like treating an empty `price` as `0`. However, it does not suppress exceptions that arise during extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel, Default\n",
    "from soupsavvy.operations import Operation, Suppress, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = Default(ClassSelector(\"price\") | Text() | Suppress(Operation(int)), 0)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">hundred</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### IfElse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `IfElse` operation enables conditional data transformations, taking three arguments:\n",
    "\n",
    "- **`condition` (callable):** A function that determines which operation to execute.\n",
    "- **`if_` (operation):** The operation performed if the condition is met.\n",
    "- **`else_` (operation):** The operation executed if the condition is not met.\n",
    "\n",
    "This is useful for applying different transformations based on the HTML structure or values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Href, IfElse, Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = (\n",
    "        ClassSelector(\"title\")\n",
    "        | Text()\n",
    "        | IfElse(lambda x: x == \"\", Operation(lambda x: None), Operation(str.upper))\n",
    "    )\n",
    "    price = ClassSelector(\"price\") | IfElse(lambda x: x.name == \"a\", Href(), Text())\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">10</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Break and Continue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, `Break` and `Continue` operations enhance `IfElse` by providing control flow capabilities:\n",
    "\n",
    "- **`Break`:** Terminates the operation pipeline.\n",
    "- **`Continue`:** Skips the current operation and moves to the next one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Break, Continue, IfElse, Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = (\n",
    "        ClassSelector(\"title\")\n",
    "        | Text()\n",
    "        | IfElse(\n",
    "            lambda x: x == \"\",\n",
    "            Break(),\n",
    "            Operation(str.upper),\n",
    "        )\n",
    "        | Operation(lambda x: x + \"!\")\n",
    "    )\n",
    "    price = (\n",
    "        ClassSelector(\"divider\")\n",
    "        | Text()\n",
    "        | Operation(int)\n",
    "        | IfElse(lambda x: x == 0, Continue(), Operation(lambda x: 100 / x))\n",
    "        | Operation(lambda x: f\"{x}$\")\n",
    "    )\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\"></p>\n",
    "        <p class=\"divider\">0</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Required"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, all fields in a model are nullable, setting the corresponding field to `None` if the field selector returns `None`. You can modify this behavior with the `Required` field wrapper, which enforces not null value for the field. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.exceptions import FieldExtractionException\n",
    "from soupsavvy.models import BaseModel, Required\n",
    "from soupsavvy.operations import Operation, SkipNone, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = Required(ClassSelector(\"price\") | SkipNone(Text() | Operation(int)))\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "try:\n",
    "    Book.find(element)\n",
    "except FieldExtractionException as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### All"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we expect multiple elements to be found within the scope, the `All` field wrapper can be used.  \n",
    "This wrapper extracts all elements matching the field selector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import All, BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = All(ClassSelector(\"price\") | Text() | Operation(int))\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\"><s>100</s></p>\n",
    "        <p class=\"price\"><s>80</s></p>\n",
    "        <p class=\"price\">60</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Post-Initialization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To handle further transformations of extracted fields, you can define the `__post_init__` method in your model class, similar to Python's `dataclass`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import All, BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = All(ClassSelector(\"price\") | Text() | Operation(int))\n",
    "\n",
    "    def __post_init__(self) -> None:\n",
    "        self.price = min(self.price)  # type: ignore\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\"><s>100</s></p>\n",
    "        <p class=\"price\"><s>80</s></p>\n",
    "        <p class=\"price\">60</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also create individual postprocessing methods for each field. It is a method with any name, but must be decorated with `@soupsavvy.models.post`.\n",
    "\n",
    "This transformation is applied before `__post_init__` and assignment of instance attributes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel, post\n",
    "from soupsavvy.operations import Text\n",
    "from soupsavvy.selectors.css import LastOfType\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text()\n",
    "    author = (LastOfType() & TypeSelector(\"p\")) | Text()\n",
    "\n",
    "    @post(\"title\")\n",
    "    def process_title(self, value: str) -> str:\n",
    "        return value.upper()\n",
    "\n",
    "    @post(\"price\")\n",
    "    def process_price(self, value: str) -> int:\n",
    "        return int(value.strip(\"$\"))\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inheritance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, fields in a model are inherited, allowing subclasses to extend parent models easily. For example, the `eBook` model inherits from the `Book` model, adding fields like `link` and `duration`. It can also override `__scope__`, although this isn't required, as all special fields are inherited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Href, Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "class eBook(Book):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"ebook\")\n",
    "\n",
    "    link = Href()\n",
    "    duration = PatternSelector(re.compile(r\"\\d{1,2}:\\d{2}\")) | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"ebook\" href=\"www.ebook.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">50</p>\n",
    "        <p>George Orwell</p>\n",
    "        <p>2:30</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "eBook.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.  \n",
    "You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, HasSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "PRICE_SELECTOR = ClassSelector(\"price\")\n",
    "TITLE_SELECTOR = ClassSelector(\"title\")\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = (\n",
    "        ClassSelector(\"book\")\n",
    "        & HasSelector(PRICE_SELECTOR)\n",
    "        & HasSelector(TITLE_SELECTOR)\n",
    "    )\n",
    "\n",
    "    title = TITLE_SELECTOR | Text()\n",
    "    price = PRICE_SELECTOR | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\">Unavailable</div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p>George Orwell</p>\n",
    "        <p>4:30</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"price\">50</p>\n",
    "        <p>Lois Lowry</p>\n",
    "        <p>3:30</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Brave New World</p>\n",
    "        <p class=\"price\">50</p>\n",
    "        <p>Aldous Huxley</p>\n",
    "        <p>2:30</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When scope element is the same as current element, `SelfSelector` can be used as scope selector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = SelfSelector()\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    author = ClassSelector(\"author\") | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"author\">George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "div = element.find_all(\"div\")[0]\n",
    "\n",
    "Book.find(div)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finding all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find_all` method returns a list of model instances for all elements that match the scope selector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "from soupsavvy.selectors.css import LastOfType\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "    author = (LastOfType() & TypeSelector(\"p\")) | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"ebook\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Brave New World</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>Aldous Huxley</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">The Giver</p>\n",
    "        <p class=\"price\">80</p>\n",
    "        <p>Lois Lowry</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find_all(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recursive option"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The recursive option applies only to scope searches. When set to `True`, the model's scope is searched among all descendants of the specified tag; when set to `False`, only direct children are considered. Field selectors, however, always search recursively, regardless of this setting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "from soupsavvy.selectors.css import LastOfType\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "    author = (LastOfType() & TypeSelector(\"p\")) | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <span>\n",
    "        <div class=\"book\">\n",
    "            <p class=\"title\">Not a child</p>\n",
    "            <p class=\"price\">200</p>\n",
    "            <p>Author</p>\n",
    "        </div>\n",
    "    </span>\n",
    "    <div class=\"book\">\n",
    "        <span>\n",
    "            <p class=\"title\">Animal Farm</p>\n",
    "            <p class=\"price\">100</p>\n",
    "            <p>George Orwell</p>\n",
    "        </span>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"html.parser\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element, recursive=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To restrict field searches to only the children of the scope element, you can use a relative selector.  \n",
    "To find out more, see [docs](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html#relative-selectors)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import Anchor, ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = (Anchor > ClassSelector(\"price\")) | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <span>\n",
    "            <p class=\"title\">Animal Farm</p>\n",
    "            <p class=\"price\">100</p>\n",
    "            <p class=\"price\">50</p>\n",
    "            <span class=\"author\">\n",
    "                <p>George Orwell</p>\n",
    "            </span>\n",
    "        </span>\n",
    "        <p class=\"price\">200</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Composition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any model class can be a field selector as `Author` class in this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from datetime import datetime\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "from soupsavvy.selectors.css import FirstChild\n",
    "\n",
    "\n",
    "class Author(BaseModel):\n",
    "    __scope__ = ClassSelector(\"author\")\n",
    "\n",
    "    birth = (\n",
    "        PatternSelector(re.compile(r\"\\d{4}-\\d{2}-\\d{2}\"))\n",
    "        | Text()\n",
    "        | Operation(lambda x: datetime.strptime(x, \"%Y-%m-%d\"))\n",
    "    )\n",
    "    name = FirstChild() | Text()\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    author = Author\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <div class=\"author\">\n",
    "            <p>George Orwell</p>\n",
    "            <p>Great author</p>\n",
    "            <p>1903-06-25</p>\n",
    "        </div>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Frozen Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make the model immutable, set the `__frozen__` attribute to `True`. This enforces immutability, making the model hashable. Modifying any field of a frozen instance will raise a `FrozenModelException`. Regardless of immutability, attempting to set an attribute not defined as a field will raise an `AttributeError`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Text\n",
    "from soupsavvy.exceptions import FrozenModelException\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "    __frozen__ = True\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "result = Book.find(element)\n",
    "\n",
    "try:\n",
    "    result.price = 200  # type: ignore\n",
    "except FrozenModelException as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Field"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, all fields are part of instance:\n",
    "\n",
    "- **String Representation:**\n",
    "- **Equality Comparison:**\n",
    "- **Hash Calculation:**\n",
    "- **Migration:**\n",
    "\n",
    "To exclude a field from these operations, use the `Field` class as a wrapper with the following boolean parameters, that default to `True`:\n",
    "\n",
    "- **`repr`**\n",
    "- **`compare`**\n",
    "- **`migrate`**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel, Field\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "PRICE_SELECTOR = ClassSelector(\"price\") | Text()\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "    __frozen__ = True\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = Field(PRICE_SELECTOR, compare=False, repr=False, migrate=False)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100$</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">50$</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "result = Book.find_all(element)\n",
    "print(f\"{result[0]} == {result[1]}: {result[0] == result[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Migration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can migrate a model instance to another model using the `migrate` method, which takes a target class as an argument and initializes it with the current model's field values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Pydantic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pydantic\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class PydanticBook(pydantic.BaseModel):\n",
    "    title: str\n",
    "    price: int\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "book = Book.find(element)\n",
    "book.migrate(PydanticBook)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SQLAlchemy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "from sqlalchemy import Column, Integer, String\n",
    "from sqlalchemy.orm import DeclarativeBase\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Base(DeclarativeBase): ...\n",
    "\n",
    "\n",
    "class SABook(Base):\n",
    "    __tablename__ = \"book\"\n",
    "\n",
    "    id = Column(Integer, primary_key=True)\n",
    "    title = Column(String, nullable=True)\n",
    "    price = Column(Integer, nullable=True)\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"<SABook(title={self.title}, price={self.price})>\"\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "book = Book.find(element)\n",
    "book.migrate(SABook)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more complex models with another models as fields, `mapping` can be passed to `migrate` method to specify how `soupsavvy` models should be transformed into respective target models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pydantic\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "\n",
    "class PydanticAuthor(pydantic.BaseModel):\n",
    "    name: str\n",
    "    country: str\n",
    "\n",
    "\n",
    "class PydanticBook(pydantic.BaseModel):\n",
    "    title: str\n",
    "    author: PydanticAuthor\n",
    "\n",
    "\n",
    "class Author(BaseModel):\n",
    "    __scope__ = TypeSelector(\"span\")\n",
    "\n",
    "    name = TypeSelector(\"p\") | Text()\n",
    "    country = ClassSelector(\"country\") | Text()\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    author = Author\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <span>\n",
    "            <p>George Orwell</p>\n",
    "            <a class=\"country\">United Kingdom</a>\n",
    "        </span>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "book = Book.find(element)\n",
    "book.migrate(PydanticBook, mapping={Author: PydanticAuthor})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### MigrationSchema"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the `migrate` method accepts keyword arguments, these apply only to the target model, not to nested models. \n",
    "\n",
    "When additional initialization parameters are needed, use `MigrationSchema`, which includes the target model and a dictionary of keyword arguments for the initializer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel, MigrationSchema\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "\n",
    "class TargetAuthor:\n",
    "    def __init__(self, name: str, country: str, genre=None):\n",
    "        self.name = name\n",
    "        self.country = country\n",
    "        self.genre = genre\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"TargetAuthor(name={self.name!r}, country={self.country!r}, genre={self.genre!r})\"\n",
    "\n",
    "\n",
    "class TargetBook:\n",
    "    def __init__(self, title: str, author: TargetAuthor, price=None):\n",
    "        self.title = title\n",
    "        self.author = author\n",
    "        self.price = price\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"TargetBook(title={self.title!r}, author={self.author!r}, price={self.price!r})\"\n",
    "\n",
    "\n",
    "class Author(BaseModel):\n",
    "    __scope__ = TypeSelector(\"span\")\n",
    "\n",
    "    name = TypeSelector(\"p\") | Text()\n",
    "    country = ClassSelector(\"country\") | Text()\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    author = Author\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <span>\n",
    "            <p>George Orwell</p>\n",
    "            <a class=\"country\">United Kingdom</a>\n",
    "        </span>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "book = Book.find(element)\n",
    "book.migrate(\n",
    "    TargetBook,\n",
    "    mapping={Author: MigrationSchema(TargetAuthor, params={\"genre\": \"Dystopia\"})},\n",
    "    price=10,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Copy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, the `copy` method allows a model to be *migrated to itself*, creating a new identical instance as a deep copy, including all nested models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    price = ClassSelector(\"price\") | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "book = Book.find(element)\n",
    "book_copy = book.copy()\n",
    "\n",
    "assert book == book_copy\n",
    "assert book is not book_copy\n",
    "\n",
    "print(book_copy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scope\n",
    "\n",
    "It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.  \n",
    "You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, HasSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "PRICE_SELECTOR = ClassSelector(\"price\")\n",
    "TITLE_SELECTOR = ClassSelector(\"title\")\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = (\n",
    "        ClassSelector(\"book\")\n",
    "        & HasSelector(PRICE_SELECTOR)\n",
    "        & HasSelector(TITLE_SELECTOR)\n",
    "    )\n",
    "\n",
    "    title = TITLE_SELECTOR | Text()\n",
    "    price = PRICE_SELECTOR | Text() | Operation(int)\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\">Unavailable</div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p>George Orwell</p>\n",
    "        <p>4:30</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"price\">50</p>\n",
    "        <p>Lois Lowry</p>\n",
    "        <p>3:30</p>\n",
    "    </div>\n",
    "    <div class=\"book\">\n",
    "        <p class=\"title\">Brave New World</p>\n",
    "        <p class=\"price\">50</p>\n",
    "        <p>Aldous Huxley</p>\n",
    "        <p>2:30</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When scope element is the same as current element, `SelfSelector` can be used as scope selector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = SelfSelector()\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    author = ClassSelector(\"author\") | Text()\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"author\">George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "div = element.find_all(\"div\")[0]\n",
    "\n",
    "Book.find(div)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Typing\n",
    "\n",
    "To maintain clean and consistent typing, you can use `typing.cast` to provide type checkers with hints about instance field types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import cast, Optional\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, SkipNone, Text\n",
    "\n",
    "\n",
    "class Book(BaseModel):\n",
    "    __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n",
    "\n",
    "    title = cast(str, ClassSelector(\"title\") | Text())\n",
    "    price = cast(\n",
    "        Optional[int], ClassSelector(\"price\") | SkipNone(Text() | Operation(int))\n",
    "    )\n",
    "\n",
    "\n",
    "text = \"\"\"\n",
    "    <div class=\"book\" href=\"www.book.com\">\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <p class=\"price\">100</p>\n",
    "        <p>George Orwell</p>\n",
    "    </div>\n",
    "\"\"\"\n",
    "soup = BeautifulSoup(text, features=\"lxml\")\n",
    "element = to_soupsavvy(soup)\n",
    "\n",
    "Book.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`soupsavvy` offers a framework for object-oriented web scraping through user-defined models.  \n",
    "This allows users to define the structure of data they wish to extract from HTML documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Enjoy `soupsavvy` and leave us feedback!**  \n",
    "**Happy scraping!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "soupsavvy",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}