{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soupsavvy` **Model** is user-defined **scraping schema**, that uses selectors and operations to extract structured information from the webpage. Model represents an entity of interest in scraping like product, article, job, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations encapsulate transformation logic, such as extracting text, converting data types, or applying custom transformations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying operation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy.operations import Operation\n", "\n", "operation = Operation(lambda x: x.strip(\"$\"))\n", "operation.execute(\"100$\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Operation` accepts positional and keyword arguments, which are passed to execution function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "from soupsavvy.operations import Operation\n", "\n", "operation = Operation(datetime.strptime, \"%d-%m-%Y\")\n", "operation.execute(\"01-02-2020\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chaining operations\n", "\n", "Operations can be chained with `|` operator to apply multiple in the sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy.operations import Operation\n", "\n", "operation = (\n", " Operation(lambda x: x.strip(\"$\")) | Operation(int) | Operation(lambda x: x * 2)\n", ")\n", "operation.execute(\"100$\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Text` is a built-in operation that extracts the text content of an element. It's very common and useful operation in web scraping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Text\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", "

Animal Farm

\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.p)\n", "operation = Text()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Href" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Href` is a built-in operation that extracts value of `href` attribute from an element. If `href` attribute is not found in the element, it returns `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Href\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", " Animal Farm\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.a)\n", "operation = Href()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Parent` is an operation that extracts the parent element of the current element. It can be sued as selector as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Parent\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", "
Animal Farm
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.a)\n", "operation = Parent()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining with selector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selectors in `soupsavvy` can be combined with operations by using `|` operator. Created pipeline first locates the element and then applies the operation on it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "from soupsavvy.operations import Operation, Text\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"book\") > ClassSelector(\"price\")\n", "operation = Text() | Operation(lambda x: x.strip(\"$\")) | Operation(int)\n", "\n", "pipeline = selector | operation\n", "pipeline.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Definition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a user-defined model in `soupsavvy`, it must:\n", "\n", "- Inherit from `soupsavvy.models.BaseModel`.\n", "- Define a `__scope__` class attribute specifying the HTML element containing the model's fields.\n", "- Include at least one field as a class attribute.\n", "\n", "**Scope:** This selector defines the HTML element that encapsulates all fields of the model.\n", "\n", "**Fields:** Class attributes that extract data from within the scope element. These can be:\n", "- Selectors, e.g., `ClassSelector(\"book\")`\n", "- Selector-operation pipelines, e.g., `ClassSelector(\"price\") | Text() | Operation(int)`\n", "- Another model class inheriting from `BaseModel`\n", "- Mixins like `Text()`, `Href()`, or custom `Operation()`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`find` method of model class can be used to extract model from `bs4` object. It returns model instance within the first found scope element.\n", "\n", "`Book` class can define a model expected to be contained within a `div.book` element and includes two fields:\n", "\n", "- **`title`**: Extracts text from `.title`.\n", "- **`price`**: Extracts text from `.price` and converts it to an integer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `find` method doesn't locate the scope in the provided tag, it returns `None`, and the model is not extracted by default. \n", "\n", "However, in `strict` mode, when the `find` method fails to find the specified scope, it raises a `ModelNotFoundException` exception instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.exceptions import ModelNotFoundException\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "result = Book.find(element)\n", "assert result is None\n", "\n", "try:\n", " Book.find(element, strict=True)\n", "except ModelNotFoundException as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, errors during data extraction are propagated, stopping the model from being built. For instance, if the `price` element isn't found, the `Text` operation fails since it can't extract text from `None`.\n", "\n", "The `strict` parameter only affects scope searches, not individual field selectors. Field selectors are forgiving, meaning they continue even if a previous step returns `None`. Edge cases must be handled explicitly within the model definition." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.exceptions import FieldExtractionException\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "try:\n", " Book.find(element)\n", "except FieldExtractionException as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Operations as fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted earlier, operations can be used as fields in the model to extract and transform data from the scope element. For example, `Operation` can extract the `id` attribute, and `Href` can be used to retrieve the `href` attribute from an element." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Href\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\")\n", "\n", " id = Operation(lambda x: x.get()[\"id\"])\n", " link = Href()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wrappers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wrappers are composite components that modify the behavior of operations or selectors. \n", "They handle edge cases, like missing data, more gracefully in the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SkipNone" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `SkipNone` wrapper prevents operations like text extraction or type conversion from running if the input is `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, SkipNone, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | SkipNone(Text() | Operation(int))\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Suppress" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Suppress` operation wrapper catches and suppresses exceptions during execution, returning `None` if an exception occurs. This is useful for handling potential incompatibilities, such as converting an empty string to an integer.\n", "It allows to specify category of exceptions to suppress by passing `category` parameter as exception or tuple of exceptions, in such case only exceptions of specified category will be suppressed.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Suppress, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = (\n", " ClassSelector(\"price\") | Text() | Suppress(Operation(int), category=ValueError)\n", " )\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Default` wrapper provides a default value when a field selector returns `None`, allowing for specific interpretations, like treating an empty `price` as `0`. However, it does not suppress exceptions that arise during extraction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel, Default\n", "from soupsavvy.operations import Operation, Suppress, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = Default(ClassSelector(\"price\") | Text() | Suppress(Operation(int)), 0)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

hundred

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### IfElse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `IfElse` operation enables conditional data transformations, taking three arguments:\n", "\n", "- **`condition` (callable):** A function that determines which operation to execute.\n", "- **`if_` (operation):** The operation performed if the condition is met.\n", "- **`else_` (operation):** The operation executed if the condition is not met.\n", "\n", "This is useful for applying different transformations based on the HTML structure or values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Href, IfElse, Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = (\n", " ClassSelector(\"title\")\n", " | Text()\n", " | IfElse(lambda x: x == \"\", Operation(lambda x: None), Operation(str.upper))\n", " )\n", " price = ClassSelector(\"price\") | IfElse(lambda x: x.name == \"a\", Href(), Text())\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

10

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Break and Continue" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, `Break` and `Continue` operations enhance `IfElse` by providing control flow capabilities:\n", "\n", "- **`Break`:** Terminates the operation pipeline.\n", "- **`Continue`:** Skips the current operation and moves to the next one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Break, Continue, IfElse, Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = (\n", " ClassSelector(\"title\")\n", " | Text()\n", " | IfElse(\n", " lambda x: x == \"\",\n", " Break(),\n", " Operation(str.upper),\n", " )\n", " | Operation(lambda x: x + \"!\")\n", " )\n", " price = (\n", " ClassSelector(\"divider\")\n", " | Text()\n", " | Operation(int)\n", " | IfElse(lambda x: x == 0, Continue(), Operation(lambda x: 100 / x))\n", " | Operation(lambda x: f\"{x}$\")\n", " )\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

\n", "

0

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Required" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, all fields in a model are nullable, setting the corresponding field to `None` if the field selector returns `None`. You can modify this behavior with the `Required` field wrapper, which enforces not null value for the field. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.exceptions import FieldExtractionException\n", "from soupsavvy.models import BaseModel, Required\n", "from soupsavvy.operations import Operation, SkipNone, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = Required(ClassSelector(\"price\") | SkipNone(Text() | Operation(int)))\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "try:\n", " Book.find(element)\n", "except FieldExtractionException as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### All" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we expect multiple elements to be found within the scope, the `All` field wrapper can be used. \n", "This wrapper extracts all elements matching the field selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import All, BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = All(ClassSelector(\"price\") | Text() | Operation(int))\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

80

\n", "

60

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Post-Initialization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To handle further transformations of extracted fields, you can define the `__post_init__` method in your model class, similar to Python's `dataclass`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import All, BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = All(ClassSelector(\"price\") | Text() | Operation(int))\n", "\n", " def __post_init__(self) -> None:\n", " self.price = min(self.price) # type: ignore\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

80

\n", "

60

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also create individual postprocessing methods for each field. It is a method with any name, but must be decorated with `@soupsavvy.models.post`.\n", "\n", "This transformation is applied before `__post_init__` and assignment of instance attributes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel, post\n", "from soupsavvy.operations import Text\n", "from soupsavvy.selectors.css import LastOfType\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text()\n", " author = (LastOfType() & TypeSelector(\"p\")) | Text()\n", "\n", " @post(\"title\")\n", " def process_title(self, value: str) -> str:\n", " return value.upper()\n", "\n", " @post(\"price\")\n", " def process_price(self, value: str) -> int:\n", " return int(value.strip(\"$\"))\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inheritance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, fields in a model are inherited, allowing subclasses to extend parent models easily. For example, the `eBook` model inherits from the `Book` model, adding fields like `link` and `duration`. It can also override `__scope__`, although this isn't required, as all special fields are inherited." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Href, Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "class eBook(Book):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"ebook\")\n", "\n", " link = Href()\n", " duration = PatternSelector(re.compile(r\"\\d{1,2}:\\d{2}\")) | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

50

\n", "

George Orwell

\n", "

2:30

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "eBook.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model. \n", "You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, HasSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "PRICE_SELECTOR = ClassSelector(\"price\")\n", "TITLE_SELECTOR = ClassSelector(\"title\")\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = (\n", " ClassSelector(\"book\")\n", " & HasSelector(PRICE_SELECTOR)\n", " & HasSelector(TITLE_SELECTOR)\n", " )\n", "\n", " title = TITLE_SELECTOR | Text()\n", " price = PRICE_SELECTOR | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
Unavailable
\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "

4:30

\n", "
\n", "
\n", "

50

\n", "

Lois Lowry

\n", "

3:30

\n", "
\n", "
\n", "

Brave New World

\n", "

50

\n", "

Aldous Huxley

\n", "

2:30

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When scope element is the same as current element, `SelfSelector` can be used as scope selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Text\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = SelfSelector()\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " author = ClassSelector(\"author\") | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "Book.find(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding all" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `find_all` method returns a list of model instances for all elements that match the scope selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "from soupsavvy.selectors.css import LastOfType\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", " author = (LastOfType() & TypeSelector(\"p\")) | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "
\n", "

Brave New World

\n", "

100

\n", "

Aldous Huxley

\n", "
\n", "
\n", "

The Giver

\n", "

80

\n", "

Lois Lowry

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursive option" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The recursive option applies only to scope searches. When set to `True`, the model's scope is searched among all descendants of the specified tag; when set to `False`, only direct children are considered. Field selectors, however, always search recursively, regardless of this setting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "from soupsavvy.selectors.css import LastOfType\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", " author = (LastOfType() & TypeSelector(\"p\")) | Text()\n", "\n", "\n", "text = \"\"\"\n", " \n", "
\n", "

Not a child

\n", "

200

\n", "

Author

\n", "
\n", "
\n", "
\n", " \n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"html.parser\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element, recursive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To restrict field searches to only the children of the scope element, you can use a relative selector. \n", "To find out more, see [docs](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html#relative-selectors)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = (Anchor > ClassSelector(\"price\")) | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", " \n", "

Animal Farm

\n", "

100

\n", "

50

\n", " \n", "

George Orwell

\n", "
\n", "
\n", "

200

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Composition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any model class can be a field selector as `Author` class in this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "from datetime import datetime\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "from soupsavvy.selectors.css import FirstChild\n", "\n", "\n", "class Author(BaseModel):\n", " __scope__ = ClassSelector(\"author\")\n", "\n", " birth = (\n", " PatternSelector(re.compile(r\"\\d{4}-\\d{2}-\\d{2}\"))\n", " | Text()\n", " | Operation(lambda x: datetime.strptime(x, \"%Y-%m-%d\"))\n", " )\n", " name = FirstChild() | Text()\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " author = Author\n", " title = ClassSelector(\"title\") | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "
\n", "

George Orwell

\n", "

Great author

\n", "

1903-06-25

\n", "
\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Frozen Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make the model immutable, set the `__frozen__` attribute to `True`. This enforces immutability, making the model hashable. Modifying any field of a frozen instance will raise a `FrozenModelException`. Regardless of immutability, attempting to set an attribute not defined as a field will raise an `AttributeError`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Text\n", "from soupsavvy.exceptions import FrozenModelException\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", " __frozen__ = True\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "result = Book.find(element)\n", "\n", "try:\n", " result.price = 200 # type: ignore\n", "except FrozenModelException as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Field" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, all fields are part of instance:\n", "\n", "- **String Representation:**\n", "- **Equality Comparison:**\n", "- **Hash Calculation:**\n", "- **Migration:**\n", "\n", "To exclude a field from these operations, use the `Field` class as a wrapper with the following boolean parameters, that default to `True`:\n", "\n", "- **`repr`**\n", "- **`compare`**\n", "- **`migrate`**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel, Field\n", "from soupsavvy.operations import Text\n", "\n", "PRICE_SELECTOR = ClassSelector(\"price\") | Text()\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", " __frozen__ = True\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = Field(PRICE_SELECTOR, compare=False, repr=False, migrate=False)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100$

\n", "
\n", "
\n", "

Animal Farm

\n", "

50$

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "result = Book.find_all(element)\n", "print(f\"{result[0]} == {result[1]}: {result[0] == result[1]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Migration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can migrate a model instance to another model using the `migrate` method, which takes a target class as an argument and initializes it with the current model's field values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pydantic" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pydantic\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class PydanticBook(pydantic.BaseModel):\n", " title: str\n", " price: int\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "book = Book.find(element)\n", "book.migrate(PydanticBook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SQLAlchemy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "from sqlalchemy import Column, Integer, String\n", "from sqlalchemy.orm import DeclarativeBase\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Base(DeclarativeBase): ...\n", "\n", "\n", "class SABook(Base):\n", " __tablename__ = \"book\"\n", "\n", " id = Column(Integer, primary_key=True)\n", " title = Column(String, nullable=True)\n", " price = Column(Integer, nullable=True)\n", "\n", " def __repr__(self):\n", " return f\"\"\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "book = Book.find(element)\n", "book.migrate(SABook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mapping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more complex models with another models as fields, `mapping` can be passed to `migrate` method to specify how `soupsavvy` models should be transformed into respective target models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pydantic\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Text\n", "\n", "\n", "class PydanticAuthor(pydantic.BaseModel):\n", " name: str\n", " country: str\n", "\n", "\n", "class PydanticBook(pydantic.BaseModel):\n", " title: str\n", " author: PydanticAuthor\n", "\n", "\n", "class Author(BaseModel):\n", " __scope__ = TypeSelector(\"span\")\n", "\n", " name = TypeSelector(\"p\") | Text()\n", " country = ClassSelector(\"country\") | Text()\n", "\n", "\n", "class Book(BaseModel):\n", "\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " author = Author\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", " \n", "

George Orwell

\n", " United Kingdom\n", "
\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "book = Book.find(element)\n", "book.migrate(PydanticBook, mapping={Author: PydanticAuthor})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### MigrationSchema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the `migrate` method accepts keyword arguments, these apply only to the target model, not to nested models. \n", "\n", "When additional initialization parameters are needed, use `MigrationSchema`, which includes the target model and a dictionary of keyword arguments for the initializer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel, MigrationSchema\n", "from soupsavvy.operations import Text\n", "\n", "\n", "class TargetAuthor:\n", " def __init__(self, name: str, country: str, genre=None):\n", " self.name = name\n", " self.country = country\n", " self.genre = genre\n", "\n", " def __repr__(self):\n", " return f\"TargetAuthor(name={self.name!r}, country={self.country!r}, genre={self.genre!r})\"\n", "\n", "\n", "class TargetBook:\n", " def __init__(self, title: str, author: TargetAuthor, price=None):\n", " self.title = title\n", " self.author = author\n", " self.price = price\n", "\n", " def __repr__(self):\n", " return f\"TargetBook(title={self.title!r}, author={self.author!r}, price={self.price!r})\"\n", "\n", "\n", "class Author(BaseModel):\n", " __scope__ = TypeSelector(\"span\")\n", "\n", " name = TypeSelector(\"p\") | Text()\n", " country = ClassSelector(\"country\") | Text()\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " author = Author\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", " \n", "

George Orwell

\n", " United Kingdom\n", "
\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "book = Book.find(element)\n", "book.migrate(\n", " TargetBook,\n", " mapping={Author: MigrationSchema(TargetAuthor, params={\"genre\": \"Dystopia\"})},\n", " price=10,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Copy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, the `copy` method allows a model to be *migrated to itself*, creating a new identical instance as a deep copy, including all nested models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " price = ClassSelector(\"price\") | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "book = Book.find(element)\n", "book_copy = book.copy()\n", "\n", "assert book == book_copy\n", "assert book is not book_copy\n", "\n", "print(book_copy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scope\n", "\n", "It's advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model. \n", "You can use `HasSelector` to extend your selection criteria by matching elements that contain the fields needed for extraction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, HasSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "PRICE_SELECTOR = ClassSelector(\"price\")\n", "TITLE_SELECTOR = ClassSelector(\"title\")\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = (\n", " ClassSelector(\"book\")\n", " & HasSelector(PRICE_SELECTOR)\n", " & HasSelector(TITLE_SELECTOR)\n", " )\n", "\n", " title = TITLE_SELECTOR | Text()\n", " price = PRICE_SELECTOR | Text() | Operation(int)\n", "\n", "\n", "text = \"\"\"\n", "
Unavailable
\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "

4:30

\n", "
\n", "
\n", "

50

\n", "

Lois Lowry

\n", "

3:30

\n", "
\n", "
\n", "

Brave New World

\n", "

50

\n", "

Aldous Huxley

\n", "

2:30

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When scope element is the same as current element, `SelfSelector` can be used as scope selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Text\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = SelfSelector()\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " author = ClassSelector(\"author\") | Text()\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "Book.find(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Typing\n", "\n", "To maintain clean and consistent typing, you can use `typing.cast` to provide type checkers with hints about instance field types." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from typing import cast, Optional\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, SkipNone, Text\n", "\n", "\n", "class Book(BaseModel):\n", " __scope__ = TypeSelector(\"div\") & ClassSelector(\"book\")\n", "\n", " title = cast(str, ClassSelector(\"title\") | Text())\n", " price = cast(\n", " Optional[int], ClassSelector(\"price\") | SkipNone(Text() | Operation(int))\n", " )\n", "\n", "\n", "text = \"\"\"\n", "
\n", "

Animal Farm

\n", "

100

\n", "

George Orwell

\n", "
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup)\n", "\n", "Book.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soupsavvy` offers a framework for object-oriented web scraping through user-defined models. \n", "This allows users to define the structure of data they wish to extract from HTML documents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }