Models

soupsavvy Model is user-defined scraping schema, that uses selectors and operations to extract structured information from the webpage. Model represents an entity of interest in scraping like product, article, job, etc.

Operations

Operations encapsulate transformation logic, such as extracting text, converting data types, or applying custom transformations.

Applying operation

from soupsavvy.operations import Operation

operation = Operation(lambda x: x.strip("$"))
operation.execute("100$")
'100'

Operation accepts positional and keyword arguments, which are passed to execution function.

from datetime import datetime

from soupsavvy.operations import Operation

operation = Operation(datetime.strptime, "%d-%m-%Y")
operation.execute("01-02-2020")
datetime.datetime(2020, 2, 1, 0, 0)

Chaining operations

Operations can be chained with | operator to apply multiple in the sequence.

from soupsavvy.operations import Operation

operation = (
    Operation(lambda x: x.strip("$")) | Operation(int) | Operation(lambda x: x * 2)
)
operation.execute("100$")
200

Text

Text is a built-in operation that extracts the text content of an element. It’s very common and useful operation in web scraping.

from bs4 import BeautifulSoup

from soupsavvy.operations import Text
from soupsavvy import to_soupsavvy

text = """
    <p class="title">Animal Farm</p>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.p)
operation = Text()
operation.execute(element)
'Animal Farm'

Href

Href is a built-in operation that extracts value of href attribute from an element. If href attribute is not found in the element, it returns None.

from bs4 import BeautifulSoup

from soupsavvy.operations import Href
from soupsavvy import to_soupsavvy

text = """
    <a href="www.book.com">Animal Farm</a>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.a)
operation = Href()
operation.execute(element)
'www.book.com'

Parent

Parent is an operation that extracts the parent element of the current element. It can be sued as selector as well.

from bs4 import BeautifulSoup

from soupsavvy.operations import Parent
from soupsavvy import to_soupsavvy

text = """
    <div><a href="www.book.com">Animal Farm</a></div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup.a)
operation = Parent()
operation.execute(element)
SoupElement(<div><a href="www.book.com">Animal Farm</a></div>)

Combining with selector

Selectors in soupsavvy can be combined with operations by using | operator. Created pipeline first locates the element and then applies the operation on it.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy
from soupsavvy.operations import Operation, Text

text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

selector = ClassSelector("book") > ClassSelector("price")
operation = Text() | Operation(lambda x: x.strip("$")) | Operation(int)

pipeline = selector | operation
pipeline.find(element)
100

Models

Definition

To create a user-defined model in soupsavvy, it must:

  • Inherit from soupsavvy.models.BaseModel.

  • Define a __scope__ class attribute specifying the HTML element containing the model’s fields.

  • Include at least one field as a class attribute.

Scope: This selector defines the HTML element that encapsulates all fields of the model.

Fields: Class attributes that extract data from within the scope element. These can be:

  • Selectors, e.g., ClassSelector("book")

  • Selector-operation pipelines, e.g., ClassSelector("price") | Text() | Operation(int)

  • Another model class inheriting from BaseModel

  • Mixins like Text(), Href(), or custom Operation()

Finding model

find method of model class can be used to extract model from bs4 object. It returns model instance within the first found scope element.

Book class can define a model expected to be contained within a div.book element and includes two fields:

  • title: Extracts text from .title.

  • price: Extracts text from .price and converts it to an integer.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
Book.find(element)
Book(title='Animal Farm', price=100)

If the find method doesn’t locate the scope in the provided tag, it returns None, and the model is not extracted by default.

However, in strict mode, when the find method fails to find the specified scope, it raises a ModelNotFoundException exception instead.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import ModelNotFoundException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
result = Book.find(element)
assert result is None

try:
    Book.find(element, strict=True)
except ModelNotFoundException as e:
    print(e)
Scope for the model 'Book' was not found by 'AndSelector(TypeSelector(name='div'), ClassSelector('book'))' in tag.

By default, errors during data extraction are propagated, stopping the model from being built. For instance, if the price element isn’t found, the Text operation fails since it can’t extract text from None.

The strict parameter only affects scope searches, not individual field selectors. Field selectors are forgiving, meaning they continue even if a previous step returns None. Edge cases must be handled explicitly within the model definition.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

try:
    Book.find(element)
except FieldExtractionException as e:
    print(e)
Extracting field 'price' failed in model 'Book' for element:
<div class="book" href="www.book.com">
<p class="title">Animal Farm</p>
<p>George Orwell</p>
</div>

Operations as fields

As noted earlier, operations can be used as fields in the model to extract and transform data from the scope element. For example, Operation can extract the id attribute, and Href can be used to retrieve the href attribute from an element.

from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Href


class Book(BaseModel):

    __scope__ = TypeSelector("div")

    id = Operation(lambda x: x.get()["id"])
    link = Href()


text = """
    <div id="book1" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(id='book1', link='www.book.com')

Wrappers

Wrappers are composite components that modify the behavior of operations or selectors.
They handle edge cases, like missing data, more gracefully in the model.

SkipNone

The SkipNone wrapper prevents operations like text extraction or type conversion from running if the input is None.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | SkipNone(Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=None)

Suppress

The Suppress operation wrapper catches and suppresses exceptions during execution, returning None if an exception occurs. This is useful for handling potential incompatibilities, such as converting an empty string to an integer. It allows to specify category of exceptions to suppress by passing category parameter as exception or tuple of exceptions, in such case only exceptions of specified category will be suppressed.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Suppress, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (
        ClassSelector("price") | Text() | Suppress(Operation(int), category=ValueError)
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"></p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=None)

The Default wrapper provides a default value when a field selector returns None, allowing for specific interpretations, like treating an empty price as 0. However, it does not suppress exceptions that arise during extraction.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, Default
from soupsavvy.operations import Operation, Suppress, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Default(ClassSelector("price") | Text() | Suppress(Operation(int)), 0)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">hundred</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=0)

IfElse

The IfElse operation enables conditional data transformations, taking three arguments:

  • condition (callable): A function that determines which operation to execute.

  • if_ (operation): The operation performed if the condition is met.

  • else_ (operation): The operation executed if the condition is not met.

This is useful for applying different transformations based on the HTML structure or values.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, IfElse, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = (
        ClassSelector("title")
        | Text()
        | IfElse(lambda x: x == "", Operation(lambda x: None), Operation(str.upper))
    )
    price = ClassSelector("price") | IfElse(lambda x: x.name == "a", Href(), Text())


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">10</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='ANIMAL FARM', price='10')

Break and Continue

Additionally, Break and Continue operations enhance IfElse by providing control flow capabilities:

  • Break: Terminates the operation pipeline.

  • Continue: Skips the current operation and moves to the next one.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Break, Continue, IfElse, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = (
        ClassSelector("title")
        | Text()
        | IfElse(
            lambda x: x == "",
            Break(),
            Operation(str.upper),
        )
        | Operation(lambda x: x + "!")
    )
    price = (
        ClassSelector("divider")
        | Text()
        | Operation(int)
        | IfElse(lambda x: x == 0, Continue(), Operation(lambda x: 100 / x))
        | Operation(lambda x: f"{x}$")
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title"></p>
        <p class="divider">0</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='', price='0$')

Required

By default, all fields in a model are nullable, setting the corresponding field to None if the field selector returns None. You can modify this behavior with the Required field wrapper, which enforces not null value for the field.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.exceptions import FieldExtractionException
from soupsavvy.models import BaseModel, Required
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = Required(ClassSelector("price") | SkipNone(Text() | Operation(int)))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

try:
    Book.find(element)
except FieldExtractionException as e:
    print(e)
Field 'price' is required and was not found in model 'Book' for element:
<div class="book" href="www.book.com">
<p class="title">Animal Farm</p>
<p>George Orwell</p>
</div>

All

If we expect multiple elements to be found within the scope, the All field wrapper can be used.
This wrapper extracts all elements matching the field selector.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=[100, 80, 60])

Post-Initialization

To handle further transformations of extracted fields, you can define the __post_init__ method in your model class, similar to Python’s dataclass.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import All, BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = All(ClassSelector("price") | Text() | Operation(int))

    def __post_init__(self) -> None:
        self.price = min(self.price)  # type: ignore


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price"><s>100</s></p>
        <p class="price"><s>80</s></p>
        <p class="price">60</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=60)

You can also create individual postprocessing methods for each field. It is a method with any name, but must be decorated with @soupsavvy.models.post.

This transformation is applied before __post_init__ and assignment of instance attributes.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text()
    author = (LastOfType() & TypeSelector("p")) | Text()

    @post("title")
    def process_title(self, value: str) -> str:
        return value.upper()

    @post("price")
    def process_price(self, value: str) -> int:
        return int(value.strip("$"))


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='ANIMAL FARM', price=100, author='George Orwell')

Inheritance

By default, fields in a model are inherited, allowing subclasses to extend parent models easily. For example, the eBook model inherits from the Book model, adding fields like link and duration. It can also override __scope__, although this isn’t required, as all special fields are inherited.

import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Href, Operation, Text


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


class eBook(Book):
    __scope__ = TypeSelector("div") & ClassSelector("ebook")

    link = Href()
    duration = PatternSelector(re.compile(r"\d{1,2}:\d{2}")) | Text()


text = """
    <div class="ebook" href="www.ebook.com">
        <p class="title">Animal Farm</p>
        <p class="price">50</p>
        <p>George Orwell</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

eBook.find(element)
eBook(title='Animal Farm', price=50, link='www.ebook.com', duration='2:30')

Scope

It’s advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.
You can use HasSelector to extend your selection criteria by matching elements that contain the fields needed for extraction.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, HasSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

PRICE_SELECTOR = ClassSelector("price")
TITLE_SELECTOR = ClassSelector("title")


class Book(BaseModel):

    __scope__ = (
        ClassSelector("book")
        & HasSelector(PRICE_SELECTOR)
        & HasSelector(TITLE_SELECTOR)
    )

    title = TITLE_SELECTOR | Text()
    price = PRICE_SELECTOR | Text() | Operation(int)


text = """
    <div class="book">Unavailable</div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
        <p>4:30</p>
    </div>
    <div class="book">
        <p class="price">50</p>
        <p>Lois Lowry</p>
        <p>3:30</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">50</p>
        <p>Aldous Huxley</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Brave New World', price=50)

When scope element is the same as current element, SelfSelector can be used as scope selector.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class Book(BaseModel):
    __scope__ = SelfSelector()

    title = ClassSelector("title") | Text()
    author = ClassSelector("author") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="author">George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

Book.find(div)
Book(title='Animal Farm', author='George Orwell')

Finding all

The find_all method returns a list of model instances for all elements that match the scope selector.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <div class="ebook" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">100</p>
        <p>Aldous Huxley</p>
    </div>
    <div class="book">
        <p class="title">The Giver</p>
        <p class="price">80</p>
        <p>Lois Lowry</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find_all(element)
[Book(title='Brave New World', price=100, author='Aldous Huxley'),
 Book(title='The Giver', price=80, author='Lois Lowry')]

Recursive option

The recursive option applies only to scope searches. When set to True, the model’s scope is searched among all descendants of the specified tag; when set to False, only direct children are considered. Field selectors, however, always search recursively, regardless of this setting.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import LastOfType


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)
    author = (LastOfType() & TypeSelector("p")) | Text()


text = """
    <span>
        <div class="book">
            <p class="title">Not a child</p>
            <p class="price">200</p>
            <p>Author</p>
        </div>
    </span>
    <div class="book">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p>George Orwell</p>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="html.parser")
element = to_soupsavvy(soup)

Book.find(element, recursive=False)
Book(title='Animal Farm', price=100, author='George Orwell')

To restrict field searches to only the children of the scope element, you can use a relative selector.
To find out more, see docs.

from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = (Anchor > ClassSelector("price")) | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <span>
            <p class="title">Animal Farm</p>
            <p class="price">100</p>
            <p class="price">50</p>
            <span class="author">
                <p>George Orwell</p>
            </span>
        </span>
        <p class="price">200</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=200)

Model Composition

Any model class can be a field selector as Author class in this example.

import re
from datetime import datetime

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import FirstChild


class Author(BaseModel):
    __scope__ = ClassSelector("author")

    birth = (
        PatternSelector(re.compile(r"\d{4}-\d{2}-\d{2}"))
        | Text()
        | Operation(lambda x: datetime.strptime(x, "%Y-%m-%d"))
    )
    name = FirstChild() | Text()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    author = Author
    title = ClassSelector("title") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <div class="author">
            <p>George Orwell</p>
            <p>Great author</p>
            <p>1903-06-25</p>
        </div>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(author=Author(birth=datetime.datetime(1903, 6, 25, 0, 0), name='George Orwell'), title='Animal Farm')

Frozen Model

To make the model immutable, set the __frozen__ attribute to True. This enforces immutability, making the model hashable. Modifying any field of a frozen instance will raise a FrozenModelException. Regardless of immutability, attempting to set an attribute not defined as a field will raise an AttributeError.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text
from soupsavvy.exceptions import FrozenModelException


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")
    __frozen__ = True

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

result = Book.find(element)

try:
    result.price = 200  # type: ignore
except FrozenModelException as e:
    print(e)
Model '<class '__main__.Book'>' is frozen and attributes of its instance cannot be modified.

Field

By default, all fields are part of instance:

  • String Representation:

  • Equality Comparison:

  • Hash Calculation:

  • Migration:

To exclude a field from these operations, use the Field class as a wrapper with the following boolean parameters, that default to True:

  • repr

  • compare

  • migrate

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, Field
from soupsavvy.operations import Text

PRICE_SELECTOR = ClassSelector("price") | Text()


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")
    __frozen__ = True

    title = ClassSelector("title") | Text()
    price = Field(PRICE_SELECTOR, compare=False, repr=False, migrate=False)


text = """
    <div class="book">
        <p class="title">Animal Farm</p>
        <p class="price">100$</p>
    </div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p class="price">50$</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

result = Book.find_all(element)
print(f"{result[0]} == {result[1]}: {result[0] == result[1]}")
Book(title='Animal Farm') == Book(title='Animal Farm'): True

Migration

You can migrate a model instance to another model using the migrate method, which takes a target class as an argument and initializes it with the current model’s field values.

Pydantic

import pydantic
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class PydanticBook(pydantic.BaseModel):
    title: str
    price: int


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(PydanticBook)
PydanticBook(title='Animal Farm', price=100)

SQLAlchemy

from bs4 import BeautifulSoup
from sqlalchemy import Column, Integer, String
from sqlalchemy.orm import DeclarativeBase

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Base(DeclarativeBase): ...


class SABook(Base):
    __tablename__ = "book"

    id = Column(Integer, primary_key=True)
    title = Column(String, nullable=True)
    price = Column(Integer, nullable=True)

    def __repr__(self):
        return f"<SABook(title={self.title}, price={self.price})>"


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(SABook)
<SABook(title=Animal Farm, price=100)>

Mapping

For more complex models with another models as fields, mapping can be passed to migrate method to specify how soupsavvy models should be transformed into respective target models.

import pydantic
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class PydanticAuthor(pydantic.BaseModel):
    name: str
    country: str


class PydanticBook(pydantic.BaseModel):
    title: str
    author: PydanticAuthor


class Author(BaseModel):
    __scope__ = TypeSelector("span")

    name = TypeSelector("p") | Text()
    country = ClassSelector("country") | Text()


class Book(BaseModel):

    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    author = Author


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <span>
            <p>George Orwell</p>
            <a class="country">United Kingdom</a>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(PydanticBook, mapping={Author: PydanticAuthor})
PydanticBook(title='Animal Farm', author=PydanticAuthor(name='George Orwell', country='United Kingdom'))

MigrationSchema

While the migrate method accepts keyword arguments, these apply only to the target model, not to nested models.

When additional initialization parameters are needed, use MigrationSchema, which includes the target model and a dictionary of keyword arguments for the initializer.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel, MigrationSchema
from soupsavvy.operations import Text


class TargetAuthor:
    def __init__(self, name: str, country: str, genre=None):
        self.name = name
        self.country = country
        self.genre = genre

    def __repr__(self):
        return f"TargetAuthor(name={self.name!r}, country={self.country!r}, genre={self.genre!r})"


class TargetBook:
    def __init__(self, title: str, author: TargetAuthor, price=None):
        self.title = title
        self.author = author
        self.price = price

    def __repr__(self):
        return f"TargetBook(title={self.title!r}, author={self.author!r}, price={self.price!r})"


class Author(BaseModel):
    __scope__ = TypeSelector("span")

    name = TypeSelector("p") | Text()
    country = ClassSelector("country") | Text()


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    author = Author


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <span>
            <p>George Orwell</p>
            <a class="country">United Kingdom</a>
        </span>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book.migrate(
    TargetBook,
    mapping={Author: MigrationSchema(TargetAuthor, params={"genre": "Dystopia"})},
    price=10,
)
TargetBook(title='Animal Farm', author=TargetAuthor(name='George Orwell', country='United Kingdom', genre='Dystopia'), price=10)

Copy

Additionally, the copy method allows a model to be migrated to itself, creating a new identical instance as a deep copy, including all nested models.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = ClassSelector("title") | Text()
    price = ClassSelector("price") | Text() | Operation(int)


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

book = Book.find(element)
book_copy = book.copy()

assert book == book_copy
assert book is not book_copy

print(book_copy)
Book(title='Animal Farm', price=100)

Tips

Scope

It’s advisable to use the most specific scope selector to ensure that only relevant elements are matched for the model.
You can use HasSelector to extend your selection criteria by matching elements that contain the fields needed for extraction.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, HasSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

PRICE_SELECTOR = ClassSelector("price")
TITLE_SELECTOR = ClassSelector("title")


class Book(BaseModel):
    __scope__ = (
        ClassSelector("book")
        & HasSelector(PRICE_SELECTOR)
        & HasSelector(TITLE_SELECTOR)
    )

    title = TITLE_SELECTOR | Text()
    price = PRICE_SELECTOR | Text() | Operation(int)


text = """
    <div class="book">Unavailable</div>
    <div class="book">
        <p class="title">Animal Farm</p>
        <p>George Orwell</p>
        <p>4:30</p>
    </div>
    <div class="book">
        <p class="price">50</p>
        <p>Lois Lowry</p>
        <p>3:30</p>
    </div>
    <div class="book">
        <p class="title">Brave New World</p>
        <p class="price">50</p>
        <p>Aldous Huxley</p>
        <p>2:30</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Brave New World', price=50)

When scope element is the same as current element, SelfSelector can be used as scope selector.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, SelfSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Text


class Book(BaseModel):
    __scope__ = SelfSelector()

    title = ClassSelector("title") | Text()
    author = ClassSelector("author") | Text()


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="author">George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

Book.find(div)
Book(title='Animal Farm', author='George Orwell')

Typing

To maintain clean and consistent typing, you can use typing.cast to provide type checkers with hints about instance field types.

from typing import cast, Optional

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, SkipNone, Text


class Book(BaseModel):
    __scope__ = TypeSelector("div") & ClassSelector("book")

    title = cast(str, ClassSelector("title") | Text())
    price = cast(
        Optional[int], ClassSelector("price") | SkipNone(Text() | Operation(int))
    )


text = """
    <div class="book" href="www.book.com">
        <p class="title">Animal Farm</p>
        <p class="price">100</p>
        <p>George Orwell</p>
    </div>
"""
soup = BeautifulSoup(text, features="lxml")
element = to_soupsavvy(soup)

Book.find(element)
Book(title='Animal Farm', price=100)

Conclusion

soupsavvy offers a framework for object-oriented web scraping through user-defined models.
This allows users to define the structure of data they wish to extract from HTML documents.

Enjoy soupsavvy and leave us feedback!
Happy scraping!