Selectors

Selectors are used to search for elements in a BeautifulSoup objects. This tutorial demonstrates various simple selectors, which are core features of soupsavvy.

API

Every soupsavvy selector follows a consistent interface, providing an API to:

  • Search for elements within BeautifulSoup objects.

  • Check for selectors equality.

  • Combine selectors to create more complex queries.

Find

The find method searches for the first element that matches the selector.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <span class="title">Animal Farm</span>
    <p class="price">Price: $10</p>
    <p class="price">Price: $20</p>
    <p class="price">Price: $30</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $10</p>)

Strict

When no match is found, the behavior of the find method is controlled by the strict parameter:

  • True - Raises a TagNotFoundException.

  • False - Returns None.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy
from soupsavvy.exceptions import TagNotFoundException

soup = BeautifulSoup(
    """
    <span class="title">Animal Farm</span>
    <p>Hello World</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")

print(f"NOT STRICT: {selector.find(element)}")

try:
    selector.find(element, strict=True)
except TagNotFoundException as e:
    print(f"STRICT: {e}")
NOT STRICT: None
STRICT: Tag was not found in markup.

Recursive

The search depth is defined by the recursive parameter:

  • True - Performs a recursive search on the element’s descendants.

  • False - Searches only within the direct children of the element.

This parameter also applies to the find_all method.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <div class="inactive">
            <p class="price">Price: $10</p>
            <p class="price">Price: $20</p>
        </div>
        <p class="price">Price: $30</p>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find(element, recursive=False)
SoupElement(<p class="price">Price: $30</p>)

Find All

The find_all method searches for all elements that match the selector.
The results list contains unique elements, maintaining the same order as they appear in the document.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <span>Hello World</span>
    <p class="price">Price: $10</p>
    <p class="price">Price: $20</p>
    <p class="price">Price: $30</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find_all(element)
[SoupElement(<p class="price">Price: $10</p>),
 SoupElement(<p class="price">Price: $20</p>),
 SoupElement(<p class="price">Price: $30</p>)]

Limit

You can restrict the number of elements returned by using the limit parameter:

  • None - Returns all matching elements.

  • int - Returns up to the specified number of matching elements.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <span>Hello World</span>
    <p class="price">Price: $10</p>
    <p class="price">Price: $20</p>
    <p class="price">Price: $30</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find_all(element, limit=2)
[SoupElement(<p class="price">Price: $10</p>),
 SoupElement(<p class="price">Price: $20</p>)]

Equality

All selectors can be compared, if two selectors are equal, their search results will always be identical.

from soupsavvy import TypeSelector

print(f"{TypeSelector('p') == TypeSelector('div') = }")
print(f"{TypeSelector('p') == TypeSelector('p') = }")
TypeSelector('p') == TypeSelector('div') = False
TypeSelector('p') == TypeSelector('p') = True

Combining

Selectors can be combined in various ways to create composite selectors. Read about this in the following tutorial.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <span>Hello World</span>
    <p class="price">Price: $10</p>
    <a class="price">Price: $20</a>
    <p class="price">Price: $30</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price") & TypeSelector("a")
selector.find(element)
SoupElement(<a class="price">Price: $20</a>)

AttributeSelector

Attribute selectors in soupsavvy allow you to select elements based on their attribute values.
For more information about the CSS counterpart, refer to Mozilla.

Find element with specific attribute, regardless of the attribute’s value.

from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <span>Animal Farm</span>
    <a href="/shop">Price: $20</a>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("href")
selector.find(element)
SoupElement(<a href="/shop">Price: $20</a>)

Find element with an exact attribute value by passing string.

from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <p class="price">Price: $20</p>
    <a role="main">Home</a>
    <a role="button">Add to Cart</a>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("role", value="button")
selector.find(element)
SoupElement(<a role="button">Add to Cart</a>)

Find elements based on a regular expression pattern.

import re

from bs4 import BeautifulSoup

from soupsavvy import AttributeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span>Animal Farm</span>
        <a href="https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm">fictiondb</a>
        <a href="https://search.worldcat.org/title/1056176764">worldcat</a>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("href", value=re.compile(r"worldcat\.org/.*/\d{10}"))
selector.find(element)
SoupElement(<a href="https://search.worldcat.org/title/1056176764">worldcat</a>)

Specific attribute selectors

Most commonly used attributes for selection have their own dedicated selectors:

  • IdSelector: Matches elements by their id attribute value.

  • ClassSelector: Matches elements by their class attribute value.

For more information about css counterparts refer to Mozilla for Class and ID selectors.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """<span class="title">Animal Farm</span><p class="price">Price: $20</p>""",
    features="lxml",
)
element = to_soupsavvy(soup)
price_selector = ClassSelector("price")
price_selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
import re

from bs4 import BeautifulSoup

from soupsavvy import IdSelector, to_soupsavvy

soup = BeautifulSoup(
    """<p id="12ghj8">Book</p><p id="13cji0" class="price">Price: $20</p>""",
    features="lxml",
)
element = to_soupsavvy(soup)
price_selector = IdSelector(re.compile(r"^13.*0$"))
price_selector.find(element)
SoupElement(<p class="price" id="13cji0">Price: $20</p>)

TypeSelector

TypeSelector is used to select elements based on their tag name. For more information about css counterpart refer to Mozilla.

from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price">Price: $10</p>
        <span>Hello World</span>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
price_selector = TypeSelector("p")
price_selector.find(element)
SoupElement(<p class="price">Price: $10</p>)

UniversalSelector

UniversalSelector is a wildcard selector, that matches any tag.
Its css counterpart is *, for more information about css counterpart refer to Mozilla.

from bs4 import BeautifulSoup

from soupsavvy import UniversalSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)
any_selector = UniversalSelector()
any_selector.find(element)
SoupElement(<p class="title" lang="es">Rebelión en la granja</p>)

PatternSelector

PatternSelector is designed to select elements based on their text content. While BeautifulSoup returns NavigableString for such queries, which is limiting, PatternSelector returns elements with text content that matches the provided pattern.

Find element with an exact attribute value by passing string.

from bs4 import BeautifulSoup

from soupsavvy import PatternSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = PatternSelector("Animal Farm")
selector.find(element)
SoupElement(<span class="title" lang="en">Animal Farm</span>)

Find elements based on a regular expression pattern.

import re

from bs4 import BeautifulSoup

from soupsavvy import PatternSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="title" lang="es">Rebelión en la granja</p>
        <p class="description" lang="en">Some animals are more equal than others</p>
        <span class="title" lang="en">Animal Farm</span>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
selector = PatternSelector(re.compile(r"animal", re.IGNORECASE))
selector.find(element)
SoupElement(<p class="description" lang="en">Some animals are more equal than others</p>)

XPathSelector

The XPathSelector enables the use of XPath expressions to select elements, a feature not natively supported by BeautifulSoup. It relies on the lxml, that needs to be installed. The XPath expression must target html elements in order to return valid results.

from bs4 import BeautifulSoup

from soupsavvy import XPathSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="id">1234</span>
        <span class="title">Frankenstein</span>
        <p class="title">Wild Animal</p>
        <span class="title">Animal Farm</span>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)
selector = XPathSelector("//span[@class='title'][contains(text(),'Animal')]")
# selector.find(element)

ExpressionSelector

The ExpressionSelector allows you to define your own custom logic for selecting elements by providing a predicate function. This function evaluates each element and decides whether it should be included in the result set.

This works similarly to the BeautifulSoup API, where you can pass a predicate function to find methods:

soup.find(lambda tag: tag.name == 'div')
from bs4 import BeautifulSoup

from soupsavvy import ExpressionSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="id">1234</span>
        <p class="title">Animal Farm</p>
        <span class="title">Frankenstein</span>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)
selector = ExpressionSelector(
    lambda tag: tag.name != "p" and "title" in tag.get()["class"]
)
selector.find(element)
SoupElement(<span class="title">Frankenstein</span>)

Conclusion

These fundamental selectors form the core of soupsavvy and provide the building blocks for more complex queries.
Read about composite selectors here.

Enjoy soupsavvy and leave us feedback!
Happy scraping!