Composite Selectors

Composite selectors allow you to combine multiple selectors into one for more refined search criteria.
Various ways of combining selectors are described in this tutorial.

Combinators

Inspired by CSS, Combinators in soupsavvy allow you to define relationship between multiple selectors.
For more information on CSS combinators, you can refer to Mozilla.

Operators

Combinators can be created using operators as a more concisely alternative. Each combinator has a corresponding operator that defines relationship between two selectors.

Combinator(left, right) == left {operator} right

For example, the >> operator can be used as a shorthand for DescendantCombinator:

DescendantCombinator(left, right) == left >> right
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <div class="book">
            <span class="title">Animal Farm</span>
            <span class="price_section">
                <p class="price">Price: $20</p>
            </span>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

Multiple Selectors

Combinators allow you to chain any number of selectors, when they are passed as positional arguments.

Combinator(first, second, third)
from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, DescendantCombinator, IdSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <div class="book">
            <span class="title">Animal Farm</span>
            <p class="price">Price: $10</p>
        </div>
        <div id="available">
            <div class="book">
                <span class="title">Animal Farm</span>
                <span class="price_section">
                    <p class="price">Price: $20</p>
                </span>
            </div>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = DescendantCombinator(
    IdSelector("available"),
    ClassSelector("book"),
    ClassSelector("price"),
)
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

Combinators Equality

Two combinators are considered equal only if they are of the same type and contain the exact same selectors in the same order. The order of selectors is significant:

left {operator} right != right {operator} left
from soupsavvy import ClassSelector

book_selector = ClassSelector("book")
price_selector = ClassSelector("price")

print(
    "left >> right == right >> left:",
    price_selector >> book_selector == book_selector >> price_selector,
)
left >> right == right >> left: False

Non-Recursive

For combinators, setting recursive=False ensures that elements are returned only if the element matched by the first selector is a direct child of the searched element.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <span class="not_child_book">
            <div class="book">
                <span class="title">Animal Farm</span>
                <span class="price_section">
                    <p class="price">Price: $50</p>
                </span>
            </div>
        </span>
        <div class="book">
            <span class="title">Animal Farm</span>
            <span class="price_section">
                <p class="price">Price: $20</p>
            </span>
        </div>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)

selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element, recursive=False)
SoupElement(<p class="price">Price: $20</p>)

Combining Combinators

Combinators can be combined to replicate complex CSS relationships, like:

#available > div .price

This is achieved using ChildCombinator and DescendantCombinator together.

Precedence Caveats

Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.

left > middle >> right

The >> (DescendantCombinator) takes precedence over > (ChildCombinator), resulting in:

ChildCombinator(left, DescendantCombinator(middle, right))

Use parentheses to adjust precedence as needed.

Combining Combinators

Combinators can be combined to replicate complex CSS relationships, like:

CSS Example:

#available > div .price

This is achieved with using ChildCombinator and DescendantCombinator together.

Caveat:

Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.

left > middle >> right

The >> operator takes precedence over >, resulting in:

ChildCombinator(left, DescendantCombinator(middle, right))

Use parentheses to adjust the expression as needed.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, IdSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <div>
            <span class="title">Animal Farm</span>
            <p class="price">Price: $10</p>
        </div>
        <div id="available">
            <div>
                <span class="title">Animal Farm</span>
                <span class="discount">
                    <h2>Discounted</h2>
                    <p class="price">Price: $15</p>
                </span>
                <p class="price">Price: $20</p>
            </div>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = (IdSelector("available") > TypeSelector("div")) >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $15</p>)

DescendantCombinator

The Descendant Combinator is one of the simplest and most frequently used combinators in CSS. It selects elements that match a second selector only if they have an ancestor that matches the first selector. In CSS, this relationship is represented by a single space " " between two selectors. For example, the following CSS:

.book .price

matches all tags with the class price that are descendants of tags with the class book. For more details on CSS combinators, refer to the Mozilla.

DescendantCombinator

Defines the relationship between two selectors, where the second selector matches the descendant of the element matched by the first selector.

CSS Example:

.book .price

Operator: >>

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <div class="book">
            <span class="title">Animal Farm</span>
            <span class="price_section">
                <p class="price">Price: $20</p>
            </span>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

ChildCombinator

Defines the relationship between two selectors, where the second selector matches only the direct children of the element matched by the first selector.

CSS Example:

div > .price

Operator: >

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $30</p>
        <div>
            <span class="title">Animal Farm</span>
            <span class="discount">
                <h2>Discounted</h2>
                <p class="price">Price: $15</p>
            </span>
            <p class="price">Price: $20</p>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("div") > ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

NextSiblingCombinator

Defines the relationship between two selectors, where the second selector matches the immediate sibling that directly follows the element matched by the first selector.

CSS Example:

div + .price

Operator: +

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, PatternSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <h2>Discounted</h2>
        <span>Unavailable</span>
        <p class="price">Price: $10</p>
        <h1>Discounted</h1>
        <p class="price">Price: $20</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = PatternSelector("Discounted") + ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

SubsequentSiblingCombinator

Defines the relationship between two selectors, where the second selector matches all siblings that follow the element matched by the first selector.

CSS Example:

div ~ .price

Operator: ~

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $25</p>
        <h2>Discounted</h2>
        <span>Bargain!!!</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("h2") * ClassSelector("price")
selector.find_all(element)
[SoupElement(<p class="price">Price: $15</p>),
 SoupElement(<p class="price">Price: $10</p>)]

ParentCombinator

Defines the relationship between two selectors, where the second selector matches all instances of the element that is a parent of the first matched element.

CSS Example:

.discount:has(> p)

Operator: <

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $15</p>
        <span class="book">
            <p class="price">Price: $25</p>
        </span>
        <span class="discount"></span>
        <span class="discount">
            <div>
                <p class="price">Price: $35</p>
            </div>
        </span>
        <span class="discount">
            <p class="price">Price: $10</p>
        </span>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("p") < ClassSelector("discount")
result = selector.find(element)
print(result)
<span class="discount">
<p class="price">Price: $10</p>
</span>

AncestorCombinator

Defines the relationship between two selectors, where the second selector matches all instances of the element that is an ancestor of the first matched element.

CSS Example:

.discount:has(p)

Operator: <<

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price">Price: $15</p>
        <span class="book">
            <p class="price">Price: $25</p>
        </span>
        <span class="discount"></span>
        <span class="discount">
            <div><p class="price">Price: $35</p></div>
        </span>
        <span class="discount">
            <p class="price">Price: $10</p>
        </span>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("p") << ClassSelector("discount")
print("\n\n".join(str(element) for element in selector.find_all(element)))
<span class="discount">
<div><p class="price">Price: $35</p></div>
</span>

<span class="discount">
<p class="price">Price: $10</p>
</span>

Logical Selectors

These selectors allow you to create new selectors by combining multiple selectors using logical operators such as AND, OR, NOT, and XOR.

Equality

Logical selectors, unlike combinators, are commutative. This means that the order of selectors within a logical selector does not affect the result.

first & second == second & first
from soupsavvy import ClassSelector

discount_selector = ClassSelector("discount")
price_selector = ClassSelector("price")

print(
    "left & right == right & left:",
    discount_selector & price_selector == price_selector & discount_selector,
)
left & right == right & left: True

Additionally, two instances can be considered equal even if they contain a different number of selectors, as long as they represent the same criteria.

from soupsavvy import AttributeSelector, ClassSelector, SelectorList

discount_selector = ClassSelector("discount")
price_selector = ClassSelector("price")
another_price_selector = AttributeSelector("class", value="price")

print(
    SelectorList(discount_selector, price_selector)
    == SelectorList(discount_selector, price_selector, another_price_selector)
)
True

AndSelector

AndSelector corresponds to the CSS compound selector, which is a concatenation of multiple selectors. It selects elements that match all of the specified selectors.

CSS Example:

p.price

Operator: &

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="title">Animal Farm</p>
        <span class="price">Price: $30</p>
        <p class="price">Price: $20</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("p") & ClassSelector("price")
print(selector.find(element))
<p class="price">Price: $20</p>

SelectorList

SelectorList corresponds to the CSS selector list. It selects elements that match any of the specified selectors.

CSS Example:

h1, h2

Operator: |

Aliases: OrSelector

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <h1>Hello World</h1>
        <span>Extra information</span>
        <h2>Goodbye World</h2>
        <h3>Not interested</h3>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = TypeSelector("h1") | TypeSelector("h2")
selector.find_all(element)
[SoupElement(<h1>Hello World</h1>), SoupElement(<h2>Goodbye World</h2>)]

NotSelector

NotSelector corresponds to the CSS :not() pseudo-class, which excludes elements that match a specified selector. It allows you to select elements that do not meet certain criteria.

CSS Example:

:not(.discount)

Operator: ~

NotSelector(selector) == ~selector

Reference: Mozilla

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price discount">Price: €10</p>
        <p class="price">Price: $20</p>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)

selector = ~ClassSelector("discount")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)

Multiple Selectors

When more then one selector is passed to NotSelector, it selects elements, that do not match any of them. Alternatively SelectorList can be used to negate multiple selectors:

NotSelector(left, right) == ~(left | right)
import re

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, NotSelector, PatternSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="price discount">Price: €10</p>
        <p class="price">Price: $20</p>
        <p class="price">Price: €15</p>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)

discount_selector = ClassSelector("discount")
dollars_selector = PatternSelector(re.compile(r"\$\d+"))
selector = NotSelector(discount_selector, dollars_selector)
selector.find(element)
SoupElement(<p class="price">Price: €15</p>)

XORSelector

XORSelector corresponds to the logical XOR operation on selectors, selecting elements that match exactly one of the provided selectors.

CSS Equivalent:
While CSS does not have a direct counterpart, you can achieve similar results using selector list with :not() pseudo-class:

span:not(.discount), .discount:not(span)

Operator: ^

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="discount">Buy!</span>
        <p class="price">Price: $10</p>
        <span class="price">Price: $20</span>
        <p class="discount">Price: $30</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = ClassSelector("discount") ^ TypeSelector("span")
selector.find_all(element)
[SoupElement(<span class="price">Price: $20</span>),
 SoupElement(<p class="discount">Price: $30</p>)]

Relative Selectors

Relative selector apart from selection criteria defines relationship with anchor element.

According to Mozilla - relative selectors represent elements in relation to anchor element(s), typically introduced by a combinator.

CSS Example:

.discount:has(> p)

In this case, an element of type p is in a child-parent relationship with the element of class discount (the anchor).

In the context of soupsavvy:

  • Anchor Element: The bs4 object being searched.

  • Relative Element: Any element that matches the selector and maintains a specified relationship with the anchor element.

Anchor

Alternative way of creating relative selectors is using Anchor object and specific operator, that match those used in combinators.

  • >> -> RelativeDescendant

  • > -> RelativeChild

  • << -> RelativeAncestor

  • < -> RelativeParent

  • + -> RelativeNextSibling

  • * -> RelativeSubsequentSibling

from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <div>
            <span><p class="price">Price: $10</p></span>
            <p class="price">Price: $20</p>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

selector = Anchor > ClassSelector("price")
selector.find(div)
SoupElement(<p class="price">Price: $20</p>)

Relative Siblings

RelativeNextSibling and RelativeSubsequentSibling are used to select following siblings of the anchor element (element passed to find methods).

from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <p class="title">Animal Farm</p>
        <div class="section">Book 1</div>
        <p class="price">Price: $30</p>
        <p class="discount">Price: $20</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

selector = Anchor + ClassSelector("price")
selector.find_all(div)
[SoupElement(<p class="price">Price: $30</p>)]

Recursivity

Relative selectors in soupsavvy are not affected by the recursive parameter. They have their own independent behavior determined by relationship between anchor and relative elements. Below, RelativeDescendant relationship takes precedence over recursive parameter.

from bs4 import BeautifulSoup

from soupsavvy import Anchor, ClassSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <div>
            <span><p class="price">Price: $10</p></span>
            <p class="price">Price: $20</p>
        </div>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]

selector = Anchor >> ClassSelector("price")
selector.find(div, recursive=False)
SoupElement(<p class="price">Price: $10</p>)

HasSelector

The HasSelector is a counterpart of CSS :has() pseudo-class.
According to Mozilla, this pseudo-class matches an element if any relative selectors passed as arguments match at least one element.

CSS Example:

:has(> .price)

This selector matches any element that has a direct child with the class price.

Nth Selectors

Selectors, that allow you to select the nth element that matches a specific selector.

While CSS offers pseudo-classes like nth-child and nth-of-type to select elements based on their ordinal position among siblings, these selectors only apply to the overall order of siblings.

For example, selecting every 2nd element with the class price in CSS is not feasible because:

.price:nth-child(2n)

selects the 2nd child that has the class price, not every 2nd price element.

In soupsavvy, you can achieve this with:

NthOfSelector(ClassSelector('price'), nth="2n")

This selector selects every 2nd element with the class price.

NthOfSelector

The NthOfSelector enables you to select elements based on a specified occurrence pattern defined by the nth rule. The provided nth parameter must follow valid CSS syntax (<An+B>, even, or odd).

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, NthOfSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price discount">Price: €1</p>
        <p class="price">Price: $2</p>
        <span>Bestseller</span>
        <p class="price">Price: €3</p>
        <p class="price">Price: €4</p>
        <p class="price">Price: €5</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = NthOfSelector(ClassSelector("price"), nth="2n")
selector.find_all(element)
[SoupElement(<p class="price">Price: $2</p>),
 SoupElement(<p class="price">Price: €4</p>)]

NthLastOfSelector

The NthLastOfSelector functions similarly to the NthOfSelector, but it counts elements from the end of the list.

import re

from bs4 import BeautifulSoup

from soupsavvy import NthLastOfSelector, PatternSelector, to_soupsavvy

soup = BeautifulSoup(
    """
        <span class="title">Animal Farm</span>
        <p class="price discount">Price: €1</p>
        <p>Price: $2</p>
        <span>Bestseller</span>
        <p class="price">Price: €3</p>
        <p>Price: €4</p>
        <p class="price">Price: €5</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = NthLastOfSelector(
    PatternSelector(re.compile("^price", re.IGNORECASE)),
    nth="odd",
)
selector.find_all(element)
[SoupElement(<p class="price discount">Price: €1</p>),
 SoupElement(<p class="price">Price: €3</p>),
 SoupElement(<p class="price">Price: €5</p>)]

OnlyOfSelector

The OnlyOfSelector selects an element only if it is the sole matching element among its siblings.
If more than one element matches the specified selector, none will be selected.

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, OnlyOfSelector, to_soupsavvy

soup = BeautifulSoup(
    """
    <div class="book">
        <span class="title">Animal Farm</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $20</p>
    </div>
    <div class="book">
        <span class="title">Frankenstein</span>
        <p class="price">Price: $30</p>
    </div>
    """,
    features="html.parser",
)
element = to_soupsavvy(soup)

selector = OnlyOfSelector(ClassSelector("price"))
selector.find(element)
SoupElement(<p class="price">Price: $30</p>)

Operators module

As an alternative way to combine selectors, soupsavvy provides convenient operator functions in soupsavvy.operators module, offering shortcuts for composite selectors:

  • and_ -> AndSelector

  • or_ -> SelectorList

  • is_ -> SelectorList

  • where -> SelectorList

  • not_ -> NotSelector

  • has -> HasSelector

  • xor -> XORSelector

These functions can enhance clarity and conciseness in some context.

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operators import and_

and_(ClassSelector("price"), TypeSelector("p"))
AndSelector(ClassSelector('price'), TypeSelector(name='p'))
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operators import has

has(ClassSelector("price"), TypeSelector("p"))
HasSelector(ClassSelector('price'), TypeSelector(name='p'))

Conclusion

soupsavvy provides a wide range of composite selectors, that can be used to create more complex search criteria. Designed for flexibility and easy customization, these selectors allow you to tailor your selectors to meet specific needs.

Enjoy soupsavvy and leave us feedback!
Happy scraping!