Composite Selectors
Composite selectors allow you to combine multiple selectors into one for more refined search criteria.
Various ways of combining selectors are described in this tutorial.
Combinators
Inspired by CSS, Combinators in soupsavvy allow you to define relationship between multiple selectors.
For more information on CSS combinators, you can refer to Mozilla.
Operators
Combinators can be created using operators as a more concisely alternative. Each combinator has a corresponding operator that defines relationship between two selectors.
Combinator(left, right) == left {operator} right
For example, the >> operator can be used as a shorthand for DescendantCombinator:
DescendantCombinator(left, right) == left >> right
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<div class="book">
<span class="title">Animal Farm</span>
<span class="price_section">
<p class="price">Price: $20</p>
</span>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
Multiple Selectors
Combinators allow you to chain any number of selectors, when they are passed as positional arguments.
Combinator(first, second, third)
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, DescendantCombinator, IdSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<div class="book">
<span class="title">Animal Farm</span>
<p class="price">Price: $10</p>
</div>
<div id="available">
<div class="book">
<span class="title">Animal Farm</span>
<span class="price_section">
<p class="price">Price: $20</p>
</span>
</div>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = DescendantCombinator(
IdSelector("available"),
ClassSelector("book"),
ClassSelector("price"),
)
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
Combinators Equality
Two combinators are considered equal only if they are of the same type and contain the exact same selectors in the same order. The order of selectors is significant:
left {operator} right != right {operator} left
from soupsavvy import ClassSelector
book_selector = ClassSelector("book")
price_selector = ClassSelector("price")
print(
"left >> right == right >> left:",
price_selector >> book_selector == book_selector >> price_selector,
)
left >> right == right >> left: False
Non-Recursive
For combinators, setting recursive=False ensures that elements are returned only if the element matched by the first selector is a direct child of the searched element.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<span class="not_child_book">
<div class="book">
<span class="title">Animal Farm</span>
<span class="price_section">
<p class="price">Price: $50</p>
</span>
</div>
</span>
<div class="book">
<span class="title">Animal Farm</span>
<span class="price_section">
<p class="price">Price: $20</p>
</span>
</div>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element, recursive=False)
SoupElement(<p class="price">Price: $20</p>)
Combining Combinators
Combinators can be combined to replicate complex CSS relationships, like:
#available > div .price
This is achieved using ChildCombinator and DescendantCombinator together.
Precedence Caveats
Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.
left > middle >> right
The >> (DescendantCombinator) takes precedence over > (ChildCombinator), resulting in:
ChildCombinator(left, DescendantCombinator(middle, right))
Use parentheses to adjust precedence as needed.
Combining Combinators
Combinators can be combined to replicate complex CSS relationships, like:
CSS Example:
#available > div .price
This is achieved with using ChildCombinator and DescendantCombinator together.
Caveat:
Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.
left > middle >> right
The >> operator takes precedence over >, resulting in:
ChildCombinator(left, DescendantCombinator(middle, right))
Use parentheses to adjust the expression as needed.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, IdSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<div>
<span class="title">Animal Farm</span>
<p class="price">Price: $10</p>
</div>
<div id="available">
<div>
<span class="title">Animal Farm</span>
<span class="discount">
<h2>Discounted</h2>
<p class="price">Price: $15</p>
</span>
<p class="price">Price: $20</p>
</div>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = (IdSelector("available") > TypeSelector("div")) >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $15</p>)
DescendantCombinator
The Descendant Combinator is one of the simplest and most frequently used combinators in CSS. It selects elements that match a second selector only if they have an ancestor that matches the first selector. In CSS, this relationship is represented by a single space " " between two selectors. For example, the following CSS:
.book .price
matches all tags with the class price that are descendants of tags with the class book. For more details on CSS combinators, refer to the Mozilla.
DescendantCombinator
Defines the relationship between two selectors, where the second selector matches the descendant of the element matched by the first selector.
CSS Example:
.book .price
Operator: >>
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<div class="book">
<span class="title">Animal Farm</span>
<span class="price_section">
<p class="price">Price: $20</p>
</span>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("book") >> ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
ChildCombinator
Defines the relationship between two selectors, where the second selector matches only the direct children of the element matched by the first selector.
CSS Example:
div > .price
Operator: >
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $30</p>
<div>
<span class="title">Animal Farm</span>
<span class="discount">
<h2>Discounted</h2>
<p class="price">Price: $15</p>
</span>
<p class="price">Price: $20</p>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("div") > ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
NextSiblingCombinator
Defines the relationship between two selectors, where the second selector matches the immediate sibling that directly follows the element matched by the first selector.
CSS Example:
div + .price
Operator: +
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, PatternSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<h2>Discounted</h2>
<span>Unavailable</span>
<p class="price">Price: $10</p>
<h1>Discounted</h1>
<p class="price">Price: $20</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = PatternSelector("Discounted") + ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
SubsequentSiblingCombinator
Defines the relationship between two selectors, where the second selector matches all siblings that follow the element matched by the first selector.
CSS Example:
div ~ .price
Operator: ~
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $25</p>
<h2>Discounted</h2>
<span>Bargain!!!</span>
<p class="price">Price: $15</p>
<p class="price">Price: $10</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("h2") * ClassSelector("price")
selector.find_all(element)
[SoupElement(<p class="price">Price: $15</p>),
SoupElement(<p class="price">Price: $10</p>)]
ParentCombinator
Defines the relationship between two selectors, where the second selector matches all instances of the element that is a parent of the first matched element.
CSS Example:
.discount:has(> p)
Operator: <
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $15</p>
<span class="book">
<p class="price">Price: $25</p>
</span>
<span class="discount"></span>
<span class="discount">
<div>
<p class="price">Price: $35</p>
</div>
</span>
<span class="discount">
<p class="price">Price: $10</p>
</span>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("p") < ClassSelector("discount")
result = selector.find(element)
print(result)
<span class="discount">
<p class="price">Price: $10</p>
</span>
AncestorCombinator
Defines the relationship between two selectors, where the second selector matches all instances of the element that is an ancestor of the first matched element.
CSS Example:
.discount:has(p)
Operator: <<
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $15</p>
<span class="book">
<p class="price">Price: $25</p>
</span>
<span class="discount"></span>
<span class="discount">
<div><p class="price">Price: $35</p></div>
</span>
<span class="discount">
<p class="price">Price: $10</p>
</span>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("p") << ClassSelector("discount")
print("\n\n".join(str(element) for element in selector.find_all(element)))
<span class="discount">
<div><p class="price">Price: $35</p></div>
</span>
<span class="discount">
<p class="price">Price: $10</p>
</span>
Logical Selectors
These selectors allow you to create new selectors by combining multiple selectors using logical operators such as AND, OR, NOT, and XOR.
Equality
Logical selectors, unlike combinators, are commutative. This means that the order of selectors within a logical selector does not affect the result.
first & second == second & first
from soupsavvy import ClassSelector
discount_selector = ClassSelector("discount")
price_selector = ClassSelector("price")
print(
"left & right == right & left:",
discount_selector & price_selector == price_selector & discount_selector,
)
left & right == right & left: True
Additionally, two instances can be considered equal even if they contain a different number of selectors, as long as they represent the same criteria.
from soupsavvy import AttributeSelector, ClassSelector, SelectorList
discount_selector = ClassSelector("discount")
price_selector = ClassSelector("price")
another_price_selector = AttributeSelector("class", value="price")
print(
SelectorList(discount_selector, price_selector)
== SelectorList(discount_selector, price_selector, another_price_selector)
)
True
AndSelector
AndSelector corresponds to the CSS compound selector, which is a concatenation of multiple selectors. It selects elements that match all of the specified selectors.
CSS Example:
p.price
Operator: &
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title">Animal Farm</p>
<span class="price">Price: $30</p>
<p class="price">Price: $20</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("p") & ClassSelector("price")
print(selector.find(element))
<p class="price">Price: $20</p>
SelectorList
SelectorList corresponds to the CSS selector list. It selects elements that match any of the specified selectors.
CSS Example:
h1, h2
Operator: |
Aliases: OrSelector
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<h1>Hello World</h1>
<span>Extra information</span>
<h2>Goodbye World</h2>
<h3>Not interested</h3>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = TypeSelector("h1") | TypeSelector("h2")
selector.find_all(element)
[SoupElement(<h1>Hello World</h1>), SoupElement(<h2>Goodbye World</h2>)]
NotSelector
NotSelector corresponds to the CSS :not() pseudo-class, which excludes elements that match a specified selector. It allows you to select elements that do not meet certain criteria.
CSS Example:
:not(.discount)
Operator: ~
NotSelector(selector) == ~selector
Reference: Mozilla
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price discount">Price: €10</p>
<p class="price">Price: $20</p>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = ~ClassSelector("discount")
selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
Multiple Selectors
When more then one selector is passed to NotSelector, it selects elements, that do not match any of them. Alternatively SelectorList can be used to negate multiple selectors:
NotSelector(left, right) == ~(left | right)
import re
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, NotSelector, PatternSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price discount">Price: €10</p>
<p class="price">Price: $20</p>
<p class="price">Price: €15</p>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
discount_selector = ClassSelector("discount")
dollars_selector = PatternSelector(re.compile(r"\$\d+"))
selector = NotSelector(discount_selector, dollars_selector)
selector.find(element)
SoupElement(<p class="price">Price: €15</p>)
XORSelector
XORSelector corresponds to the logical XOR operation on selectors, selecting elements that match exactly one of the provided selectors.
CSS Equivalent:
While CSS does not have a direct counterpart, you can achieve similar results using selector list with :not() pseudo-class:
span:not(.discount), .discount:not(span)
Operator: ^
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="discount">Buy!</span>
<p class="price">Price: $10</p>
<span class="price">Price: $20</span>
<p class="discount">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("discount") ^ TypeSelector("span")
selector.find_all(element)
[SoupElement(<span class="price">Price: $20</span>),
SoupElement(<p class="discount">Price: $30</p>)]
Relative Selectors
Relative selector apart from selection criteria defines relationship with anchor element.
According to Mozilla - relative selectors represent elements in relation to anchor element(s), typically introduced by a combinator.
CSS Example:
.discount:has(> p)
In this case, an element of type p is in a child-parent relationship with the element of class discount (the anchor).
In the context of soupsavvy:
Anchor Element: The
bs4object being searched.Relative Element: Any element that matches the selector and maintains a specified relationship with the anchor element.
Anchor
Alternative way of creating relative selectors is using Anchor object and specific operator, that match those used in combinators.
>>->RelativeDescendant>->RelativeChild<<->RelativeAncestor<->RelativeParent+->RelativeNextSibling*->RelativeSubsequentSibling
from bs4 import BeautifulSoup
from soupsavvy import Anchor, ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<div>
<span><p class="price">Price: $10</p></span>
<p class="price">Price: $20</p>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]
selector = Anchor > ClassSelector("price")
selector.find(div)
SoupElement(<p class="price">Price: $20</p>)
Relative Siblings
RelativeNextSibling and RelativeSubsequentSibling are used to select following siblings of the anchor element (element passed to find methods).
from bs4 import BeautifulSoup
from soupsavvy import Anchor, ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title">Animal Farm</p>
<div class="section">Book 1</div>
<p class="price">Price: $30</p>
<p class="discount">Price: $20</p>
<p class="price">Price: $10</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]
selector = Anchor + ClassSelector("price")
selector.find_all(div)
[SoupElement(<p class="price">Price: $30</p>)]
Recursivity
Relative selectors in soupsavvy are not affected by the recursive parameter. They have their own independent behavior determined by relationship between anchor and relative elements. Below, RelativeDescendant relationship takes precedence over recursive parameter.
from bs4 import BeautifulSoup
from soupsavvy import Anchor, ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<div>
<span><p class="price">Price: $10</p></span>
<p class="price">Price: $20</p>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
div = element.find_all("div")[0]
selector = Anchor >> ClassSelector("price")
selector.find(div, recursive=False)
SoupElement(<p class="price">Price: $10</p>)
HasSelector
The HasSelector is a counterpart of CSS :has() pseudo-class.
According to Mozilla, this pseudo-class matches an element if any relative selectors passed as arguments match at least one element.
CSS Example:
:has(> .price)
This selector matches any element that has a direct child with the class price.
Default Recursive Search
The default combinator for HasSelector is the descendant relationship. This means any selector passed to HasSelector that is not a relative selector will be treated as a relative descendant selector. As a result, if the parent of a matched element is selected, its ancestors will also be included in the selection.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, HasSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<div class="book">
<span class="title">Brave New World</span>
<p class="price">Price: $20</p>
</div>
<div class="book">
<span class="title">Animal Farm</span>
<span>
<p class="price discount">Price: $15</p>
</span>
<p class="price">Price: $20</p>
</div>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = HasSelector(ClassSelector("discount"))
print("\n\n".join(str(element) for element in selector.find_all(element)))
<div class="book">
<span class="title">Animal Farm</span>
<span>
<p class="price discount">Price: $15</p>
</span>
<p class="price">Price: $20</p>
</div>
<span>
<p class="price discount">Price: $15</p>
</span>
Siblings Search
RelativeNextSibling and RelativeSubsequentSibling can be used to select element with next/subsequent sibling that matches the selector.
from bs4 import BeautifulSoup
from soupsavvy import Anchor, ClassSelector, HasSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title">Animal Farm</p>
<span>Hello World</span>
<div class="section">Brave New World</div>
<p class="price">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = HasSelector(Anchor + ClassSelector("price"))
selector.find(element)
SoupElement(<div class="section">Brave New World</div>)
Ancestors Search
Combining RelativeAncestor and RelativeParent selectors with HasSelector allows you to find elements that have a specific ancestor or parent. For instance, you can locate all elements that have an ancestor with class breaking. Matched ancestor does not have to be descendant of bs4 object.
from bs4 import BeautifulSoup
from soupsavvy import Anchor, ClassSelector, HasSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<div class="breaking">
<span>
<span class="info">Important!</span>
<span>Actual Information</span>
</span>
</div>
""",
features="lxml",
)
element = to_soupsavvy(soup)
span = element.find_all("span")[0]
selector = HasSelector(Anchor << ClassSelector("breaking"))
selector.find_all(span)
[SoupElement(<span class="info">Important!</span>),
SoupElement(<span>Actual Information</span>)]
Nth Selectors
Selectors, that allow you to select the nth element that matches a specific selector.
While CSS offers pseudo-classes like nth-child and nth-of-type to select elements based on their ordinal position among siblings, these selectors only apply to the overall order of siblings.
For example, selecting every 2nd element with the class price in CSS is not feasible because:
.price:nth-child(2n)
selects the 2nd child that has the class price, not every 2nd price element.
In soupsavvy, you can achieve this with:
NthOfSelector(ClassSelector('price'), nth="2n")
This selector selects every 2nd element with the class price.
NthOfSelector
The NthOfSelector enables you to select elements based on a specified occurrence pattern defined by the nth rule. The provided nth parameter must follow valid CSS syntax (<An+B>, even, or odd).
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, NthOfSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<p class="price discount">Price: €1</p>
<p class="price">Price: $2</p>
<span>Bestseller</span>
<p class="price">Price: €3</p>
<p class="price">Price: €4</p>
<p class="price">Price: €5</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = NthOfSelector(ClassSelector("price"), nth="2n")
selector.find_all(element)
[SoupElement(<p class="price">Price: $2</p>),
SoupElement(<p class="price">Price: €4</p>)]
NthLastOfSelector
The NthLastOfSelector functions similarly to the NthOfSelector, but it counts elements from the end of the list.
import re
from bs4 import BeautifulSoup
from soupsavvy import NthLastOfSelector, PatternSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<p class="price discount">Price: €1</p>
<p>Price: $2</p>
<span>Bestseller</span>
<p class="price">Price: €3</p>
<p>Price: €4</p>
<p class="price">Price: €5</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = NthLastOfSelector(
PatternSelector(re.compile("^price", re.IGNORECASE)),
nth="odd",
)
selector.find_all(element)
[SoupElement(<p class="price discount">Price: €1</p>),
SoupElement(<p class="price">Price: €3</p>),
SoupElement(<p class="price">Price: €5</p>)]
OnlyOfSelector
The OnlyOfSelector selects an element only if it is the sole matching element among its siblings.
If more than one element matches the specified selector, none will be selected.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, OnlyOfSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<div class="book">
<span class="title">Animal Farm</span>
<p class="price">Price: $15</p>
<p class="price">Price: $20</p>
</div>
<div class="book">
<span class="title">Frankenstein</span>
<p class="price">Price: $30</p>
</div>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = OnlyOfSelector(ClassSelector("price"))
selector.find(element)
SoupElement(<p class="price">Price: $30</p>)
Operators module
As an alternative way to combine selectors, soupsavvy provides convenient operator functions in soupsavvy.operators module, offering shortcuts for composite selectors:
and_->AndSelectoror_->SelectorListis_->SelectorListwhere->SelectorListnot_->NotSelectorhas->HasSelectorxor->XORSelector
These functions can enhance clarity and conciseness in some context.
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operators import and_
and_(ClassSelector("price"), TypeSelector("p"))
AndSelector(ClassSelector('price'), TypeSelector(name='p'))
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operators import has
has(ClassSelector("price"), TypeSelector("p"))
HasSelector(ClassSelector('price'), TypeSelector(name='p'))
Conclusion
soupsavvy provides a wide range of composite selectors, that can be used to create more complex search criteria. Designed for flexibility and easy customization, these selectors allow you to tailor your selectors to meet specific needs.
Enjoy soupsavvy and leave us feedback!
Happy scraping!