Selectors
Selectors are used to search for elements in a BeautifulSoup objects. This tutorial demonstrates various simple selectors, which are core features of soupsavvy.
API
Every soupsavvy selector follows a consistent interface, providing an API to:
Search for elements within
BeautifulSoupobjects.Check for selectors equality.
Combine selectors to create more complex queries.
Find
The find method searches for the first element that matches the selector.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<p class="price">Price: $10</p>
<p class="price">Price: $20</p>
<p class="price">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find(element)
SoupElement(<p class="price">Price: $10</p>)
Strict
When no match is found, the behavior of the find method is controlled by the strict parameter:
True- Raises aTagNotFoundException.False- ReturnsNone.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
from soupsavvy.exceptions import TagNotFoundException
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<p>Hello World</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
print(f"NOT STRICT: {selector.find(element)}")
try:
selector.find(element, strict=True)
except TagNotFoundException as e:
print(f"STRICT: {e}")
NOT STRICT: None
STRICT: Tag was not found in markup.
Recursive
The search depth is defined by the recursive parameter:
True- Performs a recursive search on the element’s descendants.False- Searches only within the direct children of the element.
This parameter also applies to the find_all method.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<div class="inactive">
<p class="price">Price: $10</p>
<p class="price">Price: $20</p>
</div>
<p class="price">Price: $30</p>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find(element, recursive=False)
SoupElement(<p class="price">Price: $30</p>)
Find All
The find_all method searches for all elements that match the selector.
The results list contains unique elements, maintaining the same order as they appear in the document.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span>Hello World</span>
<p class="price">Price: $10</p>
<p class="price">Price: $20</p>
<p class="price">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find_all(element)
[SoupElement(<p class="price">Price: $10</p>),
SoupElement(<p class="price">Price: $20</p>),
SoupElement(<p class="price">Price: $30</p>)]
Limit
You can restrict the number of elements returned by using the limit parameter:
None- Returns all matching elements.int- Returns up to the specified number of matching elements.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span>Hello World</span>
<p class="price">Price: $10</p>
<p class="price">Price: $20</p>
<p class="price">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price")
selector.find_all(element, limit=2)
[SoupElement(<p class="price">Price: $10</p>),
SoupElement(<p class="price">Price: $20</p>)]
Equality
All selectors can be compared, if two selectors are equal, their search results will always be identical.
from soupsavvy import TypeSelector
print(f"{TypeSelector('p') == TypeSelector('div') = }")
print(f"{TypeSelector('p') == TypeSelector('p') = }")
TypeSelector('p') == TypeSelector('div') = False
TypeSelector('p') == TypeSelector('p') = True
Combining
Selectors can be combined in various ways to create composite selectors. Read about this in the following tutorial.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span>Hello World</span>
<p class="price">Price: $10</p>
<a class="price">Price: $20</a>
<p class="price">Price: $30</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = ClassSelector("price") & TypeSelector("a")
selector.find(element)
SoupElement(<a class="price">Price: $20</a>)
AttributeSelector
Attribute selectors in soupsavvy allow you to select elements based on their attribute values.
For more information about the CSS counterpart, refer to Mozilla.
Find element with specific attribute, regardless of the attribute’s value.
from bs4 import BeautifulSoup
from soupsavvy import AttributeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span>Animal Farm</span>
<a href="/shop">Price: $20</a>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("href")
selector.find(element)
SoupElement(<a href="/shop">Price: $20</a>)
Find element with an exact attribute value by passing string.
from bs4 import BeautifulSoup
from soupsavvy import AttributeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="price">Price: $20</p>
<a role="main">Home</a>
<a role="button">Add to Cart</a>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("role", value="button")
selector.find(element)
SoupElement(<a role="button">Add to Cart</a>)
Find elements based on a regular expression pattern.
import re
from bs4 import BeautifulSoup
from soupsavvy import AttributeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span>Animal Farm</span>
<a href="https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm">fictiondb</a>
<a href="https://search.worldcat.org/title/1056176764">worldcat</a>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = AttributeSelector("href", value=re.compile(r"worldcat\.org/.*/\d{10}"))
selector.find(element)
SoupElement(<a href="https://search.worldcat.org/title/1056176764">worldcat</a>)
Specific attribute selectors
Most commonly used attributes for selection have their own dedicated selectors:
IdSelector: Matches elements by theiridattribute value.ClassSelector: Matches elements by theirclassattribute value.
For more information about css counterparts refer to Mozilla for Class and ID selectors.
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, to_soupsavvy
soup = BeautifulSoup(
"""<span class="title">Animal Farm</span><p class="price">Price: $20</p>""",
features="lxml",
)
element = to_soupsavvy(soup)
price_selector = ClassSelector("price")
price_selector.find(element)
SoupElement(<p class="price">Price: $20</p>)
import re
from bs4 import BeautifulSoup
from soupsavvy import IdSelector, to_soupsavvy
soup = BeautifulSoup(
"""<p id="12ghj8">Book</p><p id="13cji0" class="price">Price: $20</p>""",
features="lxml",
)
element = to_soupsavvy(soup)
price_selector = IdSelector(re.compile(r"^13.*0$"))
price_selector.find(element)
SoupElement(<p class="price" id="13cji0">Price: $20</p>)
TypeSelector
TypeSelector is used to select elements based on their tag name. For more information about css counterpart refer to Mozilla.
from bs4 import BeautifulSoup
from soupsavvy import TypeSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="title">Animal Farm</span>
<p class="price">Price: $10</p>
<span>Hello World</span>
""",
features="lxml",
)
element = to_soupsavvy(soup)
price_selector = TypeSelector("p")
price_selector.find(element)
SoupElement(<p class="price">Price: $10</p>)
UniversalSelector
UniversalSelector is a wildcard selector, that matches any tag.
Its css counterpart is *, for more information about css counterpart refer to Mozilla.
from bs4 import BeautifulSoup
from soupsavvy import UniversalSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title" lang="es">Rebelión en la granja</p>
<p class="description" lang="en">Some animals are more equal than others</p>
<span class="title" lang="en">Animal Farm</span>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
any_selector = UniversalSelector()
any_selector.find(element)
SoupElement(<p class="title" lang="es">Rebelión en la granja</p>)
PatternSelector
PatternSelector is designed to select elements based on their text content.
While BeautifulSoup returns NavigableString for such queries, which is limiting, PatternSelector returns elements with text content that matches the provided pattern.
Find element with an exact attribute value by passing string.
from bs4 import BeautifulSoup
from soupsavvy import PatternSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title" lang="es">Rebelión en la granja</p>
<p class="description" lang="en">Some animals are more equal than others</p>
<span class="title" lang="en">Animal Farm</span>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = PatternSelector("Animal Farm")
selector.find(element)
SoupElement(<span class="title" lang="en">Animal Farm</span>)
Find elements based on a regular expression pattern.
import re
from bs4 import BeautifulSoup
from soupsavvy import PatternSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<p class="title" lang="es">Rebelión en la granja</p>
<p class="description" lang="en">Some animals are more equal than others</p>
<span class="title" lang="en">Animal Farm</span>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = PatternSelector(re.compile(r"animal", re.IGNORECASE))
selector.find(element)
SoupElement(<p class="description" lang="en">Some animals are more equal than others</p>)
XPathSelector
The XPathSelector enables the use of XPath expressions to select elements, a feature not natively supported by BeautifulSoup. It relies on the lxml, that needs to be installed. The XPath expression must target html elements in order to return valid results.
from bs4 import BeautifulSoup
from soupsavvy import XPathSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="id">1234</span>
<span class="title">Frankenstein</span>
<p class="title">Wild Animal</p>
<span class="title">Animal Farm</span>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = XPathSelector("//span[@class='title'][contains(text(),'Animal')]")
# selector.find(element)
ExpressionSelector
The ExpressionSelector allows you to define your own custom logic for selecting elements by providing a predicate function. This function evaluates each element and decides whether it should be included in the result set.
This works similarly to the BeautifulSoup API, where you can pass a predicate function to find methods:
soup.find(lambda tag: tag.name == 'div')
from bs4 import BeautifulSoup
from soupsavvy import ExpressionSelector, to_soupsavvy
soup = BeautifulSoup(
"""
<span class="id">1234</span>
<p class="title">Animal Farm</p>
<span class="title">Frankenstein</span>
""",
features="html.parser",
)
element = to_soupsavvy(soup)
selector = ExpressionSelector(
lambda tag: tag.name != "p" and "title" in tag.get()["class"]
)
selector.find(element)
SoupElement(<span class="title">Frankenstein</span>)
Conclusion
These fundamental selectors form the core of soupsavvy and provide the building blocks for more complex queries.
Read about composite selectors here.
Enjoy soupsavvy and leave us feedback!
Happy scraping!