About

With many web scraping libraries available, each with unique interfaces and conventions, building workflows and maintaining consistency in DOM traversal and selection often becomes tedious, resulting in complexity and boilerplate code.

soupsavvy solves this with a unified, consistent approach to selection, based on these principles:

Decoupling: Selection logic is abstracted away from DOM node and traversal implementations.
Framework-Agnostic: Operates consistently with any supported library.
Flexibile & Extensibile: Lightweight, reusable components used to build complex selection workflows.

Unlike traditional libraries that require various methods and parameters for different tasks, soupsavvy uses a simple, consistent selector interface:

selector = TypeSelector("div")
selector.find(element)
selector.find_all(element)

Selectors can encapsulate advanced logic, such as XPath queries, logical relationships, sequences and more:

selector = XPathSelector("//div/a")
elements = selector.find_all(element, limit=3)

With soupsavvy, developers can focus on data extraction workflows instead of wrestling with library-specific quirks and inconsistencies. Boost your web-scraping workflows by eliminating complexity and introducing:

Portability
Maintainability
Scalability

Portability

soupsavvy provides a slim, consistent selector interface, allowing DOM elements from any supported library to be wrapped and used interchangeably.

Instead of rewriting workflows when switching between libraries like BeautifulSoup and selenium, soupsavvy abstracts selection logic, ensuring the same selectors work seamlessly across different frameworks by eliminating any library-specific logic.

Example

This simple workflow extracts text of the header from www.example.com. Both selenium and BeautifulSoup have a different way of executing this operation.

Using `BeautifulSoup`

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")

p = soup.find("h1")

if p is None:
    raise Exception("Element not found")

print(p.text)

Example Domain

Using `selenium`

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)
driver.get("https://www.example.com")
p = driver.find_element(By.TAG_NAME, "h1")
print(p.text)

---------------------------------------------------------------------------
SessionNotCreatedException                Traceback (most recent call last)
Cell In[2], line 10
      7 options.add_argument("--no-sandbox")
      8 options.add_argument("--disable-dev-shm-usage")
---> 10 driver = webdriver.Chrome(options=options)
     11 driver.get("https://www.example.com")
     12 p = driver.find_element(By.TAG_NAME, "h1")

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py:45, in WebDriver.__init__(self, options, service, keep_alive)
     42 service = service if service else Service()
     43 options = options if options else Options()
---> 45 super().__init__(
     46     browser_name=DesiredCapabilities.CHROME["browserName"],
     47     vendor_prefix="goog",
     48     options=options,
     49     service=service,
     50     keep_alive=keep_alive,
     51 )

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py:66, in ChromiumDriver.__init__(self, browser_name, vendor_prefix, options, service, keep_alive)
     57 executor = ChromiumRemoteConnection(
     58     remote_server_addr=self.service.service_url,
     59     browser_name=browser_name,
   (...)     62     ignore_proxy=options._ignore_local_proxy,
     63 )
     65 try:
---> 66     super().__init__(command_executor=executor, options=options)
     67 except Exception:
     68     self.quit()

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:238, in WebDriver.__init__(self, command_executor, keep_alive, file_detector, options, locator_converter, web_element_cls, client_config)
    236 self._authenticator_id = None
    237 self.start_client()
--> 238 self.start_session(capabilities)
    240 self._websocket_connection = None
    241 self._script = None

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:325, in WebDriver.start_session(self, capabilities)
    318 """Creates a new session with the desired capabilities.
    319 
    320 :Args:
    321  - capabilities - a capabilities dict to start the session with.
    322 """
    324 caps = _create_caps(capabilities)
--> 325 response = self.execute(Command.NEW_SESSION, caps)["value"]
    326 self.session_id = response.get("sessionId")
    327 self.caps = response.get("capabilities")

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:380, in WebDriver.execute(self, driver_command, params)
    378 response = self.command_executor.execute(driver_command, params)
    379 if response:
--> 380     self.error_handler.check_response(response)
    381     response["value"] = self._unwrap_value(response.get("value", None))
    382     return response

File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py:229, in ErrorHandler.check_response(self, response)
    227         alert_text = value["alert"].get("text")
    228     raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 229 raise exception_class(message, screen, stacktrace)

SessionNotCreatedException: Message: session not created: probably user data directory is already in use, please specify a unique value for --user-data-dir argument, or don't use --user-data-dir
Stacktrace:
#0 0x57f285fd4fea <unknown>
#1 0x57f285a52d06 <unknown>
#2 0x57f285a8e32c <unknown>
#3 0x57f285a89f94 <unknown>
#4 0x57f285ad93b3 <unknown>
#5 0x57f285ad8a66 <unknown>
#6 0x57f285acabe3 <unknown>
#7 0x57f285a972d2 <unknown>
#8 0x57f285a97f91 <unknown>
#9 0x57f285f989e8 <unknown>
#10 0x57f285f9c84f <unknown>
#11 0x57f285f7fec9 <unknown>
#12 0x57f285f9d3f5 <unknown>
#13 0x57f285f6574f <unknown>
#14 0x57f285fc1cc8 <unknown>
#15 0x57f285fc1ea3 <unknown>
#16 0x57f285fd3f83 <unknown>
#17 0x75a0863c3ac3 <unknown>

Switching between these requires rewriting the logic, as the libraries have different interfaces and conventions.

In soupsavvy, selectors are independent of the underlying library, allowing you to use the same workflow across different libraries. Once you know how to use selectors, you can apply them to any supported implementation.

Using `soupsavvy`

For `selenium`

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)
driver.get("https://www.example.com")
root = driver.find_element(By.TAG_NAME, "html")
element = to_soupsavvy(root)

selector = TypeSelector("h1") | Text()

text = selector.find(element)
print(text)

Example Domain

For `BeautifulSoup`

import requests
from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")

element = to_soupsavvy(soup)
selector = TypeSelector("h1") | Text()

text = selector.find(element)
print(text)

Example Domain

Maintainability

Maintaining complex scraping workflows becomes challenging as projects grow. Adding new selectors, modifying existing ones, or adapting to changes in target websites often requires updating multiple parts of the codebase.

With traditional libraries, even seemingly simple workflows can lead to verbose code. For example, finding and handling a specific element’s text with BeautifulSoup:

from bs4 import BeautifulSoup, Tag

soup = BeautifulSoup(
    """<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
span = soup.find("span", attrs={"class": "price"})

if not isinstance(span, Tag):
    raise Exception("Element not found")

p = span.find("p", recursive=False)

if p is None:
    raise Exception("Element not found")

print(p.text)

Price: $10

Switching to lxml or selenium introduces different methods, parameters, and error handling, adding to the complexity.

With soupsavvy, selectors ensures consistency by providing a unified interface, that encapsulates entire logic, however complex defined relationships are, eliminating a lot of boilerplate code.

from bs4 import BeautifulSoup

from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

soup = BeautifulSoup(
    """<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
element = to_soupsavvy(soup)

selector = (TypeSelector("span") > TypeSelector("p")) | Text()
found = selector.find(element)
print(found)

Price: $10

Scalability

When building complex scraping workflows, you might need to manage various relationships between elements, handle multiple matches, or apply different selection criteria. Traditional libraries force developers to write extensive boilerplate code to manage lists, sets, and operations on multiple elements, leading to tangled and error-prone logic.

Using `BeautifulSoup`

Finding all sibling elements with a specific class after an <h2> tag using BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    """
        <p class="price">Price: $25</p>
        <h2>Discounted</h2>
        <span>Bargain!!!</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)

h2 = soup.find_all("h2")
matches = []

for tag in h2:
    matches.extend(tag.find_next_siblings(attrs={"class": "price"}))

print([match.text for match in matches])

['Price: $15', 'Price: $10']

This approach requires manually handling sibling relationships and merging results, which becomes increasingly complex in larger workflows.

Using `soupsavvy`

With soupsavvy, selectors encapsulate the logic for element relationships, providing a concise and reusable workflow:

from bs4 import BeautifulSoup

from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.operations import Text

soup = BeautifulSoup(
    """
        <p class="price">Price: $25</p>
        <h2>Discounted</h2>
        <span>Bargain!!!</span>
        <p class="price">Price: $15</p>
        <p class="price">Price: $10</p>
    """,
    features="lxml",
)
element = to_soupsavvy(soup)

selector = (TypeSelector("h2") * ClassSelector("price")) | Text()
selector.find_all(element)

['Price: $15', 'Price: $10']

Even thinking about implementing a workflow like the one below in common web scraping libraries can be overwhelming. With soupsavvy, selectors act as modular building blocks, encapsulating both selection logic and element relationships. These selectors are reusable and can be easily combined to create complex workflows without the usual overhead.

import re

from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.selectors.nth import NthOfSelector

# Define selectors for the workflow
pattern_selector = PatternSelector(re.compile(r"price"))
type_selector = TypeSelector("span")

# Combine selectors using XOR (matches one or the other but not both)
xor_selector = pattern_selector ^ type_selector

# Select every second match of the XOR selector
nth_selector = NthOfSelector(xor_selector, nth="2n+1")

# Combine selectors to find specific children inside elements with a class
child_selector = ClassSelector("container") > nth_selector

Conclusion

In conclusion, soupsavvy revolutionizes web scraping by introducing declarative selectors that simplify and unify the selection process across various libraries. Its flexible and consistent approach eliminates the complexity of managing different APIs, allowing you to focus on the task at hand.

Although the examples and tutorials provided are focused on BeautifulSoup, the concepts and workflows are applicable to all supported libraries. Dive deeper into the documentation to explore more examples and see how soupsavvy can streamline your web scraping projects, no matter the framework you’re using.

Enjoy soupsavvy and leave us feedback!
Happy scraping!

About

Portability

Example

Using BeautifulSoup

Using selenium

Using soupsavvy

For selenium

For BeautifulSoup

Maintainability

Scalability

Using BeautifulSoup

Using soupsavvy

Conclusion

Using `BeautifulSoup`

Using `selenium`

Using `soupsavvy`

For `selenium`

For `BeautifulSoup`

Using `BeautifulSoup`

Using `soupsavvy`