About
With many web scraping libraries available, each with unique interfaces and conventions, building workflows and maintaining consistency in DOM traversal and selection often becomes tedious, resulting in complexity and boilerplate code.
soupsavvy solves this with a unified, consistent approach to selection, based on these principles:
Decoupling: Selection logic is abstracted away from DOM node and traversal implementations.
Framework-Agnostic: Operates consistently with any supported library.
Flexibile & Extensibile: Lightweight, reusable components used to build complex selection workflows.
Unlike traditional libraries that require various methods and parameters for different tasks, soupsavvy uses a simple, consistent selector interface:
selector = TypeSelector("div")
selector.find(element)
selector.find_all(element)
Selectors can encapsulate advanced logic, such as XPath queries, logical relationships, sequences and more:
selector = XPathSelector("//div/a")
elements = selector.find_all(element, limit=3)
With soupsavvy, developers can focus on data extraction workflows instead of wrestling with library-specific quirks and inconsistencies. Boost your web-scraping workflows by eliminating complexity and introducing:
Portability
Maintainability
Scalability
Portability
soupsavvy provides a slim, consistent selector interface, allowing DOM elements from any supported library to be wrapped and used interchangeably.
Instead of rewriting workflows when switching between libraries like BeautifulSoup and selenium, soupsavvy abstracts selection logic, ensuring the same selectors work seamlessly across different frameworks by eliminating any library-specific logic.
Example
This simple workflow extracts text of the header from www.example.com. Both selenium and BeautifulSoup have a different way of executing this operation.
Using BeautifulSoup
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")
p = soup.find("h1")
if p is None:
raise Exception("Element not found")
print(p.text)
Example Domain
Using selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=options)
driver.get("https://www.example.com")
p = driver.find_element(By.TAG_NAME, "h1")
print(p.text)
---------------------------------------------------------------------------
SessionNotCreatedException Traceback (most recent call last)
Cell In[2], line 10
7 options.add_argument("--no-sandbox")
8 options.add_argument("--disable-dev-shm-usage")
---> 10 driver = webdriver.Chrome(options=options)
11 driver.get("https://www.example.com")
12 p = driver.find_element(By.TAG_NAME, "h1")
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py:45, in WebDriver.__init__(self, options, service, keep_alive)
42 service = service if service else Service()
43 options = options if options else Options()
---> 45 super().__init__(
46 browser_name=DesiredCapabilities.CHROME["browserName"],
47 vendor_prefix="goog",
48 options=options,
49 service=service,
50 keep_alive=keep_alive,
51 )
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py:66, in ChromiumDriver.__init__(self, browser_name, vendor_prefix, options, service, keep_alive)
57 executor = ChromiumRemoteConnection(
58 remote_server_addr=self.service.service_url,
59 browser_name=browser_name,
(...) 62 ignore_proxy=options._ignore_local_proxy,
63 )
65 try:
---> 66 super().__init__(command_executor=executor, options=options)
67 except Exception:
68 self.quit()
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:238, in WebDriver.__init__(self, command_executor, keep_alive, file_detector, options, locator_converter, web_element_cls, client_config)
236 self._authenticator_id = None
237 self.start_client()
--> 238 self.start_session(capabilities)
240 self._websocket_connection = None
241 self._script = None
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:325, in WebDriver.start_session(self, capabilities)
318 """Creates a new session with the desired capabilities.
319
320 :Args:
321 - capabilities - a capabilities dict to start the session with.
322 """
324 caps = _create_caps(capabilities)
--> 325 response = self.execute(Command.NEW_SESSION, caps)["value"]
326 self.session_id = response.get("sessionId")
327 self.caps = response.get("capabilities")
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py:380, in WebDriver.execute(self, driver_command, params)
378 response = self.command_executor.execute(driver_command, params)
379 if response:
--> 380 self.error_handler.check_response(response)
381 response["value"] = self._unwrap_value(response.get("value", None))
382 return response
File ~/.asdf/installs/python/3.11.12/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py:229, in ErrorHandler.check_response(self, response)
227 alert_text = value["alert"].get("text")
228 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 229 raise exception_class(message, screen, stacktrace)
SessionNotCreatedException: Message: session not created: probably user data directory is already in use, please specify a unique value for --user-data-dir argument, or don't use --user-data-dir
Stacktrace:
#0 0x57f285fd4fea <unknown>
#1 0x57f285a52d06 <unknown>
#2 0x57f285a8e32c <unknown>
#3 0x57f285a89f94 <unknown>
#4 0x57f285ad93b3 <unknown>
#5 0x57f285ad8a66 <unknown>
#6 0x57f285acabe3 <unknown>
#7 0x57f285a972d2 <unknown>
#8 0x57f285a97f91 <unknown>
#9 0x57f285f989e8 <unknown>
#10 0x57f285f9c84f <unknown>
#11 0x57f285f7fec9 <unknown>
#12 0x57f285f9d3f5 <unknown>
#13 0x57f285f6574f <unknown>
#14 0x57f285fc1cc8 <unknown>
#15 0x57f285fc1ea3 <unknown>
#16 0x57f285fd3f83 <unknown>
#17 0x75a0863c3ac3 <unknown>
Switching between these requires rewriting the logic, as the libraries have different interfaces and conventions.
In soupsavvy, selectors are independent of the underlying library, allowing you to use the same workflow across different libraries. Once you know how to use selectors, you can apply them to any supported implementation.
Using soupsavvy
For selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text
options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=options)
driver.get("https://www.example.com")
root = driver.find_element(By.TAG_NAME, "html")
element = to_soupsavvy(root)
selector = TypeSelector("h1") | Text()
text = selector.find(element)
print(text)
Example Domain
For BeautifulSoup
import requests
from bs4 import BeautifulSoup
from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text
response = requests.get("https://www.example.com")
soup = BeautifulSoup(response.text, "html.parser")
element = to_soupsavvy(soup)
selector = TypeSelector("h1") | Text()
text = selector.find(element)
print(text)
Example Domain
Maintainability
Maintaining complex scraping workflows becomes challenging as projects grow. Adding new selectors, modifying existing ones, or adapting to changes in target websites often requires updating multiple parts of the codebase.
With traditional libraries, even seemingly simple workflows can lead to verbose code. For example, finding and handling a specific element’s text with BeautifulSoup:
from bs4 import BeautifulSoup, Tag
soup = BeautifulSoup(
"""<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
span = soup.find("span", attrs={"class": "price"})
if not isinstance(span, Tag):
raise Exception("Element not found")
p = span.find("p", recursive=False)
if p is None:
raise Exception("Element not found")
print(p.text)
Price: $10
Switching to lxml or selenium introduces different methods, parameters, and error handling, adding to the complexity.
With soupsavvy, selectors ensures consistency by providing a unified interface, that encapsulates entire logic, however complex defined relationships are, eliminating a lot of boilerplate code.
from bs4 import BeautifulSoup
from soupsavvy import TypeSelector, to_soupsavvy
from soupsavvy.operations import Text
soup = BeautifulSoup(
"""<span class="price"><a>Click here</a><p>Price: $10</p></span>""", features="lxml"
)
element = to_soupsavvy(soup)
selector = (TypeSelector("span") > TypeSelector("p")) | Text()
found = selector.find(element)
print(found)
Price: $10
Scalability
When building complex scraping workflows, you might need to manage various relationships between elements, handle multiple matches, or apply different selection criteria. Traditional libraries force developers to write extensive boilerplate code to manage lists, sets, and operations on multiple elements, leading to tangled and error-prone logic.
Using BeautifulSoup
Finding all sibling elements with a specific class after an <h2> tag using BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(
"""
<p class="price">Price: $25</p>
<h2>Discounted</h2>
<span>Bargain!!!</span>
<p class="price">Price: $15</p>
<p class="price">Price: $10</p>
""",
features="lxml",
)
h2 = soup.find_all("h2")
matches = []
for tag in h2:
matches.extend(tag.find_next_siblings(attrs={"class": "price"}))
print([match.text for match in matches])
['Price: $15', 'Price: $10']
This approach requires manually handling sibling relationships and merging results, which becomes increasingly complex in larger workflows.
Using soupsavvy
With soupsavvy, selectors encapsulate the logic for element relationships, providing a concise and reusable workflow:
from bs4 import BeautifulSoup
from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy
from soupsavvy.operations import Text
soup = BeautifulSoup(
"""
<p class="price">Price: $25</p>
<h2>Discounted</h2>
<span>Bargain!!!</span>
<p class="price">Price: $15</p>
<p class="price">Price: $10</p>
""",
features="lxml",
)
element = to_soupsavvy(soup)
selector = (TypeSelector("h2") * ClassSelector("price")) | Text()
selector.find_all(element)
['Price: $15', 'Price: $10']
Even thinking about implementing a workflow like the one below in common web scraping libraries can be overwhelming.
With soupsavvy, selectors act as modular building blocks, encapsulating both selection logic and element relationships. These selectors are reusable and can be easily combined to create complex workflows without the usual overhead.
import re
from soupsavvy import ClassSelector, PatternSelector, TypeSelector
from soupsavvy.selectors.nth import NthOfSelector
# Define selectors for the workflow
pattern_selector = PatternSelector(re.compile(r"price"))
type_selector = TypeSelector("span")
# Combine selectors using XOR (matches one or the other but not both)
xor_selector = pattern_selector ^ type_selector
# Select every second match of the XOR selector
nth_selector = NthOfSelector(xor_selector, nth="2n+1")
# Combine selectors to find specific children inside elements with a class
child_selector = ClassSelector("container") > nth_selector
Conclusion
In conclusion, soupsavvy revolutionizes web scraping by introducing declarative selectors that simplify and unify the selection process across various libraries. Its flexible and consistent approach eliminates the complexity of managing different APIs, allowing you to focus on the task at hand.
Although the examples and tutorials provided are focused on BeautifulSoup, the concepts and workflows are applicable to all supported libraries. Dive deeper into the documentation to explore more examples and see how soupsavvy can streamline your web scraping projects, no matter the framework you’re using.
Enjoy soupsavvy and leave us feedback!
Happy scraping!