{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# About"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With many web scraping libraries available, each with unique interfaces and conventions, building workflows and maintaining consistency in DOM traversal and selection often becomes tedious, resulting in complexity and boilerplate code.\n",
"\n",
"`soupsavvy` solves this with a unified, consistent approach to selection, based on these principles:\n",
"\n",
"- **Decoupling**: Selection logic is abstracted away from DOM node and traversal implementations.\n",
"- **Framework-Agnostic**: Operates consistently with any supported library.\n",
"- **Flexibile & Extensibile**: Lightweight, reusable components used to build complex selection workflows.\n",
"\n",
"\n",
"Unlike traditional libraries that require various methods and parameters for different tasks, `soupsavvy` uses a simple, consistent selector interface:\n",
"\n",
"```python\n",
"selector = TypeSelector(\"div\")\n",
"selector.find(element)\n",
"selector.find_all(element)\n",
"```\n",
"\n",
"Selectors can encapsulate advanced logic, such as XPath queries, logical relationships, sequences and more:\n",
"\n",
"```python\n",
"selector = XPathSelector(\"//div/a\")\n",
"elements = selector.find_all(element, limit=3)\n",
"```\n",
"\n",
"With `soupsavvy`, developers can focus on data extraction workflows instead of wrestling with library-specific quirks and inconsistencies. Boost your web-scraping workflows by eliminating complexity and introducing:\n",
"\n",
"- **Portability**\n",
"- **Maintainability**\n",
"- **Scalability**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Portability\n",
"\n",
"`soupsavvy` provides a slim, consistent selector interface, allowing DOM elements from any supported library to be wrapped and used interchangeably. \n",
"\n",
"Instead of rewriting workflows when switching between libraries like `BeautifulSoup` and `selenium`, `soupsavvy` abstracts selection logic, ensuring the same selectors work seamlessly across different frameworks by eliminating any library-specific logic."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This simple workflow extracts text of the header from `www.example.com`. Both `selenium` and `BeautifulSoup` have a different way of executing this operation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using `BeautifulSoup`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example Domain\n"
]
}
],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"response = requests.get(\"https://www.example.com\")\n",
"soup = BeautifulSoup(response.text, \"html.parser\")\n",
"\n",
"p = soup.find(\"h1\")\n",
"\n",
"if p is None:\n",
" raise Exception(\"Element not found\")\n",
"\n",
"print(p.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using `selenium`"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example Domain\n"
]
}
],
"source": [
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.common.by import By\n",
"\n",
"options = Options()\n",
"options.add_argument(\"--headless\")\n",
"options.add_argument(\"--no-sandbox\")\n",
"options.add_argument(\"--disable-dev-shm-usage\")\n",
"\n",
"driver = webdriver.Chrome(options=options)\n",
"driver.get(\"https://www.example.com\")\n",
"p = driver.find_element(By.TAG_NAME, \"h1\")\n",
"print(p.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Switching between these requires rewriting the logic, as the libraries have different interfaces and conventions.\n",
"\n",
"In `soupsavvy`, selectors are independent of the underlying library, allowing you to use the same workflow across different libraries. Once you know how to use selectors, you can apply them to any supported implementation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using `soupsavvy`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### For `selenium`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example Domain\n"
]
}
],
"source": [
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.common.by import By\n",
"\n",
"from soupsavvy import TypeSelector, to_soupsavvy\n",
"from soupsavvy.operations import Text\n",
"\n",
"options = Options()\n",
"options.add_argument(\"--headless\")\n",
"options.add_argument(\"--no-sandbox\")\n",
"options.add_argument(\"--disable-dev-shm-usage\")\n",
"\n",
"driver = webdriver.Chrome(options=options)\n",
"driver.get(\"https://www.example.com\")\n",
"root = driver.find_element(By.TAG_NAME, \"html\")\n",
"element = to_soupsavvy(root)\n",
"\n",
"selector = TypeSelector(\"h1\") | Text()\n",
"\n",
"text = selector.find(element)\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### For `BeautifulSoup`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Example Domain\n"
]
}
],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"from soupsavvy import TypeSelector, to_soupsavvy\n",
"from soupsavvy.operations import Text\n",
"\n",
"response = requests.get(\"https://www.example.com\")\n",
"soup = BeautifulSoup(response.text, \"html.parser\")\n",
"\n",
"element = to_soupsavvy(soup)\n",
"selector = TypeSelector(\"h1\") | Text()\n",
"\n",
"text = selector.find(element)\n",
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Maintainability\n",
"\n",
"Maintaining complex scraping workflows becomes challenging as projects grow. Adding new selectors, modifying existing ones, or adapting to changes in target websites often requires updating multiple parts of the codebase. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With traditional libraries, even seemingly simple workflows can lead to verbose code. For example, finding and handling a specific element’s text with `BeautifulSoup`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Price: $10\n"
]
}
],
"source": [
"from bs4 import BeautifulSoup, Tag\n",
"\n",
"soup = BeautifulSoup(\n",
" \"\"\"Click here Price: $10 Price: $10
Price: $25
\n", "Price: $15
\n", "Price: $10
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "\n", "h2 = soup.find_all(\"h2\")\n", "matches = []\n", "\n", "for tag in h2:\n", " matches.extend(tag.find_next_siblings(attrs={\"class\": \"price\"}))\n", "\n", "print([match.text for match in matches])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach requires manually handling sibling relationships and merging results, which becomes increasingly complex in larger workflows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using `soupsavvy`\n", "\n", "With `soupsavvy`, selectors encapsulate the logic for element relationships, providing a concise and reusable workflow:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Price: $15', 'Price: $10']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "from soupsavvy.operations import Text\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $25
\n", "Price: $15
\n", "Price: $10
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = (TypeSelector(\"h2\") * ClassSelector(\"price\")) | Text()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even thinking about implementing a workflow like the one below in common web scraping libraries can be overwhelming.\n", "With `soupsavvy`, selectors act as modular building blocks, encapsulating both selection logic and element relationships. These selectors are reusable and can be easily combined to create complex workflows without the usual overhead." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, TypeSelector\n", "from soupsavvy.selectors.nth import NthOfSelector\n", "\n", "# Define selectors for the workflow\n", "pattern_selector = PatternSelector(re.compile(r\"price\"))\n", "type_selector = TypeSelector(\"span\")\n", "\n", "# Combine selectors using XOR (matches one or the other but not both)\n", "xor_selector = pattern_selector ^ type_selector\n", "\n", "# Select every second match of the XOR selector\n", "nth_selector = NthOfSelector(xor_selector, nth=\"2n+1\")\n", "\n", "# Combine selectors to find specific children inside elements with a class\n", "child_selector = ClassSelector(\"container\") > nth_selector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In conclusion, `soupsavvy` revolutionizes web scraping by introducing declarative selectors that simplify and unify the selection process across various libraries. Its flexible and consistent approach eliminates the complexity of managing different APIs, allowing you to focus on the task at hand.\n", "\n", "Although the examples and tutorials provided are focused on `BeautifulSoup`, the concepts and workflows are applicable to all supported libraries. Dive deeper into the documentation to explore more examples and see how `soupsavvy` can streamline your web scraping projects, no matter the framework you’re using." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }