{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selectors are used to search for elements in a `BeautifulSoup` objects. This tutorial demonstrates various simple selectors, which are core features of `soupsavvy`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every `soupsavvy` selector follows a consistent interface, providing an API to:\n", "- Search for elements within `BeautifulSoup` objects.\n", "- Check for selectors equality.\n", "- Combine selectors to create more complex queries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `find` method searches for the first element that matches the selector. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "

Price: $10

\n", "

Price: $20

\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Strict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When no match is found, the behavior of the `find` method is controlled by the `strict` parameter:\n", "- **`True`** - Raises a `TagNotFoundException`.\n", "- **`False`** - Returns `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "from soupsavvy.exceptions import TagNotFoundException\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "

Hello World

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\")\n", "\n", "print(f\"NOT STRICT: {selector.find(element)}\")\n", "\n", "try:\n", " selector.find(element, strict=True)\n", "except TagNotFoundException as e:\n", " print(f\"STRICT: {e}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Recursive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The search depth is defined by the `recursive` parameter:\n", "- **`True`** - Performs a recursive search on the element's descendants.\n", "- **`False`** - Searches only within the direct children of the element.\n", "\n", "This parameter also applies to the `find_all` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "
\n", "

Price: $10

\n", "

Price: $20

\n", "
\n", "

Price: $30

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\")\n", "selector.find(element, recursive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find All" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `find_all` method searches for all elements that match the selector. \n", "The results list contains unique elements, maintaining the same order as they appear in the document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Hello World\n", "

Price: $10

\n", "

Price: $20

\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Limit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can restrict the number of elements returned by using the `limit` parameter:\n", "- **`None`** - Returns all matching elements.\n", "- **`int`** - Returns up to the specified number of matching elements." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Hello World\n", "

Price: $10

\n", "

Price: $20

\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\")\n", "selector.find_all(element, limit=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Equality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All selectors can be compared, if two selectors are equal, their search results will always be identical." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import TypeSelector\n", "\n", "print(f\"{TypeSelector('p') == TypeSelector('div') = }\")\n", "print(f\"{TypeSelector('p') == TypeSelector('p') = }\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selectors can be combined in various ways to create composite selectors. Read about this in the following tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Hello World\n", "

Price: $10

\n", " Price: $20\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ClassSelector(\"price\") & TypeSelector(\"a\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AttributeSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Attribute selectors in `soupsavvy` allow you to select elements based on their attribute values. \n", "For more information about the CSS counterpart, refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find element with specific attribute, regardless of the attribute's value." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import AttributeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", " Price: $20\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = AttributeSelector(\"href\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find element with an exact attribute value by passing string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import AttributeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $20

\n", " Home\n", " Add to Cart\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = AttributeSelector(\"role\", value=\"button\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find elements based on a regular expression pattern." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import AttributeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", " fictiondb\n", " worldcat\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = AttributeSelector(\"href\", value=re.compile(r\"worldcat\\.org/.*/\\d{10}\"))\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specific attribute selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most commonly used attributes for selection have their own dedicated selectors: \n", "\n", "- **`IdSelector`**: Matches elements by their `id` attribute value.\n", "- **`ClassSelector`**: Matches elements by their `class` attribute value.\n", "\n", "For more information about css counterparts refer to Mozilla for [Class](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) and [ID](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors) selectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"Animal Farm

Price: $20

\"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "price_selector = ClassSelector(\"price\")\n", "price_selector.find(element)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import IdSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"

Book

Price: $20

\"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "price_selector = IdSelector(re.compile(r\"^13.*0$\"))\n", "price_selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TypeSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`TypeSelector` is used to select elements based on their tag name. For more information about css counterpart refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "

Price: $10

\n", " Hello World\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "price_selector = TypeSelector(\"p\")\n", "price_selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## UniversalSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`UniversalSelector` is a wildcard selector, that matches any tag. \n", "Its css counterpart is `*`, for more information about css counterpart refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Universal_selectors)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import UniversalSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Rebelión en la granja

\n", "

Some animals are more equal than others

\n", " Animal Farm\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "any_selector = UniversalSelector()\n", "any_selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PatternSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`PatternSelector` is designed to select elements based on their text content.\n", "While `BeautifulSoup` returns `NavigableString` for such queries, which is limiting, `PatternSelector` returns elements with text content that matches the provided pattern." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find element with an exact attribute value by passing string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Rebelión en la granja

\n", "

Some animals are more equal than others

\n", " Animal Farm\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = PatternSelector(\"Animal Farm\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find elements based on a regular expression pattern." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Rebelión en la granja

\n", "

Some animals are more equal than others

\n", " Animal Farm\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = PatternSelector(re.compile(r\"animal\", re.IGNORECASE))\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XPathSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `XPathSelector` enables the use of XPath expressions to select elements, a feature not natively supported by `BeautifulSoup`. It relies on the `lxml`, that needs to be installed. The XPath expression must target html elements in order to return valid results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import XPathSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " 1234\n", " Frankenstein\n", "

Wild Animal

\n", " Animal Farm\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = XPathSelector(\"//span[@class='title'][contains(text(),'Animal')]\")\n", "# selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ExpressionSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ExpressionSelector` allows you to define your own custom logic for selecting elements by providing a predicate function. This function evaluates each element and decides whether it should be included in the result set.\n", "\n", "This works similarly to the `BeautifulSoup` API, where you can pass a predicate function to `find` methods:\n", "\n", "```python\n", "soup.find(lambda tag: tag.name == 'div')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ExpressionSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " 1234\n", "

Animal Farm

\n", " Frankenstein\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "selector = ExpressionSelector(\n", " lambda tag: tag.name != \"p\" and \"title\" in tag.get()[\"class\"]\n", ")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These fundamental selectors form the core of `soupsavvy` and provide the building blocks for more complex queries. \n", "Read about composite selectors [here](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html).\n", "\n", "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }