{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selectors are used to search for elements in a `BeautifulSoup` objects. This tutorial demonstrates various simple selectors, which are core features of `soupsavvy`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Every `soupsavvy` selector follows a consistent interface, providing an API to:\n",
    "- Search for elements within `BeautifulSoup` objects.\n",
    "- Check for selectors equality.\n",
    "- Combine selectors to create more complex queries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Find"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find` method searches for the first element that matches the selector. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span class=\"title\">Animal Farm</span>\n",
    "    <p class=\"price\">Price: $10</p>\n",
    "    <p class=\"price\">Price: $20</p>\n",
    "    <p class=\"price\">Price: $30</p>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Strict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When no match is found, the behavior of the `find` method is controlled by the `strict` parameter:\n",
    "- **`True`** - Raises a `TagNotFoundException`.\n",
    "- **`False`** - Returns `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "from soupsavvy.exceptions import TagNotFoundException\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span class=\"title\">Animal Farm</span>\n",
    "    <p>Hello World</p>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\")\n",
    "\n",
    "print(f\"NOT STRICT: {selector.find(element)}\")\n",
    "\n",
    "try:\n",
    "    selector.find(element, strict=True)\n",
    "except TagNotFoundException as e:\n",
    "    print(f\"STRICT: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Recursive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The search depth is defined by the `recursive` parameter:\n",
    "- **`True`** - Performs a recursive search on the element's descendants.\n",
    "- **`False`** - Searches only within the direct children of the element.\n",
    "\n",
    "This parameter also applies to the `find_all` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <span class=\"title\">Animal Farm</span>\n",
    "        <div class=\"inactive\">\n",
    "            <p class=\"price\">Price: $10</p>\n",
    "            <p class=\"price\">Price: $20</p>\n",
    "        </div>\n",
    "        <p class=\"price\">Price: $30</p>\n",
    "    \"\"\",\n",
    "    features=\"html.parser\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\")\n",
    "selector.find(element, recursive=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Find All"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find_all` method searches for all elements that match the selector.  \n",
    "The results list contains unique elements, maintaining the same order as they appear in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span>Hello World</span>\n",
    "    <p class=\"price\">Price: $10</p>\n",
    "    <p class=\"price\">Price: $20</p>\n",
    "    <p class=\"price\">Price: $30</p>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\")\n",
    "selector.find_all(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Limit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can restrict the number of elements returned by using the `limit` parameter:\n",
    "- **`None`** - Returns all matching elements.\n",
    "- **`int`** - Returns up to the specified number of matching elements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span>Hello World</span>\n",
    "    <p class=\"price\">Price: $10</p>\n",
    "    <p class=\"price\">Price: $20</p>\n",
    "    <p class=\"price\">Price: $30</p>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\")\n",
    "selector.find_all(element, limit=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Equality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All selectors can be compared, if two selectors are equal, their search results will always be identical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy import TypeSelector\n",
    "\n",
    "print(f\"{TypeSelector('p') == TypeSelector('div') = }\")\n",
    "print(f\"{TypeSelector('p') == TypeSelector('p') = }\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selectors can be combined in various ways to create composite selectors. Read about this in the following tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span>Hello World</span>\n",
    "    <p class=\"price\">Price: $10</p>\n",
    "    <a class=\"price\">Price: $20</a>\n",
    "    <p class=\"price\">Price: $30</p>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ClassSelector(\"price\") & TypeSelector(\"a\")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AttributeSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Attribute selectors in `soupsavvy` allow you to select elements based on their attribute values.  \n",
    "For more information about the CSS counterpart, refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find element with specific attribute, regardless of the attribute's value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import AttributeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <span>Animal Farm</span>\n",
    "    <a href=\"/shop\">Price: $20</a>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = AttributeSelector(\"href\")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find element with an exact attribute value by passing string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import AttributeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "    <p class=\"price\">Price: $20</p>\n",
    "    <a role=\"main\">Home</a>\n",
    "    <a role=\"button\">Add to Cart</a>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = AttributeSelector(\"role\", value=\"button\")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find elements based on a regular expression pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import AttributeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <span>Animal Farm</span>\n",
    "        <a href=\"https://www.fictiondb.com/title/animal-farm~george-orwell~161188.htm\">fictiondb</a>\n",
    "        <a href=\"https://search.worldcat.org/title/1056176764\">worldcat</a>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = AttributeSelector(\"href\", value=re.compile(r\"worldcat\\.org/.*/\\d{10}\"))\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specific attribute selectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most commonly used attributes for selection have their own dedicated selectors: \n",
    "\n",
    "- **`IdSelector`**: Matches elements by their `id` attribute value.\n",
    "- **`ClassSelector`**: Matches elements by their `class` attribute value.\n",
    "\n",
    "For more information about css counterparts refer to Mozilla for [Class](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) and [ID](https://developer.mozilla.org/en-US/docs/Web/CSS/ID_selectors) selectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ClassSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"<span class=\"title\">Animal Farm</span><p class=\"price\">Price: $20</p>\"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "price_selector = ClassSelector(\"price\")\n",
    "price_selector.find(element)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import IdSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"<p id=\"12ghj8\">Book</p><p id=\"13cji0\" class=\"price\">Price: $20</p>\"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "price_selector = IdSelector(re.compile(r\"^13.*0$\"))\n",
    "price_selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TypeSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`TypeSelector` is used to select elements based on their tag name. For more information about css counterpart refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import TypeSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <span class=\"title\">Animal Farm</span>\n",
    "        <p class=\"price\">Price: $10</p>\n",
    "        <span>Hello World</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "price_selector = TypeSelector(\"p\")\n",
    "price_selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## UniversalSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`UniversalSelector` is a wildcard selector, that matches any tag.  \n",
    "Its css counterpart is `*`, for more information about css counterpart refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Universal_selectors)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import UniversalSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p class=\"title\" lang=\"es\">Rebelión en la granja</p>\n",
    "        <p class=\"description\" lang=\"en\">Some animals are more equal than others</p>\n",
    "        <span class=\"title\" lang=\"en\">Animal Farm</span>\n",
    "    \"\"\",\n",
    "    features=\"html.parser\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "any_selector = UniversalSelector()\n",
    "any_selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PatternSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`PatternSelector` is designed to select elements based on their text content.\n",
    "While `BeautifulSoup` returns `NavigableString` for such queries, which is limiting, `PatternSelector` returns elements with text content that matches the provided pattern."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find element with an exact attribute value by passing string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import PatternSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p class=\"title\" lang=\"es\">Rebelión en la granja</p>\n",
    "        <p class=\"description\" lang=\"en\">Some animals are more equal than others</p>\n",
    "        <span class=\"title\" lang=\"en\">Animal Farm</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = PatternSelector(\"Animal Farm\")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find elements based on a regular expression pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import PatternSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <p class=\"title\" lang=\"es\">Rebelión en la granja</p>\n",
    "        <p class=\"description\" lang=\"en\">Some animals are more equal than others</p>\n",
    "        <span class=\"title\" lang=\"en\">Animal Farm</span>\n",
    "    \"\"\",\n",
    "    features=\"lxml\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = PatternSelector(re.compile(r\"animal\", re.IGNORECASE))\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XPathSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `XPathSelector` enables the use of XPath expressions to select elements, a feature not natively supported by `BeautifulSoup`. It relies on the `lxml`, that needs to be installed. The XPath expression must target html elements in order to return valid results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import XPathSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <span class=\"id\">1234</span>\n",
    "        <span class=\"title\">Frankenstein</span>\n",
    "        <p class=\"title\">Wild Animal</p>\n",
    "        <span class=\"title\">Animal Farm</span>\n",
    "    \"\"\",\n",
    "    features=\"html.parser\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = XPathSelector(\"//span[@class='title'][contains(text(),'Animal')]\")\n",
    "# selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ExpressionSelector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ExpressionSelector` allows you to define your own custom logic for selecting elements by providing a predicate function. This function evaluates each element and decides whether it should be included in the result set.\n",
    "\n",
    "This works similarly to the `BeautifulSoup` API, where you can pass a predicate function to `find` methods:\n",
    "\n",
    "```python\n",
    "soup.find(lambda tag: tag.name == 'div')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import ExpressionSelector, to_soupsavvy\n",
    "\n",
    "soup = BeautifulSoup(\n",
    "    \"\"\"\n",
    "        <span class=\"id\">1234</span>\n",
    "        <p class=\"title\">Animal Farm</p>\n",
    "        <span class=\"title\">Frankenstein</span>\n",
    "    \"\"\",\n",
    "    features=\"html.parser\",\n",
    ")\n",
    "element = to_soupsavvy(soup)\n",
    "selector = ExpressionSelector(\n",
    "    lambda tag: tag.name != \"p\" and \"title\" in tag.get()[\"class\"]\n",
    ")\n",
    "selector.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These fundamental selectors form the core of `soupsavvy` and provide the building blocks for more complex queries.  \n",
    "Read about composite selectors [here](https://soupsavvy.readthedocs.io/en/stable/demos/combining.html).\n",
    "\n",
    "**Enjoy `soupsavvy` and leave us feedback!**  \n",
    "**Happy scraping!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "soupsavvy",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}