{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Composite Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Composite selectors allow you to combine multiple selectors into one for more refined search criteria. \n", "Various ways of combining selectors are described in this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combinators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspired by CSS, **Combinators** in `soupsavvy` allow you to define relationship between multiple selectors. \n", "For more information on CSS combinators, you can refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Child_combinator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Operators\n", "\n", "Combinators can be created using operators as a more concisely alternative. Each combinator has a corresponding operator that defines relationship between two selectors.\n", "\n", "```python\n", "Combinator(left, right) == left {operator} right\n", "```\n", "\n", "For example, the `>>` operator can be used as a shorthand for `DescendantCombinator`:\n", "\n", "```python\n", "DescendantCombinator(left, right) == left >> right\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
Price: $30
\n", "Price: $20
\n", " \n", "Price: $30
\n", "Price: $10
\n", "Price: $20
\n", " \n", "Price: $30
\n", " \n", "Price: $50
\n", " \n", "Price: $20
\n", " \n", "Price: $30
\n", "Price: $10
\n", "Price: $15
\n", " \n", "Price: $20
\n", "Price: $30
\n", "Price: $20
\n", " \n", "Price: $30
\n", "Price: $15
\n", " \n", "Price: $20
\n", "Price: $10
\n", "Price: $20
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = PatternSelector(\"Discounted\") + ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SubsequentSiblingCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all siblings that follow the element matched by the first selector.\n", "\n", "**CSS Example:**\n", "```css\n", "div ~ .price\n", "```\n", "\n", "**Operator:** `~`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Subsequent-sibling_combinator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $25
\n", "Price: $15
\n", "Price: $10
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"h2\") * ClassSelector(\"price\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ParentCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all instances of the element that is a parent of the first matched element.\n", "\n", "**CSS Example:**\n", "```css\n", ".discount:has(> p)\n", "```\n", "\n", "**Operator:** `<`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $15
\n", " \n", "Price: $25
\n", " \n", " \n", " \n", "Price: $35
\n", "Price: $10
\n", " \n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") < ClassSelector(\"discount\")\n", "result = selector.find(element)\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AncestorCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all instances of the element that is an ancestor of the first matched element.\n", "\n", "**CSS Example:**\n", "```css\n", ".discount:has(p)\n", "```\n", "\n", "**Operator:** `<<`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $15
\n", " \n", "Price: $25
\n", " \n", " \n", " \n", "Price: $35
Price: $10
\n", " \n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") << ClassSelector(\"discount\")\n", "print(\"\\n\\n\".join(str(element) for element in selector.find_all(element)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logical Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These selectors allow you to create new selectors by combining multiple selectors using logical operators such as `AND`, `OR`, `NOT`, and `XOR`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Equality\n", "\n", "Logical selectors, unlike combinators, are commutative. This means that the order of selectors within a logical selector does not affect the result.\n", "\n", "```python\n", "first & second == second & first\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "price_selector = ClassSelector(\"price\")\n", "\n", "print(\n", " \"left & right == right & left:\",\n", " discount_selector & price_selector == price_selector & discount_selector,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, two instances can be considered equal even if they contain a different number of selectors, as long as they represent the same criteria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import AttributeSelector, ClassSelector, SelectorList\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "price_selector = ClassSelector(\"price\")\n", "another_price_selector = AttributeSelector(\"class\", value=\"price\")\n", "\n", "print(\n", " SelectorList(discount_selector, price_selector)\n", " == SelectorList(discount_selector, price_selector, another_price_selector)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AndSelector\n", "\n", "`AndSelector` corresponds to the CSS compound selector, which is a concatenation of multiple selectors. It selects elements that match all of the specified selectors.\n", "\n", "**CSS Example:**\n", "```css\n", "p.price\n", "```\n", "\n", "**Operator:** `&`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors/Selector_structure#compound_selector)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Animal Farm
\n", " Price: $30\n", "Price: $20
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") & ClassSelector(\"price\")\n", "print(selector.find(element))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SelectorList\n", "\n", "`SelectorList` corresponds to the CSS selector list. It selects elements that match any of the specified selectors.\n", "\n", "**CSS Example:**\n", "```css\n", "h1, h2\n", "```\n", "\n", "**Operator:** `|`\n", "\n", "**Aliases:** `OrSelector`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Selector_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: €10
\n", "Price: $20
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ~ClassSelector(\"discount\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple Selectors\n", "\n", "When more then one selector is passed to `NotSelector`, it selects elements, that do **not** match **any** of them. Alternatively `SelectorList` can be used to negate multiple selectors:\n", "\n", "```python\n", "NotSelector(left, right) == ~(left | right)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, NotSelector, PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: €10
\n", "Price: $20
\n", "Price: €15
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "dollars_selector = PatternSelector(re.compile(r\"\\$\\d+\"))\n", "selector = NotSelector(discount_selector, dollars_selector)\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XORSelector\n", "\n", "`XORSelector` corresponds to the logical XOR operation on selectors, selecting elements that match exactly one of the provided selectors.\n", "\n", "**CSS Equivalent:** \n", "While CSS does not have a direct counterpart, you can achieve similar results using selector list with `:not()` pseudo-class:\n", "\n", "```css\n", "span:not(.discount), .discount:not(span)\n", "```\n", "\n", "**Operator:** `^`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Buy!\n", "Price: $10
\n", " Price: $20\n", "Price: $30
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"discount\") ^ TypeSelector(\"span\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relative Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relative selector apart from selection criteria defines relationship with anchor element.\n", "\n", "According to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors/Selector_structure#relative_selector) - relative selectors represent elements in relation to anchor element(s), typically introduced by a combinator.\n", "\n", "**CSS Example**:\n", " ```css\n", " .discount:has(> p)\n", " ```\n", "\n", "In this case, an element of type `p` is in a child-parent relationship with the element of class `discount` (the anchor).\n", "\n", "In the context of `soupsavvy`:\n", "\n", "- **Anchor Element**: The `bs4` object being searched.\n", "- **Relative Element**: Any element that matches the selector and maintains a specified relationship with the anchor element.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anchor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternative way of creating relative selectors is using `Anchor` object and specific operator, that match those used in combinators.\n", "\n", "- `>>` -> `RelativeDescendant`\n", "- `>` -> `RelativeChild`\n", "- `<<` -> `RelativeAncestor`\n", "- `<` -> `RelativeParent`\n", "- `+` -> `RelativeNextSibling`\n", "- `*` -> `RelativeSubsequentSibling`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $10
\n", "Price: $20
\n", "Animal Farm
\n", "Price: $30
\n", "Price: $20
\n", "Price: $10
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "selector = Anchor + ClassSelector(\"price\")\n", "selector.find_all(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursivity\n", "\n", "Relative selectors in `soupsavvy` are not affected by the `recursive` parameter. They have their own independent behavior determined by relationship between anchor and relative elements. Below, `RelativeDescendant` relationship takes precedence over `recursive` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $10
\n", "Price: $20
\n", "Price: $20
\n", "Price: $15
\n", " \n", "Price: $20
\n", "Animal Farm
\n", " Hello World\n", "Price: $30
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = HasSelector(Anchor + ClassSelector(\"price\"))\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ancestors Search\n", "\n", "Combining `RelativeAncestor` and `RelativeParent` selectors with `HasSelector` allows you to find elements that have a specific ancestor or parent. For instance, you can locate all elements that have an ancestor with class `breaking`. Matched ancestor does not have to be descendant of `bs4` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, HasSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: €1
\n", "Price: $2
\n", " Bestseller\n", "Price: €3
\n", "Price: €4
\n", "Price: €5
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthOfSelector(ClassSelector(\"price\"), nth=\"2n\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthLastOfSelector\n", "\n", "The `NthLastOfSelector` functions similarly to the `NthOfSelector`, but it counts elements from the end of the list. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import NthLastOfSelector, PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "Price: €1
\n", "Price: $2
\n", " Bestseller\n", "Price: €3
\n", "Price: €4
\n", "Price: €5
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthLastOfSelector(\n", " PatternSelector(re.compile(\"^price\", re.IGNORECASE)),\n", " nth=\"odd\",\n", ")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OnlyOfSelector\n", "\n", "The `OnlyOfSelector` selects an element only if it is the **sole** matching element among its siblings. \n", "If more than one element matches the specified selector, none will be selected." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, OnlyOfSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "Price: $15
\n", "Price: $20
\n", "Price: $30
\n", "