{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Composite Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Composite selectors allow you to combine multiple selectors into one for more refined search criteria. \n", "Various ways of combining selectors are described in this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combinators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspired by CSS, **Combinators** in `soupsavvy` allow you to define relationship between multiple selectors. \n", "For more information on CSS combinators, you can refer to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Child_combinator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Operators\n", "\n", "Combinators can be created using operators as a more concisely alternative. Each combinator has a corresponding operator that defines relationship between two selectors.\n", "\n", "```python\n", "Combinator(left, right) == left {operator} right\n", "```\n", "\n", "For example, the `>>` operator can be used as a shorthand for `DescendantCombinator`:\n", "\n", "```python\n", "DescendantCombinator(left, right) == left >> right\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", "
\n", " Animal Farm\n", " \n", "

Price: $20

\n", "
\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"book\") >> ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Multiple Selectors\n", "\n", "Combinators allow you to chain any number of selectors, when they are passed as positional arguments.\n", "\n", "```python\n", "Combinator(first, second, third)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, DescendantCombinator, IdSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", "
\n", " Animal Farm\n", "

Price: $10

\n", "
\n", "
\n", "
\n", " Animal Farm\n", " \n", "

Price: $20

\n", "
\n", "
\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = DescendantCombinator(\n", " IdSelector(\"available\"),\n", " ClassSelector(\"book\"),\n", " ClassSelector(\"price\"),\n", ")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combinators Equality\n", "\n", "Two combinators are considered equal only if they are of the same type and contain the exact same selectors in the same order. The order of selectors is significant:\n", "\n", "```python\n", "left {operator} right != right {operator} left\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector\n", "\n", "book_selector = ClassSelector(\"book\")\n", "price_selector = ClassSelector(\"price\")\n", "\n", "print(\n", " \"left >> right == right >> left:\",\n", " price_selector >> book_selector == book_selector >> price_selector,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-Recursive\n", "\n", "For combinators, setting `recursive=False` ensures that elements are returned only if the element matched by the first selector is a direct child of the searched element." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", " \n", "
\n", " Animal Farm\n", " \n", "

Price: $50

\n", "
\n", "
\n", "
\n", "
\n", " Animal Farm\n", " \n", "

Price: $20

\n", "
\n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"book\") >> ClassSelector(\"price\")\n", "selector.find(element, recursive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining Combinators\n", "\n", "Combinators can be combined to replicate complex CSS relationships, like:\n", "\n", "```css\n", "#available > div .price\n", "```\n", "\n", "This is achieved using `ChildCombinator` and `DescendantCombinator` together.\n", "\n", "### Precedence Caveats\n", "\n", "Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.\n", "\n", "```python\n", "left > middle >> right\n", "```\n", "\n", "The `>>` (DescendantCombinator) takes precedence over `>` (ChildCombinator), resulting in:\n", "\n", "```python\n", "ChildCombinator(left, DescendantCombinator(middle, right))\n", "``` \n", "\n", "Use parentheses to adjust precedence as needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining Combinators\n", "\n", "Combinators can be combined to replicate complex CSS relationships, like:\n", "\n", "**CSS Example:**\n", "```css\n", "#available > div .price\n", "```\n", "\n", "This is achieved with using `ChildCombinator` and `DescendantCombinator` together.\n", "\n", "**Caveat:**\n", "\n", "Note, that some operators have higher precedence than others, which can affect the order in which expressions are evaluated.\n", "\n", "```python\n", "left > middle >> right\n", "```\n", "\n", "The `>>` operator takes precedence over `>`, resulting in:\n", "\n", "```python\n", "ChildCombinator(left, DescendantCombinator(middle, right))\n", "``` \n", "\n", "Use parentheses to adjust the expression as needed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, IdSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", "
\n", " Animal Farm\n", "

Price: $10

\n", "
\n", "
\n", "
\n", " Animal Farm\n", " \n", "

Discounted

\n", "

Price: $15

\n", "
\n", "

Price: $20

\n", "
\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = (IdSelector(\"available\") > TypeSelector(\"div\")) >> ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DescendantCombinator\n", "\n", "The **Descendant Combinator** is one of the simplest and most frequently used combinators in CSS. It selects elements that match a second selector only if they have an ancestor that matches the first selector. In CSS, this relationship is represented by a single space `\" \"` between two selectors. For example, the following CSS:\n", "\n", "```css\n", ".book .price\n", "```\n", "\n", "matches all tags with the class `price` that are descendants of tags with the class `book`. For more details on CSS combinators, refer to the [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DescendantCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches the descendant of the element matched by the first selector.\n", "\n", "**CSS Example:**\n", "```css\n", ".book .price\n", "```\n", "\n", "**Operator:** `>>`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", "
\n", " Animal Farm\n", " \n", "

Price: $20

\n", "
\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"book\") >> ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ChildCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches only the direct children of the element matched by the first selector.\n", "\n", "**CSS Example:**\n", "```css\n", "div > .price\n", "```\n", "\n", "**Operator:** `>`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Child_combinator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $30

\n", "
\n", " Animal Farm\n", " \n", "

Discounted

\n", "

Price: $15

\n", "
\n", "

Price: $20

\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"div\") > ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NextSiblingCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches the immediate sibling that directly follows the element matched by the first selector.\n", "\n", "**CSS Example:**\n", "```css\n", "div + .price\n", "```\n", "\n", "**Operator:** `+`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Next-sibling_combinator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Discounted

\n", " Unavailable\n", "

Price: $10

\n", "

Discounted

\n", "

Price: $20

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = PatternSelector(\"Discounted\") + ClassSelector(\"price\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SubsequentSiblingCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all siblings that follow the element matched by the first selector.\n", "\n", "**CSS Example:**\n", "```css\n", "div ~ .price\n", "```\n", "\n", "**Operator:** `~`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Subsequent-sibling_combinator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $25

\n", "

Discounted

\n", " Bargain!!!\n", "

Price: $15

\n", "

Price: $10

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"h2\") * ClassSelector(\"price\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ParentCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all instances of the element that is a parent of the first matched element.\n", "\n", "**CSS Example:**\n", "```css\n", ".discount:has(> p)\n", "```\n", "\n", "**Operator:** `<`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $15

\n", " \n", "

Price: $25

\n", "
\n", " \n", " \n", "
\n", "

Price: $35

\n", "
\n", "
\n", " \n", "

Price: $10

\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") < ClassSelector(\"discount\")\n", "result = selector.find(element)\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AncestorCombinator\n", "\n", "Defines the relationship between two selectors, where the second selector matches all instances of the element that is an ancestor of the first matched element.\n", "\n", "**CSS Example:**\n", "```css\n", ".discount:has(p)\n", "```\n", "\n", "**Operator:** `<<`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: $15

\n", " \n", "

Price: $25

\n", "
\n", " \n", " \n", "

Price: $35

\n", "
\n", " \n", "

Price: $10

\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") << ClassSelector(\"discount\")\n", "print(\"\\n\\n\".join(str(element) for element in selector.find_all(element)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logical Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These selectors allow you to create new selectors by combining multiple selectors using logical operators such as `AND`, `OR`, `NOT`, and `XOR`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Equality\n", "\n", "Logical selectors, unlike combinators, are commutative. This means that the order of selectors within a logical selector does not affect the result.\n", "\n", "```python\n", "first & second == second & first\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "price_selector = ClassSelector(\"price\")\n", "\n", "print(\n", " \"left & right == right & left:\",\n", " discount_selector & price_selector == price_selector & discount_selector,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Additionally, two instances can be considered equal even if they contain a different number of selectors, as long as they represent the same criteria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import AttributeSelector, ClassSelector, SelectorList\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "price_selector = ClassSelector(\"price\")\n", "another_price_selector = AttributeSelector(\"class\", value=\"price\")\n", "\n", "print(\n", " SelectorList(discount_selector, price_selector)\n", " == SelectorList(discount_selector, price_selector, another_price_selector)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AndSelector\n", "\n", "`AndSelector` corresponds to the CSS compound selector, which is a concatenation of multiple selectors. It selects elements that match all of the specified selectors.\n", "\n", "**CSS Example:**\n", "```css\n", "p.price\n", "```\n", "\n", "**Operator:** `&`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors/Selector_structure#compound_selector)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Animal Farm

\n", " Price: $30

\n", "

Price: $20

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"p\") & ClassSelector(\"price\")\n", "print(selector.find(element))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SelectorList\n", "\n", "`SelectorList` corresponds to the CSS selector list. It selects elements that match any of the specified selectors.\n", "\n", "**CSS Example:**\n", "```css\n", "h1, h2\n", "```\n", "\n", "**Operator:** `|`\n", "\n", "**Aliases:** `OrSelector`\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/Selector_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Hello World

\n", " Extra information\n", "

Goodbye World

\n", "

Not interested

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"h1\") | TypeSelector(\"h2\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NotSelector\n", "\n", "`NotSelector` corresponds to the CSS `:not()` pseudo-class, which excludes elements that match a specified selector. It allows you to select elements that do not meet certain criteria.\n", "\n", "**CSS Example:**\n", "```css\n", ":not(.discount)\n", "```\n", "\n", "**Operator:** `~`\n", "\n", "```python\n", "NotSelector(selector) == ~selector\n", "```\n", "\n", "**Reference:** [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/:not)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: €10

\n", "

Price: $20

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ~ClassSelector(\"discount\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple Selectors\n", "\n", "When more then one selector is passed to `NotSelector`, it selects elements, that do **not** match **any** of them. Alternatively `SelectorList` can be used to negate multiple selectors:\n", "\n", "```python\n", "NotSelector(left, right) == ~(left | right)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, NotSelector, PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Price: €10

\n", "

Price: $20

\n", "

Price: €15

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "discount_selector = ClassSelector(\"discount\")\n", "dollars_selector = PatternSelector(re.compile(r\"\\$\\d+\"))\n", "selector = NotSelector(discount_selector, dollars_selector)\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XORSelector\n", "\n", "`XORSelector` corresponds to the logical XOR operation on selectors, selecting elements that match exactly one of the provided selectors.\n", "\n", "**CSS Equivalent:** \n", "While CSS does not have a direct counterpart, you can achieve similar results using selector list with `:not()` pseudo-class:\n", "\n", "```css\n", "span:not(.discount), .discount:not(span)\n", "```\n", "\n", "**Operator:** `^`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, TypeSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Buy!\n", "

Price: $10

\n", " Price: $20\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = ClassSelector(\"discount\") ^ TypeSelector(\"span\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relative Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relative selector apart from selection criteria defines relationship with anchor element.\n", "\n", "According to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors/Selector_structure#relative_selector) - relative selectors represent elements in relation to anchor element(s), typically introduced by a combinator.\n", "\n", "**CSS Example**:\n", " ```css\n", " .discount:has(> p)\n", " ```\n", "\n", "In this case, an element of type `p` is in a child-parent relationship with the element of class `discount` (the anchor).\n", "\n", "In the context of `soupsavvy`:\n", "\n", "- **Anchor Element**: The `bs4` object being searched.\n", "- **Relative Element**: Any element that matches the selector and maintains a specified relationship with the anchor element.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Anchor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternative way of creating relative selectors is using `Anchor` object and specific operator, that match those used in combinators.\n", "\n", "- `>>` -> `RelativeDescendant`\n", "- `>` -> `RelativeChild`\n", "- `<<` -> `RelativeAncestor`\n", "- `<` -> `RelativeParent`\n", "- `+` -> `RelativeNextSibling`\n", "- `*` -> `RelativeSubsequentSibling`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", "

Price: $10

\n", "

Price: $20

\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "selector = Anchor > ClassSelector(\"price\")\n", "selector.find(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relative Siblings\n", "\n", "`RelativeNextSibling` and `RelativeSubsequentSibling` are used to select following siblings of the anchor element (element passed to `find` methods)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Animal Farm

\n", "
Book 1
\n", "

Price: $30

\n", "

Price: $20

\n", "

Price: $10

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "selector = Anchor + ClassSelector(\"price\")\n", "selector.find_all(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursivity\n", "\n", "Relative selectors in `soupsavvy` are not affected by the `recursive` parameter. They have their own independent behavior determined by relationship between anchor and relative elements. Below, `RelativeDescendant` relationship takes precedence over `recursive` parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", "

Price: $10

\n", "

Price: $20

\n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "div = element.find_all(\"div\")[0]\n", "\n", "selector = Anchor >> ClassSelector(\"price\")\n", "selector.find(div, recursive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### HasSelector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `HasSelector` is a counterpart of CSS `:has()` pseudo-class. \n", "According to [Mozilla](https://developer.mozilla.org/en-US/docs/Web/CSS/:has), this pseudo-class matches an element if any relative selectors passed as arguments match at least one element.\n", "\n", "**CSS Example:**\n", "```css\n", ":has(> .price)\n", "```\n", "\n", "This selector matches any element that has a direct child with the class `price`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Default Recursive Search\n", "\n", "The default combinator for `HasSelector` is the descendant relationship. This means any selector passed to `HasSelector` that is not a relative selector will be treated as a relative descendant selector. As a result, if the parent of a matched element is selected, its ancestors will also be included in the selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, HasSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " Brave New World\n", "

Price: $20

\n", "
\n", "
\n", " Animal Farm\n", " \n", "

Price: $15

\n", "
\n", "

Price: $20

\n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = HasSelector(ClassSelector(\"discount\"))\n", "print(\"\\n\\n\".join(str(element) for element in selector.find_all(element)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Siblings Search\n", "\n", "`RelativeNextSibling` and `RelativeSubsequentSibling` can be used to select element with next/subsequent sibling that matches the selector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, HasSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Animal Farm

\n", " Hello World\n", "
Brave New World
\n", "

Price: $30

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = HasSelector(Anchor + ClassSelector(\"price\"))\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ancestors Search\n", "\n", "Combining `RelativeAncestor` and `RelativeParent` selectors with `HasSelector` allows you to find elements that have a specific ancestor or parent. For instance, you can locate all elements that have an ancestor with class `breaking`. Matched ancestor does not have to be descendant of `bs4` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, ClassSelector, HasSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " \n", " Important!\n", " Actual Information\n", " \n", "
\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "span = element.find_all(\"span\")[0]\n", "\n", "selector = HasSelector(Anchor << ClassSelector(\"breaking\"))\n", "selector.find_all(span)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nth Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selectors, that allow you to select the nth element that matches a specific selector.\n", "\n", "While CSS offers pseudo-classes like `nth-child` and `nth-of-type` to select elements based on their ordinal position among siblings, these selectors only apply to the overall order of siblings.\n", "\n", "For example, selecting every 2nd element with the class `price` in CSS is not feasible because:\n", "\n", "```css\n", ".price:nth-child(2n)\n", "```\n", "\n", "selects the 2nd child that has the class `price`, not every 2nd `price` element.\n", "\n", "In `soupsavvy`, you can achieve this with:\n", "\n", "```python\n", "NthOfSelector(ClassSelector('price'), nth=\"2n\")\n", "```\n", "\n", "This selector selects every 2nd element with the class `price`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthOfSelector\n", "\n", "The `NthOfSelector` enables you to select elements based on a specified occurrence pattern defined by the `nth` rule. The provided `nth` parameter must follow valid CSS syntax (``, `even`, or `odd`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, NthOfSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "

Price: €1

\n", "

Price: $2

\n", " Bestseller\n", "

Price: €3

\n", "

Price: €4

\n", "

Price: €5

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthOfSelector(ClassSelector(\"price\"), nth=\"2n\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthLastOfSelector\n", "\n", "The `NthLastOfSelector` functions similarly to the `NthOfSelector`, but it counts elements from the end of the list. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import NthLastOfSelector, PatternSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " Animal Farm\n", "

Price: €1

\n", "

Price: $2

\n", " Bestseller\n", "

Price: €3

\n", "

Price: €4

\n", "

Price: €5

\n", " \"\"\",\n", " features=\"lxml\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthLastOfSelector(\n", " PatternSelector(re.compile(\"^price\", re.IGNORECASE)),\n", " nth=\"odd\",\n", ")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OnlyOfSelector\n", "\n", "The `OnlyOfSelector` selects an element only if it is the **sole** matching element among its siblings. \n", "If more than one element matches the specified selector, none will be selected." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, OnlyOfSelector, to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " Animal Farm\n", "

Price: $15

\n", "

Price: $20

\n", "
\n", "
\n", " Frankenstein\n", "

Price: $30

\n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = OnlyOfSelector(ClassSelector(\"price\"))\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operators module" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an alternative way to combine selectors, `soupsavvy` provides convenient operator functions in `soupsavvy.operators` module, offering shortcuts for composite selectors:\n", "\n", "- `and_` -> `AndSelector`\n", "- `or_` -> `SelectorList`\n", "- `is_` -> `SelectorList`\n", "- `where` -> `SelectorList`\n", "- `not_` -> `NotSelector`\n", "- `has` -> `HasSelector`\n", "- `xor` -> `XORSelector`\n", "\n", "These functions can enhance clarity and conciseness in some context." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, TypeSelector\n", "from soupsavvy.operators import and_\n", "\n", "and_(ClassSelector(\"price\"), TypeSelector(\"p\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, TypeSelector\n", "from soupsavvy.operators import has\n", "\n", "has(ClassSelector(\"price\"), TypeSelector(\"p\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soupsavvy` provides a wide range of composite selectors, that can be used to create more complex search criteria. Designed for flexibility and easy customization, these selectors allow you to tailor your selectors to meet specific needs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }