{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CSS Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `soupsavvy.selectors.css` subpackage provides a set of CSS-based selectors, built as wrappers around the [`soupsieve`](https://github.com/facelessuser/soupsieve) library — *'a modern CSS selector implementation for BeautifulSoup'*. These selectors can be seamlessly combined with other `soupsavvy` selectors, allowing for flexible use of pure CSS and common [`pseudo-classes`](https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Child Selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Child selectors target elements based on their position among siblings within a parent element. While `nth-child` can handle any position-based selection, `soupsavvy` offers convenient wrappers for several frequently used CSS pseudo-classes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FirstChild\n", "\n", "The `FirstChild` selector selects every element that is the first child of its parent.\n", "\n", "```css\n", ":first-child\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import FirstChild\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

First

\n", "
\n", " First\n", " \n", " First\n", " \n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = FirstChild()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LastChild\n", "\n", "The `LastChild` selector selects every element that is the last child of its parent.\n", "\n", "```css\n", ":last-child\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import LastChild\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

\n", "
\n", "
\n", " \n", " Last\n", " \n", " Last\n", "
\n", "
Last
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = LastChild()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthChild\n", "\n", "The `NthChild` selector allows you to target elements based on their position among siblings, using a CSS-like `nth-child` expression.\n", "\n", "```css\n", ":nth-child(3)\n", "```\n", "\n", "It, along with other nth-based selectors, fully supports all valid CSS `nth` parameter values, letting you select elements using the same syntax as CSS.\n", "\n", "```css\n", ":nth-child(2n)\n", ":nth-child(odd)\n", ":nth-child(even)\n", ":nth-child(-n+2)\n", "```\n", "\n", "```python\n", "NthChild('2n')\n", "NthChild('odd')\n", "NthChild('even')\n", "NthChild('-n+2')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import NthChild\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

1

\n", "

2

\n", "

3

\n", "

4

\n", "

5

\n", "

6

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthChild(\"2n\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthLastChild\n", "\n", "The `NthLastChild` selector allows you to select elements based on their position among their siblings, counting from the last child of the parent element.\n", "\n", "```css\n", ":nth-last-child(3)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import NthLastChild\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

1

\n", "

2

\n", "

3

\n", "

4

\n", "

5

\n", "

6

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthLastChild(\"odd\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OnlyChild\n", "\n", "The `OnlyChild` selector matches elements that are the only child of their parent.\n", "\n", "```css\n", ":only-child\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import OnlyChild\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Text

\n", "
\n", " \n", " Text\n", "
\n", "

Only child

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = OnlyChild()\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Type selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Type selectors are used to select elements based on their position among sibling elements of the same type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### FirstOfType\n", "\n", "Selects every element that is the first child of the type.\n", "\n", "```css\n", ":first-of-type\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import FirstOfType\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

First p

\n", "
First div
\n", "
\n", " First span\n", " \n", " First a\n", " \n", "
\n", "

\n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = FirstOfType()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LastOfType\n", "\n", "Selects every element that is the last child of the type.\n", "\n", "```css\n", ":last-of-type\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import LastOfType\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Last p

\n", "
\n", " \n", " Last a\n", " \n", " Last span\n", "
\n", "
Last div
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = LastOfType()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthOfType\n", "\n", "Selects every element that is the nth child of the type.\n", "\n", "```css\n", ":nth-of-type(n)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import NthOfType\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

1

\n", " 1\n", "

2

\n", " 2\n", "

3

\n", " 3\n", "

4

\n", " 4\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthOfType(\"2n+2\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NthLastOfType\n", "\n", "Selects every element that is the nth child of the type, counting from the last child.\n", "\n", "```css\n", ":nth-last-of-type(n)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import NthLastOfType\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

1

\n", " 1\n", "

2

\n", " 2\n", "

3

\n", " 3\n", "

4

\n", " 4\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = NthLastOfType(\"-n+2\")\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OnlyOfType\n", "\n", "Selects every element that is the only child of the type.\n", "\n", "```css\n", ":only-of-type\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import OnlyOfType\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
\n", " First span\n", " Second span\n", "
\n", "

Only p

\n", "
\n", " Only span\n", " Only a\n", "
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = OnlyOfType()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Empty\n", "\n", "Selects every element that has no children and no text content.\n", "\n", "**CSS Example:**\n", "```css\n", ":empty\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import Empty\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Text

\n", "
\n", " \n", " Text\n", " \n", " \n", "
\n", "
Text
\n", "

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = Empty()\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CSS\n", "\n", "Wrapper for any CSS selector, uses `soupsieve` under the hood, so support is limited to its version.\n", "Convenience class for search based on CSS selector, results of the search are equivalent to `BeautifulSoup.select` method.\n", "\n", "**Using BeautifulSoup:**\n", "```python\n", "soup.select('div > p')\n", "```\n", "\n", "**Using soupsieve:**\n", "```python\n", "soupsieve.select_one('div > p', soup)\n", "```\n", "\n", "**Using soupsavvy:**\n", "```python\n", "CSS('div > p').find(element)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import CSS\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "
Not span
\n", " Not first\n", "
Not .foo
\n", "
Found
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = CSS(\"span.foo:first-child\")\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining selectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CSS based selectors can be combined with other `soupsavvy` selectors to create composite selectors. For example, to select all elements, that are not empty and are children of a div element, the following selector can be used:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import TypeSelector, to_soupsavvy\n", "from soupsavvy.selectors.css import Empty\n", "\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Text

\n", "
\n", " \n", " Text\n", " \n", "

\n", "
\n", "

\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = TypeSelector(\"div\") > (~Empty())\n", "selector.find_all(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For finding all elements that have one child and are last child of their parent following selector can be used:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import Anchor, HasSelector, to_soupsavvy\n", "from soupsavvy.selectors.css import LastChild, OnlyChild\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", "

Text

\n", "
\n", " \n", " Text\n", "
\n", "
Only Child
\n", "
Only Child - Last
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "only_child = Anchor > OnlyChild()\n", "selector = HasSelector(only_child) & LastChild()\n", "selector.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recursivity\n", "\n", "Unlike in `BeautifulSoup.select`, `soupsavvy` allows non-recursive searches by setting `recursive=False` in the `find` methods, ensuring only direct children matching the selector are returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.selectors.css import CSS\n", "from soupsavvy import to_soupsavvy\n", "\n", "soup = BeautifulSoup(\n", " \"\"\"\n", " \n", "
\n", "
\n", "
\n", "
Child
\n", " \"\"\",\n", " features=\"html.parser\",\n", ")\n", "element = to_soupsavvy(soup)\n", "\n", "selector = CSS(\"div.foo\")\n", "selector.find(element, recursive=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soupsavvy` offers an easy way to select elements using CSS selectors. It includes wrappers for commonly used pseudo-classes that share the same implementation as other selectors, allowing them to be easily combined." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }