{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Showcase\n", "\n", "We will explore how to use `soupsavvy` to extract information from the [Rotten Tomatoes](https://editorial.rottentomatoes.com/) website, a well-known review aggregator for film and television." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import to_soupsavvy\n", "\n", "headers = {\n", " \"User-Agent\": \"Mozilla/5.0\",\n", " \"Accept-Language\": \"en-US,en;q=0.9\",\n", "}\n", "url = \"https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/\"\n", "response = requests.get(url, headers=headers)\n", "soup = BeautifulSoup(response.text, \"lxml\")\n", "element = to_soupsavvy(soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selectors\n", "\n", "Selectors are the core feature of `soupsavvy`, enabling a declarative approach to locating HTML elements. In this example, we use `movie_selector` to identify movie elements on the webpage, which are characterized by `p` tag name and `movie` class attribute. For a detailed guide and additional examples, see [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/selectors.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, TypeSelector\n", "\n", "movie_selector = ClassSelector(\"movie\") & TypeSelector(\"p\")\n", "result = movie_selector.find_all(element)\n", "\n", "print(f\"Found {len(result)} movies.\\n\")\n", "print(result[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipelines\n", "\n", "`soupsavvy` selection pipeline is a combination of selectors and operations, which provides an efficient way to extract and transform information from a webpage. For instance, we can use `title_pipeline` to locate the first 5 movie titles, extract their text, and transform them to uppercase. For detailed guide with examples, see [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#combining-with-selector)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, TypeSelector\n", "from soupsavvy.operations import Operation, Text\n", "\n", "movie_selector = ClassSelector(\"movie\") & TypeSelector(\"p\")\n", "title_pipeline = (\n", " (movie_selector >> ClassSelector(\"title\")) | Text() | Operation(str.upper)\n", ")\n", "result = title_pipeline.find_all(element, limit=5)\n", "print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Models\n", "\n", "Models define **scraping schemas**, utilizing selectors and operations to extract structured information from the webpage. Model is user-defined data structure representing an entity of interest in scraping. For example, the `Movie` model can be used to extract details such as the movie's `title` and `score` from [Rotten Tomatoes](https://www.rottentomatoes.com/). A comprehensive guide with examples on models is available [here](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, TypeSelector\n", "from soupsavvy.models import BaseModel, post\n", "from soupsavvy.operations import Text\n", "\n", "\n", "class Movie(BaseModel):\n", " __scope__ = ClassSelector(\"movie\") & TypeSelector(\"p\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " score = ClassSelector(\"score\") | Text()\n", "\n", " @post(\"score\")\n", " def process_score(self, score: str) -> int:\n", " \"\"\"\n", " There are multiple methods of transforming field values,\n", " field post-process methods are one of them.\n", " \"\"\"\n", " return int(score.strip(\"%\"))\n", "\n", "\n", "Movie.find(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Migrations\n", "\n", "Model instances can be seamlessly migrated to other data structures, with object attributes passed directly to the target model's constructor. This functionality ensures smooth integration with third-party libraries such as `pydantic` or `sqlalchemy`. For example, `Movie` instances can be migrated to a parallel `sqlalchemy` model, `MovieSQL`, and saved to a database. More examples and complex use cases can be found in [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#migration)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import Column, Identity, Integer, MetaData, String, create_engine\n", "from sqlalchemy.orm import Session, declarative_base\n", "\n", "from soupsavvy import ClassSelector, TypeSelector\n", "from soupsavvy.models import BaseModel\n", "from soupsavvy.operations import Operation, Text\n", "\n", "Base = declarative_base(metadata=MetaData())\n", "\n", "\n", "class SQLMovie(Base):\n", " __tablename__ = \"movie\"\n", "\n", " id = Column(Integer, Identity(start=1, increment=1), primary_key=True)\n", " title = Column(String(128), nullable=False)\n", " score = Column(Integer)\n", "\n", " def __repr__(self):\n", " return f\"\"\n", "\n", "\n", "engine = create_engine(\"sqlite:///:memory:\")\n", "Base.metadata.create_all(engine)\n", "\n", "\n", "class Movie(BaseModel):\n", " __scope__ = ClassSelector(\"movie\") & TypeSelector(\"p\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " # chaining operations is another way of transforming field values\n", " score = ClassSelector(\"score\") | Text() | Operation(lambda x: int(x.strip(\"%\")))\n", "\n", "\n", "movie = Movie.find(element)\n", "sql_movie = movie.migrate(SQLMovie)\n", "\n", "with Session(engine) as session:\n", " session.add(sql_movie)\n", " session.commit()\n", "\n", " result = session.query(SQLMovie).one()\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Composite Models\n", "\n", "Models in `soupsavvy` are highly flexible, supporting multiple fields, including sub-models. For instance, we will use `MovieDetails` as a field within the `Movie` model to separate movie-specific information from Rotten Tomatoes' `score` and `rank`. For more information, check out [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#model-composition)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy import ClassSelector, HasSelector, TypeSelector\n", "from soupsavvy.models import BaseModel, post\n", "from soupsavvy.operations import Operation, Text\n", "from soupsavvy.selectors.css import FirstOfType\n", "\n", "\n", "class MovieDetails(BaseModel):\n", " __scope__ = ClassSelector(\"details\")\n", "\n", " title = ClassSelector(\"title\") | Text()\n", " year = (\n", " ClassSelector(\"year\")\n", " | Text()\n", " | Operation(lambda x: x.strip(\"()\"))\n", " | Operation(int)\n", " )\n", "\n", "\n", "class Movie(BaseModel):\n", " # finds ancestor `tr` of p.movie\n", " # HasSelector was added after a bug on rotten tomatoes site - one position was empty\n", " __scope__ = (\n", " ClassSelector(\"movie\") & TypeSelector(\"p\") & HasSelector(ClassSelector(\"year\"))\n", " ) << TypeSelector(\"tr\")\n", "\n", " rank = (FirstOfType() & TypeSelector(\"td\")) | Text()\n", " score = ClassSelector(\"score\") | Text()\n", " # Model can be a field of another model\n", " details = MovieDetails\n", "\n", " @post(\"score\")\n", " def process_score(self, score: str) -> int:\n", " return int(score.strip(\"%\"))\n", "\n", " def __post_init__(self) -> None:\n", " \"\"\"\n", " Post initialization method is another way of transforming field values.\n", " Here, access to all fields is available.\n", " \"\"\"\n", " self.rank = int(str(self.rank).strip(\".\"))\n", "\n", "\n", "result = Movie.find_all(element)\n", "print(f\"Found {len(result)} movies\")\n", "result[8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "These are just a few of the many powerful features available in `soupsavvy`. \n", "To explore them further, dive into the [Documentation](https://soupsavvy.readthedocs.io) and start building your scraping workflows!\n", "\n", "**Enjoy `soupsavvy` and leave us feedback!** \n", "**Happy scraping!**" ] } ], "metadata": { "kernelspec": { "display_name": "soupsavvy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }