{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Showcase\n",
    "\n",
    "We will explore how to use `soupsavvy` to extract information from the [Rotten Tomatoes](https://editorial.rottentomatoes.com/) website, a well-known review aggregator for film and television."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "from soupsavvy import to_soupsavvy\n",
    "\n",
    "headers = {\n",
    "    \"User-Agent\": \"Mozilla/5.0\",\n",
    "    \"Accept-Language\": \"en-US,en;q=0.9\",\n",
    "}\n",
    "url = \"https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/\"\n",
    "response = requests.get(url, headers=headers)\n",
    "soup = BeautifulSoup(response.text, \"lxml\")\n",
    "element = to_soupsavvy(soup)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selectors\n",
    "\n",
    "Selectors are the core feature of `soupsavvy`, enabling a declarative approach to locating HTML elements. In this example, we use `movie_selector` to identify movie elements on the webpage, which are characterized by `p` tag name and `movie` class attribute. For a detailed guide and additional examples, see [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/selectors.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy import ClassSelector, TypeSelector\n",
    "\n",
    "movie_selector = ClassSelector(\"movie\") & TypeSelector(\"p\")\n",
    "result = movie_selector.find_all(element)\n",
    "\n",
    "print(f\"Found {len(result)} movies.\\n\")\n",
    "print(result[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipelines\n",
    "\n",
    "`soupsavvy` selection pipeline is a combination of selectors and operations, which provides an efficient way to extract and transform information from a webpage. For instance, we can use `title_pipeline` to locate the first 5 movie titles, extract their text, and transform them to uppercase. For detailed guide with examples, see [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#combining-with-selector)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy import ClassSelector, TypeSelector\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "movie_selector = ClassSelector(\"movie\") & TypeSelector(\"p\")\n",
    "title_pipeline = (\n",
    "    (movie_selector >> ClassSelector(\"title\")) | Text() | Operation(str.upper)\n",
    ")\n",
    "result = title_pipeline.find_all(element, limit=5)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Models\n",
    "\n",
    "Models define **scraping schemas**, utilizing selectors and operations to extract structured information from the webpage. Model is user-defined data structure representing an entity of interest in scraping. For example, the `Movie` model can be used to extract details such as the movie's `title` and `score` from [Rotten Tomatoes](https://www.rottentomatoes.com/). A comprehensive guide with examples on models is available [here](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy import ClassSelector, TypeSelector\n",
    "from soupsavvy.models import BaseModel, post\n",
    "from soupsavvy.operations import Text\n",
    "\n",
    "\n",
    "class Movie(BaseModel):\n",
    "    __scope__ = ClassSelector(\"movie\") & TypeSelector(\"p\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    score = ClassSelector(\"score\") | Text()\n",
    "\n",
    "    @post(\"score\")\n",
    "    def process_score(self, score: str) -> int:\n",
    "        \"\"\"\n",
    "        There are multiple methods of transforming field values,\n",
    "        field post-process methods are one of them.\n",
    "        \"\"\"\n",
    "        return int(score.strip(\"%\"))\n",
    "\n",
    "\n",
    "Movie.find(element)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Migrations\n",
    "\n",
    "Model instances can be seamlessly migrated to other data structures, with object attributes passed directly to the target model's constructor. This functionality ensures smooth integration with third-party libraries such as `pydantic` or `sqlalchemy`. For example, `Movie` instances can be migrated to a parallel `sqlalchemy` model, `MovieSQL`, and saved to a database. More examples and complex use cases can be found in [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#migration)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import Column, Identity, Integer, MetaData, String, create_engine\n",
    "from sqlalchemy.orm import Session, declarative_base\n",
    "\n",
    "from soupsavvy import ClassSelector, TypeSelector\n",
    "from soupsavvy.models import BaseModel\n",
    "from soupsavvy.operations import Operation, Text\n",
    "\n",
    "Base = declarative_base(metadata=MetaData())\n",
    "\n",
    "\n",
    "class SQLMovie(Base):\n",
    "    __tablename__ = \"movie\"\n",
    "\n",
    "    id = Column(Integer, Identity(start=1, increment=1), primary_key=True)\n",
    "    title = Column(String(128), nullable=False)\n",
    "    score = Column(Integer)\n",
    "\n",
    "    def __repr__(self):\n",
    "        return f\"<Movie(title={self.title}, score={self.score})>\"\n",
    "\n",
    "\n",
    "engine = create_engine(\"sqlite:///:memory:\")\n",
    "Base.metadata.create_all(engine)\n",
    "\n",
    "\n",
    "class Movie(BaseModel):\n",
    "    __scope__ = ClassSelector(\"movie\") & TypeSelector(\"p\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    # chaining operations is another way of transforming field values\n",
    "    score = ClassSelector(\"score\") | Text() | Operation(lambda x: int(x.strip(\"%\")))\n",
    "\n",
    "\n",
    "movie = Movie.find(element)\n",
    "sql_movie = movie.migrate(SQLMovie)\n",
    "\n",
    "with Session(engine) as session:\n",
    "    session.add(sql_movie)\n",
    "    session.commit()\n",
    "\n",
    "    result = session.query(SQLMovie).one()\n",
    "\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Composite Models\n",
    "\n",
    "Models in `soupsavvy` are highly flexible, supporting multiple fields, including sub-models. For instance, we will use `MovieDetails` as a field within the `Movie` model to separate movie-specific information from Rotten Tomatoes' `score` and `rank`. For more information, check out [tutorial](https://soupsavvy.readthedocs.io/en/stable/demos/models.html#model-composition)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from soupsavvy import ClassSelector, HasSelector, TypeSelector\n",
    "from soupsavvy.models import BaseModel, post\n",
    "from soupsavvy.operations import Operation, Text\n",
    "from soupsavvy.selectors.css import FirstOfType\n",
    "\n",
    "\n",
    "class MovieDetails(BaseModel):\n",
    "    __scope__ = ClassSelector(\"details\")\n",
    "\n",
    "    title = ClassSelector(\"title\") | Text()\n",
    "    year = (\n",
    "        ClassSelector(\"year\")\n",
    "        | Text()\n",
    "        | Operation(lambda x: x.strip(\"()\"))\n",
    "        | Operation(int)\n",
    "    )\n",
    "\n",
    "\n",
    "class Movie(BaseModel):\n",
    "    # finds ancestor `tr` of p.movie\n",
    "    # HasSelector was added after a bug on rotten tomatoes site - one position was empty\n",
    "    __scope__ = (\n",
    "        ClassSelector(\"movie\") & TypeSelector(\"p\") & HasSelector(ClassSelector(\"year\"))\n",
    "    ) << TypeSelector(\"tr\")\n",
    "\n",
    "    rank = (FirstOfType() & TypeSelector(\"td\")) | Text()\n",
    "    score = ClassSelector(\"score\") | Text()\n",
    "    # Model can be a field of another model\n",
    "    details = MovieDetails\n",
    "\n",
    "    @post(\"score\")\n",
    "    def process_score(self, score: str) -> int:\n",
    "        return int(score.strip(\"%\"))\n",
    "\n",
    "    def __post_init__(self) -> None:\n",
    "        \"\"\"\n",
    "        Post initialization method is another way of transforming field values.\n",
    "        Here, access to all fields is available.\n",
    "        \"\"\"\n",
    "        self.rank = int(str(self.rank).strip(\".\"))\n",
    "\n",
    "\n",
    "result = Movie.find_all(element)\n",
    "print(f\"Found {len(result)} movies\")\n",
    "result[8]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "These are just a few of the many powerful features available in `soupsavvy`.  \n",
    "To explore them further, dive into the [Documentation](https://soupsavvy.readthedocs.io) and start building your scraping workflows!\n",
    "\n",
    "**Enjoy `soupsavvy` and leave us feedback!**  \n",
    "**Happy scraping!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "soupsavvy",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}