Showcase

We will explore how to use soupsavvy to extract information from the Rotten Tomatoes website, a well-known review aggregator for film and television.

import requests
from bs4 import BeautifulSoup

from soupsavvy import to_soupsavvy

headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept-Language": "en-US,en;q=0.9",
}
url = "https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
element = to_soupsavvy(soup)

Selectors

Selectors are the core feature of soupsavvy, enabling a declarative approach to locating HTML elements. In this example, we use movie_selector to identify movie elements on the webpage, which are characterized by p tag name and movie class attribute. For a detailed guide and additional examples, see tutorial.

from soupsavvy import ClassSelector, TypeSelector

movie_selector = ClassSelector("movie") & TypeSelector("p")
result = movie_selector.find_all(element)

print(f"Found {len(result)} movies.\n")
print(result[0])

Found 300 movies.

<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-icon" height="16" src="https://images.fandango.com/cms/assets/c6672520-d359-11ea-a15f-bdf29fa24277--certified-fresh.png" width="16"/>
<span class="score"><strong>97%</strong></span>
</span>
<span class="details">
<a class="title" href="https://www.rottentomatoes.com/m/the_godfather">The Godfather</a>
<span class="year">(1972)</span>
</span>
</p>

Pipelines

soupsavvy selection pipeline is a combination of selectors and operations, which provides an efficient way to extract and transform information from a webpage. For instance, we can use title_pipeline to locate the first 5 movie titles, extract their text, and transform them to uppercase. For detailed guide with examples, see tutorial.

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operations import Operation, Text

movie_selector = ClassSelector("movie") & TypeSelector("p")
title_pipeline = (
    (movie_selector >> ClassSelector("title")) | Text() | Operation(str.upper)
)
result = title_pipeline.find_all(element, limit=5)
print(result)

['THE GODFATHER', 'CASABLANCA', 'L.A. CONFIDENTIAL', 'SEVEN SAMURAI', 'PARASITE']

Models

Models define scraping schemas, utilizing selectors and operations to extract structured information from the webpage. Model is user-defined data structure representing an entity of interest in scraping. For example, the Movie model can be used to extract details such as the movie’s title and score from Rotten Tomatoes. A comprehensive guide with examples on models is available here.

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Text


class Movie(BaseModel):
    __scope__ = ClassSelector("movie") & TypeSelector("p")

    title = ClassSelector("title") | Text()
    score = ClassSelector("score") | Text()

    @post("score")
    def process_score(self, score: str) -> int:
        """
        There are multiple methods of transforming field values,
        field post-process methods are one of them.
        """
        return int(score.strip("%"))


Movie.find(element)

Movie(title='The Godfather', score=97)

Migrations

Model instances can be seamlessly migrated to other data structures, with object attributes passed directly to the target model’s constructor. This functionality ensures smooth integration with third-party libraries such as pydantic or sqlalchemy. For example, Movie instances can be migrated to a parallel sqlalchemy model, MovieSQL, and saved to a database. More examples and complex use cases can be found in tutorial.

from sqlalchemy import Column, Identity, Integer, MetaData, String, create_engine
from sqlalchemy.orm import Session, declarative_base

from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text

Base = declarative_base(metadata=MetaData())


class SQLMovie(Base):
    __tablename__ = "movie"

    id = Column(Integer, Identity(start=1, increment=1), primary_key=True)
    title = Column(String(128), nullable=False)
    score = Column(Integer)

    def __repr__(self):
        return f"<Movie(title={self.title}, score={self.score})>"


engine = create_engine("sqlite:///:memory:")
Base.metadata.create_all(engine)


class Movie(BaseModel):
    __scope__ = ClassSelector("movie") & TypeSelector("p")

    title = ClassSelector("title") | Text()
    # chaining operations is another way of transforming field values
    score = ClassSelector("score") | Text() | Operation(lambda x: int(x.strip("%")))


movie = Movie.find(element)
sql_movie = movie.migrate(SQLMovie)

with Session(engine) as session:
    session.add(sql_movie)
    session.commit()

    result = session.query(SQLMovie).one()

result

<Movie(title=The Godfather, score=97)>

Composite Models

Models in soupsavvy are highly flexible, supporting multiple fields, including sub-models. For instance, we will use MovieDetails as a field within the Movie model to separate movie-specific information from Rotten Tomatoes’ score and rank. For more information, check out tutorial.

from soupsavvy import ClassSelector, HasSelector, TypeSelector
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import FirstOfType


class MovieDetails(BaseModel):
    __scope__ = ClassSelector("details")

    title = ClassSelector("title") | Text()
    year = (
        ClassSelector("year")
        | Text()
        | Operation(lambda x: x.strip("()"))
        | Operation(int)
    )


class Movie(BaseModel):
    # finds ancestor `tr` of p.movie
    # HasSelector was added after a bug on rotten tomatoes site - one position was empty
    __scope__ = (
        ClassSelector("movie") & TypeSelector("p") & HasSelector(ClassSelector("year"))
    ) << TypeSelector("tr")

    rank = (FirstOfType() & TypeSelector("td")) | Text()
    score = ClassSelector("score") | Text()
    # Model can be a field of another model
    details = MovieDetails

    @post("score")
    def process_score(self, score: str) -> int:
        return int(score.strip("%"))

    def __post_init__(self) -> None:
        """
        Post initialization method is another way of transforming field values.
        Here, access to all fields is available.
        """
        self.rank = int(str(self.rank).strip("."))


result = Movie.find_all(element)
print(f"Found {len(result)} movies")
result[8]

Found 300 movies

Movie(rank=9, score=99, details=MovieDetails(title='On the Waterfront', year=1954))

Conclusion

These are just a few of the many powerful features available in soupsavvy.
To explore them further, dive into the Documentation and start building your scraping workflows!

Enjoy soupsavvy and leave us feedback!
Happy scraping!