Showcase
We will explore how to use soupsavvy to extract information from the Rotten Tomatoes website, a well-known review aggregator for film and television.
import requests
from bs4 import BeautifulSoup
from soupsavvy import to_soupsavvy
headers = {
"User-Agent": "Mozilla/5.0",
"Accept-Language": "en-US,en;q=0.9",
}
url = "https://editorial.rottentomatoes.com/guide/best-movies-of-all-time/"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "lxml")
element = to_soupsavvy(soup)
Selectors
Selectors are the core feature of soupsavvy, enabling a declarative approach to locating HTML elements. In this example, we use movie_selector to identify movie elements on the webpage, which are characterized by p tag name and movie class attribute. For a detailed guide and additional examples, see tutorial.
from soupsavvy import ClassSelector, TypeSelector
movie_selector = ClassSelector("movie") & TypeSelector("p")
result = movie_selector.find_all(element)
print(f"Found {len(result)} movies.\n")
print(result[0])
Found 300 movies.
<p class="apple-news-link-wrap movie">
<span class="score-wrap">
<img class="apple-critic-score-icon" height="16" src="https://images.fandango.com/cms/assets/c6672520-d359-11ea-a15f-bdf29fa24277--certified-fresh.png" width="16"/>
<span class="score"><strong>97%</strong></span>
</span>
<span class="details">
<a class="title" href="https://www.rottentomatoes.com/m/the_godfather">The Godfather</a>
<span class="year">(1972)</span>
</span>
</p>
Pipelines
soupsavvy selection pipeline is a combination of selectors and operations, which provides an efficient way to extract and transform information from a webpage. For instance, we can use title_pipeline to locate the first 5 movie titles, extract their text, and transform them to uppercase. For detailed guide with examples, see tutorial.
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.operations import Operation, Text
movie_selector = ClassSelector("movie") & TypeSelector("p")
title_pipeline = (
(movie_selector >> ClassSelector("title")) | Text() | Operation(str.upper)
)
result = title_pipeline.find_all(element, limit=5)
print(result)
['THE GODFATHER', 'CASABLANCA', 'L.A. CONFIDENTIAL', 'SEVEN SAMURAI', 'PARASITE']
Models
Models define scraping schemas, utilizing selectors and operations to extract structured information from the webpage. Model is user-defined data structure representing an entity of interest in scraping. For example, the Movie model can be used to extract details such as the movie’s title and score from Rotten Tomatoes. A comprehensive guide with examples on models is available here.
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Text
class Movie(BaseModel):
__scope__ = ClassSelector("movie") & TypeSelector("p")
title = ClassSelector("title") | Text()
score = ClassSelector("score") | Text()
@post("score")
def process_score(self, score: str) -> int:
"""
There are multiple methods of transforming field values,
field post-process methods are one of them.
"""
return int(score.strip("%"))
Movie.find(element)
Movie(title='The Godfather', score=97)
Migrations
Model instances can be seamlessly migrated to other data structures, with object attributes passed directly to the target model’s constructor. This functionality ensures smooth integration with third-party libraries such as pydantic or sqlalchemy. For example, Movie instances can be migrated to a parallel sqlalchemy model, MovieSQL, and saved to a database. More examples and complex use cases can be found in tutorial.
from sqlalchemy import Column, Identity, Integer, MetaData, String, create_engine
from sqlalchemy.orm import Session, declarative_base
from soupsavvy import ClassSelector, TypeSelector
from soupsavvy.models import BaseModel
from soupsavvy.operations import Operation, Text
Base = declarative_base(metadata=MetaData())
class SQLMovie(Base):
__tablename__ = "movie"
id = Column(Integer, Identity(start=1, increment=1), primary_key=True)
title = Column(String(128), nullable=False)
score = Column(Integer)
def __repr__(self):
return f"<Movie(title={self.title}, score={self.score})>"
engine = create_engine("sqlite:///:memory:")
Base.metadata.create_all(engine)
class Movie(BaseModel):
__scope__ = ClassSelector("movie") & TypeSelector("p")
title = ClassSelector("title") | Text()
# chaining operations is another way of transforming field values
score = ClassSelector("score") | Text() | Operation(lambda x: int(x.strip("%")))
movie = Movie.find(element)
sql_movie = movie.migrate(SQLMovie)
with Session(engine) as session:
session.add(sql_movie)
session.commit()
result = session.query(SQLMovie).one()
result
<Movie(title=The Godfather, score=97)>
Composite Models
Models in soupsavvy are highly flexible, supporting multiple fields, including sub-models. For instance, we will use MovieDetails as a field within the Movie model to separate movie-specific information from Rotten Tomatoes’ score and rank. For more information, check out tutorial.
from soupsavvy import ClassSelector, HasSelector, TypeSelector
from soupsavvy.models import BaseModel, post
from soupsavvy.operations import Operation, Text
from soupsavvy.selectors.css import FirstOfType
class MovieDetails(BaseModel):
__scope__ = ClassSelector("details")
title = ClassSelector("title") | Text()
year = (
ClassSelector("year")
| Text()
| Operation(lambda x: x.strip("()"))
| Operation(int)
)
class Movie(BaseModel):
# finds ancestor `tr` of p.movie
# HasSelector was added after a bug on rotten tomatoes site - one position was empty
__scope__ = (
ClassSelector("movie") & TypeSelector("p") & HasSelector(ClassSelector("year"))
) << TypeSelector("tr")
rank = (FirstOfType() & TypeSelector("td")) | Text()
score = ClassSelector("score") | Text()
# Model can be a field of another model
details = MovieDetails
@post("score")
def process_score(self, score: str) -> int:
return int(score.strip("%"))
def __post_init__(self) -> None:
"""
Post initialization method is another way of transforming field values.
Here, access to all fields is available.
"""
self.rank = int(str(self.rank).strip("."))
result = Movie.find_all(element)
print(f"Found {len(result)} movies")
result[8]
Found 300 movies
Movie(rank=9, score=99, details=MovieDetails(title='On the Waterfront', year=1954))
Conclusion
These are just a few of the many powerful features available in soupsavvy.
To explore them further, dive into the Documentation and start building your scraping workflows!
Enjoy soupsavvy and leave us feedback!
Happy scraping!