{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soupsavvy` **Model** is user-defined **scraping schema**, that uses selectors and operations to extract structured information from the webpage. Model represents an entity of interest in scraping like product, article, job, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations encapsulate transformation logic, such as extracting text, converting data types, or applying custom transformations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying operation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy.operations import Operation\n", "\n", "operation = Operation(lambda x: x.strip(\"$\"))\n", "operation.execute(\"100$\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Operation` accepts positional and keyword arguments, which are passed to execution function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "from soupsavvy.operations import Operation\n", "\n", "operation = Operation(datetime.strptime, \"%d-%m-%Y\")\n", "operation.execute(\"01-02-2020\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chaining operations\n", "\n", "Operations can be chained with `|` operator to apply multiple in the sequence." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from soupsavvy.operations import Operation\n", "\n", "operation = (\n", " Operation(lambda x: x.strip(\"$\")) | Operation(int) | Operation(lambda x: x * 2)\n", ")\n", "operation.execute(\"100$\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Text` is a built-in operation that extracts the text content of an element. It's very common and useful operation in web scraping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Text\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", "
Animal Farm
\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.p)\n", "operation = Text()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Href" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Href` is a built-in operation that extracts value of `href` attribute from an element. If `href` attribute is not found in the element, it returns `None`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Href\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", " Animal Farm\n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.a)\n", "operation = Href()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Parent` is an operation that extracts the parent element of the current element. It can be sued as selector as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy.operations import Parent\n", "from soupsavvy import to_soupsavvy\n", "\n", "text = \"\"\"\n", " \n", "\"\"\"\n", "soup = BeautifulSoup(text, features=\"lxml\")\n", "element = to_soupsavvy(soup.a)\n", "operation = Parent()\n", "operation.execute(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining with selector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selectors in `soupsavvy` can be combined with operations by using `|` operator. Created pipeline first locates the element and then applies the operation on it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "from soupsavvy import ClassSelector, to_soupsavvy\n", "from soupsavvy.operations import Operation, Text\n", "\n", "text = \"\"\"\n", "Animal Farm
\n", "100$
\n", "George Orwell
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", "George Orwell
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", "George Orwell
\n", "Animal Farm
\n", " \n", "George Orwell
\n", "Animal Farm
\n", "hundred
\n", "George Orwell
\n", "Animal Farm
\n", "10
\n", "George Orwell
\n", "0
\n", "George Orwell
\n", "Animal Farm
\n", "George Orwell
\n", "Animal Farm
\n", "100
80
60
\n", "George Orwell
\n", "Animal Farm
\n", "100
80
60
\n", "George Orwell
\n", "Animal Farm
\n", "100$
\n", "George Orwell
\n", "Animal Farm
\n", "50
\n", "George Orwell
\n", "2:30
\n", "Animal Farm
\n", "George Orwell
\n", "4:30
\n", "50
\n", "Lois Lowry
\n", "3:30
\n", "Brave New World
\n", "50
\n", "Aldous Huxley
\n", "2:30
\n", "Animal Farm
\n", " \n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Brave New World
\n", "100
\n", "Aldous Huxley
\n", "The Giver
\n", "80
\n", "Lois Lowry
\n", "Not a child
\n", "200
\n", "Author
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", " \n", "Animal Farm
\n", "100
\n", "50
\n", " \n", " \n", "200
\n", "Animal Farm
\n", " \n", "Animal Farm
\n", "100$
\n", "George Orwell
\n", "Animal Farm
\n", "100$
\n", "Animal Farm
\n", "50$
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", " \n", "George Orwell
\n", " United Kingdom\n", " \n", "Animal Farm
\n", " \n", "George Orwell
\n", " United Kingdom\n", " \n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "Animal Farm
\n", "George Orwell
\n", "4:30
\n", "50
\n", "Lois Lowry
\n", "3:30
\n", "Brave New World
\n", "50
\n", "Aldous Huxley
\n", "2:30
\n", "Animal Farm
\n", " \n", "Animal Farm
\n", "100
\n", "George Orwell
\n", "