Extract and Count Reviews/AggregateRating Script (Python)

Extract and Count Reviews Script

This script was basically the concept for a similar WP Plugin, which automatically counts the amount of all single product ratings in each category and writes the correct amount of total reviews in a category on the category pages "aggregate rating" Schema.org Markup.

We had a case, where this was the optimal solution to display the correct amount of "aggregate rating" in "Recipe Rich Results" for a Foodblog/Recipe-Website.

As for today, Google does not seem to give to much attention, but there are indicators showing, the math is getting more important.

Description

This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

Usage

Run the Script: Replace placeholders (https://example.com/category-sitemap.xml) with the actual URL of the sitemap. Execute the script in a Python environment.
Output: The total reviews per category are saved in result.txt.

Requirements

Python libraries: requests, beautifulsoup4, re.

Special Notes

Review Format: This script is suitable for webpages where the number of reviews is enclosed in parentheses, such as "(123)". It uses a regular expression to identify and extract these numbers.

# scrape_review_count.py
# Author: Christopher Hneke
# Date: 04.08.2024
# Description: This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. 
# Description: It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

import requests
from bs4 import BeautifulSoup
import re

# Function to get the total number of reviews from a category URL
def get_total_reviews(url):
    total_reviews = 0
    page_number = 1
    review_pattern = re.compile(r'\((\d+)\)')

    while True:
        page_url = f"{url}/page/{page_number}/" if page_number > 1 else url
        response = requests.get(page_url)

        if response.status_code == 404:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        page_reviews = soup.find_all(string=review_pattern)

        if not page_reviews:
            break

        for review_text in page_reviews:
            match = review_pattern.search(review_text)
            if match:
                total_reviews += int(match.group(1))

        page_number += 1

    return total_reviews

# Main function to process the sitemap and extract reviews for each category
def main():
    sitemap_url = 'https://example.com/category-sitemap.xml'  # Replace with the actual sitemap URL
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, 'xml')
    categories = soup.find_all('loc')

    results = []

    for category in categories:
        category_url = category.text
        category_name = category_url.split('/')[-2]
        print(f"Processing category: {category_name}")
        total_reviews = get_total_reviews(category_url)
        results.append(f"{category_name}: {total_reviews} reviews\n")

    with open('result.txt', 'w', encoding='utf-8') as file:
        file.writelines(results)

    print("Results saved to result.txt")

if __name__ == '__main__':
    main()