Extract Schema.org Data Script (Python)

Featured Imgs 23
Extract Schema.org Data Script

Maybe this is helpful for somebody...

Description

This script extracts Schema.org data from a given URL and saves it to a file.

Usage
  1. Run the Script: Execute the script in a Python environment.
  2. Input URL: Enter the URL of the webpage (without 'https://') when prompted.
  3. Output: The extracted data is saved in schema_data.txt.
Features
  • Extracts JSON-LD data from webpages.
  • Identifies and counts schema types and fields.
  • Saves formatted data along with metadata to a file.
Requirements
  • Python libraries: requests, beautifulsoup4.

      # extract_schema_data.py
      # Author: Christopher Hneke
      # Date: 07.07.2024
      # Description: This script extracts Schema.org data from a given URL and saves it to a file.
    
      import requests
      from bs4 import BeautifulSoup
      import json
      import os
      from collections import defaultdict
    
      # Function to extract Schema.org data from a given URL
      def extract_schema_data(url):
          response = requests.get(url)
          soup = BeautifulSoup(response.content, 'html.parser')
    
          schema_data = []
          schema_types = set()
          field_count = defaultdict(int)
    
          # Recursive helper function to extract types and field frequencies from JSON data
          def extract_types_and_fields(data):
              if isinstance(data, dict):
                  if '@type' in data:
                      if isinstance(data['@type'], list):
                          schema_types.update(data['@type'])
                      else:
                          schema_types.add(data['@type'])
                  for key, value in data.items():
                      field_count[key] += 1
                      extract_types_and_fields(value)
              elif isinstance(data, list):
                  for item in data:
                      extract_types_and_fields(item)
    
          # Look for all <script> tags with type="application/ld+json"
          for script in soup.find_all('script', type='application/ld+json'):
              try:
                  json_data = json.loads(script.string)
                  schema_data.append(json_data)
                  extract_types_and_fields(json_data)
              except json.JSONDecodeError as e:
                  print(f"Error decoding JSON: {e}")
    
          return schema_data, schema_types, field_count
    
      # Function to format Schema.org data for readable output
      def format_schema_data(schema_data):
          formatted_data = ""
          for data in schema_data:
              formatted_data += json.dumps(data, indent=4) + "\n\n"
          return formatted_data
    
      # Function to get the meta title of the page
      def get_meta_title(url):
          response = requests.get(url)
          soup = BeautifulSoup(response.content, 'html.parser')
          title_tag = soup.find('title')
          return title_tag.string if title_tag else 'No title found'
    
      # Function to save extracted data to a file
      def save_to_file(url, title, schema_types, formatted_data, field_count, filename='schema_data.txt'):
          try:
              with open(filename, 'w', encoding='utf-8') as file:
                  file.write(f"URL: {url}\n")
                  file.write(f"TITLE: {title}\n")
                  file.write(f"SCHEMA TYPES: {', '.join(schema_types)}\n\n")
                  file.write("Field Frequencies:\n")
                  for field, count in field_count.items():
                      file.write(f"{field}: {count}\n")
                  file.write("\nSchema Data:\n")
                  file.write(formatted_data)
              print(f"Schema.org data successfully saved to {filename}")
          except Exception as e:
              print(f"Error saving to file: {e}")
    
      # Main function to orchestrate the extraction and saving process
      def main():
          url_input = input("Please enter the URL without 'https://': ")
          url = f"https://{url_input}"
    
          schema_data, schema_types, field_count = extract_schema_data(url)
          if not schema_data:
              print("No Schema.org data found.")
              return
    
          meta_title = get_meta_title(url)
          formatted_data = format_schema_data(schema_data)
          save_to_file(url, meta_title, schema_types, formatted_data, field_count)
    
      if __name__ == "__main__":
          main()

Extract and Count Reviews/AggregateRating Script (Python)

Featured Imgs 23
Extract and Count Reviews Script

This script was basically the concept for a similar WP Plugin, which automatically counts the amount of all single product ratings in each category and writes the correct amount of total reviews in a category on the category pages "aggregate rating" Schema.org Markup.

We had a case, where this was the optimal solution to display the correct amount of "aggregate rating" in "Recipe Rich Results" for a Foodblog/Recipe-Website.

As for today, Google does not seem to give to much attention, but there are indicators showing, the math is getting more important.

Description

This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

Usage

Run the Script: Replace placeholders (https://example.com/category-sitemap.xml) with the actual URL of the sitemap. Execute the script in a Python environment.
Output: The total reviews per category are saved in result.txt.

Requirements

Python libraries: requests, beautifulsoup4, re.

Special Notes

Review Format: This script is suitable for webpages where the number of reviews is enclosed in parentheses, such as "(123)". It uses a regular expression to identify and extract these numbers.

# scrape_review_count.py
# Author: Christopher Hneke
# Date: 04.08.2024
# Description: This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. 
# Description: It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

import requests
from bs4 import BeautifulSoup
import re

# Function to get the total number of reviews from a category URL
def get_total_reviews(url):
    total_reviews = 0
    page_number = 1
    review_pattern = re.compile(r'\((\d+)\)')

    while True:
        page_url = f"{url}/page/{page_number}/" if page_number > 1 else url
        response = requests.get(page_url)

        if response.status_code == 404:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        page_reviews = soup.find_all(string=review_pattern)

        if not page_reviews:
            break

        for review_text in page_reviews:
            match = review_pattern.search(review_text)
            if match:
                total_reviews += int(match.group(1))

        page_number += 1

    return total_reviews

# Main function to process the sitemap and extract reviews for each category
def main():
    sitemap_url = 'https://example.com/category-sitemap.xml'  # Replace with the actual sitemap URL
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, 'xml')
    categories = soup.find_all('loc')

    results = []

    for category in categories:
        category_url = category.text
        category_name = category_url.split('/')[-2]
        print(f"Processing category: {category_name}")
        total_reviews = get_total_reviews(category_url)
        results.append(f"{category_name}: {total_reviews} reviews\n")

    with open('result.txt', 'w', encoding='utf-8') as file:
        file.writelines(results)

    print("Results saved to result.txt")

if __name__ == '__main__':
    main()