Chris Hüneke – The Blog Pros

August 6, 2024September 3, 2024

Anti-Hotlinking Script for WP on Apache (.htaccess) – Linkspam Prevention

Never published this before, so this is a DaniWeb.com Exclusive :)

If your WP-Site has a lot of K-Links, you should consider using this script.

It definetly works. For now...

Negative SEO through spamming Backlinks can be a huge problem for the visibility of a webpage.

You can not defend your site against all kinds of attacks, but on one of the most common, you can significantly decrease the negative effects:

"K-Links" (new version: C-Links), where Image Hotlinking is used to generate Links, targeting mainly Wordpress Instances.

Examples:
k-links.png

this is why they're called "K-Links/C-Links". They always end with "-k.html" oder "-c.html"

The basic Anti-Hotlinking-Script can help in reducing the amount of traffic, when hotlinking is abused to burn your bandwith.

But i have never seen it recover any visibility losses in the SERPs.

This is the basic "Anti-Hotlinking-Script":

<IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteCond %{HTTP_REFERER} !^$
    RewriteCond %{HTTP_REFERER} !^http(s)?://(www\\.)?daniweb.com [NC]
    RewriteCond %{HTTP_REFERER} !^http(s)?://(www\\.)?google.com [NC]
    RewriteCond %{HTTP_REFERER} !^http(s)?://(www\\.)?bing.com [NC]
    RewriteCond %{HTTP_REFERER} !^http(s)?://(www\\.)?yahoo.com [NC]
    RewriteCond %{HTTP_REFERER} !^http(s)?://(www\\.)?duckduckgo.com [NC]
    RewriteRule \\.(jpg|jpeg|png|gif|avif|webp|svg)$ /nohotlink.html [L]
</IfModule>

---

content nohotlink.html:

<body>
    <h1>Hotlinking not allowed</h1>
    <p>Too view our images, please visit our <a href="<https://daniweb.com/>">Website</a>.</p>
</body>

It integrates the "Whitelist" directly into .htaccess, which is not optimal.

I had a case, where this caused problems, because the Whitelist was huge (1000+ Domains).

So i found this solution with "RewriteMap", which i integradted into this Script to put the whitelist inside a .txt file.
This also was easier for the client, as he might needs to add entries to the whitelist and like this does not have to edit the htaccess everytime.

I have also set the link inside the HTML to rel="nofollow".

I did get some nice results with this!

Even if there are still other DoFollow-Links on the Hotlinking Site, the presence of this one nofollow-link seems to reduce the toxicity of each one.

Important: Dont link the actual Canonical URl from your Main Page from nohotlink.html!
If your Domain is https://daniweb.com for example, you link to http://www.daniweb.com (with "www" and "http").

I experimented a lot with this and set the Canonical of the nohotlink.html to the Main Page, tested with noindex, nofollow robots tag, but it was all a mess.

If anybody is as deep into this stuff as i am, i will be happy to discuss.

Feel free to share your thoughts!

Disclaimer: Please use at your own risk, only if you know, what you are doing. Don't make me responsible, if you make mistakes. They are yours, not mine.

August 4, 2024

Extract Schema.org Data Script (Python)

Extract Schema.org Data Script

Maybe this is helpful for somebody...

Description

This script extracts Schema.org data from a given URL and saves it to a file.

Usage

Run the Script: Execute the script in a Python environment.
Input URL: Enter the URL of the webpage (without 'https://') when prompted.
Output: The extracted data is saved in schema_data.txt.

Features

Extracts JSON-LD data from webpages.
Identifies and counts schema types and fields.
Saves formatted data along with metadata to a file.

Requirements

Python libraries: requests, beautifulsoup4.

  # extract_schema_data.py
  # Author: Christopher Hneke
  # Date: 07.07.2024
  # Description: This script extracts Schema.org data from a given URL and saves it to a file.

  import requests
  from bs4 import BeautifulSoup
  import json
  import os
  from collections import defaultdict

  # Function to extract Schema.org data from a given URL
  def extract_schema_data(url):
      response = requests.get(url)
      soup = BeautifulSoup(response.content, 'html.parser')

      schema_data = []
      schema_types = set()
      field_count = defaultdict(int)

      # Recursive helper function to extract types and field frequencies from JSON data
      def extract_types_and_fields(data):
          if isinstance(data, dict):
              if '@type' in data:
                  if isinstance(data['@type'], list):
                      schema_types.update(data['@type'])
                  else:
                      schema_types.add(data['@type'])
              for key, value in data.items():
                  field_count[key] += 1
                  extract_types_and_fields(value)
          elif isinstance(data, list):
              for item in data:
                  extract_types_and_fields(item)

      # Look for all <script> tags with type="application/ld+json"
      for script in soup.find_all('script', type='application/ld+json'):
          try:
              json_data = json.loads(script.string)
              schema_data.append(json_data)
              extract_types_and_fields(json_data)
          except json.JSONDecodeError as e:
              print(f"Error decoding JSON: {e}")

      return schema_data, schema_types, field_count

  # Function to format Schema.org data for readable output
  def format_schema_data(schema_data):
      formatted_data = ""
      for data in schema_data:
          formatted_data += json.dumps(data, indent=4) + "\n\n"
      return formatted_data

  # Function to get the meta title of the page
  def get_meta_title(url):
      response = requests.get(url)
      soup = BeautifulSoup(response.content, 'html.parser')
      title_tag = soup.find('title')
      return title_tag.string if title_tag else 'No title found'

  # Function to save extracted data to a file
  def save_to_file(url, title, schema_types, formatted_data, field_count, filename='schema_data.txt'):
      try:
          with open(filename, 'w', encoding='utf-8') as file:
              file.write(f"URL: {url}\n")
              file.write(f"TITLE: {title}\n")
              file.write(f"SCHEMA TYPES: {', '.join(schema_types)}\n\n")
              file.write("Field Frequencies:\n")
              for field, count in field_count.items():
                  file.write(f"{field}: {count}\n")
              file.write("\nSchema Data:\n")
              file.write(formatted_data)
          print(f"Schema.org data successfully saved to {filename}")
      except Exception as e:
          print(f"Error saving to file: {e}")

  # Main function to orchestrate the extraction and saving process
  def main():
      url_input = input("Please enter the URL without 'https://': ")
      url = f"https://{url_input}"

      schema_data, schema_types, field_count = extract_schema_data(url)
      if not schema_data:
          print("No Schema.org data found.")
          return

      meta_title = get_meta_title(url)
      formatted_data = format_schema_data(schema_data)
      save_to_file(url, meta_title, schema_types, formatted_data, field_count)

  if __name__ == "__main__":
      main()

August 4, 2024

Extract and Count Reviews/AggregateRating Script (Python)

Extract and Count Reviews Script

This script was basically the concept for a similar WP Plugin, which automatically counts the amount of all single product ratings in each category and writes the correct amount of total reviews in a category on the category pages "aggregate rating" Schema.org Markup.

We had a case, where this was the optimal solution to display the correct amount of "aggregate rating" in "Recipe Rich Results" for a Foodblog/Recipe-Website.

As for today, Google does not seem to give to much attention, but there are indicators showing, the math is getting more important.

Description

This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

Usage

Run the Script: Replace placeholders (https://example.com/category-sitemap.xml) with the actual URL of the sitemap. Execute the script in a Python environment.
Output: The total reviews per category are saved in result.txt.

Requirements

Python libraries: requests, beautifulsoup4, re.

Special Notes

Review Format: This script is suitable for webpages where the number of reviews is enclosed in parentheses, such as "(123)". It uses a regular expression to identify and extract these numbers.

# scrape_review_count.py
# Author: Christopher Hneke
# Date: 04.08.2024
# Description: This script extracts the total number of reviews from all categories listed in a sitemap and saves the results to a file. 
# Description: It is specifically designed to work with webpages where review counts are displayed in a specific format (e.g., "(123)").

import requests
from bs4 import BeautifulSoup
import re

# Function to get the total number of reviews from a category URL
def get_total_reviews(url):
    total_reviews = 0
    page_number = 1
    review_pattern = re.compile(r'\((\d+)\)')

    while True:
        page_url = f"{url}/page/{page_number}/" if page_number > 1 else url
        response = requests.get(page_url)

        if response.status_code == 404:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        page_reviews = soup.find_all(string=review_pattern)

        if not page_reviews:
            break

        for review_text in page_reviews:
            match = review_pattern.search(review_text)
            if match:
                total_reviews += int(match.group(1))

        page_number += 1

    return total_reviews

# Main function to process the sitemap and extract reviews for each category
def main():
    sitemap_url = 'https://example.com/category-sitemap.xml'  # Replace with the actual sitemap URL
    response = requests.get(sitemap_url)
    soup = BeautifulSoup(response.content, 'xml')
    categories = soup.find_all('loc')

    results = []

    for category in categories:
        category_url = category.text
        category_name = category_url.split('/')[-2]
        print(f"Processing category: {category_name}")
        total_reviews = get_total_reviews(category_url)
        results.append(f"{category_name}: {total_reviews} reviews\n")

    with open('result.txt', 'w', encoding='utf-8') as file:
        file.writelines(results)

    print("Results saved to result.txt")

if __name__ == '__main__':
    main()

June 4, 2024August 3, 2024

Does Google’s Disavow-Tool still work – or does it hurt?

Back in the days Google's Disavow Tool could be useful to disavow spammy backlinks, which competitors created to hurt the rankings of a website.

Negative SEO and Spamlink-attacks were (and are) real and can cause serious damages to businesses relying on Google's Organic Search as a Traffic Source.

Of course, also Spammers and Blackhat-SEO's made heavy use of this tool to simply "recover" their sites from penalties, they have caused themselves, when buying toxic backlinks.

My theory is, that the Disavow-Tool was/is used by Blackhats way more frequently than by Whitehat SEOs. And Google therefore might have slightly changed the function of it.

For example, many Backlinksellers advertise like this to calm their potential clients: "In case of consequences, there's Google's Disavow tool, so no worries".

I directly confronted John Muller (Search Advocate at Google) with this - you can see his response in the attached image. He obviously dodged my question, but sometimes no answer can be an answer too, right?

There's an very interesting case study on Reddit on this and i have a lot more info, which i will try to put into an article on DaniWeb this weekend.

I have found a lot of answers for many questions, but on still remains...

Does the Disavow Tool still help to recover a website's lost rankings caused by spammy links?

Or does it nowadays hurt the rankings of a site even more, because Google lets it as kind of a "honeypot" for Blackhatters, which try to recover from the penalties they caused themselves?

If anybody has something to share about this, would be very appreciated.

Thanks for your attention,
Chris