Comparing GPT-4o vs Claude 3.5 Sonnet for Zero Shot Text Classification

Featured Imgs 23

On June 20, 2024, Anthropic released the Claude 3.5 sonnet large language model. Claude claims it to be the state-of-the-art model for many natural language processing tasks, surpassing the OpenAI GPT-4o model.

My first test for comparing two large language models is their zero-shot text classification ability. In this article, I will compare the Antropic Claude 3.5 sonnet with the OpenAI GPT-4o model for zero-shot tweet sentiment classification.

So, let's begin without ado.

Importing and Installing Required Libraries

The following script installs the Anthropic and OpenAI libraries to access the corresponding APIs.


!pip install anthropic
!pip install openai

The script below imports the required libraries into your Python application.


import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import anthropic
from openai import OpenAI

Importing and Preprocessing the Dataset

We will use the Twitter US Airline Sentiment dataset to perform zero-shot classification. You can download the dataset from Kaggle.

The following script imports the dataset into a Pandas DataFrame.

## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\tweets.csv")
print(dataset.shape)
dataset.head()

Output:

image1.png

Tweet sentiment falls into three categories: neutral, positive, and negative. For comparison, we will filter 100 tweets. The neutral, positive, and negative categories will contain 34, 33, and 33 tweets, respectively.

# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

image2.png

Zero Shot Text Classification with GPT-4o

We will first perform zero-shot text classification with GPT-4o. Remember to get your OpenAI API key before executing the following script.

The script below defines a function find_sentiment() that accepts the client and model parameters. The client corresponds to the client, i.e., anthropic or OpenAI, while the model corresponds to Claude 3.5 Sonnet or GPT-4o.

The find_sentiment() function iterates through all the tweets from the filtered dataset and predicts sentiment for each tweet. The predicted sentiments are then compared with the actual sentiment values to determine the model's accuracy.


def find_sentiment(client, model):

    tweets_list = dataset["text"].tolist()


    all_sentiments = []

    i = 0
    exceptions = 0
    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            sentiment_value = client.chat.completions.create(
                                  model= model,
                                  temperature = 0,
                                  max_tokens = 10,
                                  messages=[
                                        {"role": "user", "content": content}
                                    ]
                                ).choices[0].message.content

            all_sentiments.append(sentiment_value)
            i = i + 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    print("Total exception count:", exceptions)
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print("Accuracy:", accuracy)

Next, we create an OpenAI client and pass the client object along with the gpt-4o model to predict sentiments.


%%time
client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)
model = "gpt-4o"

find_sentiment(client, model)

Output:

image3.png

The above output shows that GPT-4o achieved an accuracy of 76%.

Zero Shot Text Classification with Claude 3.5-Sonnet

Let's now predict the sentiment for the same set of tweets using the Claude 3.5 sonnet model. We define the find_sentiment_claude() function which is similar to the find_sentiment() function, but predicts sentiments using the Claude 3.5 sonnet model.


def find_sentiment_claude(client, model):

    tweets_list = dataset["text"].tolist()


    all_sentiments = []

    i = 0
    exceptions = 0
    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            sentiment_value = client.messages.create(
                                model= model,
                                max_tokens=10,
                                temperature=0.0,
                                messages=[
                                    {"role": "user", "content": content}
                                ]
                            ).content[0].text

            all_sentiments.append(sentiment_value)
            i = i + 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    print("Total exception count:", exceptions)
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print("Accuracy:", accuracy)

The script below calls the find_sentiment_claude() function using the anthropic client and the claude-3-5-sonnet-20240620 (the id for the Claude 3.5 sonnet) model.


%%time
client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key = os.environ.get('ANTHROPIC_API_KEY')
)

model = "claude-3-5-sonnet-20240620"

find_sentiment_claude(client, model)

Output:

image4.png

The above output shows that the Claude 3.5 Sonnet model achieved an accuracy of 83%, which is significantly better than the GPT-4o model.

Final Verdict

Here is an overall comparison between the Claude Sonnet 3.5 and the GPT-4o model.

image5.png

Given the price, performance, and context window size, I prefer the Claude 3.5 Sonnet over other proprietary models such as GPT-4o.

Feel free to share your opinions. I would be very interested in any results you obtain using the two models.

Tabular Data Classification with Hugging Face Meta Tree Transformer

Featured Imgs 23

As a data scientist, I have extensively used the Hugging Face library for processing unstructured data such as images, text, and audio. My previous blogs have covered various transformer models for these types of data. Lately, however, I discovered that Hugging Face also provides transformer models for tabular data. One such transformer is the Meta Tree Transformer.

This article will explore using the Meta Tree Transformer model to classify tabular data, detailing each process step and providing insights based on the Bank Note Authentication dataset.

Installing and Importing Required Libraries

You must install and import the following libraries to run the codes in this article.


!pip install metatreelib
!pip install --upgrade scikit-learn
!pip install imodels

from metatree.model_metatree import LlamaForMetaTree as MetaTree
from metatree.decision_tree_class import DecisionTree, DecisionTreeForest
from metatree.run_train import preprocess_dimension_patch
from transformers import AutoConfig
from sklearn.metrics import accuracy_score
import imodels # pip install imodels
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import random
Loading and Preprocessing the Dataset

The dataset used in this tutorial is the Bank Note Authentication dataset, which you can download from Kaggle. The dataset contains features extracted from images of banknotes and is used to classify whether a banknote is authentic or not.

The dataset consists of the following columns:

  • variance
  • skewness
  • curtosis
  • entropy
  • class

The class column is the target variable, indicating whether the banknote is authentic (1) or not (0).

First, we need to read the dataset and preprocess it. We will use the pandas library to read the dataset from a CSV file.

This will load the dataset into a pandas DataFrame and display the first few rows to get an overview of the data.


# Load the dataset
file_path = '/content/BankNote_Authentication.csv'  # Path to the dataset
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

Output:

image1.png

Next, we split the dataset into training and testing sets using sklearn.model_selection.train_test_split. Here, 20% of the data is reserved for testing, ensuring we have sufficient data to evaluate the model's performance.


# Split the dataset into features and target variable
X = df.drop(columns=['class'])
y = df['class']

# Split the data into training and testing sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2,
DataLoader for Batching

To handle the entire dataset in batches of 256, we will create a custom dataset class and use PyTorch's DataLoader to batch and shuffle the data. The batch size is set to 256 since the Meta Tree transformer expects the data to be in batches of 256 records.


class TabularDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        feature = self.features[idx]
        label = self.labels[idx]
        return torch.tensor(feature, dtype=torch.float32), torch.tensor(label, dtype=torch.float32)

# Convert data to tensors
train_features = train_X.values
train_labels = torch.nn.functional.one_hot(torch.tensor(train_y.values), num_classes=2).float().numpy()

# Create Dataset
train_dataset = TabularDataset(train_features, train_labels)

# Parameters
batch_size = 256

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
Setting Up the Meta Tree Transformer Model

To begin with, we need to initialize the Meta Tree Transformer model and adjust its configuration to match our dataset, particularly the number of features and classes.

The model is configured to handle a different number of features by default. We set config.n_feature to the number of features in our dataset (train_X.shape[1]).

Similarly, the model is configured for a different number of output classes. We set config.n_class to the number of classes in our dataset (2).



# Initialize Model
model_name_or_path = "yzhuang/MetaTree"

config = AutoConfig.from_pretrained(model_name_or_path)
# Override config parameters to match your dataset
config.n_feature = train_X.shape[1]
config.n_class = 2

model = MetaTree.from_pretrained(
    model_name_or_path,
    config=config,
    ignore_mismatched_sizes=True
)

decision_tree_forest = DecisionTreeForest()

# Set the depth of the model
model.depth = 2

Training the Model with Batches

Next, we train the model using the batches provided by the DataLoader.


# Training loop
for batch_features, batch_labels in train_loader:
    # Prepare the batch for the model
    batch = {"input_x": batch_features, "input_y": batch_labels, "input_y_clean": batch_labels}
    batch = preprocess_dimension_patch(batch, n_feature=train_X.shape[1], n_class=2)

    # Generate decision tree
    outputs = model.generate_decision_tree(batch['input_x'], batch['input_y'], depth=model.depth)
    decision_tree_forest.add_tree(DecisionTree(auto_dims=outputs.metatree_dimensions, auto_thresholds=outputs.tentative_splits, input_x=batch['input_x'], input_y=batch['input_y'], depth=model.depth))

    print("Decision Tree Features: ", [x.argmax(dim=-1) for x in outputs.metatree_dimensions])
    print("Decision Tree Thresholds: ", outputs.tentative_splits)


Evaluating the Model

Finally, we evaluate the model's performance on the test set.


# Predict using the decision tree forest
test_X_tensor = torch.tensor(test_X.values, dtype=torch.float32)
tree_pred = decision_tree_forest.predict(test_X_tensor)

tree_pred = tree_pred.argmax(dim=-1).squeeze().numpy()

# Calculate accuracy
accuracy = accuracy_score(test_y, tree_pred)
print("MetaTree Test Accuracy: ", accuracy)

Output:


MetaTree Test Accuracy:  0.8727272727272727
Conclusion

The Meta Tree Transformer offers a powerful method for classifying tabular data by combining the interpretability of decision trees with the robust performance of transformer models. In this tutorial, we walked through the process of setting up the model, preprocessing the data, training with multiple batches, and evaluation.

In my experience, the performance of the Meta Tree Transformer was on par with simpler algorithms like Random Forest and AdaBoost. Experimenting with different parameters and datasets can further enhance its performance, making it a valuable addition to any data scientist's toolkit.

Feel free to leave your feedback and the results you obtained using Transformer models on tabular data.