Comparison of Fine-tuning GPT-4o mini vs GPT-3.5 for Text Classification

Featured Imgs 23

In my previous articles, I presented a comparison of OpenAI GPT-4o mini model with GPT-4o and GPT-3.5 turbo models for zero-shot text classification. The results showed that GPT-4o mini, while significantly cheaper than its counterparts, achieves comparable performance.

On 8 August 2024, OpenAI enabled GPT-4o mini fine-tuning for developers across usage tiers 1-5. You can now fine-tune GPT-4o mini for free until 23 September 2024, with a daily token limit of 2 million.

In this article, I will show you how to fine-tune the GPT-4o mini for text classification tasks and compare it to the fine-tuned GPT-3.5 turbo.

So, let's begin without ado.

Importing and Installing Required Libraries

The following script installs the OpenAI Python library you can use to make calls to the OpenAI API.


!pip install openai

The script below imports the required liberaries into your Python application.


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from openai import OpenAI
import pandas as pd
import os
import json
Importing the Dataset

We will use the Twitter US Airline Sentiment dataset for fine-tuning the GPT-4o mini and GPT-3.5 turbo models.

The following script imports the dataset and defines the preprocess_data() function. This function takes in a dataset and an index value as inputs. It then divides the dataset by sentiment category, returning 34, 33, and 33 tweets from each category, beginning at the specified index. This approach ensures we have around 100 balanced records. You can use more number of records for fine-tuning if you want.



dataset = pd.read_csv(r"D:\Datasets\Tweets.csv")

def preprocess_data(dataset, n):

    # Remove rows where 'airline_sentiment' or 'text' are NaN
    dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

    # Remove rows where 'airline_sentiment' or 'text' are empty strings
    dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

    # Filter the DataFrame for each sentiment
    neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
    positive_df = dataset[dataset['airline_sentiment'] == 'positive']
    negative_df = dataset[dataset['airline_sentiment'] == 'negative']

    # Select records from Nth index
    neutral_sample = neutral_df[n: n +34]
    positive_sample = positive_df[n: n +33]
    negative_sample = negative_df[n: n +33]

    # Concatenate the samples into one DataFrame
    dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

    # Reset index if needed
    dataset.reset_index(drop=True, inplace=True)

    dataset = dataset[["text", "airline_sentiment"]]

    return dataset

The following script creates a balanced training dataset.


training_data = preprocess_data(dataset, 0)
print(training_data["airline_sentiment"].value_counts())
training_data.head()

Output:

image1.png

Similarly, the script below creates a test dataset.


test_data = preprocess_data(dataset, 100)
print(test_data["airline_sentiment"].value_counts())
test_data.head()

Output:

image2.png

Converting Training Data to JSON Format for OpenAI Model Fine-tuning

To fine-tune an OpenAI model, you need to transform the training data into JSON format, as outlined in the OpenAI official documentation. To achieve this, I have written a straightforward function that converts the input Pandas DataFrame into the required JSON structure.

The following script converts the training data into OpenAI complaint JSON format for fine-tuning. Fine-tuning relies significantly on the content specified for the system role, so pay special attention when setting this value.


# JSON file path
json_file_path = r"D:\Datasets\airline_sentiments.json"

# Function to create the JSON structure for each row
def create_json_structure(row):
    return {
        "messages": [
            {"role": "system", "content": "You are a Twitter sentiment analysis expert who can predict sentiment expressed in the tweets about an airline. You select sentiment value from positive, negative, or neutral."},
            {"role": "user", "content": row['text']},
            {"role": "assistant", "content": row['airline_sentiment']}
        ]
    }

# Convert DataFrame to JSON structures
json_structures = training_data.apply(create_json_structure, axis=1).tolist()

# Write JSON structures to file, each on a new line
with open(json_file_path, 'w') as f:
    for json_structure in json_structures:
        f.write(json.dumps(json_structure) + '\n')

print(f"Data has been written to {json_file_path}")

Output:


Data has been written to D:\Datasets\airline_sentiments.json

The next step is to upload your JSON file to the OpenAI server. To do so, start by creating an OpenAI client object. Then, call the files.create() method, passing the file path as an argument, as demonstrated in the following script:

client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)


training_file = client.files.create(
  file=open(json_file_path, "rb"),
  purpose="fine-tune"
)

print(training_file.id)

Once the file is uploaded, you will receive a file ID, as the above script demonstrates. You will use this file ID to fine-tune your OpenAI model.

Fine-Tuning GPT-4o Mini for Text Classification

To start fine-tuning, you must call the fine_tuning.jobs.create() method and pass it the ID of the uploaded training file and the model name. The current model name for GPT-4o mini is gpt-4o-mini-2024-07-18.


fine_tuning_job_gpt4o_mini = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-mini-2024-07-18"
)

Executing the above script initiates the fine-tuning process. The following script allows you to monitor and display various fine-tuning events.


# List up to 10 events from a fine-tuning job
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id = fine_tuning_job_gpt4o_mini.id,
                                    limit=10))

Once fine-tuning is complete, you will receive an email containing the ID of your fine-tuned model, which you can use to make inferences. Alternatively, you can retrieve the ID of your fine-tuned model by running the following script.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o_mini.id).fine_tuned_model

The remainder of the process follows the same steps as outlined in a previous article. We will define the find_sentiment() function and pass it our fine-tuned model and the test set to predict the sentiment of the tweets in the dataset.

Finally, we predict the model's accuracy by comparing the actual and predicted sentiments of the tweets.


def find_sentiment(client, model, dataset):
    tweets_list = dataset["text"].tolist()

    all_sentiments = []


    i = 0


    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            response = client.chat.completions.create(
                model=model,
                temperature=0,
                max_tokens=10,
                messages=[
                    {"role": "user", "content": content}
                ]
            )

            sentiment_value = response.choices[0].message.content

            all_sentiments.append(sentiment_value)
            i += 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)

    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print(f"Accuracy: {accuracy}")

find_sentiment(client,ft_model_id, test_data)

Output:


Accuracy: 0.78

The above output shows that the fine-tuned GPT-4o mini achieves a performance accuracy of 78% on the test set.

Fine-Tuning GPT-3.5 Turbo for Text Classification

For comparison, we will also fine-tune the GPT-3.5 turbo model for text classification.

The fine-tuning process remains the same as for the GPT-4o mini. We will pass the training file ID and the GPT-3.5 turbo model ID to the client.fine_tuning.jobs.create() method, as shown below.


fine_tuning_job_gpt_3_5 = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-3.5-turbo"
)

Next, we will pass the fine-tuned GPT-3.5 model ID and the test dataset to the find_sentiment() function to evaluate the model's performance on the test set.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt_3_5.id).fine_tuned_model
find_sentiment(client,ft_model_id, test_data)

Output:


Accuracy: 0.82

The above output shows that the GPT-3.5 turbo model achieves 82% performance accuracy, 4% higher than the GPT-4o mini model.

Conclusion

GPT-4o mini is a cheaper and faster alternative to GPT-3.5. My last article showed that it achieves higher performance for zero-shot text classification than the GPT-3.5 turbo model.

However, based on the results presented in this article, a fine-tuned GPT-3.5 turbo model is still better than a fine-tuned GPT-4o mini.

Feel free to share your feedback in the comments section.