Extracting Structured Outputs from LLMs in LangChain

Featured Imgs 23

Large language models (LLMS) are trained to predict the next token (set of characters) following an input sequence of tokens. This makes LLMs suitable for unstructured textual responses.

However, we often need to extract structured information from unstructured text. With the Python LangChain module, you can extract structured information in a Python Pydantic object.

In this article, you will see how to extract structured information from news articles. You will extract the article's tone, type, country, title, and conclusion. You will also see how to extract structured information from single and multiple text documents.

So, let's begin without ado.

Installing and Importing Required Libraries

As always, we will first install and import the required libraries.
The script below installs the LangChain and LangChain OpenAI libraries. We will extract structured data from the news articles using the OpenAI GPT-4 latest LLM.


!pip install -U langchain
!pip install -qU langchain-openai

Next, we will import the required libraries in a Python application.


import pandas as pd
import os
from typing import List, Optional
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
Importing the Dataset

We will extract structured information from the articles in the News Article with Summary dataset.

The following script imports the data into a Pandas DataFrame.


dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset.head(10)

Output:

image1.png

Defining the Structured Output Format

To extract structured output, we need to define the attributes of the structured output. We will extract the article title, type, tone, country, and conclusion. Furthermore, we want to categorize the article types and tones into the following categories.


article_types = [
    "informative",
    "critical",
    "opinion",
    "explanatory",
    "analytical",
    "persuasive",
    "narrative",
    "investigative",
    "feature",
    "review",
    "profile",
    "how-to/guide",
]

article_tones = [
    "aggressive",
    "neutral",
    "passive",
    "formal",
    "informal",
    "humorous",
    "serious",
    "optimistic",
    "pessimistic",
    "sarcastic",
    "enthusiastic",
    "melancholic",
    "objective",
    "subjective",
    "cautious",
    "assertive",
    "conciliatory",
    "urgent"
]

Next, you must define a class inheriting from the Pytdantic BaseModel class. Inside the class, you define the attributes containing the structured information.

For example, in the following script, the title attribute contains a string type article title. The LLM will use the attribute description to extract information for this attribute from the article text.

We will extract the title, type, tone, country, and conclusion.


class ArticleInformation(BaseModel):
    """Information about a news paper article"""


    title:str = Field(description= "This is the title of the article in less than 100 characters")
    article_type: str = Field(description = f"The type of the artile. It can be one of the following : {article_types}")
    tone: str = Field(description = f"The tone of the artile. It can be one of the following: {article_tones}")
    country: str = Field(description= """The country which is at the center of discussion in the article.
                                         Return global if the article is about the whole world.""")

    conclusion: str = Field(description= "The conclusion of the article in less than 100 words.")


Extracting the Structured Output from Text

Next, you must define an LLM to extract structured information from the news article. In the following script, we will use the latest OpenAI GPT-4o LLM.


OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
llm = ChatOpenAI(api_key = OPENAI_API_KEY ,
                 temperature = 0,
                model_name = "gpt-4o-2024-08-06")

You need to define the prompt that instructs the LLM that he should act as an expert extraction algorithm while extracting structured outputs.

Subsequently, using the LangChain Expression Language, we will create a chain that passes the prompt to an LLM. Notice that here, we call the with_structured_output() method on the LLM object and pass it the ArticleInformation class to the schema attribute of the method. This ensures the output object contains attributes from the ArticleInformation class.


extraction_prompt = """
You are an expert extraction algorithm.
Only extract relevant information from the text.
If you do not know the value of an attribute asked to extract,
return null for the attribute's value."
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", extraction_prompt),
    ("user", "{input}")
])

extraction_chain = prompt | llm.with_structured_output(schema = ArticleInformation)

Finally, we can call the invoke() function of the chain you just created and pass it the article text.


first_article = dataset["content"].iloc[0]
article_information = extraction_chain.invoke({"input":first_article})
print(article_information)

Output:

image2.png

From the above output, you can see structured data extracted from the article.

Extracting a List of Formatted Items

In most cases, you will want to extract structured data from multiple text documents. To do so, you have two options: merge multiple documents into one document or iterate through multiple documents and extract structured data from each document.

Extracting List of Items From a Single Merged Document

You can merge multiple documents into a single document and then create a Pydantic class that contains a list of objects of the Pydantic class containing the structure data you want to extract. This approach is helpful if you have a small number of documents since merging multiple documents can result in the number of tokens greater than an LLM's context window.

To do so, we will create another Pydantic class with a list of objects from the initial Pydantic class containing structured data information.

For example, in the following script, we define the ArticleInfos class, which contains the articles list of the ArticleInformation class.


class ArticleInfos(BaseModel):
    """Extracted data about multiple articles."""

    # Creates a model so that we can extract multiple entities.
    articles: List[ArticleInformation]

Next, we will merge the first 10 documents from our dataset using the following script.


# Function to generate the formatted article
def format_articles(df, num_articles=10):
    formatted_articles = ""
    for i in range(min(num_articles, len(df))):
        article_info = f"================================================================\n"
        article_info += f"Article Number: {i+1}, {df.loc[i, 'author']}, {df.loc[i, 'date']}, {df.loc[i, 'year']}, {df.loc[i, 'month']}\n"
        article_info += "================================================================\n"
        article_info += f"{df.loc[i, 'content']}\n\n"
        formatted_articles += article_info
    return formatted_articles

# Get the formatted articles for the first 10
formatted_articles = format_articles(dataset, 10)

# Output the result
print(formatted_articles)

Output:

image3.png

The above output shows one extensive document containing text from the first ten articles.

We will create a chain where the LLM uses the ArticleInfos class in the llm.with_structured_output() method.

Finally, we call the invoke() method and pass our document containing multiple articles, as shown in the following script.

If you print the articles attribute from the LLM response, you will see that it contains a list of structured items corresponding to each article.


extraction_chain = prompt | llm.with_structured_output(schema = ArticleInfos)
article_information = extraction_chain.invoke({"input":formatted_articles})
print(article_information.articles)

Output:

image4.png

Using the script below, you can store the extracted information in a Pandas DataFrame.


# Converting the list of objects to a list of dictionaries
articles_data = [
    {
        "title": article.title,
        "article_type": article.article_type,
        "tone": article.tone,
        "country": article.country,
        "conclusion": article.conclusion
    }
    for article in article_information.articles
]

# Creating a DataFrame from the list of dictionaries
df = pd.DataFrame(articles_data)

df.head(10)

Output:

image5.png

The above output shows the extracted article title, type, tone, country, and conclusion in a Pandas DataFrame.

Extracting List of Items From Multiple Documents

The second option for extracting structured data from multiple documents is to simply iterate over each document and use the Pydantic structured class to extract structured information. I prefer this approach if I have a large number of documents.

The following script iterates through the first 10 documents in the dataset, extracts structured data from each document, and stores the extracted data in a list.


extraction_chain = prompt | llm.with_structured_output(schema = ArticleInformation)

articles_information_list = []
for index, row in dataset.tail(10).iterrows():
    content_text = row['content']
    article_information = extraction_chain.invoke({"input":content_text})
    articles_information_list.append(article_information)

articles_information_list

Output:

image6.png

Finally, we can convert the list of extracted data into a Pandas DataFrame using the following script.


# Converting the list of objects to a list of dictionaries
articles_data = [
    {
        "title": article.title,
        "article_type": article.article_type,
        "tone": article.tone,
        "country": article.country,
        "conclusion": article.conclusion
    }
    for article in articles_information_list
]

# Creating a DataFrame from the list of dictionaries
df = pd.DataFrame(articles_data)

# Displaying the DataFrame
df.head(10)

Output:

image7.png

Conclusion

Extracting structured data from an LLM can be crucial, particularly for data engineering, preprocessing, analysis, and visualization tasks. In this article, you saw how to extract structured data using LLMs in LangChain, both from a single document and multiple documents.

If you have any feedback, please leave it in the comments section.

Enhancing RAG Functionalities using Tools and Agents in LangChain

Featured Imgs 23

Retrieval augmented generation (RAG) allows large language models (LLMs) to answer queries related to the data the models have not seen during training. In my previous article, I explained how to develop RAG systems using the Claude 3.5 Sonnet model.

However, RAG systems only answer queries about the data stored in the vector database. For example, you have a RAG system that answers queries related to financial documents in your database. If you ask it to search the internet for some information, it will not be able to do so.

This is where tools and agents come into play. Tools and agents enable LLMs to retrieve information from various external sources such as the internet, Wikipedia, YouTube, or virtually any Python method implemented as a tool in LangChain.

This article will show you how to enhance the functionalities of your RAG systems using tools and agents in the Python LangChain framework.

So, let's begin without an ado.

Installing and Importing Required Libraries

The following script installs the required libraries, including the Python LangChain framework and its associated modules and the OpenAI client.



!pip install -U langchain
!pip install langchain-core
!pip install langchainhub
!pip install -qU langchain-openai
!pip install pypdf
!pip install faiss-cpu
!pip install --upgrade --quiet  wikipedia
Requirement already satisfied: langchain in c:\us

The script below imports the required libraries into your Python application.


from langchain.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.tools import tool
from langchain import hub
import os

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
Enhancing RAG with Tools and Agents

To enhance RAG using tools and agents in LangChain, follow the following steps.

  1. Import or create tools you want to use with RAG.
  2. Create a retrieval tool for RAG.
  3. Add retrieval and other tools to an agent.
  4. Create an agent executor that invokes agents' calls.

The benefit of agents over chains is that agents decide at runtime which tool to use to answer user queries.

This article will enhance the RAG model's performance using the Wikipedia tool. We will create a LangChain agent with a RAG tool capable of answering questions from a document containing information about the British parliamentary system. We will incorporate the Wikipedia tool into the agent to enhance its functionality.

If a user asks a question about the British parliament, the agent will call the RAG tool to answer it. In case of any other query, the agent will use the Wikipedia tool to search for answers on Wikipedia.

Let's implement this model step by step.

Importing Wikipedia Tool

The following script imports the built-in Wikipedia tool from the LangChain module. To retrieve Wikipedia pages, pass your query to the run() method.


wikipedia_tool = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
response = wikipedia_tool.run("What are large language models?")
print(response)

Output:

image1.png

To add the above tool to an agent, you must define a function using the @tool decorator. Inside the method, you simply call the run() method as you previously did and return the response.


@tool
def WikipediaSearch(search_term: str):

    """
    Use this tool to search for wikipedia articles.
    If a user asks to search the internet, you can search via this wikipedia tool.
    """


    result = wikipedia_tool.run(search_term)
    return result

Next, we will create a retrieval tool that implements the RAG functionality.

Creating Retrieval Tool

In my article on Retrieval Augmented Generation with Claude 3.5 Sonnet, I explained how to create retriever using LangChain. The process remains the same here.


openai_api_key = os.getenv('OPENAI_API_KEY')

loader = PyPDFLoader("https://web.archive.org/web/20170809122528id_/http://global-settlement.org/pub/The%20English%20Constitution%20-%20by%20Walter%20Bagehot.pdf")
docs = loader.load_and_split()

documents = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
).split_documents(docs)

embeddings = OpenAIEmbeddings(openai_api_key = openai_api_key)
vector = FAISS.from_documents(documents, embeddings)

retriever = vector.as_retriever()

You can query the vector retriever using the invoke() method. In the output, you will see the section of documents having the highest semantic similarity with the input.


query = """
"What is the difference between the house of lords and house of commons? How members are elected for both?"
"""
retriever.invoke(query)[:3]

Output:

image2.png

Next, we will create a retrieval tool that uses the vector retriever you created to answer user queries. You can create a RAG retrieval tool using the create_retriever_tool() function.


BritishParliamentSearch = create_retriever_tool(
    retriever,
    "british_parliament_search",
    "Use this tool tos earch for information about the british parliament, house of lords and house of common and any other related information.",
)

We have created a Wikipedia tool and a retriever (RAG) tool; the next step is adding these tools to an agent.

Creating Tool Calling Agent

First, we will create a list containing all our tools. Next, we will define the prompt that we will use to call our agent. I used a built-in prompt from LangSmith, which you can see in the script's output below.


tools = [WikipediaSearch, BritishParliamentSearch]
# Get the prompt to use - you can modify this!
prompt = hub.pull("hwchase17/openai-functions-agent")
prompt.messages

Output:

image3.png

You can use your prompt if you want.

We also need to define the LLM we will use with the agent. We will use the OpenAI GPT-4o in our script. You can use any other LLM from LangChain.


llm = ChatOpenAI(model="gpt-4o",
                 temperature=0,
                 api_key=openai_api_key,
                )

Next, we will create a tool-calling agent that generates responses using the tools, LLM, and the prompt we just defined.

Finally, to execute an agent, we need to define our agent executor, which returns the agent's response to the user when invoked via the invoke() method.

In the script below, we ask our agent a question about the British parliament.


agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose = True)
response = agent_executor.invoke({"input": "How many members are there in the House of Common?"})
print(response)

Output:

image4.png

As you can see from the above output, the agent invoked the british_parliament_search tool to generate a response.

Let's ask another question about the President of the United States. Since this information is not available in the document that the RAG tool uses, the agent will call the WikipediaSearch tool to generate the response to this query.


response = agent_executor.invoke({"input": "Who is the current president of United States?"})

Output:

image5.png

Finally, if you want only to return the response without any additional information, you can use the output key of the response as shown below:


print(response["output"])

Output:


The current President of the United States is Joe Biden. He is the 46th president and assumed office on January 20, 2021.

As a last step, I will show you how to add memory to your agents so that they remember previous conversations with the user.

Adding Memory to Agent Executor

We will first create an AgentExecutor object as we did previously, but this time, we will set verbose = False since we are not interested in seeing the agent's internal workings.

Next, we will create an object of the ChatMessageHistory() class to save past conversations.

Finally, we will create an object of the RunnableWithMessageHistory() class and pass to it the agent executor and message history objects. We also pass the keys for the user input and chat history.


agent_executor = AgentExecutor(agent=agent, tools=tools, verbose = False)

message_history = ChatMessageHistory()

agent_with_chat_history = RunnableWithMessageHistory(
    agent_executor,
    # This is needed because in most real world scenarios, a session id is needed
    # It isn't really used here because we are using a simple in memory ChatMessageHistory
    lambda session_id: message_history,
    input_messages_key="input",
    history_messages_key="chat_history",
)

Next, we will define a function generate_response() that accepts a user query as a function parameter and invokes the RunnableWithMessageHistory() class object. In this case, you also need to pass the session ID, which points to the past conversation. You can have multiple session IDs if you want multiple conversations.


def generate_response(query):
    response = agent_with_chat_history.invoke(
    {"input": query},
    config={"configurable": {"session_id": "<foo>"}}
    )

    return response

Let's test the generate_response() function by asking a question about the British parliament.


query = "What is the difference between the house of lords and the house of commons?"
response = generate_response(query)
print(response["output"])

Output:

image6.png

Next, we will ask a question about the US president.


query = "Who is the current President of America?"
response = generate_response(query)
print(response["output"])

Output:


The current President of the United States is Joe Biden. He assumed office on January 20, 2021, and is the 46th president of the United States.

Next, we will only ask And for France but since the agent remembers the past conversation, it will figure out that the user wants to know about the current French President.


query = "And of France?"
response = generate_response(query)
print(response["output"])

Output:


The current President of France is Emmanuel Macron. He has been in office since May 14, 2017, and was re-elected for a second term in 2022.
Conclusion

Retrieval augmented generation allows you to answer questions using documents from a vector database. However, you may need to fetch information from external sources. This is where tools and agents come into play.

In this article, you saw how to enhance the functionalities of your RAG systems using tools and agents in LangChain. I encourage you to incorporate other tools and agents into your RAG systems to build amazing LLM products.

How to Fine-tune the OpenAI GPT-4o Model – The Wait is Finally Over

Featured Imgs 23

On August 20, 2024, OpenAI enabled GPT-4o fine-tuning in the OpenAI playground and the OpenAI API. The much-awaited feature is free for fine-tuning 1 million daily tokens until September 23, 2024.

In this article, I will show you how to fine-tune the OpenAI GPT-4o model for text classification and summarization tasks.

It is important to note that in my previous articles I have already demonstrated results obtained for zero-shot text classification and zero-shot text summarization using default GPT-4o model. In this article, you will see that fine-tuning a GPT-4o model improves text classification and text summarization performance significantly.

So, let's begin without an ado.

Installing and Importing Required Libraries

The following script installs the Python libraries you need to run codes in this article.


!pip install openai
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

The script below imports the required libraries into your Python application.


import os
import json
import time
import pandas as pd
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
from openai import OpenAI
Fine-tuning GPT-4o for Text Classification

In a previous article, I explained the process of fine-tuning GPT-4o mini and GPT-3.5 turbo models for zero-shot text classification.

The process remains the same for fine-tuning GPT-4o.
We will first import the text classification dataset, which in this article is the Twitter US Airline Sentiment Dataset.

The following script imports the dataset.


dataset = pd.read_csv(r"D:\Datasets\Tweets.csv")
dataset.head()

Output:

image1.png

Next, we will write the preprocess_data() function, which takes in a dataset, the start index n, and the number of records as parameters. It then divides the dataset by sentiment category and returns the number of records beginning at the specified index. This approach ensures we have an equal number of records for each sentiment category.

We will fetch 600 records (200 positive, negative, and neutral) for training and 99 records (33 for each category) for testing. You can use more number of records for fine-tuning if you want.




def preprocess_data(dataset, n, records):

    # Remove rows where 'airline_sentiment' or 'text' are NaN
    dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

    # Remove rows where 'airline_sentiment' or 'text' are empty strings
    dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

    # Filter the DataFrame for each sentiment
    neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
    positive_df = dataset[dataset['airline_sentiment'] == 'positive']
    negative_df = dataset[dataset['airline_sentiment'] == 'negative']

    # Select records from Nth index
    neutral_sample = neutral_df[n: n +records]
    positive_sample = positive_df[n: n +records]
    negative_sample = negative_df[n: n +records]

    # Concatenate the samples into one DataFrame
    dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

    # Reset index if needed
    dataset.reset_index(drop=True, inplace=True)

    dataset = dataset[["text", "airline_sentiment"]]

    return dataset

The following script creates training and test sets.


training_data = preprocess_data(dataset, 0, 200)
print("Training data value counts:\n", training_data["airline_sentiment"].value_counts())
print("===========================")
test_data = preprocess_data(dataset, 600, 33)
print("Test data value counts:\n", test_data["airline_sentiment"].value_counts())

Output:

image2.png

Next, we convert our dataset into the JSON format required to fine-tune OpenAI models.


# JSON file path
json_file_path = r"D:\Datasets\airline_sentiments.json"

# Function to create the JSON structure for each row
def create_json_structure(row):
    return {
        "messages": [
            {"role": "system", "content": "You are a Twitter sentiment analysis expert who can predict sentiment expressed in the tweets about an airline. You select sentiment value from positive, negative, or neutral."},
            {"role": "user", "content": row['text']},
            {"role": "assistant", "content": row['airline_sentiment']}
        ]
    }

# Convert DataFrame to JSON structures
json_structures = training_data.apply(create_json_structure, axis=1).tolist()

# Write JSON structures to file, each on a new line
with open(json_file_path, 'w') as f:
    for json_structure in json_structures:
        f.write(json.dumps(json_structure) + '\n')

print(f"Data has been written to {json_file_path}")

To fine-tune the OpenAI model, you need to upload training files to the OpenAI server. To do so, create a client object of the OpenAI class and pass the JSON file to the files.create() method of the client object.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)


training_file = client.files.create(
  file=open(json_file_path, "rb"),
  purpose="fine-tune"
)

Finally, as shown in the script below, you can start fine-tuning using the client.fine_tuning.jobs.create() method. Here, you must pass GPT-4o model id gpt-4o-2024-08-06 to the model attribute.


fine_tuning_job_gpt4o = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-2024-08-06"
)

You can see fine-tuning events for your fine-tuning job using the following script:


# List up to 10 events from a fine-tuning job
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id = fine_tuning_job_gpt4o.id,
                                    limit=10))

Once fine-tuning is completed, you will receive an email with the fine-tuned model ID. Alternatively, you can retrieve the fine-tuned model ID using the following script.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o.id).fine_tuned_model

Once you have the fine-tuned model ID, you can use it like any default OpenAI model. The following script defines the find_sentiment() function, which uses the fine-tuned model ID to predict the sentiments of the tweets in the test set and finally prints the overall fine-tuned model accuracy.


def find_sentiment(client, model, dataset):
    tweets_list = dataset["text"].tolist()

    all_sentiments = []


    i = 0


    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            response = client.chat.completions.create(
                model=model,
                temperature=0,
                max_tokens=10,
                messages=[
                    {"role": "user", "content": content}
                ]
            )

            sentiment_value = response.choices[0].message.content

            all_sentiments.append(sentiment_value)
            i += 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)

    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print(f"Accuracy: {accuracy}")

find_sentiment(client,ft_model_id, test_data)

Output:

image3.png

The above output shows that the fine-tuned model achieved an accuracy of 92.92%, significantly better than the accuracy achieved via the default GPT-4o model in a previous article.

In the next section, you will see how to fine-tune GPT-4o for text summarization.

Fine-tuning GPT-4o for Text Summarization

We will use the News Articles with Summary dataset to fine-tune the GPT-4o model.

The following script imports the dataset.


dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")
print(dataset.shape)
dataset.head()

Output:

image4.png

The rest of the process remains the same as text classification. We will filter a subset of data for fine-tuning (in this case, records 101 to 200) and convert the dataset into OpenAI-compliant JSON format.


selected_data = dataset.iloc[101:201]

# Function to create the JSON structure for each row
def create_json_structure(row):
    return {
        "messages": [
            {"role": "system", "content": "You are analyzing news articles. Use the provided content to generate a concise summary."},
            {"role": "user", "content": row['content']},
            {"role": "assistant", "content": row['human_summary']}
        ]
    }

# Convert selected DataFrame rows to JSON structures
json_structures = selected_data.apply(create_json_structure, axis=1).tolist()

# JSON file path
json_file_path = r"D:\Datasets\news_summaries.json"

# Write JSON structures to file, each on a new line
with open(json_file_path, 'w') as f:
    for json_structure in json_structures:
        f.write(json.dumps(json_structure) + '\n')

print(f"Data has been written to {json_file_path}")

Next, upload the training file to OpenAI servers.


training_file = client.files.create(
  file=open(json_file_path, "rb"),
  purpose="fine-tune"
)

Finally, you can start fine-tuning using the following script.


fine_tuning_job_gpt4o_ts = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-2024-08-06"
)

Once the model is fine-tuned, retrieve the model ID using the following script.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o_ts.id).fine_tuned_model

We will use the ROUGE scores to evaluate the text summarization performance of the fine-tuned model. The following script defines the calculate_rouge() function that allows you to calculate ROUGE1, ROUGE2, and ROUGEL scores.


# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

Finally, the following script demonstrates how we generate the summaries of the first 20 articles in our dataset using the fine-tuned model.



%%time

results = []

i = 0

for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']

    i = i + 1
    print(f"Summarizing article {i}.")

    prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

    response = client.chat.completions.create(
        model= ft_model_id,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1150,
        temperature=0.7
    )
    generated_summary = response.choices[0].message.content
    rouge_scores = calculate_rouge(human_summary, generated_summary)

    results.append({
    'article_id': row.id,
    'generated_summary': generated_summary,
    'rouge1': rouge_scores['rouge1'],
    'rouge2': rouge_scores['rouge2'],
    'rougeL': rouge_scores['rougeL']
    })

The following script prints average ROUGE scores.


results_df = pd.DataFrame(results)
mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Output:


rouge1    0.579758
rouge2    0.417515
rougeL    0.431266
dtype: float64

The above script shows that the fine-tuned GPT-4o model achieves significantly higher ROUGE scores than the default GPT-4o model.

Conclusion

Fine-tuning can significantly improve a model's performance on a specific task. This article explains how to fine-tune the OpenAI GPT-4o model for text classification and text summarization. The results show that the fine-tuned GPT-4o model significantly outperforms the default GPT-4o model on both tasks.

GPT-4o Snapshot vs Meta Llama 3.1 70b for Zero-Shot Text Summarization

Featured Imgs 23

In a previous article, I compared GPT-4o mini vs. GPT-4o and GPT-3.5 Turbo for zero-shot text summarization. The results showed that the GPT-4o mini achieves almost similar performance for zero-shot text classification at a much-reduced price compared to the other models.

I will compare Meta Llama 3.1 70b with OpenAI GPT-4o snapshot for zero-shot text summarization in this article. Meta Llama 3.1 series consists of Meta's state-of-the-art LLMs, including Llama 3.1 8b, Llama 3.1 70b, and Llama 3.1 405b. On the other hand, [OpenAI GPT-4o[(https://platform.openai.com/docs/models)] snapshot is OpenAIs latest LLM. We will use the Groq API to access Meta Llama 3.1 70b and the OpenAI API to access GPT-4o snapshot model.

So, let's begin without ado.

Installing and Importing Required Libraries

The following script installs the Python libraries you will need to run scripts in this article.


!pip install openai
!pip install groq
!pip install rouge-score
!pip install --upgrade openpyxl
!pip install pandas openpyxl

The script below installs the required libraries into your Python application.


import os
import time
import pandas as pd
from rouge_score import rouge_scorer
from openai import OpenAI
from groq import Groq
Importing the Dataset

This article will summarize the text in the News Articles with Summary dataset. The dataset consists of article content and human-generated summaries.

The following script imports the CSV dataset file into a Pandas DataFrame.


# Kaggle dataset download link
# https://github.com/reddzzz/DataScience_FP/blob/main/dataset.xlsx


dataset = pd.read_excel(r"D:\Datasets\dataset.xlsx")
dataset = dataset.sample(frac=1)
dataset['summary_length'] = dataset['human_summary'].apply(len)
average_length = dataset['summary_length'].mean()
print(f"Average length of summaries: {average_length:.2f} characters")
print(dataset.shape)
dataset.head()

Output:

image1.png

The content column stores the article's text, and the human_summary column contains the corresponding human-generated summaries.

We also calculate the average number of characters in the human-generated summaries, which we will use to generate summaries via the LLM models.

Text Summarization with GPT-4o Snapshot

We are now ready to summarize articles using GPT-4o snapshot and Llama 3.1 70b.

First, we'll create an instance of the OpenAI class, which we'll use to interact with various OpenAI language models. When initializing this object, you must provide your OpenAI API Key.

Additionally, we'll define the calculate_rouge() function, which computes the ROUGE-1, ROUGE-2, and ROUGE-L scores by comparing the LLM-generated summaries with the human-generated ones.

ROUGE scores are used to evaluate the quality of machine-generated text, such as summaries, by comparing them with human-generated text. ROUGE-1 evaluates the overlap of unigrams (single words), ROUGE-2 considers bigrams (pairs of consecutive words), and ROUGE-L focuses on the longest common subsequence between the two texts.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)

# Function to calculate ROUGE scores
def calculate_rouge(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return {key: value.fmeasure for key, value in scores.items()}

Next, we will iterate through the first 20 articles in the dataset and call the GPT-4o snapshot model to produce a summary of the article with a target length of 1150 characters. We will use 1150 characters because the average length of the human-generated summaries is 1168 characters. Next, the LLM-generated and human-generated summaries are passed to the calculate_rouge() function, which returns ROUGE scores for the LLM-generated summaries. These ROUGE scores, along with the generated summaries, are stored in the results list.


%%time

results = []

i = 0

for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']

    i = i + 1
    print(f"Summarizing article {i}.")

    prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

    response = client.chat.completions.create(
        model= "gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1150,
        temperature=0.7
    )
    generated_summary = response.choices[0].message.content
    rouge_scores = calculate_rouge(human_summary, generated_summary)

    results.append({
    'article_id': row.id,
    'generated_summary': generated_summary,
    'rouge1': rouge_scores['rouge1'],
    'rouge2': rouge_scores['rouge2'],
    'rougeL': rouge_scores['rougeL']
    })

Output:

image2.png

The above output shows that it took 59 seconds to summarize 20 articles.

Next, we convert the results list into a results_df dataframe and display the average ROUGE scores for 20 articles.


results_df = pd.DataFrame(results)
mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Output:


rouge1    0.386724
rouge2    0.100371
rougeL    0.187491
dtype: float64

The above results show that ROUGE scores obtained by GPT-4o snapshot are slightly less than the results obtained by GPT-4o model in the previous article.

Let's evaluate the summarization of GPT-4o using another LLM, GPT-4o mini, in this case.

In the following script, we define the llm_evaluate_summary() function, which accepts the original article and LLM-generated summary and evaluates it on the completeness, conciseness, and coherence criteria.


def llm_evaluate_summary(article, summary):
    prompt = f"""Evaluate the following summary for the given article. Rate it on a scale of 1-10 for:
    1. Completeness: Does it capture all key points?
    2. Conciseness: Is it brief and to the point?
    3. Coherence: Is it well-structured and easy to understand?

    Article: {article}

    Summary: {summary}

    Provide the ratings as a comma-separated list (completeness,conciseness,coherence).
    """
    response = client.chat.completions.create(
        model= "gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7
    )
    return [float(score) for score in response.choices[0].message.content.strip().split(',')]

We iterate through the first 20 articles and pass the content and LLM-generated summaries to the llm_evaluate_summary() function.


scores_dict = {'completeness': [], 'conciseness': [], 'coherence': []}

i = 0
for _, row in results_df.iterrows():
    i = i + 1
    # Corrected method to access content by article_id
    article = dataset.loc[dataset['id'] == row['article_id'], 'content'].iloc[0]
    scores = llm_evaluate_summary(article, row['generated_summary'])
    print(f"Article ID: {row['article_id']}, Scores: {scores}")

    # Store the scores in the dictionary
    scores_dict['completeness'].append(scores[0])
    scores_dict['conciseness'].append(scores[1])
    scores_dict['coherence'].append(scores[2])

Finally, the script below calculates and displays the average scores for completeness, conciseness, and coherence for GPT-4o snapshot summaries.


# Calculate the average scores
average_scores = {
    'completeness': sum(scores_dict['completeness']) / len(scores_dict['completeness']),
    'conciseness': sum(scores_dict['conciseness']) / len(scores_dict['conciseness']),
    'coherence': sum(scores_dict['coherence']) / len(scores_dict['coherence']),
}

# Convert to DataFrame for better visualization (optional)
average_scores_df = pd.DataFrame([average_scores])
average_scores_df.columns = ['Completeness', 'Conciseness', 'Coherence']

# Display the DataFrame
average_scores_df.head()

Output:

image3.png

Text Summarization with Llama 3.1 70b

In this section, we will perform a zero-shot text summarization of the same set of articles using the Llama 3.1 70b model.

You should try the Llama 3.1 405b model to get better results. However, at the time of writing this article, Groq Cloud had suspended the API calls for Llama 405b due to excessive demand. You can also try other cloud providers to run Llama 3.1 405b.

The process remains the same for text summarization using Meta Llama 3.1 70b. The only difference is that we will create an object of the Groq client in this case.


client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

Next, we will iterate through the first 20 articles in the dataset, generate their summaries using the Llama 3.1 70b model, calculate ROUGE scores, and store the results in the results list.


%%time

results = []

i = 0

for _, row in dataset[:20].iterrows():
    article = row['content']
    human_summary = row['human_summary']

    i = i + 1
    print(f"Summarizing article {i}.")

    prompt = f"Summarize the following article in 1150 characters. The summary should look like human created:\n\n{article}\n\nSummary:"

    response = client.chat.completions.create(
          model="llama-3.1-70b-versatile",
          temperature = 0.7,
          max_tokens = 1150,
          messages=[
                {"role": "user", "content":  prompt}
            ]
    )

    generated_summary = response.choices[0].message.content
    rouge_scores = calculate_rouge(human_summary, generated_summary)

    results.append({
    'article_id': row.id,
    'generated_summary': generated_summary,
    'rouge1': rouge_scores['rouge1'],
    'rouge2': rouge_scores['rouge2'],
    'rougeL': rouge_scores['rougeL']
    })

Output:

image4.png

The above output shows that it only took 24 seconds to process 20 article using Llama 3.1 70b. The faster processing is because Llama 3.1 70b is a smaller model than the GPT-4o snapshot. Also, Groq uses LPU (language processing unit) which is much faster for LLM inference.

Next, we will convert the results list into the results_df dataframe and display the average ROUGE scores.


results_df = pd.DataFrame(results)
mean_values = results_df[["rouge1", "rouge2", "rougeL"]].mean()
print(mean_values)

Output:


rouge1    0.335863
rouge2    0.080865
rougeL    0.170834
dtype: float64

The above output shows that the ROUGE scores for Meta Llama 3.1 70b are lower compared to the GPT-40 snapshot model. I would again stress that you should use Llama 3.1 405b to get better results.

Finally, we will evaluate the summaries generated via Llama 3.1 70b using the GPT-4o mini model for completeness, conciseness, and coherence.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)

scores_dict = {'completeness': [], 'conciseness': [], 'coherence': []}

i = 0
for _, row in results_df.iterrows():
    i = i + 1
    # Corrected method to access content by article_id
    article = dataset.loc[dataset['id'] == row['article_id'], 'content'].iloc[0]
    scores = llm_evaluate_summary(article, row['generated_summary'])
    print(f"Article ID: {row['article_id']}, Scores: {scores}")

    # Store the scores in the dictionary
    scores_dict['completeness'].append(scores[0])
    scores_dict['conciseness'].append(scores[1])
    scores_dict['coherence'].append(scores[2])

# Calculate the average scores
average_scores = {
    'completeness': sum(scores_dict['completeness']) / len(scores_dict['completeness']),
    'conciseness': sum(scores_dict['conciseness']) / len(scores_dict['conciseness']),
    'coherence': sum(scores_dict['coherence']) / len(scores_dict['coherence']),
}

# Convert to DataFrame for better visualization (optional)
average_scores_df = pd.DataFrame([average_scores])
average_scores_df.columns = ['Completeness', 'Conciseness', 'Coherence']

# Display the DataFrame
average_scores_df.head()

Output:

image5.png

The above output shows that Llama 3.1 70b achieves performance similar to GPT-4o snapshot for text summarization when evaluated using 3rd LLM.

Conclusion

Meta Llama 3.1 series models are state-of-the-art open-source models. This article shows that Meta Llama 3.1 70b performs very similarly to GPT-4o snapshot for zero-shot text summarization. I stress that you use the Llama 3.1 405b model from Groq to see if you can get better results than GPT-4o.

Comparison of Fine-tuning GPT-4o mini vs GPT-3.5 for Text Classification

Featured Imgs 23

In my previous articles, I presented a comparison of OpenAI GPT-4o mini model with GPT-4o and GPT-3.5 turbo models for zero-shot text classification. The results showed that GPT-4o mini, while significantly cheaper than its counterparts, achieves comparable performance.

On 8 August 2024, OpenAI enabled GPT-4o mini fine-tuning for developers across usage tiers 1-5. You can now fine-tune GPT-4o mini for free until 23 September 2024, with a daily token limit of 2 million.

In this article, I will show you how to fine-tune the GPT-4o mini for text classification tasks and compare it to the fine-tuned GPT-3.5 turbo.

So, let's begin without ado.

Importing and Installing Required Libraries

The following script installs the OpenAI Python library you can use to make calls to the OpenAI API.


!pip install openai

The script below imports the required liberaries into your Python application.


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from openai import OpenAI
import pandas as pd
import os
import json
Importing the Dataset

We will use the Twitter US Airline Sentiment dataset for fine-tuning the GPT-4o mini and GPT-3.5 turbo models.

The following script imports the dataset and defines the preprocess_data() function. This function takes in a dataset and an index value as inputs. It then divides the dataset by sentiment category, returning 34, 33, and 33 tweets from each category, beginning at the specified index. This approach ensures we have around 100 balanced records. You can use more number of records for fine-tuning if you want.



dataset = pd.read_csv(r"D:\Datasets\Tweets.csv")

def preprocess_data(dataset, n):

    # Remove rows where 'airline_sentiment' or 'text' are NaN
    dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

    # Remove rows where 'airline_sentiment' or 'text' are empty strings
    dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

    # Filter the DataFrame for each sentiment
    neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
    positive_df = dataset[dataset['airline_sentiment'] == 'positive']
    negative_df = dataset[dataset['airline_sentiment'] == 'negative']

    # Select records from Nth index
    neutral_sample = neutral_df[n: n +34]
    positive_sample = positive_df[n: n +33]
    negative_sample = negative_df[n: n +33]

    # Concatenate the samples into one DataFrame
    dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

    # Reset index if needed
    dataset.reset_index(drop=True, inplace=True)

    dataset = dataset[["text", "airline_sentiment"]]

    return dataset

The following script creates a balanced training dataset.


training_data = preprocess_data(dataset, 0)
print(training_data["airline_sentiment"].value_counts())
training_data.head()

Output:

image1.png

Similarly, the script below creates a test dataset.


test_data = preprocess_data(dataset, 100)
print(test_data["airline_sentiment"].value_counts())
test_data.head()

Output:

image2.png

Converting Training Data to JSON Format for OpenAI Model Fine-tuning

To fine-tune an OpenAI model, you need to transform the training data into JSON format, as outlined in the OpenAI official documentation. To achieve this, I have written a straightforward function that converts the input Pandas DataFrame into the required JSON structure.

The following script converts the training data into OpenAI complaint JSON format for fine-tuning. Fine-tuning relies significantly on the content specified for the system role, so pay special attention when setting this value.


# JSON file path
json_file_path = r"D:\Datasets\airline_sentiments.json"

# Function to create the JSON structure for each row
def create_json_structure(row):
    return {
        "messages": [
            {"role": "system", "content": "You are a Twitter sentiment analysis expert who can predict sentiment expressed in the tweets about an airline. You select sentiment value from positive, negative, or neutral."},
            {"role": "user", "content": row['text']},
            {"role": "assistant", "content": row['airline_sentiment']}
        ]
    }

# Convert DataFrame to JSON structures
json_structures = training_data.apply(create_json_structure, axis=1).tolist()

# Write JSON structures to file, each on a new line
with open(json_file_path, 'w') as f:
    for json_structure in json_structures:
        f.write(json.dumps(json_structure) + '\n')

print(f"Data has been written to {json_file_path}")

Output:


Data has been written to D:\Datasets\airline_sentiments.json

The next step is to upload your JSON file to the OpenAI server. To do so, start by creating an OpenAI client object. Then, call the files.create() method, passing the file path as an argument, as demonstrated in the following script:

client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)


training_file = client.files.create(
  file=open(json_file_path, "rb"),
  purpose="fine-tune"
)

print(training_file.id)

Once the file is uploaded, you will receive a file ID, as the above script demonstrates. You will use this file ID to fine-tune your OpenAI model.

Fine-Tuning GPT-4o Mini for Text Classification

To start fine-tuning, you must call the fine_tuning.jobs.create() method and pass it the ID of the uploaded training file and the model name. The current model name for GPT-4o mini is gpt-4o-mini-2024-07-18.


fine_tuning_job_gpt4o_mini = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-4o-mini-2024-07-18"
)

Executing the above script initiates the fine-tuning process. The following script allows you to monitor and display various fine-tuning events.


# List up to 10 events from a fine-tuning job
print(client.fine_tuning.jobs.list_events(fine_tuning_job_id = fine_tuning_job_gpt4o_mini.id,
                                    limit=10))

Once fine-tuning is complete, you will receive an email containing the ID of your fine-tuned model, which you can use to make inferences. Alternatively, you can retrieve the ID of your fine-tuned model by running the following script.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt4o_mini.id).fine_tuned_model

The remainder of the process follows the same steps as outlined in a previous article. We will define the find_sentiment() function and pass it our fine-tuned model and the test set to predict the sentiment of the tweets in the dataset.

Finally, we predict the model's accuracy by comparing the actual and predicted sentiments of the tweets.


def find_sentiment(client, model, dataset):
    tweets_list = dataset["text"].tolist()

    all_sentiments = []


    i = 0


    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            response = client.chat.completions.create(
                model=model,
                temperature=0,
                max_tokens=10,
                messages=[
                    {"role": "user", "content": content}
                ]
            )

            sentiment_value = response.choices[0].message.content

            all_sentiments.append(sentiment_value)
            i += 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)

    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print(f"Accuracy: {accuracy}")

find_sentiment(client,ft_model_id, test_data)

Output:


Accuracy: 0.78

The above output shows that the fine-tuned GPT-4o mini achieves a performance accuracy of 78% on the test set.

Fine-Tuning GPT-3.5 Turbo for Text Classification

For comparison, we will also fine-tune the GPT-3.5 turbo model for text classification.

The fine-tuning process remains the same as for the GPT-4o mini. We will pass the training file ID and the GPT-3.5 turbo model ID to the client.fine_tuning.jobs.create() method, as shown below.


fine_tuning_job_gpt_3_5 = client.fine_tuning.jobs.create(
  training_file=training_file.id,
  model="gpt-3.5-turbo"
)

Next, we will pass the fine-tuned GPT-3.5 model ID and the test dataset to the find_sentiment() function to evaluate the model's performance on the test set.


ft_model_id = client.fine_tuning.jobs.retrieve(fine_tuning_job_gpt_3_5.id).fine_tuned_model
find_sentiment(client,ft_model_id, test_data)

Output:


Accuracy: 0.82

The above output shows that the GPT-3.5 turbo model achieves 82% performance accuracy, 4% higher than the GPT-4o mini model.

Conclusion

GPT-4o mini is a cheaper and faster alternative to GPT-3.5. My last article showed that it achieves higher performance for zero-shot text classification than the GPT-3.5 turbo model.

However, based on the results presented in this article, a fine-tuned GPT-3.5 turbo model is still better than a fine-tuned GPT-4o mini.

Feel free to share your feedback in the comments section.

GPT-4o mini – A Cheaper and Faster Alternative to GPT-4o

Featured Imgs 23

On July 18th, 2024, OpenAI released GPT-4o mini, their most cost-efficient small model. GPT-4o mini is around 60% cheaper than GPT-3.5 Turbo and around 97% cheaper than GPT-4o. As per OpenAI, GPT-4o mini outperforms GPT-3.5 Turbo on almost all benchmarks while being cheaper.

In this article, we will compare the cost, performance, and latency of GPT-4o mini with GPT-3.5 turbo and GPT-4o. We will perform a zero-shot tweet sentiment classification task to compare the models. By the end of this article, you will find out which of the three models is better for your use cases. So, let's begin without ado.

Importing and Installing Required Libraries

As a first step, we will install and import the required libraries.

Run the following script to install the OpenAI library.


!pip install openai

The following script imports the required libraries into your application.


import os
import time
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from openai import OpenAI
Importing and Preprocessing the Dataset

To compare the models, we will perform zero-shot classification on the Twitter US Airline Sentiment dataset, which you can download from kaggle.

The following script imports the dataset from a CSV file into a Pandas dataframe.


## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\tweets.csv")
print(dataset.shape)
dataset.head()

Output:

image1.png

The dataset contains more than 14 thousand records. However, we will randomly select 100 records. Of these, 34, 33, and 33 will have neutral, positive, and negative sentiments, respectively.

The following script selects the 100 tweets.


# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

image2.png

Let's find out the average number of characters per tweet in these 100 tweets.


dataset['tweet_length'] = dataset['text'].apply(len)
average_length = dataset['tweet_length'].mean()
print(f"Average length of tweets: {average_length:.2f} characters")

Output:

Average length of tweets: 103.63 characters

Next, we will perform zero-shot classification of these tweets using GPT-4o mini, GPT-3.5 Turbo, and GPT-4o models.

Comparing GPT-4o mini with GPT 3.5 Turbo and GPT-4o

We will define the find_sentiment() function, which takes the OpenAI client object, model name, prices per input and output token for the model, and the dataset.

The find_sentiment() function will iterate through all the tweets in the dataset and will perform the following task:

  • predict their sentiment using the specified model.
  • calculate the number of input and output tokens for the request
  • calculate the total price to process all the tweets using the total input and output tokens.
  • calculate the average latency of all API calls.
  • calculate the model accuracy by comparing the actual and predicted sentiments.

Here is the code for the find_sentiment() function.


def find_sentiment(client, model, prompt_token_price, completion_token_price, dataset):
    tweets_list = dataset["text"].tolist()

    all_sentiments = []
    prompt_tokens = 0
    completion_tokens = 0

    i = 0
    exceptions = 0
    total_latency = 0

    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            # Record the start time before making the API call
            start_time = time.time()

            response = client.chat.completions.create(
                model=model,
                temperature=0,
                max_tokens=10,
                messages=[
                    {"role": "user", "content": content}
                ]
            )

            # Record the end time after receiving the response
            end_time = time.time()

            # Calculate the latency for this API call
            latency = end_time - start_time
            total_latency += latency

            sentiment_value = response.choices[0].message.content
            prompt_tokens += response.usage.prompt_tokens
            completion_tokens += response.usage.completion_tokens

            all_sentiments.append(sentiment_value)
            i += 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    total_price = (prompt_tokens * prompt_token_price) + (completion_tokens * completion_token_price)
    average_latency = total_latency / len(tweets_list) if tweets_list else 0

    print(f"Total exception count: {exceptions}")
    print(f"Total price: ${total_price:.8f}")
    print(f"Average API latency: {average_latency:.4f} seconds")
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print(f"Accuracy: {accuracy}")
Results with GPT-4o Mini

First, Let's call the find_sentiment method using the GPT-4o mini model. You can see that the GPT-4o mini costs 15/60 cents to process a million input and output tokens, respectively.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)
model = "gpt-4o-mini"
input_token_price = 0.150/1_000_000
output_token_price = 0.600/1_000_000

find_sentiment(client, model, input_token_price, output_token_price, dataset)

Output:


Total exception count: 0
Total price: $0.00111945
Average API latency: 0.5097 seconds
Accuracy: 0.8

The above output shows that GPT-4o mini costs $0.0011 to process 100 tweets of around 103 characters. The average latency for an API call was 0.5097 seconds. Finally, the model achieved an accuracy of 80% in processing a 100 tweets.

Results with GPT-3.5 Turbo

Let's perform the same test with GPT-3.5 Turbo.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)
model = "gpt-3.5-turbo"
input_token_price = 0.50/1_000_000
output_token_price = 1.50/1_000_000

find_sentiment(client, model, input_token_price, output_token_price, dataset)

Output:


Total exception count: 0
Total price: $0.00370600
Average API latency: 0.4991 seconds
Accuracy: 0.72

The output showed that the GPT-3.5 turbo cost over three times as much as the GPT-4o mini for predicting the sentiment of 100 tweets. The latency is almost similar to that of the GPT-4o mini. Finally, the performance is much lower (72%) compared to that of the GPT-4o mini.

Results with GPT-4o

Finally, we can perform the zero-shot sentiment classification with the state-of-the-art GPT-4o.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)
model = "gpt-4o"
input_token_price = 5.00/1_000_000
output_token_price = 15/1_000_000

find_sentiment(client, model, input_token_price, output_token_price, dataset)

Output:


Total exception count: 0
Total price: $0.03681500
Average API latency: 0.5602 seconds
Accuracy: 0.82

The output shows GPT-4o has slightly slower latency than GPT-4o mini and GPT-3.5 turbo. In terms of performance, GPT-4o achieved 82% accuracy which is 2% higher than GPT-4o mini. However, GPT-4o is 36 times more expensive than GPT-4o mini.

Is a 2% performance gain worth the 36 times higher price? Let me know in the comments.

Final Verdict

To conclude, the following table summarizes the tests performed in this article.

image3.png

I recommend you always prefer the GPT-4o mini over the GPT-3.5 turbo model, as the former is cheaper and more accurate. I would go for the GPT-4o mini if you need to process huge volumes of texts that are not very sensitive and where you can compromise a bit on accuracy. This can save you tons of money.
Finally, I would still go for GPT-4o for the best accuracy, though it costs 36 times more than GPT-4o mini.

Let me know what you think of these results and which model you plan to use.

Image Analysis Using Claude 3.5 Sonnet Model

Featured Imgs 23

In my article on Image Analysis Using OpenAI GPT-4o Model, I explained how GPT-4o model allows you to analyze images and answer questions related images precisely.

In this article, I will show you how to analyze images with the Anthropic Claude 3.5 Sonnet model, which has shown state-of-the-art performance for many text and vision problems. I will also share my insights on how Claude 3.5 Sonnet compares with GPT-4o for image analysis tasks. So, let's begin without ado.

Importing Required Libraries

You will need to install the anthropic Python library to access the Claude 3.5 Sonnet model in this article. In addition, you will need the Anthropic API key, which you can obtain here.

The following script installs the Anthropic Python library.


!pip install anthropic

The script below imports all the Python modules you will need to run scripts in this article.


import os
import base64
from IPython.display import display, HTML
from IPython.display import Image
from anthropic import Anthropic
General Image Analysis

Let's first perform a general image analysis. We will analyze the following image and ask Claude 3.5 Sonnet if it shows any potentially dangerous situation.


# image source: https://healthier.stanfordchildrens.org/wp-content/uploads/2021/04/Child-climbing-window-scaled.jpg

image_path = r"D:\Datasets\sofa_kid.jpg"
img = Image(filename=image_path, width=600, height=600)
img

Output:

image1.jpg

Note: For comparison, the images we will analyze in this article are the same as those we analyzed with GPT-4o.

Next, we will define a method that converts an image into Base64 format. The Claude 3.5 Sonnet model expects image inputs to be in Base64 format.

We also define an object of the Anthropic client. We will call the Claude 3.5 Sonnet model using this client object.


def encode_image64(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image64(image_path)
image1_media_type = "image/jpeg"

client = Anthropic(api_key = os.environ.get('ANTHROPIC_API_KEY'))

We will define a helper function analyze_image() that accepts text query as a parameter. Inside the image function, we call the message.create() method of the Anthropic client object. We set the model value to claude-3-5-sonnet-20240620, which is the id for the Claude 3.5 Sonnet model. The temperature is set to 0 since we want a fair comparison with the GPT-4o model. Finally, We set the system prompt and then pass the image and the text query to the messages list.

We ask the Claude 3.5 Sonnet model to identify any dangerous situation in the image.


def analyze_image(query):
    message = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        temperature = 0,
        max_tokens=1024,
        system="You are a baby sitter.",
        messages=[
             {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": image1_media_type,
                            "data":  base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": query
                    }
                ],
            }
        ],
    )
    return message

response_content = analyze_image("Do you see any dangerous situation in the image? If yes, how to prevent it?")
print(response_content.content[0].text)

Output:

image2.png

The above output shows that the Claude 3.5 Sonnet model has identified a dangerous situation and provided some suggestions.

Compared to GPT-4o, which gave five suggestions, Claude 3.5 Sonnet provided seven suggestions and a more detailed response.

Graph Analysis

Next, we will perform the graph Analysis task using Claude 3.5 Sonnet and summarize the following graph.


# image path: https://globaleurope.eu/wp-content/uploads/sites/24/2023/12/Folie2.jpg

image_path = r"D:\Datasets\Folie2.jpg"
img = Image(filename=image_path, width=800, height=800)
img

Output:

image2.jpg


base64_image = encode_image64(image_path)

def analyze_graph(query):
    message = client.messages.create(
        model = "claude-3-5-sonnet-20240620",
        temperature = 0,
        max_tokens = 1024,
        system = "You are a an expert graph and visualization expert",
        messages = [
             {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": image1_media_type,
                            "data":  base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": query
                    }
                ],
            }
        ],
    )
    return message.content[0].text

response_content = analyze_graph("Can you summarize the graph?")
print(response_content)

Output:

image3.png

The above output shows the graph's summary. Though Claude 3.5 Sonnet here is more elaborative than GPT-4o, I found the GPT-4o summary better as it categorized the countries into high, moderate, and lower debt levels.

Next, I asked Claude 3.5 Sonnet to create a table showing countries against their debts.


response_content = analyze_graph("Can you convert the graph to table such as Country -> Debt?")
print(response_content)

Output:

image3b.png

The results obtained with Claude 3.5 Sonnet were astonishingly accurate compared to GPT-4o. For example, GPT-4o showed Estonia having a debt of 10% of its GDP, whereas Claude 3.5 Sonnet depicted Estonia as having a debt of 19.2%. If you look at the Graph, you will see that Claude 3.5 Sonnet is extremely accurate here.

Claude 3.5 Sonnet is a clear winner for Graph Analysis.

Image Sentiment Prediction

Next, we will predict facial sentiment using Claude 3.5 Sonnet. Here is the sample image.


# image path: https://www.allprodad.com/the-3-happiest-people-in-the-world/

image_path = r"D:\Datasets\happy_men.jpg"
img = Image(filename=image_path, width=800, height=800)
img

Output:

image3.jpg


base64_image = encode_image64(image_path)

def predict_sentiment(query):
    message = client.messages.create(
        model = "claude-3-5-sonnet-20240620",
        temperature = 0,
        max_tokens = 1024,
        system = "You are helpful psychologist.",
        messages = [
             {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": image1_media_type,
                            "data":  base64_image,
                        },
                    },
                    {
                        "type": "text",
                        "text": query
                    }
                ],
            }
        ],
    )
    return message.content[0].text

response_content = predict_sentiment("Can you predict facial sentiment from the input image?")
print(response_content)

Output:

image3c.png

The above output shows that Claude 3.5 Sonnet provided detailed information about the sentiment expressed in the image. GPT-4o, on the other hand, was more precise.

I will again go with Claude 3.5 Sonnet here as the first choice for sentiment classification.

Analyzing Multiple Images

Finally, let's see how Claude 3.5 Sonnet fairs for analyzing multiple images.
We will compare the following two images for sentiment predictions.


from PIL import Image
import matplotlib.pyplot as plt

# image1_path: https://www.allprodad.com/the-3-happiest-people-in-the-world/
# image2_path: https://www.shortform.com/blog/self-care-for-grief/

image_path1 = r"D:\Datasets\happy_men.jpg"
image_path2 = r"D:\Datasets\sad_woman.jpg"


# Open the images using Pillow
img1 = Image.open(image_path1)
img2 = Image.open(image_path2)

# Create a figure to display the images side by side
fig, axes = plt.subplots(1, 2, figsize=(10, 5))

# Display the first image
axes[0].imshow(img1)
axes[0].axis('off')  # Hide axes

# Display the second image
axes[1].imshow(img2)
axes[1].axis('off')  # Hide axes

# Show the plot
plt.tight_layout()
plt.show()

Output:

image4.png


base64_image1 = encode_image64(image_path1)
base64_image2 = encode_image64(image_path2)


def predict_sentiment(query):
    message = client.messages.create(
        model = "claude-3-5-sonnet-20240620",
        temperature = 0,
        max_tokens = 1024,
        system = "You are helpful psychologist.",
        messages = [
             {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": image1_media_type,
                            "data":  base64_image1,
                        },
                    },
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": image1_media_type,
                            "data":  base64_image2,
                        },
                    },
                    {
                        "type": "text",
                        "text": query
                    }
                ],
            }
        ],
    )
    return message.content[0].text

response_content = predict_sentiment("Can you explain all the differences in the two images?")
print(response_content)

Output:

image4b.png

The above output shows the image comparison results achieved via Claude 3.5 Sonnet. I found that for image comparison, the results I obtained with GPT-4o in my previous article were better than those of Claude 3.5 Sonnet. In my opinion, GPT-4o is a better model for image comparison tasks.

Conclusion

The Claude 3.5 Sonnet Model is a state-of-the-art model for text and vision tasks. In this article, I explained how to analyze images with Claude 3.5 Sonnet. Compared with GPT-4o, I found Claude 3.5 Sonnet better for general image and graph analysis tasks. On the contrary, GPT-4o achieved better results for image summarization and comparison tasks. I urge you to test both these models and share your results.

Extracting YouTube Channel Statistics in Python Using YouTube Data API

Featured Imgs 23

Are you interested in finding out what a YouTube channel mostly discusses? Do you want to analyze YouTube videos of a specific channel? If yes, we are in the same boat.

YouTube video titles are a great way to determine the channel's primary focus. Plotting a word cloud or a bar plot of the most frequently occurring words in YouTube video titles can give you precise insight into the nature of a YouTube channel. I will do exactly this in this tutorial using the Python programming language.

So, let's begin without ado.

Getting YouTube Data API Key

You can get information about a YouTube channel in Python programming via the YouTube Data API. However, to access the API, you must create a new project in Google Cloud Platform. You can do so for free.

image1.png

Once you create a new project, click the Go to APIs overview link, as shown in the screenshot below.

image2.png

Next, click the ENABLE APIS AND SERVICES link.

image3.png

Search for youtube data api v3.

image4.png

Click the ENABLE button.

image5.png

You will need to create credentials. To do so, click the CREDENTIALS link.

image6.png

If you have any existing credentials, you will see them. To create new credentials, click the + CREATE CREDENTIALS link and select API key.

image7.png

Your API key will be generated. Copy and save it in a secure place.

image8.png

Now, you can access the YouTube Data API in Python.

Installing and Importing Required Libraries

To begin our analysis, we must set up our Python environment by installing and importing the necessary libraries. The main libraries we will use are google-api-python-client for accessing the YouTube Data API, wordcloud for generating word clouds, and nltk (Natural Language Toolkit) for text processing.


!pip install google-api-python-client
!pip install wordcloud
!pip install nltk

Next, import the required libraries. These include googleapiclient.discovery for accessing the YouTube API, re for regular expressions to clean text, Counter from the collections module to count word frequencies, matplotlib.pyplot for plotting, WordCloud from the wordcloud library, and stopwords from nltk.corpus for removing common English words that do not contribute much to the analysis.


import googleapiclient.discovery
import re
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
Extracting YouTube Channel Statistics

This section will extract video titles from a specific YouTube channel using the YouTube Data API. To do this, we need to set up our API client with the appropriate API key and configure the request to fetch video data.

First, specify the API details and initialize the YouTube API client using the API key you generated earlier. Replace "YOUR_API_KEY" with your actual API key.


# API information
api_service_name = "youtube"
api_version = "v3"

# API key
DEVELOPER_KEY = "YOUR_API_KEY"

# API client
youtube = googleapiclient.discovery.build(
    api_service_name,
    api_version,
    developerKey=DEVELOPER_KEY)

Next, define the channel ID of the YouTube channel you want to analyze. Here, we are using the channel ID UCYNS-_653RIE9x_BINefAMA as an example. You can use any other channel if you want.

We will then create a request to retrieve video titles for the specified channel. The request fetches up to 50 video titles per call. We use pagination to fetch additional results if available to handle more videos.


# Channel ID of the channel you want to search
channel_id = "UCYNS-_653RIE9x_BINefAMA"


# Request to retrieve all video titles for the specified channel
request = youtube.search().list(
    part="snippet",
    channelId=channel_id,
    maxResults=50,
    type="video"
)

# Initialize an empty list to store the video titles
video_titles = []

# Execute the request and retrieve the results
while request is not None:
    response = request.execute()
    for item in response["items"]:
        video_titles.append(item["snippet"]["title"])
    request = youtube.search().list_next(request, response)

Finally, print the total number of extracted video titles and display the first 10 titles to ensure our extraction process is working correctly.


# Print the video titles
print("Total extracted videos:", len(video_titles))
print("First 10 videos")
print("===============")
for title in video_titles[:10]:
    print(title)

Output:

image9.png

Plotting a Bar Plot with Most Frequently Occurring Words

This section will process the extracted video titles to identify the most frequently occurring words. We will then visualize these words using a bar plot.

First, download the NLTK stop words and define an additional set of stop words to exclude from our analysis. Stop words are common words like "the", "is", and "in" that does not provide significant meaning in our context.

nltk.download('stopwords')
stop_words = set(stopwords.words('english')) | {'sth', 'syed', 'talat', 'hussain'}

Next, filter the video titles to include only those containing Latin characters. This step ensures we focus on titles written in English or similar languages.

# Filter videos with Latin text only
latin_titles = [title for title in video_titles if re.search(r'[a-zA-Z]', title)]

We then clean the titles by removing special characters and tokenizing them into individual words. Stop words are excluded from the analysis.

# Remove special characters and tokenize
words = []
for title in latin_titles:
    cleaned_title = re.sub(r'[^\w\s]', '', title)  # Remove special characters
    for word in re.findall(r'\b\w+\b', cleaned_title):
        if word.lower() not in stop_words:
            words.append(word.lower())

Subsequently, we count the frequency of each word and identify the ten most common words. To avoid redundancy, exclude additional common words specific to the dataset, such as "sth", "syed", "talat", and "hussain".


# Count the frequency of words
word_counts = Counter(words)

# Get the 10 most common words
most_common_words = [word for word, count in word_counts.most_common(10) if word not in {'sth', 'syed', 'talat', 'hussain'}]

Finally, create a bar plot to visualize the frequency of the most common words.


# Create bar plot
plt.figure(figsize=(10, 5))
plt.bar(range(len(most_common_words)), [word_counts[word] for word in most_common_words])
plt.title('10 Most Common Words in Video Titles')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(range(len(most_common_words)), most_common_words, rotation=45)
plt.show()

Output:

image10.png

The output will display a bar plot showing the 10 most common words in the video titles, providing a clear insight into the primary topics discussed on the channel.

Word Cloud of Video Titles

We will create a word cloud to visualize the word frequency data further. A word cloud presents the most common words in a visually appealing way, and the size of each word represents its frequency.

First, generate the word cloud using the WordCloud class. The word cloud is generated from the list of words we compiled earlier. Next, we display the word cloud using Matplotlib.


# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(words))

# Display the word cloud
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud of Video Titles")
plt.show()

Output:

image11.png

The resulting word cloud provides a visual representation of the most frequently occurring words in the video titles, allowing for quick and intuitive insights into the channel's content.

Conclusion

Analyzing YouTube video titles allows us to gain valuable insights into a channel's primary topics. Using Youtube Data API and libraries such as google-api-python-client, nltk, matplotlib, and wordcloud enables us to extract video data, process text, and visualize the most common words. This approach reveals the core themes of a YouTube channel, offering a clear understanding of its focus and audience interests. Whether for content creators or viewers, these techniques effectively uncover the essence of YouTube video discussions.

Retrieval Augmented Generation with Claude 3.5 Sonnet

Featured Imgs 23

In my previous article I presented results comparing Anthropic Claude 3.5 Sonnet and OpenAI GPT-4o models for zero-shot text classification. The results showed that the Claude 3.5 Sonnet significantly outperformed GPT-4o.

These results motivated me to develop a simple retrieval augmented generation system with LangChain that enables the Claude 3.5 Sonnet model to answer questions pertaining to custom documents.

By the end of this article, you will know how to develop a chatbot that uses the Claude 3.5 Sonnet LLM to answer questions on custom documents.

So, let's begin without ado.

Installing and Importing Required Libraries

The following script installs the libraries required to run scripts in this article.

!pip install -U langchain
!pip install -U langchain-anthropic
!pip install langchain-openai
!pip install pypdf
!pip install faiss-cpu

Subsequently, the script below imports the required libraries into your Python application.


from langchain_anthropic import ChatAnthropic

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.documents import Document
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
import os

Generating Default Response with Claude 3.5 Sonnet

Let's first generate a default response using Claude 3.5 Sonnet LLM in LangChain.

You will need an anthropic API key which you can get here.

Next, create an object of the ChatAnthropic class and pass the anthropic API key, the model ID, and the temperature value to its constructor. The temperature specifies how creative a model should be while generating responses. Higher temperature values allow models to be more creative.

Finally, pass the prompt to the invoke() method of the ChatAnthropic object to generate the model response.

anthropic_api_key = os.environ.get('ANTHROPIC_API_KEY')

llm = ChatAnthropic(model="claude-3-5-sonnet-20240620",
                     anthropic_api_key = anthropic_api_key,
                     temperature = 0.3)

result = llm.invoke("Write a funny poem for an ice cream shop on a beach.")
print(result.content)

Output:

image1.png

The LangChain ChatPromptTemplate class allows you to create a chatbot. The from_messages() method indicates that the conversation should be executed in message format. In this setup, you must specify the value for the user attribute, while the system attribute is optional.

You can use the StrOutputParser class to parse the model response in string format, as shown in the script below:


prompt = ChatPromptTemplate.from_messages([
    ("system", '{assistant}'),
    ("user", "{input}")
])

output_parser = StrOutputParser()

chain = prompt | llm | output_parser

result = chain.invoke(
    {"assistant": "You are a comedian",
     "input": "Write a funny poem for a music store on a beach."}
)
print(result)

Output:

image2.png

RAG with Claude 3.5 Sonnet

Now you know how to call the Claude 3.5 Sonnet LLM in LangChain. In this section, we will augment the Claude 3.5 Sonnet model's knowledge, making it capable of answering questions related to the documents it had not seen during training.

Step 1: Loading and Splitting Documents

We start by loading and splitting the document using PyPDFLoader. In this case, we load "The English Constitution" by Walter Bagehot from a URL.

The following script's load_and_split() method ensures the document is parsed correctly and divided into manageable sections.


loader = PyPDFLoader("https://web.archive.org/web/20170809122528id_/http://global-settlement.org/pub/The%20English%20Constitution%20-%20by%20Walter%20Bagehot.pdf")
docs = loader.load_and_split()
Step 2: Creating Embeddings

Next, we create embeddings for the text using OpenAI API's OpenAIEmbeddings class. We then split the text into smaller chunks with RecursiveCharacterTextSplitter and created a FAISS vector store using the split documents and their embeddings. This step transforms the text into a format suitable for retrieval and similarity search.


openai_key = os.environ.get('OPENAI_API_KEY')

embeddings = OpenAIEmbeddings(openai_api_key = openai_key)

text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_documents(documents, embeddings)
Step 3: Crafting the Prompt Template

The next step is to define a prompt template using the ChatPromptTemplate class. The template instructs the model to answer questions based solely on the provided context, ensuring accurate and relevant responses. The create_stuff_documents_chain function links this template with the language model, forming a document chain that will be used for generating responses.


prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

Question: {input}

Context: {context}
"""
)

document_chain = create_stuff_documents_chain(llm, prompt)
Step 4: Setting Up the Retriever

We convert the vector store into a retriever object with vector.as_retriever() method. The retriever, combined with the document chain, forms a retrieval_chain. This setup enables the system to fetch relevant document sections based on the user's query and use them as context for generating answers.


retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)
Step 5: Generating Responses

Finally, we define the generate_response() function that takes a query as input, invokes the retrieval chain, and prints the answer.


def generate_response(query):
    response = retrieval_chain.invoke({"input": query})
    print(response["answer"])

To demonstrate the system in action, we run a few example queries as shown below:

Query1:


query = "What is the total number of members of the house of commons?"
generate_response(query)

Output:

image3.png

Query2:


query = "What is the difference between the house of lords and house of commons? How members are elected for both?"
generate_response(query)

Output:

image4.png

Query3:


query = "How many players participate in a football game?"
generate_response(query)

Output:

image5.png

You can see that the model correctly replied to queries related to custom document and refused to generate response to questions that are not related to the document.

Conclusion

The retrieval augmented generation (RAG) technique has revolutionized the development of customized chatbots for various data sources. In this article, we demonstrated how to build a chatbot using the Claude 3.5 Sonnet model to answer questions based on previously unseen documents. This method can be applied to create chatbots capable of querying diverse data types such as PDFs, websites, text documents, and beyond.

I encourage you to leverage Claude 3.5 Sonnet to develop your custom chatbots, explore its powerful capabilities, and share your experience and feedback in the comments section.

Comparing GPT-4o vs Claude 3.5 Sonnet for Zero Shot Text Classification

Featured Imgs 23

On June 20, 2024, Anthropic released the Claude 3.5 sonnet large language model. Claude claims it to be the state-of-the-art model for many natural language processing tasks, surpassing the OpenAI GPT-4o model.

My first test for comparing two large language models is their zero-shot text classification ability. In this article, I will compare the Antropic Claude 3.5 sonnet with the OpenAI GPT-4o model for zero-shot tweet sentiment classification.

So, let's begin without ado.

Importing and Installing Required Libraries

The following script installs the Anthropic and OpenAI libraries to access the corresponding APIs.


!pip install anthropic
!pip install openai

The script below imports the required libraries into your Python application.


import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import anthropic
from openai import OpenAI

Importing and Preprocessing the Dataset

We will use the Twitter US Airline Sentiment dataset to perform zero-shot classification. You can download the dataset from Kaggle.

The following script imports the dataset into a Pandas DataFrame.

## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\tweets.csv")
print(dataset.shape)
dataset.head()

Output:

image1.png

Tweet sentiment falls into three categories: neutral, positive, and negative. For comparison, we will filter 100 tweets. The neutral, positive, and negative categories will contain 34, 33, and 33 tweets, respectively.

# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

image2.png

Zero Shot Text Classification with GPT-4o

We will first perform zero-shot text classification with GPT-4o. Remember to get your OpenAI API key before executing the following script.

The script below defines a function find_sentiment() that accepts the client and model parameters. The client corresponds to the client, i.e., anthropic or OpenAI, while the model corresponds to Claude 3.5 Sonnet or GPT-4o.

The find_sentiment() function iterates through all the tweets from the filtered dataset and predicts sentiment for each tweet. The predicted sentiments are then compared with the actual sentiment values to determine the model's accuracy.


def find_sentiment(client, model):

    tweets_list = dataset["text"].tolist()


    all_sentiments = []

    i = 0
    exceptions = 0
    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            sentiment_value = client.chat.completions.create(
                                  model= model,
                                  temperature = 0,
                                  max_tokens = 10,
                                  messages=[
                                        {"role": "user", "content": content}
                                    ]
                                ).choices[0].message.content

            all_sentiments.append(sentiment_value)
            i = i + 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    print("Total exception count:", exceptions)
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print("Accuracy:", accuracy)

Next, we create an OpenAI client and pass the client object along with the gpt-4o model to predict sentiments.


%%time
client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)
model = "gpt-4o"

find_sentiment(client, model)

Output:

image3.png

The above output shows that GPT-4o achieved an accuracy of 76%.

Zero Shot Text Classification with Claude 3.5-Sonnet

Let's now predict the sentiment for the same set of tweets using the Claude 3.5 sonnet model. We define the find_sentiment_claude() function which is similar to the find_sentiment() function, but predicts sentiments using the Claude 3.5 sonnet model.


def find_sentiment_claude(client, model):

    tweets_list = dataset["text"].tolist()


    all_sentiments = []

    i = 0
    exceptions = 0
    while i < len(tweets_list):

        try:
            tweet = tweets_list[i]
            content = """What is the sentiment expressed in the following tweet about an airline?
            Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
            tweet: {}""".format(tweet)

            sentiment_value = client.messages.create(
                                model= model,
                                max_tokens=10,
                                temperature=0.0,
                                messages=[
                                    {"role": "user", "content": content}
                                ]
                            ).content[0].text

            all_sentiments.append(sentiment_value)
            i = i + 1
            print(i, sentiment_value)

        except Exception as e:
            print("===================")
            print("Exception occurred:", e)
            exceptions += 1

    print("Total exception count:", exceptions)
    accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
    print("Accuracy:", accuracy)

The script below calls the find_sentiment_claude() function using the anthropic client and the claude-3-5-sonnet-20240620 (the id for the Claude 3.5 sonnet) model.


%%time
client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key = os.environ.get('ANTHROPIC_API_KEY')
)

model = "claude-3-5-sonnet-20240620"

find_sentiment_claude(client, model)

Output:

image4.png

The above output shows that the Claude 3.5 Sonnet model achieved an accuracy of 83%, which is significantly better than the GPT-4o model.

Final Verdict

Here is an overall comparison between the Claude Sonnet 3.5 and the GPT-4o model.

image5.png

Given the price, performance, and context window size, I prefer the Claude 3.5 Sonnet over other proprietary models such as GPT-4o.

Feel free to share your opinions. I would be very interested in any results you obtain using the two models.

Tabular Data Classification with Hugging Face Meta Tree Transformer

Featured Imgs 23

As a data scientist, I have extensively used the Hugging Face library for processing unstructured data such as images, text, and audio. My previous blogs have covered various transformer models for these types of data. Lately, however, I discovered that Hugging Face also provides transformer models for tabular data. One such transformer is the Meta Tree Transformer.

This article will explore using the Meta Tree Transformer model to classify tabular data, detailing each process step and providing insights based on the Bank Note Authentication dataset.

Installing and Importing Required Libraries

You must install and import the following libraries to run the codes in this article.


!pip install metatreelib
!pip install --upgrade scikit-learn
!pip install imodels

from metatree.model_metatree import LlamaForMetaTree as MetaTree
from metatree.decision_tree_class import DecisionTree, DecisionTreeForest
from metatree.run_train import preprocess_dimension_patch
from transformers import AutoConfig
from sklearn.metrics import accuracy_score
import imodels # pip install imodels
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import random
Loading and Preprocessing the Dataset

The dataset used in this tutorial is the Bank Note Authentication dataset, which you can download from Kaggle. The dataset contains features extracted from images of banknotes and is used to classify whether a banknote is authentic or not.

The dataset consists of the following columns:

  • variance
  • skewness
  • curtosis
  • entropy
  • class

The class column is the target variable, indicating whether the banknote is authentic (1) or not (0).

First, we need to read the dataset and preprocess it. We will use the pandas library to read the dataset from a CSV file.

This will load the dataset into a pandas DataFrame and display the first few rows to get an overview of the data.


# Load the dataset
file_path = '/content/BankNote_Authentication.csv'  # Path to the dataset
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

Output:

image1.png

Next, we split the dataset into training and testing sets using sklearn.model_selection.train_test_split. Here, 20% of the data is reserved for testing, ensuring we have sufficient data to evaluate the model's performance.


# Split the dataset into features and target variable
X = df.drop(columns=['class'])
y = df['class']

# Split the data into training and testing sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2,
DataLoader for Batching

To handle the entire dataset in batches of 256, we will create a custom dataset class and use PyTorch's DataLoader to batch and shuffle the data. The batch size is set to 256 since the Meta Tree transformer expects the data to be in batches of 256 records.


class TabularDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        feature = self.features[idx]
        label = self.labels[idx]
        return torch.tensor(feature, dtype=torch.float32), torch.tensor(label, dtype=torch.float32)

# Convert data to tensors
train_features = train_X.values
train_labels = torch.nn.functional.one_hot(torch.tensor(train_y.values), num_classes=2).float().numpy()

# Create Dataset
train_dataset = TabularDataset(train_features, train_labels)

# Parameters
batch_size = 256

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
Setting Up the Meta Tree Transformer Model

To begin with, we need to initialize the Meta Tree Transformer model and adjust its configuration to match our dataset, particularly the number of features and classes.

The model is configured to handle a different number of features by default. We set config.n_feature to the number of features in our dataset (train_X.shape[1]).

Similarly, the model is configured for a different number of output classes. We set config.n_class to the number of classes in our dataset (2).



# Initialize Model
model_name_or_path = "yzhuang/MetaTree"

config = AutoConfig.from_pretrained(model_name_or_path)
# Override config parameters to match your dataset
config.n_feature = train_X.shape[1]
config.n_class = 2

model = MetaTree.from_pretrained(
    model_name_or_path,
    config=config,
    ignore_mismatched_sizes=True
)

decision_tree_forest = DecisionTreeForest()

# Set the depth of the model
model.depth = 2

Training the Model with Batches

Next, we train the model using the batches provided by the DataLoader.


# Training loop
for batch_features, batch_labels in train_loader:
    # Prepare the batch for the model
    batch = {"input_x": batch_features, "input_y": batch_labels, "input_y_clean": batch_labels}
    batch = preprocess_dimension_patch(batch, n_feature=train_X.shape[1], n_class=2)

    # Generate decision tree
    outputs = model.generate_decision_tree(batch['input_x'], batch['input_y'], depth=model.depth)
    decision_tree_forest.add_tree(DecisionTree(auto_dims=outputs.metatree_dimensions, auto_thresholds=outputs.tentative_splits, input_x=batch['input_x'], input_y=batch['input_y'], depth=model.depth))

    print("Decision Tree Features: ", [x.argmax(dim=-1) for x in outputs.metatree_dimensions])
    print("Decision Tree Thresholds: ", outputs.tentative_splits)


Evaluating the Model

Finally, we evaluate the model's performance on the test set.


# Predict using the decision tree forest
test_X_tensor = torch.tensor(test_X.values, dtype=torch.float32)
tree_pred = decision_tree_forest.predict(test_X_tensor)

tree_pred = tree_pred.argmax(dim=-1).squeeze().numpy()

# Calculate accuracy
accuracy = accuracy_score(test_y, tree_pred)
print("MetaTree Test Accuracy: ", accuracy)

Output:


MetaTree Test Accuracy:  0.8727272727272727
Conclusion

The Meta Tree Transformer offers a powerful method for classifying tabular data by combining the interpretability of decision trees with the robust performance of transformer models. In this tutorial, we walked through the process of setting up the model, preprocessing the data, training with multiple batches, and evaluation.

In my experience, the performance of the Meta Tree Transformer was on par with simpler algorithms like Random Forest and AdaBoost. Experimenting with different parameters and datasets can further enhance its performance, making it a valuable addition to any data scientist's toolkit.

Feel free to leave your feedback and the results you obtained using Transformer models on tabular data.