Computer Science | The Blog Pros

May 8, 2024

Generate Stunning AI Images for Free Using Diffusion Models

In this tutorial, you will see how to generate stunning AI-generated images from text inputs using state-of-the-art diffusion models from Hugging Face. You'll learn about base diffusion models and how combining them with a refiner creates even more detailed, refined results. Diffusion models are powerful because they iteratively refine an image starting from pure noise.

Advanced generative AI tools like Midjourney and OpenAI DALLE 3 use diffusion models to generate photo-realistic AI images. However, these models charge fees to generate AI images. With diffusion models from Hugging Face, you can generate AI images for free. So, let's dive in!

Installing Required Libraries

To begin, let's install the libraries necessary for this project. Execute the following commands to get all dependencies ready:

!pip install diffusers --upgrade
!pip install invisible_watermark transformers accelerate safetensors

Generating AI Images Using Base Diffusion Models

Most state-of-the-art text-to-image diffusion models consist of a base model and a refiner. We'll first generate an image using the base diffusion model. We will use the stabilityai/stable-diffusion-xl-base-1.0 (SDXL) model for image generation. SDXL employs an ensemble of expert models for latent diffusion. Initially, the base model generates (noisy) latent images, which are then refined by a specialized model during the final denoising stages. You can use any other text-to-image diffusions from Hugging Face.

The following Python script initializes a Hugging Face pipeline for the diffusion model and sets it up for GPU acceleration.


from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16,
                                         use_safetensors=True,
                                         variant="fp16")
pipe.to("cuda")

The next step is to pass a text prompt to the prompt attribute of the pipeline you defined. As shown in the script below, you can retrieve the generated image using the images list.


prompt = "A texas ranger riding a white horse"

images = pipe(prompt=prompt).images[0]

images

Output:

Look at the image generated above; isn't it cool? You can even use this for commercial purposes.

Generating Refined Images using Ensemble of Experts

Using an ensemble of experts and a refiner, you can create more refined and advanced images. To do so, you first create a simple base model as you did before. Next, you create a refiner model and pass the base model to it.

The refiner will build upon the image created by the base model to deliver a more polished, detailed final output.

The script below creates our base model and refiner.


from diffusers import DiffusionPipeline
import torch

# load both base & refiner
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16, variant="fp16",
    use_safetensors=True
)
base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

In the following script, we specify that the ensemble of experts should take 40 steps to generate an image from noise. Out of these 40 steps, the base model will take 80% (32 steps), and the refiner will use the remaining 20% (8 steps) to refine the image.


n_steps = 40
high_noise_frac = 0.8

prompt = "An  panda sitting on a table having a drink in a wooden room"

# run both experts
image = base(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images

image = refiner(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=image,
).images[0]

image

Output:

From the above output, you can see a cute panda drinking in a wooden room. Excellent? Isn't it?

Conclusion

Diffusion models allow you to create stunning AI images. You can use diffusion modes from Hugging Face to generate AI images for free.

In this tutorial, we employed the SDXL model for image generation. The base model generates (noisy) latent images, which are then refined by a specialized model during the final denoising stages. The base model can also function independently as a standalone module.

I invite you to try these models and share what you generated.

April 27, 2024

Summarizing YouTube Video Transcriptions Using Distil Whisper and LLM

In this tutorial, you will see how to summarize YouTube video transcriptions using Distil Whisper large V3 and Mistral-7b-Instruct. Both Distill Whisper Large V3 and Mistral-7B-Instruct models are open-source and free-to-use models.

The Distil Whisper large V3 model is a faster and smaller variant of the Whisper large V3 model, a state-of-the-art speech-to-text model. You will use this model to transcribe YouTube audio. Next, you will use the Mistral-7b-Instruct LLM to summarize the transcriptions. In the process, you will learn to extract audio from YouTube videos. We have many interesting things to see, so let's begin without ado.

Importing and Installing Required Libraries

As always, the first step is to install and import the required libraries. The following script installs libraries required to run codes in this tutorial.


!pip install -q -U transformers==4.38.0
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U accelerate==0.27.1
!pip install -q datasets
!pip install -q pytube

The script below imports the required libraries.


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, logging
from datasets import load_dataset
from pytube import YouTube
from transformers import BitsAndBytesConfig

Extracting Audios from YouTube Videos

We will begin by extracting audio from the YouTube video we want to transcribe.
You can use the YouTube class from the pytube module, as shown in the following script.

youtube_video_url = "https://www.youtube.com/watch?v=5sLYAQS9sWQ"
youtube_video_content = YouTube(youtube_video_url)

The streams attribute of the YouTube class object returns various audio and video streams.

for stream in youtube_video_content.streams:
  print(stream)

Output:

We are only interested in the audio streams here. We will filter the audio/mp4 with 128kbps ABR (adaptive bitrate stream). You can select any other audio stream.

audio_stream = [stream for stream in youtube_video_content.streams if stream.mime_type == "audio/mp4" and stream.abr == "128kbps"][0]
audio_stream

Output:

<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">

Next, we will download the audio stream using the download method of the stream we filtered.

audio_path = audio_stream.download("intro_to_llms")
audio_path

Output:

/content/intro_to_llms/How Large Language Models Work.mp4

We have downloaded the audio for our YouTube video. The next step is to transcribe this audio into text.

Transcribing Audio Using Distil Whisper Large V3

To transcribe the YouTube audio, we will use the Distil Whisper Large V3 model from the Hugging Face library. The following script downloads the model and its input/output processor.

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v3"

whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
whisper_model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

Next, we will define a pipeline that takes the audio file as input, preprocess and tokenizes it into segments, and generates transcriptions.


pipe = pipeline(
    "automatic-speech-recognition",
    model=whisper_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe(audio_path)
print(result["text"])

Output:

We have transcribed the YouTube video; the next step is to summarize it using an LLM.

Summarizing YouTube Video Text Using Mistral-7B-Instruct LLM

We will use the Mistral-7B-instruct LLM to summarize YouTube audio transcriptions. To know more about Mistral-7B, check my article on 7 NLP Tasks to Perform for Free in Python with Mistral 7b LLM. You can use any other LLM instead of Mistral-7B.

Mistral-7B consists of seven billion parameters. It requires a large amount of memory to use Mistral-7B even for inference, let alone fine-tuning. You can use quantization techniques to reduce the weight sizes of such huge LLMs. The script below defines a quantization model that reduces the weight sizes of an LLM to 4 bits. We will use this quantization model to reduce the weight sizes of our Mistral-7B model.


#Ignore warnings
logging.set_verbosity(logging.CRITICAL)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Next, we will import the Mistral-7B model and its tokenizer from the Hugging Face library.

model_id = "mistralai/Mistral-7B-Instruct-v0.1"
device = "cuda" # the device to load the model onto
LLM = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)

We can now use the Mistral-7B model for summarization. To do so, we will define the generate_response() method, which takes the input text, the number of response tokens, and the temperature value. The temperature value must be between 0 and 1, with higher temperature values allowing more creative model responses. The generate_response() function uses the Mistral-7B model to generate the model response.


def generate_response(input_text, response_tokens, temperature):
  messages = [
      {"role": "user", "content": input_text},
  ]
  encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

  model_inputs = encodeds.to(device)

  generated_ids = LLM.generate(model_inputs,
                                max_new_tokens=response_tokens,
                                temperature = temperature,
                                do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  return decoded[0].split("[/INST]")[1].rstrip("</s>")

Finally, we can summarize the YouTube audio transcript by passing it in the summarization prompt to the generate_response() function.

input_text = f"Summarize the following text: {result['text']}"
response = generate_response(input_text, 1000, 0.1)
print(f"Total characters in summarized result: {len(response)}")
print(response)

Output:

The above output shows that the YouTube video transcription is summarized in less than 1000 characters.

Asking Other Questions About the Video

In addition to summarization, you can use the generate_response() method to ask other questions about the YouTube video. For example, the following script asks the model to tell whether the video's tone is positive, negative, or neutral.

input_text = f"What is the overall tone of the following video text, positive, negative, or neutral: {result['text']}"
response = generate_response(input_text, 50, 0.1)
print(response)

Output:

Conclusion

Summarizing YouTube video transcriptions is a handy task, as it allows users to retrieve important information expressed in videos, thereby saving time spent watching the complete video. With the help of Distil Whisper and Mistral-7B, you can easily and without any cost summarize YouTube transcriptions. I hope you liked the article. Feel free to share your feedback.

April 20, 2024

Retrieval Augmented Generation with Hugging Face Models in LangChain

In my previous articles, I explained how to develop customized chatbots using Retrieval Augmented Generation (RAG) approach in LangChain. However, I used proprietary models such as OpenAI, which can be expensive when you try to scale.

In this article, I will show you how to use the open-source and free-of-cost models from Hugging Face to develop chatbot applications in LangChain. By the end of this tutorial, you will be able to import any Hugging Face Large Language Model (LLM) and embedding model in LangChain and develop your customized chatbot applications.

Importing and Installing Required Libraries

First, install and import the libraries and modules you will need to run codes in this tutorial.

The codes in this tutorial are run on Google Colab, where some of the libraries are preinstalled. You can install the rest of the libraries via the following pip command.


!pip install -q -U transformers==4.38.0
!pip install -q -U sentence-transformers
!pip install -q -U faiss-cpu
!pip install -q -U bitsandbytes==0.42.0
!pip install -q -U accelerate==0.27.1
!pip install -q -U huggingface_hub
!pip install -q -U langchain
!pip install -q -U pypdf

The script below imports the required libraries in your application.


from transformers import AutoModelForCausalLM, AutoTokenizer, logging, pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from sentence_transformers import SentenceTransformer
from transformers import BitsAndBytesConfig
import torch

Importing a Hugging Face LLM in langChain

A RAG application requires two models: a large language model (LLM) for generating responses and an embedding model for converting documents into numeric representations.

Let's first see how you can import and use an open-source LLM from Hugging Face in LangChain.

The following script defines the quantization settings that reduce LLM weight sizes to 4 bits. This setting reduces the memory required to run LLM with enormous sizes.


#Ignore warnings
logging.set_verbosity(logging.CRITICAL)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Next, you need to import the model and the corresponding tokenizer from the Hugging Face transformers library.
In the following script, we import the mistralai/Mistral-7B-Instruct-v0.2 model and its tokenizer. You can use any other LLM from Hugging Face if you want.


model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)

The next step is to create a transformers text-generation pipeline using the model and tokenizer you just imported. Subsequently, using the transformers pipeline, create an object of the langChain HuggingFacePipeline class.


pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=1000)

hf = HuggingFacePipeline(pipeline=pipe)

The HuggingFacePipeline object works like any other LLM in langChain and allows you to generate responses, as shown in the script below.


template = """You are a an expert baking chef.
{Question}"""

prompt = PromptTemplate.from_template(template)

chain = prompt | hf

question = "How to bake a pizza?"
print(chain.invoke({"Question": question}))

Output:

The next step is to generate embeddings using Hugging Face embedding models in LangChain.

Generating Hugging Face Model Embeddings in LangChain

To generate embeddings, you must first load the document and split it into chunks.

You can import a PDF document using the LangChain PyPDFLoader class and split the loaded document via the load_and_split().


loader = PyPDFLoader("https://ecfr.eu/wp-content/uploads/2023/05/Brace-yourself-How-the-2024-US-presidential-election-could-affect-Europe.pdf")
pages = loader.load_and_split()

The load_and_split() method splits a PDF document into pages. However, you need to create smaller chunks of your document. To do so, you can use the RecursiveCharacterTextSplitter class from LangChain.

The following script creates text chunks of 1000 characters with an overlap of 200 characters between chunks.


splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

all_text_chunks = []  # To store chunks from all documents
for doc in pages:
    text_content = doc.page_content
    text_chunks = splitter.split_text(text_content)
    all_text_chunks.extend(text_chunks)

print("Total chunks:", len(all_text_chunks))
print("============================")

Output:


Total chunks: 77
============================

The next step is to create embeddings. To do so, you can use any embedding model from Hugging Face. Pass the embedding model's path to the langChain HuggingFaceEmbeddings class. You can then use any vector store index such as FAISS to store embedded chunks.


model_path = "thenlper/gte-large"
embeddings = HuggingFaceEmbeddings(
    model_name = model_path
)

embedding_vectors = FAISS.from_texts(all_text_chunks, embeddings)

We imported the LLM and embedding model from Hugging Face in langChain. With these models, we can easily create a RAG application.

RAG Using Open Source LLM and Embeddings from Hugging Face

The first step in a simple RAG application is to define the prompt that receives user input and the context from embedded documents.

The following script defines our sample prompt. This prompt receives the user input and the context from the vector store index containing embedded documents.

The script also creates a create_stuff_documents_chain chain that allows you to execute prompts on LLM.


prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

Question: {input}

Context: {context}
<end>
"""
)

document_chain = create_stuff_documents_chain(hf, prompt)

The next step is to create a retriever using the vector store object and pass the retriever and the document_chain to the create_retrieval_chain class. The create_retrieval_chain class in the following script binds the context received from embeddings to the user input.


retriever = embedding_vectors.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

Finally, you can call the invoke() method on the retrieval chain to get customized responses based on the document embeddings. The response contains concatenated user input and the model's reply. Therefore, I split the response using the <end> token and returned only the model's reply.


def generate_response(query):
    response = retrieval_chain.invoke({"input": query})
    return response["answer"]


query = "What are the three points where Republicans and Democrats agree?"
result = generate_response(query)

print(result.split("<end>")[1])

Output:

Conclusion

Creating a RAG application involves a large language model and an embedding model. Though proprietary models achieve higher accuracy, they can be expensive on a large scale.
In this article, you saw how to use free-to-use open-source models from Hugging Face to create a simple RAG application. I encourage you to develop your own RAG application using free and open-source models from Hugging Face and share what you build.

April 13, 2024

Question Answering with YouTube Videos Using RAG in LangChain

In previous articles, I explained how to use natural language to interact with PDF documents and SQL databases, using the Python LangChain module and OpenAI API.

In this article, you will learn how to use LangChain and OpenAI API to create a question-answering application that allows you to retrieve information from YouTube videos. So, let's begin without ado.

Importing and Installing Required Libraries

Before diving into the code, let's set up our environment with the necessary libraries.

We will use the Langchain module to access the vector databases and execute queries on large language models to retrieve information about YouTube videos. We will also employ the YouTube Transcript API for fetching video transcripts, the Pytube library for downloading YouTube videos, and the FAISS vector index for efficient similarity search in large datasets.

The following script installs these modules and libraries.


!pip install -qU langchain
!pip install -qU langchain-community
!pip install -qU langchain-openai
!pip install -qU youtube-transcript-api
!pip install -qU pytube
!pip install -qU faiss-cpu

The script below imports the required libraries into our Python application.


from langchain_community.document_loaders import YoutubeLoader
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.documents import Document
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
import os

Creating Text Documents from YouTube Videos

The first step involves converting YouTube video content into text documents. You can use the from_youtube_url() method of the LangChain YoutubeLoader class, which extracts video transcripts and additional video information. You need to pass a list of YouTube video URLs to this method.

The following script converts YouTube videos into text documents and prints the transcript of the first YouTube video in our list of YouTube videos.


urls = [
    "https://www.youtube.com/watch?v=C5BkxbbLbIY",
    "https://www.youtube.com/watch?v=7qbJvucsU-Y",
    "https://www.youtube.com/watch?v=DNNMS7l6A-g",
    "https://www.youtube.com/watch?v=1rh-P2N0DR4",
    "https://www.youtube.com/watch?v=AW2kppo3rtI",
    "https://www.youtube.com/watch?v=lznP52OEpQk",
    "https://www.youtube.com/watch?v=ul14Qs24-s8",
    "https://www.youtube.com/watch?v=sB3N0ZqjPyM",
    "https://www.youtube.com/watch?v=JG1zkvwrn3E"
]
docs = []
for url in urls:
    docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=True).load())


print("=====================================")
print(f"Total number of videos: {len(docs)}")
print("=====================================")
print("Contents of video 1")
print("=====================================")
print(docs[0])

Output:

The transcript of each video is stored in a LangChain document format, as you can see from the following script.


type(docs[0])

Output:


langchain_core.documents.base.Document

Once you have text in the LangChain document format, you can split it using any LangChain splitter, create text embeddings, and use it for retrieval augmented generation like any other text document. Let's see these steps.

Splitting and Embedding YouTube Video Documents

To handle large documents and prepare them for retrieval, we split them into smaller chunks using the LangChain RecursiveCharacterTextSplitter. This facilitates more efficient processing and embedding.

The embeddings are generated using OpenAIEmbeddings, which transform the text chunks into numerical vectors. These vectors are stored in a FAISS vector index, enabling fast and efficient similarity searches.


openai_key = os.environ.get('OPENAI_KEY2')


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)


embeddings = OpenAIEmbeddings(openai_api_key = openai_key)
vector = FAISS.from_documents(documents, embeddings)

Question Answering with YouTube Videos

The rest of the process is straightforward. We create an object of the ChatOpenAI LLM and define a retriever that will retrieve the documents that match our query from the FAISS vector index.


llm = ChatOpenAI(
    openai_api_key = openai_key ,
    model = 'gpt-4',
    temperature = 0.5
)

retriever = vector.as_retriever()

Next, we will create two chains. The first chain, history_retriever_chain, will be responsible for maintaining memory. The input to this chain will be the chat history in the form of a list of messages and the user prompt. The output from this chain will be context relevant to the user prompt and chat history. The following script defines the history_retriever_chain.


prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}"),
    ("user", "Given the above conversation, generate a search query to look up in order to get information relevant to the conversation")
])

history_retriever_chain = create_history_aware_retriever(llm, retriever, prompt)

Next, we will create a document_chain. This chain will receive the user prompt and chat history as input, as well as the output context from the history_retriever_chain.


prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the user's questions based on the below context:\n\n{context}"),
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}")
])
document_chain = create_stuff_documents_chain(llm, prompt)

Finally, we will create our retrieval_chain that combines the history_retriever_chain and document_chain to retrieve the final response from the LLM.

In the following script, we ask the LLM a question about the top three things to do in Paris. Notice that we did not ask the LLM to return information about the top three things to do in Paris. The LLM is intelligent enough to infer from the context containing YouTube video documents that we are asking a question about the top three things to do in Paris.


retrieval_chain = create_retrieval_chain(history_retriever_chain, document_chain)

result = retrieval_chain.invoke({
    "chat_history": [],
    "input": "What are the top 3 things to do?"
})

print(result['answer'])

Output:


The top three things to do based on the context are:

1. Visiting the top of the Ark (Arc de Triomphe) for a view of the city.
2. Having a picnic near the Eiffel Tower with food from a local market.
3. Renting bikes and riding around the city with no specific destination.

A Command Line YouTube Question Answering Chatbot

Now that we know how to get answers to our questions related to YouTube videos, we can create a simple command-line application for question-answering. To do so, we define an empty chat_history list and the generate_response_with_memory() method that accepts a user prompt as a parameter.

Inside the method, we generate a response for the user prompt and then append both the input prompt and response to the chat_history list. Thanks to the' chat_history' list, the command line application will be able to answer follow-up questions.

chat_history = []

def generate_response_with_memory(query):
    result = retrieval_chain.invoke({
    "chat_history": chat_history,
    "input": query
    })

    response = result['answer']
    chat_history.extend([HumanMessage(content = query),
                       AIMessage(content = response)])

    return response

The following script executes a while loop. Inside the loop, the user asks a question that is passed on to the generate_response_with_memory() method. The method's response is displayed on the console. The process continues until the user types bye.


print("=======================================================================")
print("A Chatbot that tells you things to do in Paris. Enter your query")
print("=======================================================================")

query = ""
while query != "bye":
    query = input("\033[1m User >>:\033[0m")

    if query == "bye":
        chat_history = []
        print("\033[1m Chatbot>>:\033[0m Thank you for your messages, have a good day!")
        break
    response = generate_response_with_memory(query)
    print(f"\033[1m Chatbot>>:\033[0m {response}")

Output:

Conclusion

YouTube videos are a valuable source of information. However, not everyone has time to watch complete YouTube videos. In addition, a viewer might only be interested in certain parts of videos. The YouTube question-answering chatbot allows you to chat with any YouTube videos. I encourage you to test this application and let me know your feedback.

April 6, 2024

Using Natural Language to Query SQL Databases with Python LangChain Module

The advent of large language models (LLM) has replaced complex scripts with natural language for automating various tasks. You can now use LLM to interact with your databases using natural language, which makes life easier for people who do not have sufficient SQL knowledge.

In this article, you will learn how to retrieve information from SQL databases using natural language. For this purpose, you will use the Python LangChain library. The LangChain agents convert natural language questions into SQL queries and return the response in natural language.

Using natural language queries, you will learn how to interact with PostgreSQL, MySQL, and SQLite databases. You will retrieve information from the sample Northwind database. You can download the Northwind database samples for PostgreSQL, MySQL, and SQLite from Github. This article assumes you imported the Northwind database into the corresponding servers.

So, let's begin with ado.

Installing and Importing Required Libraries

To connect your Python application with PostgreSQL and MySQL, you must install the PostGreSQL and MySQL connectors. Execute the following script to download these connectors.

# connector for PostgreSQL
!pip install psycopg2

# connector for MySQL
!pip install mysql-connector-python

Defining the LLM and Agent

As previously said, I will use LangChain agents to execute natural language queries on various databases. To do so, we need a large language model (LLM) and database objects.

The following script imports the GPT-4 LLM via LangChain.


openai_key = os.environ.get('OPENAI_KEY2')

llm = ChatOpenAI(
    openai_api_key = openai_key ,
    model = 'gpt-4',
    temperature = 0.5
)

Next, we define a get_db_response function that accepts the database object and the user query as input parameters. Inside the function, we create a LangChain SQL agent using the create_sql_agent() function. You pass the LLM and database to the agent.

In addition, you need to set the agent_type attribute to open_tools, which tells the agent to use the OpenAI tools to process agent queries. Finally, we set the verbose attribute to True to see how the agent processes the input query.

Finally, you can call the invoke() method to run the query via the agent.


def get_db_response(db, query):
    agent_executor = create_sql_agent(LLM,
                                  db=db,
                                  agent_type="openai-tools",
                                  verbose=True)
    response = agent_executor.invoke(query)

    return response

Generating Response from PostgreSQL Database

Generating a response from the LangChain SQL agent is straightforward. First, you must create a database object using the SQLDatabase class. Then, you must pass the database connection string to the from_uri() method of the SQLDatabase class.

For PostgreSQL, the syntax of the connection string is as follows:

f"postgresql+psycopg2://postgres:<<password>>@<<server>>:<<port>>/<<database>>"

In the above, replace values for <<password>>, <<server>>, <<port>> and <<database>>.

Next, you can write a query string and pass the database object and the query string to the get_db_response() function you defined earlier.

In the following script, we retrieve information about the top 10 employees with the highest sales.


postgres_uri = f"postgresql+psycopg2://postgres:mani123@localhost:5432/northwind"
pg_db = SQLDatabase.from_uri(postgres_uri)

query = "Return the top 10 employees who did most sales"

response = get_db_response(pg_db, query)

Output:

The above output shows the agent's actions in executing the query. It searches for the employee, salesorder, and orderdetail tables to retrieve the required information.

You can use the output key of the response dictionary to see the returned response.

print(response['output'])

Output:

The top 10 employees who made the most sales are:

1. Yael Peled with 9798 sales
2. Judy Lew with 7852 sales
3. Sara Davis with 7812 sales
4. Don Funk with 6055 sales
5. Maria Cameron with 5913 sales
6. Russell King with 4654 sales
7. Paul Suurs with 3527 sales
8. Sven Buck with 3036 sales
9. Zoya Dolgopyatova with 2670 sales

The above output shows the list of the top 10 employees with the most sales.

Generate Response from MySQL Database

Generating a response from a MySQL database is almost the same as PostgreSQL. The only thing that changes is the connection string, whose syntax should be as follows:

'mysql+mysqlconnector://<<user>>:<<password>>@<<server>>:<<port>>/<<database>>'

For example, in the following script, we retrieve the top 5 products with the least sales from the Northwind database in a MySQL server.


mysql_uri = 'mysql+mysqlconnector://root:mani123@localhost:3306/northwind'
mysql_db = SQLDatabase.from_uri(mysql_uri)

query = "Give me the 5 products with least sales"

response = get_db_response(mysql_db, query)

Output:

The above output shows the agent's thought process.

You can print the final output using the script below:

print(response['output'])

Output:

The 5 products with the least sales are:

1. Product AOZBW with 95 units sold.
2. Product KSZOI with 122 units sold.
3. Product EVFFA with 125 units sold.
4. Product MYNXN with 138 units sold.
5. Product XLXQF with 184 units sold.

Generate Response from SQLite Database

Finally, you can use a connection string with the following syntax to generate a response from an SQLite database:

"sqlite:///<<sqlite database path"

The script below returns the names of the top 10 customers with the most orders. The output shows a snapshot of the agent's thought process.


sqlite3_uri = "sqlite:///D:/Datasets/Northwind.db"
sqlite3_uri = SQLDatabase.from_uri(sqlite3_uri)

query = "Give me the name of top 10 customers with most number of orders"

response = get_db_response(sqlite3_uri, query)

Output:

The script below returns the final output:

print(response['output'])

Output:


The top 10 customers with the most number of orders are:

1. Customer LCOUJ with 31 orders
2. Customer THHDP with 30 orders
3. Customer IRRVL with 28 orders
4. Customer FRXZL with 19 orders
5. Customer CYZTN with 19 orders
6. Customer UMTLM with 18 orders
7. Customer NYUHS with 18 orders
8. Customer HGVLZ with 18 orders
9. Customer RTXGC with 17 orders
10. Customer ZHYOS with 15 orders

Conclusion

This article shows how you can interact with SQL databases using natural language via the Python LangChain agents. Though the agents described in the article are extremely useful for executing SELECT queries, you cannot use them to execute, update, or delete queries on databases. I will show you how to do that in one of my following tutorials.

March 26, 2024

Paris Olympics Ticket Information Chatbot with Memory Using LangChain

In my previous article, I explained how I developed a simple chatbot using LangChain and Chat-GPT that can answer queries related to Paris Olympics ticket prices.

However, one major drawback with that chatbot is that it can only generate a single response based on user queries. It can not answer follow-up questions. In short, the chatbot has no memory where it can store previous conversations and answer questions based on the information in the past conversation.

In this article, I will explain how to add memory to this chatbot and execute conversations where the chatbot can respond to queries considering the past conversation.

So, let's begin without further ado.

Installing and Importing Required Libraries

The following script installs the required libraries for this article.

!pip install -U langchain
!pip install langchain-openai
!pip install pypdf
!pip install faiss-cpu

The script below imports required libraries.


from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.documents import Document
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
import os

Paris Olympics Chatbot for Generating a Single Response

Let me briefly review how we developed a chatbot capable of generating a single response and its associated problems.

The following script creates an object of the ChatOpenAI llm with the GPT-4 model, a model that powers Chat-GPT.

openai_key = os.environ.get('OPENAI_KEY2')

llm = ChatOpenAI(
    openai_api_key = openai_key ,
    model = 'gpt-4',
    temperature = 0.5
)

Next, we import and load the official PDF containing the Paris Olympics ticket information.


loader = PyPDFLoader("https://tickets.paris2024.org/obj/media/FR-Paris2024/ticket-prices.pdf")
docs = loader.load_and_split()

We then split our PDF document and create embeddings for the different chunks of information in the PDF document. We store the embeddings in a vector database.


embeddings = OpenAIEmbeddings(openai_api_key = openai_key)
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_documents(documents, embeddings)

Subsequently, we create a ChatPromptTemplate object that accepts our input query and context information extracted from the PDF document. We create a documents_chain chain that passes the input prompt to our LLM model.


from langchain.chains.combine_documents import create_stuff_documents_chain

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

Question: {input}

Context: {context}
"""
)

document_chain = create_stuff_documents_chain(llm, prompt)

Next, we create our vector database retriever, which retrieves information from the vector database based on the input query.

Finally, we create our retrieval_chain that accepts the retriever and document_chain as parameters and returns the final response from an LLM.


retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

We define a function, generate_response(), which accepts a user's input query and returns a chatbot response.


def generate_response(query):
    response = retrieval_chain.invoke({"input": query})
    print(response["answer"])

Let's ask a query about the lowest-priced tickets for tennis games.


query = "What is the lowest ticket price for tennis games?"
generate_response(query)

Output:

The lowest ticket price for tennis games is 30.

The model responded correctly.

Let's now ask for a follow-up query. The following query clearly conveys that we want to get information about the lowest-priced tickets for volleyball games.

However, since the chatbot does not have any memory to store past conversations, it treats the following query as a new standalone query. The response is different from what we aim for.


query = "And for beach volleyball?"
generate_response(query)

Output:


Beach Volleyball is played by two teams of two players each. They face off in the best of three sets on a sand court that is 16m long and 8m wide. The net is at the same height as indoor volleyball (2.24m for women and 2.43m for men). The game is contested by playing two sets to 21 points, and teams must win at least two points more than their opponents to win the set. If needed, the third set is played to 15 points. The matches take place at the Eiffel Tower Stadium in Paris.

Let's pass another query.


query = "And what is the category of this ticket?"
generate_response(query)

Output:

The context does not provide specific information on the category of the ticket.

The time model refuses to return any information.

If you have conversed with Chat-GPT, you would have noticed that it responds to follow-up questions. In the next section, you will see how to add memory to your chatbot to track past conversations.

Adding Memory to Paris Olympics Chatbot

We will create two chat templates and three chains.

The first chat template will accept user input queries and the message history and return the matching documents from our vector database. The template will have a MessagesPlaceholder attribute to store our previous chat.

We will also define history_retriever_chain, which takes the chat template we defined earlier and returns the matched document.

The following script defines our first template and chain.


prompt = ChatPromptTemplate.from_messages([
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}"),
    ("user", "Given the above conversation, generate a search query to look up in order to get information relevant to the conversation")
])

history_retriever_chain = create_history_aware_retriever(llm, retriever, prompt)

You can test the above chain using the following script. The chat_history list will contain HumanMessage and AIMessage objects corresponding to user queries and chatbot responses.

Next, while invoking the history_retriever_chain object, we pass the user input and the chat history.

In the response, you will see the matched documents returned by the retriever. As an example, I have only printed the first document. If you look carefully, you will see the ticket prices for the beach volleyball games. We will pass this information on to our next chain, which will return the final response.


chat_history = [
    HumanMessage(content="What is the lowest ticket price for tennis games?"),
    AIMessage(content="The lowest ticket price for tennis games is 30.")
]

result = history_retriever_chain.invoke({
    "chat_history": chat_history,
    "input": "And for Beach Volleyball?"
})

result[0]

Output:

Let's now define our second prompt template and chain. This prompt template will receive user input and message history from the user and context information from the history_retriever_chain chain. We will also define the corresponding document chain that invokes a response to this prompt.


prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the user's questions based on the below context:\n\n{context}"),
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "{input}")
])
document_chain = create_stuff_documents_chain(llm, prompt)

Finally, we will sequentially chain the history_retriever_chain, and the document_chain together to create our final create_retrieval_chain.

We will pass the chat history and user input to this retrieval_chain, which first fetches the context information using the chat history from the history_retriever_chain. Next, the context retrieved from the history_retriever_chain, along with the user input and chat history, will be passed to the document_chain to generate the final response.

Since we already have some messages in the chat history, we can test our retrieval_chain using the following script.

retrieval_chain = create_retrieval_chain(history_retriever_chain, document_chain)

result = retrieval_chain.invoke({
    "chat_history": chat_history,
    "input": "And for Beach Volleyball?"
})

print(result['answer'])

Output:


The lowest ticket price for Beach Volleyball games is 24.

In the above output, you can see that Chat-GPT successfully generated a response to a follow-up question.

Putting it All Together - A Command Line Chatbot

To create a simple command-line chatbot, we will instantiate an empty list to store our past conversations.

Next, we define the generate_response_with_memory() function that accepts the user query as an input parameter and invokes the retrieval_chain to generate a model response.

Inside the generate_response_with_memory() function, we create HumanMessage and AIMessage objects using the user queries and chatbot responses and add them to the chat_history list.


chat_history = []

def generate_response_with_memory(query):
    result = retrieval_chain.invoke({
    "chat_history": chat_history,
    "input": query
    })

    response = result['answer']
    chat_history.extend([HumanMessage(content = query),
                       AIMessage(content = response)])

    return response

Finally, we can execute a while loop that asks users to enter queries as console inputs. If the input contains the string bye, we empty the chat_history list, print a goodbye message and quit the loop.

Otherwise, the query is passed to the generate_response_with_memory() function to generate a chatbot response.

Here is the script and a sample output.


print("=======================================================================")
print("Welcome to Paris Olympics Ticket Information Chatbot. Enter your query")
print("=======================================================================")

query = ""
while query != "bye":
    query = input("\033[1m User >>:\033[0m")

    if query == "bye":
        chat_history = []
        print("\033[1m Chatbot>>:\033[0m Thank you for your messages, have a good day!")
        break
    response = generate_response_with_memory(query)
    print(f"\033[1m Chatbot>>:\033[0m {response}")

Output:

Conclusion

A conversational chatbot keeps track of the past conversation. In this article, you saw how to create a Paris Olympics ticket information chatbot that answers user queries and follow-up questions using LangChain and Chat-GPT. You can use the same approach to develop conversational chatbots for other problems.
Feel free to leave your feedback in the comments.

March 22, 2024

Paris Olympics Chatbot- Get Ticket Information Using Chat-GPT and LangChain

I was searching for Paris Olympics ticket prices for tennis games recently. The official website directs you to a PDF document containing ticket prices and venues for all the games. However, I found the PDF document to be very hard to navigate. To make things easier, I developed a chatbot to search this PDF document and answer my queries in natural language. And this is what I am going to share in this article.

I used the OpenAI API to create document embeddings (convert documents to numeric values) and the Python LangChain library as the orchestration framework to develop this chatbot.

So, let's begin without ado.

Installing and Importing Required Libraries

The following script installs the libraries required to run scripts in this article.


!pip install -U langchain
!pip install langchain-openai
!pip install pypdf
!pip install faiss-cpu

The script below imports required libraries.


from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.documents import Document
import os

Generate Default Responses from Chat-GPT

Let's first generate some responses from Chat-GPT without augmenting its knowledge base with information about the Paris Olympics ticket price.

In a Python application, you will use the OpenAI API key to generate Chat-GPT responses. You can retrieve your API key by signing up for OpenAI API.

You can save your API key in an environment variable and retrieve it in your Python application via the following script.


openai_key = os.environ.get('OPENAI_KEY2')

Next, you must create an object of the ChatOpenAI class, pass it your API key, the model name (gpt-4 in this case), and the temperature value (between 0 and 1). A higher temperature value allows the model to be more creative.


llm = ChatOpenAI(
    openai_api_key = openai_key ,
    model = 'gpt-4',
    temperature = 0.5
)

To generate a response, call the invoke() method using the model object and pass it your input query.

result = llm.invoke("You are a comedian, tell a Joke about maths.")
print(result.content)

Output:


Why was the math book sad?

Because it had too many problems!

You can use the LangChain ChatPromptTemplate class to create a chatbot. The from_messages() method in the following script tells LangChain that we want to execute the conversation in message format. In this case, you must specify the value for the user attribute. The system attribute is optional.


prompt = ChatPromptTemplate.from_messages([
    ("system", '{assistant}'),
    ("user", "{input}")
])

Next, you must create a chain combining your prompt with the LLM. You can use the pipe operator (|) to create a chain.

Finally, you can execute the message chain using the invoke() method. You must pass values for the attributes defined in your prompt (system and user in our case).

You can print LLM's response using the content attribute of the response object.


chain = prompt | llm

result = chain.invoke(
    {"assistant": "You are a comedian",
     "input": "Tell a joke about mathematics"}
)

print(result.content)

Output:


Why was the math book sad?

Because it had too many problems!

The LangChain also provides a StrOutputParser object that you can use to directly retrieve string responses without using the content attribute, as shown in the script below.


output_parser = StrOutputParser()

chain = prompt | llm | output_parser

result = chain.invoke(
    {"assistant": "You are a comedian",
     "input": "Tell a joke about mathematics"}
)
print(result)

Output:


Why was the math book sad?

Because it had too many problems!

Chat-GPT can only generate correct responses if it contains the answer in its knowledge base. Since the latest model of GPT-4 was trained on the data until December 2023, it will not be able to return factual information about the events of 2024. Let's verify this.

In the following prompt, I asked Chat-GPT to tell me the ticket prices for tennis games in the Paris Olympics 2024.


result = chain.invoke(
    {"assistant": "You are a ticket receptionist",
     "input": "What is the lowest ticket price for tennis games in Paris Olympics 2024?"})
print(result)

Output:

The above response shows that Chat-GPT does not know the Paris Olympics ticket prices.

In the next section, we will provide Chat-GPT with the ticket information using a retrieval augmented generation technique (RAG), and you will see that it generates correct responses.

RAG for Augmented Response Generation from Chat-GPT

Retrieval augmented generation (RAG) works in three steps.

Split and create embeddings for the documents containing the knowledge base you want to search.
Based on the user query, use similarity search to find documents most likely to contain a response for the query.
Pass the user query and the matched document in a prompt to an LLM (Chat-GPT in our case) to generate a response.

Let's see how we can do this.

Splitting and Embedding Ticket Price Information Document

The following script imports the PDF document containing ticket price information and splits it into multiple pages.


loader = PyPDFLoader("https://tickets.paris2024.org/obj/media/FR-Paris2024/ticket-prices.pdf")
docs = loader.load_and_split()

Next, you can use any embedding technique to convert documents to numeric format. We convert documents to numeric embeddings since matching numbers is easier than documents.

Many advanced embedding techniques exist. However, OpenAI embeddings are the most accurate. The following script creates a vector database storing the vector embeddings for all the pages in the input PDF document.

embeddings = OpenAIEmbeddings(openai_api_key = openai_key)

text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
vector = FAISS.from_documents(documents, embeddings)

Question Answering with Chatbot

To implement question answering, we will create a ChatPromptTemplate object, passing the input query along with the context information retrieved from the vector database. The context information contains the answer to our query.

We will create a create_stuff_documents_chain chain that can generate LLM responses based on the context retrieved from documents.


from langchain.chains.combine_documents import create_stuff_documents_chain

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

Question: {input}

Context: {context}
"""
)

document_chain = create_stuff_documents_chain(llm, prompt)

Let's first run our chain with the hard-coded context. In the following script, we ask a question and pass the context manually.

document_chain.invoke({
    "input": "What is the lowest ticket price for tennis games in Paris Olympics 2024?",
    "context": [Document(page_content = "The ticket prices for tennis games are 15, 10, 25 euros")]
})

Output:

'The lowest ticket price for tennis games in Paris Olympics 2024 is 10 euros.'

The Chat-GPT was intelligent enough to infer from the context that the lowest price is 10 euros.

However, we do not want to pass the context information manually. This kills the whole purpose of a chatbot. Instead, we want to provide the context information by searching the vector database containing the knowledge base. Then, we pass the searched context along with the input query to an LLM.

To do so, you can use your vector database's retriever() class and pass it along with the document_chain object to the create_retrieval_chain() class object.


retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

Next, you can call the invoke() method from your retrieval chain to generate the final response. This method will pass the query to the vector database, which returns the most similar document and stores it in the context variable. This context variable, along with the input from the retrieval chain, is passed to the document_chain, which returns the final response.

The following script implements the method to generate the final response.


def generate_response(query):
    response = retrieval_chain.invoke({"input": query})
    print(response["answer"])

Let's now ask some questions about the prices of tickets for the tennis games. The page for the tennis games' ticket prices looks like this.

Let's first ask about the lowest-priced ticket.

query = "What is the lowest ticket price for tennis games?"
generate_response(query)

Output:

The lowest ticket price for tennis games is 30.

The above response is correct, as shown in the price table.

Let's see another example. The following output shows that the model correctly returns the category for the lowest-priced ticket.

query = "What is the category for the lowest ticket price for tennis games?"
generate_response(query)

Output:

The category for the lowest ticket price for tennis games is D.

You can also ask a more complicated question such as the following, and you will see that the model will generate a correct response.

query = "What is the maximum price for category B for men's single tennis games for non-medal games?"
generate_response(query)

Output:

The maximum price for category B for men's single tennis games for non-medal games is 185.

I tried asking further questions about other sports, game venues, etc., and the model always returned the correct response.

Conclusion

The retrieval augmented generation (RAG) technique has made it easy to develop customized chatbots on your data. In this tutorial, you saw how to develop a chatbot that can provide ticket information for all the games in the Paris Olympics. You can use the same approach to develop chatbots that query other data types such as PDFs, websites, text documents, etc.

March 18, 2024

Claude 3 Opus Vs. Google Gemini Vs. GPT-4 for Zero-Shot Text Classification

On March 4, 2024, Anthropic launched the Claude 3 family of large language models. Anthropic claimed that its Claude 3 Opus model outperforms GPT-4 on various benchmarks.

Intrigued by Anthropic's claim, I performed a simple test to compare the performances of Claude 3 Opus, Google Gemini Pro, and OpenAI's GPT-4 for zero-shot text classification. This article explains the experiment and the results obtained, along with my personal observations.

Note: I have already compared the performance of Google Gemini Pro and Chat-GPT on another dataset, in one of my previous articles. This article adds Claude 3 Opus to the list of compared models. In addition, the tests are performed on a significantly more difficult dataset.

So, let's begin without an ado.

Importing and Installing Required Libraries

The following script installs the corresponding APIs for importing Claude 3 Opus, Google Gemini Pro, and OpenAI GPT-4 models.


!pip install anthropic
!pip install --upgrade google-cloud-aiplatform
!pip install openai

The script below imports the required libraries.


import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import anthropic
from openai import OpenAI
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

Importing and Preprocessing the Dataset

We will use LLMs to make zero-shot predictions on the US Airline Sentiment dataset, which you can download from Kaggle.

The dataset consists of tweets regarding various US airlines. The tweets are manually annotated for positive, negative, or neutral sentiments. The text column contains the tweet texts, while the airline_sentiment column contains sentiment labels.

The following script imports the dataset, prints the dataset shape, and displays the dataset's first five rows.


## Dataset download link
## https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment?select=Tweets.csv

dataset = pd.read_csv(r"D:\Datasets\tweets.csv")
print(dataset.shape)
dataset.head()

Output:

The dataset originally consisted of 14640 records. However, for comparison, we will randomly select 100 records with equal proportion of tweets belonging to each sentiment category, e.g., 34 for neutral and 33 each for positive and negative sentiments. The following script selects 100 random tweets.


# Remove rows where 'airline_sentiment' or 'text' are NaN
dataset = dataset.dropna(subset=['airline_sentiment', 'text'])

# Remove rows where 'airline_sentiment' or 'text' are empty strings
dataset = dataset[(dataset['airline_sentiment'].str.strip() != '') & (dataset['text'].str.strip() != '')]

# Filter the DataFrame for each sentiment
neutral_df = dataset[dataset['airline_sentiment'] == 'neutral']
positive_df = dataset[dataset['airline_sentiment'] == 'positive']
negative_df = dataset[dataset['airline_sentiment'] == 'negative']

# Randomly sample records from each sentiment
neutral_sample = neutral_df.sample(n=34)
positive_sample = positive_df.sample(n=33)
negative_sample = negative_df.sample(n=33)

# Concatenate the samples into one DataFrame
dataset = pd.concat([neutral_sample, positive_sample, negative_sample])

# Reset index if needed
dataset.reset_index(drop=True, inplace=True)

# print value counts
print(dataset["airline_sentiment"].value_counts())

Output:

airline_sentiment
neutral     34
positive    33
negative    33
Name: count, dtype: int64

We are now ready to perform zero-shot classification with various large language models.

Zero Shot Text Classification with Google Gemini Pro

To access the Google Gemini Pro model via the Google Cloud API, you need to create a project in the VertexAI Service Account and download the JSON credentials file for the project. Next, you need to create an environment variable GOOGLE_APPLICATION_CREDENTIALS and set its value to the path of the JSON file you just downloaded.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "PATH_TO_VERTEX_AI_SERVICE_ACCOUNT JSON FILE"

The rest of the process is straight-forward. You must create an Object of the GenerativeModel class and pass the input query to the model.generate_content() method.

In the following script, we define the find_sentiment_gemini() function, which accepts a tweet and returns its sentiment. Pay attention to the content variable. It contains the prompt we will pass to our Google Gemini Pro model. The prompt will also remain the same for the rest of the models.


model = GenerativeModel("gemini-pro")
config = {
    "max_output_tokens": 10,
    "temperature": 0.0,
}

def find_sentiment_gemini(tweet):

    content = """What is the sentiment expressed in the following tweet about an airline?
    Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
    tweet: {}""".format(tweet)

    responses = model.generate_content(
        content,
        generation_config= config,
    stream=True,
    )

    for response in responses:
        return response.text

Finally, we iterate through all the tweets from the text column in our dataset and pass the tweets to the find_sentiment_gemini() method. The response is saved in the all_sentiments list. We time the script to see how long it takes to execute.


%%time

all_sentiments = []

tweets_list = dataset["text"].tolist()

i = 0
exceptions = 0
while i < len(tweets_list):

    try:
        tweet = tweets_list[i]
        sentiment_value = find_sentiment_gemini(tweet)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except Except as e:
        print("===================")
        print("Exception occured", e)
        exception = exception + 1

print("Total exception count:", exceptions)

Output:

Total exception count: 0
CPU times: total: 312 ms
Wall time: 54.5 s

The about output shows that the script took 54.5 seconds to run.

Finally, you can calculate model accuracy by comparing the values in the airline_sentiment columns of the dataset with the all_sentiments list.

accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
print("Accuracy:", accuracy)

Output:

Accuracy: 0.78

You can see that the model achieves 78% accuracy.

Zero Shot Text Classification with GPT-4

Let's now perform zero-shot classification with GPT-4. The process remains the same. We will first create an OpenAI client using the OpenAI API key.

Next, we define the find_sentiment_gpt() function, which internally calls the OpenAI.chat.completions.create() method to generate a response for the input tweet.

client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('OPENAI_API_KEY'),
)


def find_sentiment_gpt(tweet):

    content = """What is the sentiment expressed in the following tweet about an airline?
    Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
    tweet: {}""".format(tweet)

    sentiment = client.chat.completions.create(
      model= "gpt-4",
      temperature = 0,
      max_tokens = 10,
      messages=[
            {"role": "user", "content": content}
        ]
    )

    return sentiment.choices[0].message.content

Next, we iterate through all the tweets and pass each tweet to the find_sentiment_gpt() function. The responses are stored in the all_sentiments list.


%%time

all_sentiments = []

tweets_list = dataset["text"].tolist()

i = 0
exceptions = 0
while i < len(tweets_list):

    try:
        tweet = tweets_list[i]
        sentiment_value = find_sentiment_gpt(tweet)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except Except as e:
        print("===================")
        print("Exception occured", e)
        exception = exception + 1

print("Total exception count:", exceptions)

Output:

Total exception count: 0
CPU times: total: 250 ms
Wall time: 49.4 s

The GPT-4 model took 49.4 seconds to process 100 tweets.

The following script prints the model's accuracy.

accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
print("Accuracy:", accuracy)

Output:

Accuracy: 0.79

The output shows that GPT-4 achieves a slightly better accuracy (79%) than Google Gemini Pro (78%).

Zero shot Text Classification with Claude 3 Opus

Finally, let's try the so-called best, Claude 3 Opus. To generate text using Claude 3, you need to create a client object of the anthropic.Anthropic class and pass it your Anthropic API Key, which you can retrieve by signing up for Claude console.

You can call the message.create() method from the Anthropic client to generate a response.

The following script defines the find_sentiment_claude() method that returns the sentiment of a tweet using the Claude Opus 3 model.

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key = os.environ.get('ANTHROPIC_API_KEY')
)

def find_sentiment_claude(tweet):

    content = """What is the sentiment expressed in the following tweet about an airline?
    Select sentiment value from positive, negative, or neutral. Return only the sentiment value in small letters.
    tweet: {}""".format(tweet)

    sentiment = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1000,
        temperature=0.0,
        messages=[
            {"role": "user", "content": content}
        ]
    )

    return sentiment.content[0].text

We can pass all the tweets to the find_sentiment_claude() function and store the corresponding responses in the all_sentiments list. Finally, we can compare response predicitons with the actual sentiment labels to calculate the model's accuracy.

%%time

all_sentiments = []

tweets_list = dataset["text"].tolist()

i = 0
exceptions = 0
while i < len(tweets_list):

    try:
        tweet = tweets_list[i]
        sentiment_value = find_sentiment_claude(tweet)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except Except as e:
        print("===================")
        print("Exception occured", e)
        exception = exception + 1

print("Total exception count:", exceptions)

accuracy = accuracy_score(all_sentiments, dataset["airline_sentiment"])
print("Accuracy:", accuracy)

Output:

Total exception count: 0
Accuracy: 0.71
CPU times: total: 141 ms
Wall time: 3min 8s

The above output shows that the Claude Opus took 3 minutes and 8 seconds to process 100 tweets and achieved an accuracy of only 71%, substantially lower than GPT-4 and Gemini Pro. Given Anthropic's big claims, I was not impressed.

Conclusion

The results from the experiments in this article show that despite Anthropic's high claims, the performance of Claude Opus 3 for a simple task such as zero-shot text classification was not up to the mark. I would still use GPT-4 or Gemini Pro for zero-shot text classification tasks.

March 14, 2024

7 NLP Tasks to Perform for Free in Python with Mistral 7b LLM

In the rapidly evolving field of Natural Language Processing (NLP), open-source large language models (LLMs) are becoming increasingly popular as they are free to use. Among these, the Mistral family of models stands out as a state-of-the-art model that is freely accessible to the public.

Comparable in performance to the renowned GPT 3.5, Mistral 7b enables users to perform various NLP tasks, such as text generation, text classification, and more, without any cost.

While GPT 3.5 can be used for free in a browser, utilizing its functions in a Python application via OpenAI API incurs charges. This is where open-source Large Language Models (LLMs) like Mistral 7b become game-changers.

This article will explore leveraging the Mistral 7b Instruct model (seven billion parameters) to execute seven common NLP tasks within your Python applications using the HuggingFace library. So, lets dive in without further ado.

Importing and Installing Required Libraries

The following script installs the libraries required to run scripts in this article.


!pip install git+https://github.com/huggingface/transformers
!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U accelerate==0.27.1

Since I am using Google Colab to run the scripts in this article, the rest of the libraries are pre-installed in the environment.

The following script imports the required libraries.


from transformers import AutoModelForCausalLM, AutoTokenizer, logging
from transformers import BitsAndBytesConfig
import torch

Importing and Configuring the Mistral 7b Instruct Model

Mistral 7b is a large model with seven billion parameters. We will quantize it by reducing its weight precisions to four bits. This allows us to fit Mistral 7b in low-memory hardware. The following script defines weight precisions for our Mistral 7b model.


#Ignore warnings
logging.set_verbosity(logging.CRITICAL)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Next, we import the Mistral-7B-Instruct-v0.1 model, a variant of the Mistral 7b model trained to handle chatbot use cases. The script below imports this model and input tokenizer from the HuggingFace library. We pass the quantization configuration settings to the quantization_config parameter of the from_pretrained() method.


model_id = "mistralai/Mistral-7B-Instruct-v0.1"
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(model_id)

We are now ready to interact with the Mistral 7b model. Let's see how to perform various NLP tasks with Mistral.

1. Question Answering with Mistral 7b

We will define a general purpose function generate_response() that accepts the following parameters:

the input text.
the number of response tokens (maximum characters in response)
the model temperature (the measure of a model's creativity)

Inside the generate_response function, we wrap the input text in a message list, convert the messages to the required format using the tokenizer.apply_chat_template() method, and pass the encoded inputs to the model.generate() method.

The model's response contains both the input text and the corresponding response. We will extract the response text only by splitting the string using the [/INST] substring.


def generate_response(input_text, response_tokens, temperature):
  messages = [
      {"role": "user", "content": input_text},
  ]
  encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

  model_inputs = encodeds.to(device)

  generated_ids = model.generate(model_inputs,
                                max_new_tokens=response_tokens,
                                temperature = temperature,
                                do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  return decoded[0].split("[/INST]")[1].rstrip("</s>")

Let's now ask our model a simple question. We will ask the model to tell us the recipe to bake a Pizza. Notice that we set the temperature value to 0.1 since we want the model to be more accurate than creative.


input_text = "How to bake a pizza?"
response = generate_response(input_text, 1000, 0.1)
print(response)

Output:

2. Text Summarization with Mistral 7b

Let's now perform text summarization. In this case, we set the model temperature to a higher value to allow the model to be more creative in summarizing the input paragraph.


paragraph = """
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification.
LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1]
LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2] LLMs are artificial neural networks.
The largest and most capable are built with a decoder-only transformer-based architecture while some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model)
"""

print(f"Total characters in original paragraph: {len(paragraph)}")
input_text = f"Summarize the following text: {paragraph}"
response = generate_response(input_text, 500, 0.9)
print(f"Total characters in summarized paragraph: {len(response)}")
print(response)

Output:

You will see that the model generates a very accurate summary.

3. Text Classification with Mistral 7b

Text classification involves assigning a label or category to an input text. In the following script, we ask the model to assign a sentiment label to the input review.

We ask the model to return one of the three possible output labels. To receive a short response, we set the number of output tokens to 10. In addition, we set the temperature to 0.1 to get a more certain and accurate response.


review = """
I enjoyed the movie but found it very long at times with boring scenes.
"""

input_text = f"Find the sentiment of this review, your response should only contain  single word 'positive', 'negative', or 'neutral: {review}"
response = generate_response(input_text, 10, 0.1)
print(response)

Output:


neutral

The above output shows that the model has successfully guessed the label for the input review.

4. Text Translation with Mistral 7b

Text translation is another common task that you can perform with Mistral 7b shows. Again, I recommend setting the temperature to a lower value for text translation tasks. Here is an example:


input = """
I am feeling hugry, I think I should go out and have lunch.
"""

input_text = f"Translate the following into French. The response should only contain the French translation: {input}"
response = generate_response(input_text, 100, 0.1)
print(response)

Output:

 Je suis fch, je pense que je devrais sortir et avoir djeuner.

5. Text Generation with Mistral 7b

You can also ask Mistral 7b to generate text for you. For instance, the script asks Mistral to recommend five catchy names for an ice cream parlor. Again, the temperature for such tasks should be higher.


input_text = "Give me 5 catchy names for an ice cream parlor on a beach"
response = generate_response(input_text, 200, 0.9)
print(response)

Output:

1. "Scoops by the Sea"
2. "Beach Bites"
3. "Surfside Scoops"
4. "Wave Wonders"
5. "Ocean Oasis"

6. Code Generation with Mistral 7b

The Mistral 7b also allows you to generate text, as seen in the following script.


input_text = "Write a Python fuction to add two numbers"
response = generate_response(input_text, 200, 0.1)
print(response)

Output:

Note: Be careful with the code and always verify it. In case of an error, you can again ask Mistral for the solutions.

7. Named Entity Recognition

Another common task you can perform with Mistral is named entity recognition (NER), which involves identifying and classifying critical information (entities) in text into predefined categories such as the names of persons, organizations, locations, etc.

Here is an example of performing NER in Python with Mistral 7b.


input = """
Ronaldo from Portugal was one of the best players Manchester United ever produced in the Premier League"
"""

input_text = f"Extract name entities from the following text. Response should be in the form word -> entity type: {input}"
response = generate_response(input_text, 100, 0.1)
print(response)

Output:


Ronaldo -> Person
Portugal -> Location
Manchester United -> Organization
Premier League -> Event

Conclusion

In this article, you saw how to perform various NLP tasks using the Mistral 7b model in Python. The Mistral series of models is open-source and free to use for commercial purposes. Their performance is at par with GPT 3.5. However, calling GPT 3.5 functions via the OpenAI API incurs a cost. This is where you can use Mistral LLM via HuggingFace API.

February 28, 2024

Retrieval Augmented Generation (RAG) with Google Gemma From HuggingFace

In a previous article, I explained how to fine-tune Google's Gemma model for text classification. In this article, I will explain how you can improve performance of a pretrained large language model (LLM) using retrieval augmented generation (RAG) technique. So, let's begin without ado.

What is Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) enhances a language model's knowledge by integrating external information into the response generation process. By dynamically pulling relevant information from a vast corpus of data, RAG enables models to produce more informed, accurate, and contextually rich responses, bridging the gap between raw computational power and real-world knowledge.

RAG works in the following four steps:

Store data containing external knowledge into a vector database.
Convert the input query into corresponding vector embeddings and retrieve the text from the database having the highest similarity with the input query.
Formulate the query and the information retrieved from the vector database.
Pass the formulated query to an LLM and generate a response.

You will see how to perform the above steps in this tutorial.

RAG with Google Gemma from HuggingFace

We will first import the required libraries and then import our dataset from Kaggle. The dataset consists of Warren Buffet letters to investors from 1977 to 2021.

Next, we will split our dataset into chunks using the Pythhon LangChain module. Subsequently, we will import an embedding model from HuggingFace and create a dataset containing vector embeddings for the text chunks.

After that, we will retrieve responses from the dataset based on our input query. Finally, we will pass the query and database response to the Gemma LLM model to generate the final response.

Importing Required libraries

!pip install -q langchain
!pip install -q torch
!pip install -q -U transformers==4.38.0
!pip install -q sentence-transformers
!pip install -q -U bitsandbytes==0.42.0
!pip install -q datasets
!pip install -q faiss-cpu
!pip install unstructured
!pip install accelerate
!pip install kaggle
!pip install huggingface-hub

The script below imports required libraries.


from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains.question_answering import load_qa_chain
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig, GemmaTokenizer
from transformers import AutoTokenizer, pipeline
from langchain_community.document_loaders import DirectoryLoader
import torch

Importing Warren Buffet Letters Dataset from Kaggle

I ran my script in Google Colab and downloaded the Kaggle dataset in Google Colab.

Using the following script, you can import your Kaggle.json file containing your Kaggle API key into Google Colab.


from google.colab import files
uploaded = files.upload()

Next, You can run the following script to download and unzip the dataset into your Google Colab directory.


!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d balabaskar/warren-buffet-letters-to-investors-1977-2021

!unzip -q /content/warren-buffet-letters-to-investors-1977-2021.zip

Reading and Splitting Documents with Langchain

The following script uses the LangChain DirectoryLoader().load() method to load the text documents into LangChain document objects.


folder_path = '/content/Warren_buffet_letters/Warren_buffet_letters'
loader = DirectoryLoader(folder_path, glob='**/*.txt')
docs = loader.load()
print(f"Total documents loaded: {len(docs)}")

Output:

Total documents loaded: 45

Next, we will divide our documents into multiple chunks using the RecursiveCharacterTextSplitter from the langchain.text_splitter module. You can use any other splitter if you want.

The following script creates an object of the RecursiveCharacterTextSplitter class. We divide our documents into chunks of 1000 characters with an overlap of 200 characters between all chunks.


splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  
    chunk_overlap=200,  
    length_function=len
)

The script below divides all the documents into text chunks using the RecursiveCharacterTextSplitter splitter.


all_text_chunks = []  # To store chunks from all documents
for doc in docs:
    text_content = doc.page_content
    text_chunks = splitter.split_text(text_content)
    all_text_chunks.extend(text_chunks)

print("Total chunks:", len(all_text_chunks))
print("============================")

Output:

Total chunks: 4795
============================

Creating Document embeddings

The next step is to create vector embeddings for these chunks. You can use any embedding model you want. However, for this article, I will use a free open source embedding model from HuggingFace.


embedding_model = SentenceTransformer("thenlper/gte-large")
model_path = "thenlper/gte-large"
embeddings = HuggingFaceEmbeddings(
    model_name = model_path
)

embedding_vectors = FAISS.from_texts(all_text_chunks, embeddings)

The FAISS in the above script is a Facebook library that allows efficient searching and clustering of vector embeddings. We vectorize our document using this library.

We have created our vector embeddings. Let's see an example. In the following script, we pass an input query to our vector embeddings database, which returns the text with the highest similarly.


question = "What is Warren Buffets Investment Pshychology?"
searchDocs = embedding_vectors.similarity_search(question)
searchDocs[0].page_content

Output:

Getting Response Using Gemma Model

We will pass our input query and response from the vector database to the Gemma model to generate the final response.

First, you must log in to HuggingFace CLI by passing your HuggingFace access token in response to the following command.

!huggingface-cli login

Next, we will import the tokenizer and model weights for the gemma-2b-it model, a 2 billion parameters instruction variant of Gemma.


model_name = "google/gemma-2b-it"

device = "cuda:0"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

Finally, we will define the generate_response() function that accepts a text query, generates a response from the vector embeddings, combines the input query and vector response, and passes it to the Gemma model for the final response.


def generate_response(query):
  searchDocs = embedding_vectors.similarity_search(question)

  response = searchDocs[0].page_content

  final_query = f"""Query: {query}\nContinue to answer the query by using the following Search Results.\n{response}. <end>"""
  print(final_query)


  inputs = tokenizer(final_query, return_tensors="pt").to(device)
  outputs = model.generate(**inputs, max_new_tokens = 500)
  final_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

  return final_response

We can test the generate_response function using the following script:


query = "What is Warren Buffets Investment Pshychology?"
final_response = generate_response(query)
print("===================================")
print("RESPONSE FROM RAG MODEL")
print("===================================")
print(final_response.split("<end>")[1])

Output:

You can see that the response contains information from our dataset. The response is more customized depending upon the information we passed to it from our vector database.

Conclusion

RAG is a powerful technique for integrating external knowledge into an LLM response. In this article, you saw how you can use vector embeddings and RAG to retrieve enhanced responses from an LLM model. You can use this technique to create custom chatbots based on your dataset. I suggest you try the Gemma 7b (seven billion parameters) model to see if you get better responses.

February 24, 2024

Fine Tuning Google Gemma Model for Text Classification in Python

On February 21, 2024, Google released Gemma, a family of state-of-the-art open-source large language models (LLMs). As per initial results, its 7b (seven billion parameter) version is known to perform better than Meta's Llama 2, the previous state-of-the-art open-source LLM.

As always, my first test with any new open-source LLM is the text classification task. In this tutorial, I will show you how you can fine-tune the Google Gemma LLM for text classification tasks in Python. So, let's begin without ado.

Installing and Importing Required Libraries

The following script installs libraries required to run scripts in this article.

!pip3 install -q -U bitsandbytes==0.42.0
!pip3 install -q -U peft==0.8.2
!pip3 install -q -U trl==0.7.10
!pip3 install -q -U accelerate==0.27.1
!pip3 install -q -U datasets==2.17.0
!pip3 install -q -U transformers==4.38.0
!pip3 install -q -U datasets
!pip install huggingface-hub

The script below imports the required libraries into your Python application.


import os
import transformers
import torch
from google.colab import userdata
from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig, GemmaTokenizer
import pandas as pd
from datasets import Dataset

Finally, you must run the following script and enter your Hugging Face user access token.

!huggingface-cli login

Google Gemma is a new model, and you must agree to its terms of use before importing it from Hugging Face. You can agree to its terms of use on the Hugging Face Gemma model card.

Testing Google Gemma Model for Casual LM Tasks

Let's first test the default Gemma 2b model without fine-tuning it for the text classification task.

Gemma is a huge model requiring a lot of resources and time to run. We can reduce the model weight sizes using bits and bytes configuration. The following script sets model weight sizes to 4 bits.


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

The script below imports the Gemma 2b tokenizer and model. You can also try Gemma 7b version if you want, but it will require more resources and time to run.

model_id = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=bnb_config,
                                             device_map={"":0})

Finally, we pass some text to the Gemma model and see what we get.


text = "Jack of all"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:


Jack of all trades, master of none.

Thats

The about output shows that the Gemma model correctly predicts the text that follows the input text. It also outputs some additional text since we tell it to predict ten tokens.

Fine Tuning Google Gemma Model

Let's now fine-tune our Gemma model for a text classification task.

Importing and Preprocessing the Dataset

We will use the IMDB movie review dataset that contains around 50k positive and negative movie reviews.

The following script imports the CSV file into a Pandas dataframe. We randomly shuffled the dataset and took only 5000 records for fine-tuning. You can fine-tune on any number of records.


dataset = pd.read_csv(r"/content/IMDB Dataset.csv")
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset = dataset.head(5000)
print(dataset['sentiment'].value_counts())
dataset.head()

Output:

The script below converts our Pandas dataframe to a Hugging Face dataset. The script divides the dataset into 80% training and 20% test set.


dataset = Dataset.from_pandas(dataset)
final_dataset = dataset.train_test_split(test_size=0.2)

Next, we define a formatting function that converts the dataset into a format that we can use to fine-tune our Gemma model. The format converts reviews and sentiments into key-value pairs.


def formatting_func(example):
    text = f"Review: {example['review'][0]}\nSentiment: {example['sentiment'][0]}"
    return [text]

formatting_func(final_dataset['train'])

Output:


['Review: i was very impressed with this production on likely all levels; from production to plot and character development.<br /><br />this definitely fall under the "realism" genre, since there is nothing going on here that ...\nSentiment: positive']

From the above output, you can see that the formatted record consists of a list that starts with the word Review: followed by the text review. At the end of the review, we insert a new line and add the text Sentiment: followed by the review sentiment.

Fine Tuning Gemma Model

Finally, we are ready to fine-tune our Gemma model.

We will use the LoRA (Low-Rank Adaptation) approach to fine-tune only some of the weights of our Gemma model. Fine-tuning the complete Gemma model can take hours. LoRa is a common approach for fine-tuning very large language models.

The following script sets the LoRa configuration for fine-tuning.



lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

Finally, you can create an object of the SFTTrainer class and pass the Gemma model object, the training data, and various training arguments. Next, you can call the train() method to train the Gemma model. The model will be trained for 100 steps.


trainer = SFTTrainer(
    model=model,
    train_dataset=final_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

trainer.train()

Output:

Let's now try to make a prediction using our fine-tuned model. To do so, we will take a single example and convert it to the same format as used for fine-tuning the Gemma model.

text = f"Review: {final_dataset['test'][2]['review']}\nSentiment: "
print(text)

Output:

Review: If you know anything about the Manhattan Project, you will find "Fat Man and Little Boy" at least an interesting depiction of the events surrounding that story. The film is in all ways a very realistic portrayal of these events, and in many ways it is almost too real.... something to think about.<br /><br />*** out of ****
Sentiment:

The above output shows the text that we will pass to the model. The model will predict the word after the sentiment, i.e., positive or negative.

The following script generates the Gemma model output for the above text. We set the max_new_tokens size to 1 since we want a single word in the output. Finally, we decode the output and print the last generated word.

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=1)
prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
prediction.split(" ")[-1]

Output:


positive

You can see that the model assigned a positive sentiment to the input text review.

Evaluating Fine-tuned Model Performance on Test Set

To test the model on the complete dataset, we define the predict_sentiment() function that accepts a text review, formats it, and predicts its sentiment using our fine-tuned Gemma model.


def predict_sentiment(review):
  text = f"Review: {review}\nSentiment: "
  inputs = tokenizer(text, return_tensors="pt").to(device)
  outputs = model.generate(**inputs, max_new_tokens=1)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
  sentiment = prediction.split(" ")[-1]
  return sentiment

Next, we loop through all the reviews in the test set, pass each review to the predict_sentiment() method, and store the response in the predictions list. The model may sometimes predict words other than positive or negative, which we discard.


targets = []
predictions = []

for i in range(len(final_dataset['test'])):

  review = final_dataset['test'][i]['review']
  target_sentiment = final_dataset['test'][i]['sentiment']
  predicted_sentiment = predict_sentiment(review)

  if predicted_sentiment in ["positive", "negative"]:
    targets.append(target_sentiment)
    predictions.append(predicted_sentiment)
    print(f"Record {i+1} - Actual:{target_sentiment}, Predicted: {predicted_sentiment}")

Output:

Finally, we can compare the actual and predicted reviews to calculate model accuracy on the test set.


accuracy = accuracy_score(targets, predictions)
print(f'Accuracy: {accuracy:.2f}')

report = classification_report(targets, predictions)
print('Classification Report:\n', report)

Output:

The above output shows that the model achieved an accuracy of around 88% on the test set. You can fine-tune the Gemma 2b model on a larger dataset or use the Gemma 7b model to get better results.

February 22, 2024

Extract Tabular Data from PDF Images using Hugging Face Table Transformer

In a previous article, I explained how to extract tabular data from PDF image documents using Multimodal Google Gemini Pro. However, there are a couple of disadvantages with Google Gemini Pro. First, Google Gemini Pro is not free, and second, it needs complex prompt engineering to retrieve table, columns, and row pixel coordinates.

To solve the problems above, in this article, you will see how to extract tables from PDF image documents using Microsoft's Table Transformer from the Hugging Face library. You will see how to detect tables, rows, and columns within a table, extract cell values from tables using an OCR, and save the table as CSV. So, let's begin without ado.

Installing and Importing Required Libraries

The first step is to install various libraries you will need to run scripts in this article.

!pip install transformers
!sudo apt install tesseract-ocr
!pip install pytesseract
!pip install easyocr
!sudo apt-get install -y poppler-utils
!pip install pdf2image
!wget "https://fonts.google.com/download?family=Roboto" -O roboto.zip
!unzip roboto.zip -d ./roboto

The following script imports the required libraries into your application.


from transformers import AutoImageProcessor, TableTransformerForObjectDetection
import torch
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import csv
import numpy as np
import pandas as pd
from pdf2image import convert_from_path
from tqdm.auto import tqdm
import pytesseract
import easyocr

Table Detection with Table Transformer

The Table Transformer has two sub-models: table-transformer-detection, and table-structure-recognition-v1.1-all model. As a first step, we will detect tables within a PDF document using the table-transformer-detection model.

Importing and Converting PDF to Image

The following script defines the pdf_to_img() function that converts PDF documents to bytes images. This step is mandatory since the Table transformer expects documents in image format.

# convert PDF to Image
def pdf_to_img(image_path):

  image = convert_from_path(pdf_path)[0].convert("RGB")
  return image

pdf_path = '/content/sample_input_ieee-10.pdf'
image = pdf_to_img(pdf_path)
image

Output:

The above output shows the input image. We will detect tables inside this image.

Detecting Tables

The following script imports the preprocessor and model objects for the table-transformer-detection model. The preprocessor converts the input image to a format the table-transformer-detection model can process.


model_name = "microsoft/table-transformer-detection"
# define image preprocessor for table transformer
image_processor = AutoImageProcessor.from_pretrained(model_name)

# import table transformer model for table detection
model = TableTransformerForObjectDetection.from_pretrained(model_name,
                                                           revision="no_timm")

Next, we define the detect_table() function that accepts the input image as a parameter. The method preprocesses the image and then passes it to the table-transformer-detection model.

The preprocesses objects post_process_object_detection() method processes the output from the table-transformer-detection model. The final processed output consists of the label, bounding box coordinates, and the prediction confidence score for the detected tables. The detect_table() function returns the final output.


def detect_table(image_doc):

  # preproces image document
  inputs = image_processor(images = image_doc, return_tensors="pt")

  # detect tables
  outputs = model(**inputs)

  # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
  target_sizes = torch.tensor([image_doc.size[::-1]])
  results = image_processor.post_process_object_detection(outputs,
                                                          threshold=0.9,
                                                          target_sizes=target_sizes)[0]

  return results


results = detect_table(image)
results

Output:


{'scores': tensor([0.9993, 0.9996], grad_fn=<IndexBackward0>),
 'labels': tensor([0, 0]),
 'boxes': tensor([[ 111.4175,  232.4397, 1481.5710,  606.8784],
         [ 110.4231,  738.1602, 1471.6283,  916.2267]],
        grad_fn=<IndexBackward0>)}

The above output shows the confidence score, labels (0 for table), and bounding box coordinates for the two detected tables.

Next, we define the get_table_bbox() function, which prints the labels, confidence scores, and bounding box coordinates for the detected tables. The function also returns the detected bounding box coordinates for all the tables.


def get_table_bbox(results):

  tables_coordinates = []

  # iterate through all the detected table data
  for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]

    # store bbox coodinates in Pascal VOC format for later use
    table_dict = {"xmin" : box[0],
                  "ymin" : box[1],
                  "xmax" : box[2],
                  "ymax" : box[3]}

    tables_coordinates.append(table_dict)

    # print prediction label, prediction confidence score, and bbox values
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
        )

  return tables_coordinates

table_bbox = get_table_bbox(results)

Output:


Detected table with confidence 0.999 at location [69.43, 344.96, 660.61, 488.47]
Detected table with confidence 0.989 at location [68.7, 549.5, 657.53, 838.82]

Display Tables

Finally, the script below plots the original image and draws red rectangles around the detected tables using their bounding box coordinates.


def highlight_tables(image, table_bbox, padding):
    # Create a drawing context for doc image
    doc_image = image.copy()
    draw = ImageDraw.Draw(doc_image)

    # Iterate over each table in the list
    for table in table_bbox:
        # Define the coordinates for the rectangle with padding for each table
        rectangle_coords = (table["xmin"] - padding,
                            table["ymin"] - padding,
                            table["xmax"] + padding,
                            table["ymax"] + padding)

        # Draw a red rectangle around the detected table
        draw.rectangle(rectangle_coords, outline="red", width=2)

    return doc_image

padding = 10
table_detected_image = highlight_tables(image, table_bbox, padding)
table_detected_image

Output:

You can see the detected tables in the above image.

Subsequently, we define the get_cropped_image() function that accepts the original image, the corresponding bounding box coordinates, and padding values as parameters. The get_cropped_image() function returns the cropped table, which you can use to extract rows and columns.

def get_cropped_image(image, table, padding):
  # Create a new image object with the cropped area
  cropped_image = image.copy().crop((table["xmin"] -padding,
                             table["ymin"] - padding,
                             table["xmax"] + padding,
                             table["ymax"] + padding
                             ))

  return cropped_image

cropped_image = get_cropped_image(image, table_bbox[1], padding)
cropped_image

Output:

Extract Table data

Now that we have cropped a table, we can extract rows and columns.

Extract Table Features

You can extract table rows and columns using the table-structure-recognition-v1.1-all model. The following script imports this model.


# import model for detecting table features e.g. rows, columns, etc
structure_model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-structure-recognition-v1.1-all")

We define the get_table_features() function that accepts the cropped table image as a parameter and returns the labels, confidence scores, and the bounding box coordinates for the detected rows and columns. The function also prints these values.


def get_table_features(cropped_image):

  # preprocess image input for table transformer
  inputs = image_processor(images = cropped_image, return_tensors="pt")

  # make prediction using table transformer
  outputs = structure_model(**inputs)

  # post process output to Pasval VOC bbox format
  target_sizes = torch.tensor([cropped_image.size[::-1]])
  results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

  # define a list to store detected features
  features = []

  # iterate through all the detected features and store feature label, confidence score, and bbox values to cells list
  for i, (score, label, box) in enumerate(zip(results["scores"], results["labels"], results["boxes"])):
      box = [round(i, 2) for i in box.tolist()]
      score = score.item()
      label = structure_model.config.id2label[label.item()]

      cell_dict = {"label":label,
                  "score":score,
                  "bbox":box
                  }


      # print table features
      features.append(cell_dict)
      print(
          f"Detected {label} with confidence "
          f"{round(score, 3)} at location {box}"
      )

  return outputs


features = get_table_features(cropped_image)

Output:

Display Detected features

Next, we define the display_detected_features() function that draws rectangles around detected rows and columns.


def display_detected_features(cropped_image, features):

  cropped_table_visualized = cropped_image.copy()
  draw = ImageDraw.Draw(cropped_table_visualized)

  # increase font size for text labels
  font = ImageFont.truetype("/content/roboto/Roboto-Bold.ttf", 15)

  # iterate through all features and display bounding box with text labels
  for feature in features:
      draw.rectangle(feature["bbox"], outline="red")

      text_position = (feature["bbox"][0], feature["bbox"][1] - 3)

      draw.text(text_position, feature["label"], fill="blue", font = font)

  # return cropped image with bounding box
  return cropped_table_visualized

display_detected_features(cropped_image, features)

Output:

Extract Cell Text Using OCR and Convert to CSV

In the final step, we will detect cell text and convert the detected table to CSV format.

Extract Cells Coordinates

We define the get_cell_coordinates_by_row() function that iterates through the detected rows and extracts column values for each row. The function returns a list of rows where each row contains cell values for all the columns.

def get_cell_coordinates_by_row(table_data):
    # Extract rows and columns
    rows = [entry for entry in table_data if entry['label'] == 'table row']
    columns = [entry for entry in table_data if entry['label'] == 'table column']

    # Sort rows and columns by their Y and X coordinates, respectively
    rows.sort(key=lambda x: x['bbox'][1])
    columns.sort(key=lambda x: x['bbox'][0])

    # Function to find cell coordinates
    def find_cell_coordinates(row, column):
        cell_bbox = [column['bbox'][0], row['bbox'][1], column['bbox'][2], row['bbox'][3]]
        return cell_bbox

    # Generate cell coordinates and count cells in each row
    cell_coordinates = []

    for row in rows:
        row_cells = []
        for column in columns:
            cell_bbox = find_cell_coordinates(row, column)
            row_cells.append({'cell': cell_bbox})

        # Append row information to cell_coordinates
        cell_coordinates.append({'cells': row_cells, 'cell_count': len(row_cells)})


    return cell_coordinates

cell_coordinates = get_cell_coordinates_by_row(features)

Extract Text from Cell Coordinates using OCR

Finally, we define the apply_ocr() function that iterates through all the rows and then applies the PyTesseract OCR to extract cell values for all the columns in a row. The function returns a dictionary where each dictionary value is a list of items corresponding to row cell values from the input table, as you can see in the output of the following script.


def apply_ocr(cell_coordinates, cropped_image):
    # let's OCR row by row
    data = dict()
    max_num_columns = 0
    for idx, row in enumerate(tqdm(cell_coordinates)):
        row_text = []
        for cell in row["cells"]:
            # crop cell out of image
            cell_image = np.array(cropped_image.crop(cell["cell"]))

            # apply OCR using PyTesseract
            text = pytesseract.image_to_string(cell_image, lang='eng', config='--psm 6').strip()
            if text:
                row_text.append(text)


        if len(row_text) > max_num_columns:
            max_num_columns = len(row_text)

        data[idx] = row_text

    print("Max number of columns:", max_num_columns)

    # pad rows which don't have max_num_columns elements
    for row, row_data in data.copy().items():
        if len(row_data) != max_num_columns:
            row_data = row_data + ["" for _ in range(max_num_columns - len(row_data))]
        data[row] = row_data
        print(row_data)

    return data

data = apply_ocr(cell_coordinates, cropped_image)

Output:

As a last step, we iterate through the table rows data dictionary and write the row values line by line to a CSV file using the csv.writer() method.


def write_csv(data):

  with open('output.csv','w') as result_file:
      wr = csv.writer(result_file, dialect='excel')
      for row, row_text in data.items():

        wr.writerow(row_text)

write_csv(data)

df = pd.read_csv("output.csv")
df

Output:

The above output shows the Pandas dataframe containing the data from the generated CSV file.

I hope you liked this tutorial. Feel free to leave your feedback or comments.

February 8, 2024

PDF Image Table Extractor Web App with Google Gemini Pro and Streamlit

In my previous article, I explained how to convert PDF image to CSV using Multimodal Google Gemini Pro. To do so, I wrote a Python script that passes text command to Google Gemino Pro for extracting tables from PDF images and storing them in a CSV file.

In this article, I will build upon that script and develop a web application that allows users to upload images and submit text queries via a web browser to extract tables from PDF images. We will use the Python Streamlit library to develop web data applications.

So, let's begin without ado.

Installing Required Libraries

You must install the google-cloud-aiplatform library to access the Google Gemini Pro model. For Streamlit data application, you will need to install the streamlit library. The following script installs these libraries:


google-cloud-aiplatform
streamlit

Creating Google Gemini Pro Connector

I will divide the code into two Python files: geminiconnector.py and main.py. The geminiconnector.py library will contain the logic to connect to the Google Gemini Pro model and make API calls.

Code for geminiconnector.py

import os
from vertexai.preview.generative_models import GenerativeModel, Part
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"PATH_TO_JSON_API_FILE"

model = GenerativeModel("gemini-pro-vision")
config={
    "max_output_tokens": 2048,
    "temperature": 0,
    "top_p": 1,
    "top_k": 32
}


def generate(img, prompt):

    input = img + [prompt]

    responses = model.generate_content(    
        input,
        generation_config= config,
        stream=True,
    )
    full_response = ""

    for response in responses:
        full_response += response.text

    return full_response

I have already explained the details for the above code in my previous article. Therefore I will not delve into the details of this code here.

Creating Web GUI for the PDF Image Table Extractor

We will develop the following GUI, allowing users to upload images from their local drive. The user must then click the Upload Images button to upload the images stored in a temporary directory. Next, the user enters a text query in a text field and presses the "Submit Query" button. The response is stored in a Pandas dataframe and displayed in the output.

The code for the above GUI is implemented in the main.py file. I will break down the code into multiple code snippets for improved readability.

Import Required Libraries

The following script imports required libraries for the main.py file.

from geminiconnector import generate
from vertexai.preview.generative_models import Part
import streamlit as st
import pandas as pd
import os
import base64
import glob
import re
import csv

Creating an Image Uploader

The first step is to create an image uploader. You can use the st.uploader() method, as shown in the following script.


st.write("# Image Table Extractor")
uploaded_files = st.file_uploader("Choose images", accept_multiple_files=True, type=['jpg', 'png'])

Next, we will define the save_uploaded_files function that accepts the directory for storing images and image files as parameters. The following script also defines the path for storing images.

def save_uploaded_files(directory, uploaded_files):

    if not os.path.exists(directory):
        os.makedirs(directory)  

    for uploaded_file in uploaded_files:
        file_path = os.path.join(directory, uploaded_file.name)

        with open(file_path, "wb") as f:
            f.write(uploaded_file.getbuffer())

local_dir = "tempdir"

Next, we will define the Upload Images button using the st.button() method, which, when clicked, uploads images to the local directory.

if st.button('Upload Images'):
    if uploaded_files:

        save_uploaded_files(local_dir, uploaded_files)
        st.success(f'Images have been Uploaded.')

    else:
        st.error('Please upload at least one image.')

Defining Image Preprocessing Functions

Like the previous article, we will define two image processing functions: get_jpg_file_paths() and read_image(). The former returns the file paths of all the files in a directory, while the latter converts images to Google Gemini Pro compliant format.


def get_jpg_file_paths(directory):

    jpg_file_paths = glob.glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    return [os.path.abspath(path) for path in jpg_file_paths]


def read_image(img_paths):

    imgs_b64 = []
    for img in img_paths:
        with open(img, "rb") as f: # open the image file in binary mode
            img_data = f.read() # read the image data as bytes
            img_b64 = base64.b64encode(img_data) # encode the bytes as base64
            img_b64 = img_b64.decode() # convert the base64 bytes to a string
            img_b64 = Part.from_data(data=img_b64, mime_type="image/jpeg")

            imgs_b64.append(img_b64)

    return imgs_b64

Creating Query Submitter and Result Generator

To capture user queries, we will define a text area using the st.write() method as shown below:


st.write("## Enter your query.")
user_input = st.text_area("query",
                          height=100,
                          label_visibility = "hidden")

Before generating a response from the Google Gemini Pro model, we will define the process_line() function that handles the unique patterns in the response, such as the currency symbols and the decimal separators.


def process_line(line):

    lines = full_response.strip().split('\n')

    special_patterns = re.compile(r'\d+,\d+\s[%]')

    temp_replacement = "TEMP_CURRENCY"

    currency_matches = special_patterns.findall(line)

    for match in currency_matches:
        line = line.replace(match, temp_replacement, 1)

    parts = line.split(',')

    for i, part in enumerate(parts):
        if temp_replacement in part:
            parts[i] = currency_matches.pop(0)

    return parts

Finally, we will create a Submit Query button, which, when clicked, passes the user input prompt and the input images to the generate() function from the geminiconnector.py file.

The response is split into multiple lines. Each line is formatted using the process_line() function and appended to the data list. The pd.DataFrame constructor converts the data list to a Pandas dataframe, which is displayed on the web page using the st.write() method.

And this is it! You have successfully developed your PDF image table extractor. I used the following prompt to extract tables from the PDF image. I intentionally made spelling mistakes to see if the model returned the desired results.


"""I have the above receipts. Return a response that contains information from the receipts in a comma-separated file format where row fields are table columns,
whereas row values are column values. The output should contain (header + number of recept rows).
The first row should contain all column headers, and the remaining rows should contain all column values from two recepts one in each row.  
Must use all field values in the receipt. """

You can modify the above prompt to get different information from your PDF file.

Conclusion

In this article, you saw how to create a PDF image table extractor using multimodal Google Gemini Pro and Python Streamlit library. Using Google Gemini Pro is exceptionally straightforward. I encourage you to develop your Streamlit web applications using Google Gemini Pro or other multimodal large language models. It is easy and fun to use and can solve highly complex tasks requiring image and text inputs.

February 7, 2024

How can I better use C++and data structures and algorithms

I am a first-year university student from China. My major is Computer Science and Technology. I have been self-learning C++and data structures and algorithms recently. May I ask how I can learn them well? Is anyone interested in being my teacher or learning with friends? (Machine translation, my English is not very good, I can understand some)

January 26, 2024

Converting PDF Image to CSV Using Multimodal Google Gemini Pro

In this article, you will learn to use Google Gemini Pro, a state-of-the-art multimodal generative model, to extract information from PDF and convert it to CSV files. You will use a simple text prompt to tell Google Gemini Pro about the information you want to extract. This is a valuable skill for data analysis, reporting, and automation.

You will use Python language to call the Google Vertex AI API functions and extract information from PDF converted to JPEG images.

So, let's begin without ado.

Importing and Installing Required Libraries

I ran my code on Google Colab, where I only needed to install the Google Cloud APIs. You can install the Google Cloud API via the following script installs.

pip install --upgrade google-cloud-aiplatform

Note: You must create an account with Google Cloud Vertex AI and get your API keys before running the scripts in this tutorial. When you sign up for the Google cloud platform, you will get free credits worth $300.

The following script imports the required libraries into our application.


import base64
import glob
import csv
import os
import re
from vertexai.preview.generative_models import GenerativeModel, Part

Defining Helping Functions for Image Reading

Before using Google Gemini Pro to extract information from PDF tables, you must convert your PDF files to image formats, e.g. JPG, PNG, etc. Google Gemini Pro can only accept images as input, not PDF files. You can use any tool that can convert PDF files to JPG images, such as PDFtoJPG.

Once you have converted your PDF files to JPG images, you need to read them as bytes and encode them as base64 strings. Google Gemini Pro can only accept base64-encoded strings as input, not raw bytes. You also need to specify the MIME type of the images, which is image/jpeg since we will process JPEG images.

To simplify these tasks, you can define two helper functions: get_jpg_file_paths() and read_image().

The get_jpg_file_paths() function takes a directory as an argument and returns a list of absolute paths to all the JPG files in that directory and its subdirectories.

The read_image() function takes a list of image paths as an argument and returns a list of Part objects, which are helper classes provided by the vertexai.preview.generative_models module. Each Part object contains the base64-encoded string and the mime type of an image.


def get_jpg_file_paths(directory):

    jpg_file_paths = glob.glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    return [os.path.abspath(path) for path in jpg_file_paths]


def read_image(img_paths):

    imgs_b64 = []
    for img in img_paths:
        with open(img, "rb") as f: # open the image file in binary mode
            img_data = f.read() # read the image data as bytes
            img_b64 = base64.b64encode(img_data) # encode the bytes as base64
            img_b64 = img_b64.decode() # convert the base64 bytes to a string
            img_b64 = Part.from_data(data=img_b64, mime_type="image/jpeg")

            imgs_b64.append(img_b64)

    return imgs_b64

Extracting Information from PDF Using Google Gemini Pro

Now that you know how to convert your PDF files to JPG images and encode them as base64 strings, you can use Google Gemini Pro to extract information from them.

To use Google Gemini Pro, you must create a GenerativeModel object and pass it the name of the model you want to use.
In this tutorial, You will use Google's latest generative model named gemini-pro-vision, a multimodal LLM capable of processing images and text.

You will also use a specific generation config, which is a set of parameters that control the behavior of the generative model.

But before the above steps, you will need to set the GOOGLE_APPLICATION_CREDENTIALS variable that stores the path to the JSON file having information about your Vertex AI Service Account and API Key.

The following script sets the environment variable, creates the model object, and defines the configuration settings.


os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r"PATH_TO_VERTEX_AI_SERVICE_ACCOUNT JSON FILE"

model = GenerativeModel("gemini-pro-vision")
config={
    "max_output_tokens": 2048,
    "temperature": 0,
    "top_p": 1,
    "top_k": 32
}

Finally, to generate a response from the Google Gemini Pro model, you need to call the generate_content() method of the GenerativeModel object. This method takes three arguments:

input: A list of Part objects that contain the data and the mime type of the input. You can provide both text and image inputs in this list.

generation_config: A dictionary containing the generation parameters you set earlier.

stream: A boolean value that indicates whether you want to receive the response as a stream or as a single object.

You can use the following code to define the generate() function that generates a response from Google Gemini Pro, given an image or a list of images, and a text prompt:


def generate(img, prompt):


    input = img + [prompt]

    responses = model.generate_content(    
        input,
        generation_config= config,
        stream=True,
    )
    full_response = ""

    for response in responses:
        full_response += response.text

    return full_response

As an example, we will convert the contents of this receipt into a CSV file. The receipt is in French language and contains information about the date of purchase, number of tickets, tax information, etc. The receipt is not in tabular format, yet you will see that we will be able to convert the information in this receipt to a CSV file.

For demonstration purposes, I will use two copies of the same receipt to show you how you can extract information from multiple images.

The following calls the get_jpg_file_paths() and read_image() functions that we defined earlier to read all the images in my input directory and convert them into Part objects that the Google Gemini Pro model expects.


directory_path = r'D:\\Receipts\\'
image_paths = get_jpg_file_paths(directory_path)
imgs_b64 = read_image(image_paths)

Next, we define our text prompt to extract information from the image receipt. Your prompt engineering skills will shine here. A good prompt can make the task of LLM much easier. We will use the following prompt to extract information.


prompt = """I have the above receipts. Return a response that contains information from the receipts in a comma-separated file format where row fields are table columns,
whereas row values are column values. The output should contain (header + number of recept rows).
The first row should contain all column headers, and the remaining rows should contain all column values from two recepts one in each row.  
Must use all field values in the receipt. """

Finally, we will pass the input images and the text prompt to the generates() function that returns the model response.

full_response = generate(imgs_b64, prompt)

print(full_response)

Output:

**Numro de session,Date,Heure,Pass Easy n,Fin de validit,Type,Quantit,Prix Unitaire,TVA,Montant total HT,Montant total TTC**
1,16/01/2024,09:32:32,3307837143,30/09/2023,Carnet de Ticket t+,10,17,35 ,10,00 %,15,77 ,17,35 
1,16/01/2024,09:32:32,3307837143,30/09/2023,Carnet de Ticket t+,10,17,35 ,10,00 %,15,77 ,17,35

The above output shows that the Google Gemini Pro has extracted the information we need in CSV string format.

The last step is to convert this string into a CSV file.

Converting Google Gemini Pro Response to a CSV File

To convert the response to a CSV file, we first need to split the response into lines using the string object's strip() and split() methods. This will create a list of strings, where each string is a line in the response.

Next, we will define the process_line() function that handles the unique patterns in the response, such as the currency symbols and the decimal separators.


lines = full_response.strip().split('\n')


def process_line(line):

    special_patterns = re.compile(r'\d+,\d+\s[%]')

    temp_replacement = "TEMP_CURRENCY"

    currency_matches = special_patterns.findall(line)

    for match in currency_matches:
        line = line.replace(match, temp_replacement, 1)

    parts = line.split(',')

    for i, part in enumerate(parts):
        if temp_replacement in part:
            parts[i] = currency_matches.pop(0)

    return parts

The rest of the process is straightforward.

We will open a CSV file for writing using the open function with the mode argument set to w, the newline argument set to '', and the encoding argument set to utf-8. This will create a file object that you can use to write the CSV data.

Next, we will define the create csv.writer object that you can use to write the rows to the CSV file.

We will loop through all the items (CSV rows) in the lines list and write them to our CSV files.


csv_file_path = r'D:\\Receipts\\receipts.csv'  

# Open the CSV file for writing
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)

    # Process each line in the data list
    for line in lines:
        processed_line = process_line(line)
        writer.writerow(processed_line)

Once you execute the above script, you will see the following CSV file in your destination path, containing information from your input receipt image.

Conclusion

Extracting information from PDFs and images is a crucial task for data analysts. In this tutorial, you saw how to use the Google Gemini Pro, a state-of-the-art multimodal large language model, to extract information from a receipt image. You can use the same technique to extract any other type of information by simply using a text query.

Feel free to leave your feedback and suggestions!

January 20, 2024

Comparing Google Gemini Pro with OpenAI GPT-4 for Zero-Shot Classification

In this article, we will compare two state-of-the-art large language models for zero-shot text classification: Google Gemini Pro and OpenAI GPT-4.

Zero-shot text classification is a task where a model is trained on a set of labeled examples but can then classify new examples from previously unseen classes. This is useful for situations where the labeled data is small, or the output classes are dynamic and unpredictable.

We will use the IMDB movie review dataset as an example and try to classify the reviews into positive or negative sentiments without using any labeled data. We will use the results to compare the speed, accuracy, and price of Google Gemini Pro and OpenAI GPT-4. By the end of this tutorial, you will know which model to select for your custom use cases.

Importing and Installing Required Libraries

The first step is to install the required libraries. I ran my code on Google Colab. Therefore, I only needed to install the Google Cloud and OpenAI APIs. The following script installs these libraries.

Note: It is important to mention that you must create an account with OpenAI and Google Cloud Vertex AI and get your API keys before running the scripts in this tutorial. OpenAI and Gemini Pro are paid LLMs, but you can get free credits for testing when you sign up.

pip install --upgrade google-cloud-aiplatform
pip install openai

The rest of the libraries come pre-installed with Google Colab.
The following script imports the libraries you will need to run the scripts in this tutorial.


import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from openai import OpenAI

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

Importing the Dataset

To compare Gemini Pro and GPT-4, we will perform zero-shot classification on IMDB movie reviews. You can download the CSV file from Kaggle.

The following script imports the CSV file into a Pandas DataFrame, shuffles the dataset randomly, and selects only the first 100 rows due to simplicity and cost constraints.


dataset = pd.read_csv(r"D:\Datasets\IMDB Dataset.csv")
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset = dataset.head(100)
print(dataset['sentiment'].value_counts())
dataset.head()

You can see we have 50 reviews with positive sentiments and 50 reviews with negative sentiments.

Zero Shot Text Classification with Google Gemini Pro

Let's first perform zero-shot classification with the Gemini Pro model. You have first to set an environment variable containing the path to the JSON file containing your Vertex AI Service Account and API Key information. The following script does that.


os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "PATH_TO_VERTEX_AI_SERVICE_ACCOUNT JSON FILE"

Next, We will use the generative mode of Gemini Pro, which can generate natural language responses based on a given prompt. We will define the find_sentiment_gemini() function that takes a movie review as input and returns the sentiment value as output. We will also set some configuration parameters for the generation, such as the maximum output tokens and the temperature. We will also handle any exceptions that may occur during the process.


model = GenerativeModel("gemini-pro")
config = {
    "max_output_tokens": 100,
    "temperature": 0,
}

def find_sentiment_gemini(review):

    content = """What is the sentiment expressed in the following IMDB movie review?
    Select sentiment value from positive or negative. Return only the sentiment value in small letters.
    Movie review: {}""".format(review)

    responses = model.generate_content(
        content,
        generation_config= config,
    stream=True,
    )

    for response in responses:
        return response.text

Next, we will use the following script to loop through the reviews and append the sentiment values to a list.


%%time

all_sentiments = []

reviews_list = dataset["review"].tolist()

i = 0
exceptions = 0
while i < len(reviews_list):

    try:
        review = reviews_list[i]
        sentiment_value = find_sentiment_gemini(review)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except Except as e:
        print("===================")
        print("Exception occured", e)
        exception = exception + 1

print("Total exception count:", exceptions)

Output:

Total exception count: 0
CPU times: total: 1.23 s
Wall time: 53.3 s

The above output shows that it took 53.3 seconds to process 100 reviews, and no exception occurred.

Finally, we can evaluate the model's performance using the following script.


accuracy = accuracy_score(all_sentiments, dataset["sentiment"])
print("Accuracy:", accuracy)

The Gemini Pro model returns an accuracy of 93%, which is impressive.

Zero Shot Text Classification with GPT-4

Let's perform zero-shot classification on the same dataset using the OpenAI GPT-4 model.

The following script sets the OpenAI API Key.


client = OpenAI(
    # This is the default and can be omitted
    api_key = os.environ.get('YOUR_OPENAI_KEY'),
)

Next, as we did in the case of Gemini Pro, we will define the find_sentiment_gpt() function that takes a movie review as input and returns the sentiment value as output. We will use the gpt-4 model from OpenAI to predict sentiments.


def find_sentiment_gpt(review):

    content = """What is the sentiment expressed in the following IMDB movie review?
    Select sentiment value from positive or negative. Return only the sentiment value in small letters.
    Movie review: {}""".format(review)

    sentiment = client.chat.completions.create(
      model= "gpt-4",
      temperature = 0,
      max_tokens = 100,
      messages=[
            {"role": "user", "content": content}
        ]
    )

    return sentiment.choices[0].message.content

Next, we will loop through all the movie reviews in the dataset, make a sentiment prediction on these reviews using the find_sentiment_gpt() function, and append the response to the all_sentiments list.


%%time

all_sentiments = []

reviews_list = dataset["review"].tolist()

i = 0
exceptions = 0
while i < len(reviews_list):

    try:
        review = reviews_list[i]
        sentiment_value = find_sentiment_gpt(review)
        all_sentiments.append(sentiment_value)
        i = i + 1
        print(i, sentiment_value)

    except Except as e:
        print("===================")
        print("Exception occured", e)
        exception = exception + 1

print("Total exception count:", exceptions)

Total exception count: 0
CPU times: total: 297 ms
Wall time: 2min 38s

The above output shows that it took 2min 38s to process the 100 reviews, three times slower than Gemini Pro.

Finally, the following script prints the accuracy of GPT-4 for zero-shot classification on our dataset.

accuracy = accuracy_score(all_sentiments, dataset["sentiment"])
print("Accuracy:", accuracy)

Accuracy: 0.95

The above output shows that GPT-4 achieves an accuracy of 95%, which is 2% greater than Gemini Pro.

Final Verdict

The following table summarizes the results. Though GPT-4 achieved slightly better accuracy, it is slower and almost 30 times more expensive than Gemini Pro. However, I hope that with some prompt engineering, you might be able to achieve better accuracy with Gemini Pro as well.


| Model     | Speed    | Accuracy | Price                          |
|-----------|----------|----------|--------------------------------|
| Gemini Pro| 53.3s    | 93%      | $0.00025 per 1k characters     |
| GPT-4     | 2min 38s | 95%      | $0.00750 app per 1k characters |
|------------------------------------------------------------------|

To conclude, I suggest you try Gemini Pro with better prompts as it is faster and cheaper than GPT-4.

I would love to read your feedback in the comment section.

January 14, 2024

TensorFlow Keras Sequence Data Generator for Multimodal Classification

I recently tackled a challenging research task involving multimodal data for a classification problem using TensorFlow Keras. One of the trickiest aspects was figuring out how to load multimodal data in batches from storage efficiently.

While TensorFlow Keras offers helpful functions for batch-loading images from various sources, the documentation and online resources don't explicitly cover how to load images in combination with other data types like CSV files.

However, with some experimentation, I discovered a solution to this problem. In this article, I'll demonstrate how to create custom data loaders capable of batch-loading data from multiple sources, such as image directories and CSV files.

We will solve a multimodal classification problem with images and corresponding texts as inputs. We will train a Keras model that classifies this multimodal input into one of the three predefined categories. This is called multi-class classification.

So, let's begin without ado.

Importing Required Libraries

We will extract text and image features using Transformer models from the Huggingface library. The following script installs the Huggingface transformers library.


! pip install accelerate -U
! pip install datasets transformers[sentencepiece]

The script below imports the libraries required to execute scripts in this article. I did not have to install these libraries since I used a Google Colab notebook.

import pandas as pd
import os
import numpy as np

import tensorflow as tf

from transformers import AutoTokenizer, TFBertModel
from transformers import AutoImageProcessor, TFViTModel


from keras.utils import Sequence
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Concatenate
from keras.callbacks import ModelCheckpoint
from keras.models import load_model, Model
from keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

Importing and Preprocessing the Dataset

We will train our multimodal classifier using the Meme Images Dataset from Kaggle. The dataset consists of meme images and meme text. The dataset is annotated with labels very_positive, positive, neutral, negative, and very_negative.

You will see the following directory structure once you download the dataset.

The labels.csv file contains meme image names, texts, and corresponding labels. The following script imports the labels.csv file as a Pandas dataframe and displays its header.


labels_df = pd.read_csv("/content/multimodal-memes/labels.csv")
labels_df.head()

Output:

The next step is to preprocess our dataset.

First, we will concatenate the directory path containing images with the image names in the image_name column. The concatenated image path is stored in a new column named image_path.

Next, we will remove all the records where the text_corrected columns contain a null value or an empty string.

Finally, we will remove all the dataframe columns except text_corrected, image_path, and overall_sentiment.

The text_corrected column contains meme text. The image_path column contains the absolute path to the corresponding meme image. And the overall_sentiment column contains output labels.

The following script performs the above preprocessing steps.


image_folder_path = '/content/multimodal-memes/images/images'
labels_df['image_path'] = labels_df['image_name'].apply(lambda x: os.path.join(image_folder_path, x))
labels_df = labels_df[labels_df['text_corrected'].notna() & (labels_df['text_corrected'] != '')]
labels_df = labels_df.filter(["text_corrected", "image_path", "overall_sentiment"])

We have five output labels. For the sake of simplicity, we will merge the very_positive and very_negative, labels with positive and negative labels, respectively. This reduces the number of output labels to 3.

labels_df['overall_sentiment'] = labels_df['overall_sentiment'].replace({'very_positive': 'positive', 'very_negative': 'negative'})
labels_df = labels_df.sample(frac=1).reset_index(drop=True)

Next, we will divide our dataset into features and label sets and convert output labels to one-hot encoded vectors.


X = labels_df.drop('overall_sentiment', axis=1)
y = labels_df["overall_sentiment"]

# convert labels to one-hot encoded vectors
y = pd.get_dummies(y)

Finally, we will divide our dataset into train, test, and validation sets with ratios of 80, 10, and 10, respectively.


X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Defining Transformer Models for Text and Image Data

Deep learning algorithms work with numeric data. Therefore, we must convert our text and images into corresponding numeric representations. One way to achieve this is to use Transformer models, which have demonstrated state-of-the-art performance for many natural language and image processing tasks.

For text feature extraction in this article, we will use the BERT transformer. We will use the Vision Transformer for image representation. You can download both of these models from the Huggingface library. You will also need to download the corresponding text and image processors.

The following script installs BERT and Vision Transormoer models.

Also, we will only train the last four layers of both transformer models to speed up the training process.


## importing text model and tokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertModel.from_pretrained("bert-base-uncased")

for layer in bert_model.layers[:-4]:
    layer.trainable = False

## importing image model and tokenizer

image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
vit_model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

for layer in vit_model.layers[:-4]:
    layer.trainable = False

The next section will show you how to create your custom Keras sequence data generator for batch data loading.

Creating Keras Sequence Data Generator for Batch Processing

Creating a custom sequence generator capable of handling multimodal data is the trickiest part of this problem. The Python PyTorch library handles this problem relatively easily using the DataSet class. However, TensorFlow Keras's documentation on handling this problem is not very descriptive.

To define a custom batch data loader in Keras, you must subclass the keras.utils.Sequence class. In the class's __getitem__ method, you need to define your custom data loading logic.

For example, in the following script, we define the MultiModalDataGenerator class that inherits from the Sequence class. We pass the features and labels dataframe, the text tokenizer, the image processor, the batch size for batch loading, and the sequence length for text vectors to the MultiModalDataGenerator class constructor.

We define the batch indices inside the __getitem__method. Next, we loop over all batch indices, and for each index within a batch, we load text from the text_corrected column and images from the image_path column of the features dataframe. We also load labels. We append these values in lists and get feature vectors for texts and images within a batch. The method also returns the corresponding label set. And that's it. You have defined your custom sequence data generator in Keras.


class MultiModalDataGenerator(Sequence):

    def __init__(self, df, labels, tokenizer, image_processor, batch_size=32, max_length=128):
        self.df = df
        self.labels_df = labels
        self.tokenizer = tokenizer
        self.image_processor = image_processor
        self.batch_size = batch_size
        self.max_length = max_length

    def __len__(self):
        # Number of batches per epoch
        return int(np.ceil(len(self.df) / float(self.batch_size)))

    def __getitem__(self, idx):
        # Batch indices
        batch_indices = self.df.index[idx * self.batch_size:(idx + 1) * self.batch_size]

        # Initialize lists to store data
        batch_texts = []
        batch_images = []
        batch_labels = []

        # Loop over each index in the batch
        for i in batch_indices:
            # Append text
            batch_texts.append(self.df.at[i, 'text_corrected'])  # Replace 'text_column' with the name of your text column
            # Append Image paths

            batch_images.append(Image.open(self.df.at[i, 'image_path']).convert("RGB"))

            # Fetch labels
            label_values = self.labels_df.loc[i].values
            batch_labels.append(label_values)

        # Tokenize text data in the batch
        tokenized_data = self.tokenizer(batch_texts, padding='max_length', truncation=True, max_length=self.max_length, return_tensors="tf")

        # Process images

        processed_images = [self.image_processor(images=image, return_tensors="tf") for image in batch_images]
        image_tensors = tf.concat([img['pixel_values'] for img in processed_images], axis=0)


        # Convert labels to numpy array
        batch_labels = np.array(batch_labels, dtype='float32')

        final_features = {'input_ids': tokenized_data['input_ids'],
                          'attention_mask': tokenized_data['attention_mask'],
                          'image_input': image_tensors}
        return final_features, batch_labels

Next, you will create train, test, and validation data generator using the MultiModalDataGenerator class.


max_text_length = 128
batch_size = 8

train_generator = MultiModalDataGenerator(X_train,
                                y_train,
                                bert_tokenizer,
                                image_processor,
                                batch_size,
                                max_text_length)

test_generator = MultiModalDataGenerator(X_test,
                                y_test,
                                bert_tokenizer,
                                image_processor,
                                batch_size,
                                max_text_length)

val_generator = MultiModalDataGenerator(X_val,
                              y_val,
                              bert_tokenizer,
                              image_processor,
                              batch_size,
                              max_text_length)

The rest of the process is similar to training any Keras model, as seen in the next section.

Creating & Training the TensorFlow Keras Multimodal Classifier

We have two inputs to our model: text features and image features. We will pass input_ids and attention_mask obtained from the bert_tokenizer for text features. For images, there will be a single input image_input. It is important to note that the inputs depend on feature extraction techniques, not the data modality or the model structure.

Next, we pass the input_ids, attention_mask, and image_input to the corresponding transformer models and extract model representations. These model representations are concatenated and passed to further fine-tuning layers. Our model only has a single fine-tuning layer of 512 nodes. Finally, we pass the output of the fine-tuning layer to the output layer, which, in our case, consists of 3 nodes corresponding to three output classes.


# Define input layers for text
input_ids = Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = Input(shape=(None,), dtype=tf.int32, name="attention_mask")

# Define input layer for images
image_input = Input(shape=(3, 224, 224), dtype=tf.float32, name="image_input")

# Get the output of BERT model
bert_outputs = bert_model(input_ids, attention_mask=attention_mask)
pooled_output = bert_outputs.pooler_output

# Get the output of ViT model
vit_outputs = vit_model(image_input)
vit_pooled_output = vit_outputs.pooler_output

# Concatenate the outputs from BERT and ViT
concatenated_outputs = Concatenate()([pooled_output, vit_pooled_output])


# Add additional layers for fine-tuning
x = Dense(512, activation='relu')(concatenated_outputs)
x = Dropout(0.1)(x)
final_output = tf.keras.layers.Dense(3, activation='softmax')(x)  

# Create the model
model = Model(inputs=[input_ids, attention_mask, image_input], outputs=final_output)

adam_optimizer = Adam(learning_rate=2e-5)

# Compile the model
model.compile(optimizer = adam_optimizer,
              loss='categorical_crossentropy',
              metrics=['accuracy'])

The following script trains the model. We save the model that achieves the highest accuracy on the validation set across all epochs.


# Define the checkpoint callback
checkpoint = ModelCheckpoint(
    'best_model.h5',  
    monitor='val_accuracy',
    verbose=1,  
    save_best_only=True,
    mode='max',  
    save_weights_only=False  
)

# Train the model
history = model.fit(
    train_generator,
    validation_data=val_generator,
    epochs=5,  
    callbacks=[checkpoint],
    verbose=1
)

Output:

The about output shows that the model is overfitting. You can further add dropout layers and freeze more transformer model layers to see if you get rid of overfitting.

Making Predictions and Evaluating Model Performance

Finally, the script below loads the best model from training and makes predictions on the test set.


# Load the model, including the custom TFBertModel and TFViTModel layers
custom_objects = {"TFBertModel": TFBertModel, "TFViTModel": TFViTModel}
best_model = load_model('best_model.h5', custom_objects=custom_objects)


predictions = best_model.predict(test_generator)

# convert predicitons to binary values
predictions = (predictions == predictions.max(axis=1)[:, None]).astype(int)

# printing results
print(classification_report(y_test, predictions))
print(f"Accuracy score: {accuracy_score(y_test, predictions)}")

Output:

The accuracy is not very high, and the model is clearly biased towards the majority class. You can add more fine-tuning layers to see if you can improve model performance. Nevertheless, the idea here is to demonstrate how to define a custom batch data loader for handling multimodal data. You should now be able to define your custom data generators for batch-loading multimodal data.

Feel free to share your feedback or any questions that you may have.

January 6, 2024

Multivariate Stock Price Prediction with Transformer Encoder in TensorFlow

In a previous tutorial, I covered how to predict future stock prices using a deep learning model with 1D CNN layers. This method is effective for basic time series forecasting.

Recently, I've enhanced this model by not just considering past closing prices but also factors like Open, High, Low, Volume, and Adjusted Volume. Furthermore, instead of using 1-D CNN layers, I used transformer encoder to capture contextual information between various stock prices in a time series. This improved the model significantly, cutting the error between the actual and predicted stock prices by more than 50%.

In this tutorial, I will show you how to create a multivariate stock price prediction model using a transformer encoder in TensorFlow Keras. By the end of this article, you'll learn to shape your data for multivariate time series analysis and use a transformer encoder to make a stock price prediction model.

Importing Required Libraries and Datasets

You need to install the following library to access the TensorFlow Keras TransformerEncoder layer.

!pip install keras-nlp

Since I used Google Colab to run scripts in this article, I did not have to install any other library. The following script imports the required libraries into our application.


import yfinance as yf
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import tensorflow as tf
from keras.models import Model
from keras.layers import Input, Conv1D, MaxPooling1D, Flatten, Dense, Dropout
from keras_nlp.layers import TransformerEncoder
from keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error, mean_absolute_error

Next, we will import the dataset. For the sake of comparison, I will import the same dataset that I used in my previous stock price prediction article. The following script imports the data.


# Define the ticker symbol for the stock
ticker_symbol = "GOOG"  # Example: Apple Inc.

# Define the start and end dates for the historical data
# Dataset in the previous article was downloaded on 02-Dec-2023

date_string = "02-Dec-2023"
end_date = datetime.strptime(date_string, "%d-%b-%Y")
start_date = end_date - timedelta(days=5 * 365)  # 5 years ago

# Retrieve historical data
data = yf.download(ticker_symbol, start=start_date, end=end_date)

# Display the historical data as a Pandas DataFrame
print(data.shape)
data.tail()

Output:

Data Preprocessing for Creating Multivariate Time Series

Before applying the transformer encoder to predict the stock prices, we need to preprocess our dataset and convert it into a shape suitable for training a deep learning model in TensorFlow Keras.

We will divide the dataset into training and test sets and then into corresponding features and label sets. Our feature set will be a multivariate time series.

A multivariate time series is a data point sequence consisting of multiple variables or features. For example, in our case, we have five features (Open, High, Low, Volume, and Adj Close) for a single data point in our sequence. Each feature represents a different aspect of the stock market behavior.

Dividing the Data into Training and Test Sets

The following script divides the data into training and test sets. The training set will be used to train the model, while the test set will be used to evaluate its performance. We will use the last 60 days of data as the test set and the rest as the training set.

We will further divide the training and test sets into corresponding features and labels sets. The feature set will consist of values from the Open, High, Low, Volume, and Adj Close columns of the dataset. The label set will consist of the values from the Close column.


import pandas as pd

# Get the number of records in the DataFrame
total_records = len(data)

# Number of records to keep in the training set
train_size = total_records - 60

# Create the training set
train_data = data.iloc[:train_size]

# Create the test set
test_data = data.iloc[train_size:]

# split each of the training and test sets into features and labels
# Features will include all columns except 'Close'
# Labels will include only the 'Close' column

train_features = train_data.drop('Close', axis=1)
train_labels = train_data['Close']

test_features = test_data.drop('Close', axis=1)
test_labels = test_data['Close']

# Print shapes to confirm
print("Train Features:", train_features.shape, "Train Labels:", train_labels.shape)
print("Test Features:", test_features.shape, "Test Labels:", test_labels.shape)

Output:


Train Features: (1198, 5) Train Labels: (1198,)
Test Features: (60, 5) Test Labels: (60,)

Data Scaling

The next step is to scale the data between 0 and 1. The transformer encoder model works better with normalized data, and the features in our dataset have different scales and units. For example, the Volume feature has much larger values than the Open feature, which can affect the model's performance.

We use the MinMaxScaler from the sklearn library to scale the data. We fit the scaler to the training features and then transform the training and test features with the same scaler. We also reconstruct the dataframes with the original columns and indices for convenience.

We also scale the labels with a separate scaler since we will need to inverse the scaling later to get the actual stock prices.


scaler = MinMaxScaler()

# Fit the scaler to the training features
scaler.fit(train_features)

# Transform the training and test feature sets
train_features_scaled = scaler.transform(train_features)
test_features_scaled = scaler.transform(test_features)

# Reconstruct dataframes with original columns and indices
train_scaled_df = pd.DataFrame(train_features_scaled, columns = train_features.columns, index = train_features.index)
test_scaled_df = pd.DataFrame(test_features_scaled, columns = test_features.columns, index = test_features.index)

scaler = MinMaxScaler()

# Fit the scaler to the training labels and transform them
train_labels = scaler.fit_transform(train_labels.values.reshape(-1, 1))

# Transform the test labels with the already fitted scaler
test_labels = scaler.transform(test_labels.values.reshape(-1, 1))

Creating Multivariate Training and Test Sequences

In this section, we will create multivariate training and test sequences that can be fed to the transformer encoder-based deep learning model. A multivariate sequence is a subset of consecutive data points that consists of multiple features or variables.

To create multivariate sequences, we will use a sliding window approach. We will use a fixed length of data as the input (X) and the next data point as the output (y). For example, if we use a sequence length of 60, we will use the past 60 days of data (Open, High, Low, Volume, and Adj Close) as the input and the next day's closing price as the output. This way, we can capture the temporal dependency of the data and train the model to predict the future price based on past prices.

To create the training set, we will define the create_train_sequence function that takes the normalized training features dataframe, the training labels, and the sequence length as inputs and returns two arrays: X and `y'.

X is an array of multivariate sequences, each with a length of 60 (sequence length), whereas `y' is an array of the next day's closing prices for each sequence. The function will iterate through the dataframe and create sequences by slicing the data.

I used a sequence length of 60 for a fair comparison with the 1D-CNN approach I explained in a previous article.


sequence_length = 60

def create_train_sequence(train_df, train_labels, sequence_length):
    seq_length = sequence_length  # Length of the sequence
    features = []  # List to hold feature sequences
    labels = []    # List to hold labels

    # Iterate through the DataFrame to create sequences
    for i in range(seq_length, len(train_df)):
        sequence = train_df.iloc[i-seq_length:i]  # Get 60 days sequence
        label = train_labels[i]                      # Get the label for the corresponding day

        features.append(sequence.values)  # Append sequence to features
        labels.append(label)              # Append label

    # Convert lists to numpy arrays
    X = np.array(features)
    y = np.array(labels).reshape(-1, 1)

    return X, y

The following script creates the final training features and corresponding labels. You can see the features and labels' shape in the output.


X_train, y_train = create_train_sequence(train_scaled_df,
                                         train_labels,
                                         sequence_length)

print(X_train.shape)
print(y_train.shape)

Output:


(1138, 60, 5)
(1138, 1)

Next, we define the create_test_sequence function that prepares features and labels for the test set. The function takes the normalized training and test dataframes, test labels, and sequence length as parameters.

The create_test_sequence also returns two numpy arrays: X_test and y_test, containing the test features and labels, respectively.

The create_test_sequence function uses a different approach than creating training sequences to create the sequences for the test data, as we want to use the most recent data from the training set and all the data from the test set to make the predictions. The function concatenates the last part of the training data with the test data and then slices the array to get the sequences.

For example, the first record in the test set will have the last 60 days of the training data as the input and the first day of the test data as the output. The second sequence will have the last (60 - 1) days of the training data and the first day of the test data as the input; and the second day of the test data as the output, and so on.

Here is the code for the function create_test_sequence function.


def create_test_sequence(train_df, test_df, test_labels, seq_length):
    # Concatenate the last part of train features with test features
    combined_features_df = pd.concat([train_df.iloc[-seq_length:], test_df])

    test_features = []  # List to hold test feature sequences
    test_labels_list = []  # List to hold test labels

    # Create sequences for the test set
    for i in range(seq_length, len(combined_features_df)):
        sequence = combined_features_df.iloc[i-seq_length:i]  # Get 60 days sequence
        # Use test_labels for label (assuming test_labels is aligned with test_features_df)
        label = test_labels[i - seq_length] if i < len(test_df) + seq_length else None

        test_features.append(sequence.values)  # Append sequence to test features
        test_labels_list.append(label)         # Append label

    # Remove the None values at the end (if any)
    test_features = [feature for feature, label in zip(test_features, test_labels_list) if label is not None]
    test_labels_list = [label for label in test_labels_list if label is not None]

    # Convert lists to numpy arrays
    X = np.array(test_features)
    y = np.array(test_labels_list).reshape(-1, 1)

    return X, y

X_test, y_test = create_test_sequence(train_scaled_df,
                                      test_scaled_df,
                                      test_labels,
                                      sequence_length)
print(X_test.shape)
print(y_test.shape)

Output:


(60, 60, 5)
(60, 1)

Training a Stock Price Prediction Model with Transformer Encoder

We are ready to define our deep learning model containing the Transformer Encoder layer. The model architecture will be similar to the one I created in the 1D CNN article. The only difference will be that, I will replace the CNN layers with TransformerEncoder layer, as shown in the following script.



# input shape: (60 time steps, 5 features)
input_layer = Input(shape=(60, 5))

# Transformer Encoder layers
transformer_1 = TransformerEncoder(num_heads=2, intermediate_dim=64)(input_layer)
dropout_1 = Dropout(0.2)(transformer_1)

transformer_2 = TransformerEncoder(num_heads=2, intermediate_dim=64)(dropout_1)
dropout_2 = Dropout(0.2)(transformer_2)

transformer_3 = TransformerEncoder( num_heads=2, intermediate_dim=64)(dropout_2)
dropout_3 = Dropout(0.2)(transformer_3)

# Flatten layer
flatten = Flatten()(dropout_3)

# Dense layers
dense_1 = Dense(200, activation='relu')(flatten)

# Dense layer
dense_2 = Dense(100, activation='relu')(dense_1)

# Output layer with a single neuron
output_layer = Dense(1, activation='linear')(dense_2)

# Create the model
model = Model(inputs=input_layer, outputs=output_layer)

# Compile the model
model.compile(loss='mean_squared_error', optimizer='adam')

# Display the model summary
model.summary()

Output:

The following script will train the model. Again, this code is similar to the one used for the 1D CNN article.


early_stopping = EarlyStopping(monitor='val_loss', patience=100, restore_best_weights=True)

# Train the model with early stopping
history = model.fit(
    X_train, y_train,
    epochs=500,
    validation_split=0.2,  # 20% of the data for validation
    callbacks=[early_stopping],
    verbose=1
)

Evaluating Model Performance

The following script evaluates the model performance on the test set.


# Make predictions on the test set
predictions = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, predictions)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, predictions)

# Print the values
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)

Output:


Mean Squared Error (MSE): 0.0025678165444383786
Mean Absolute Error (MAE): 0.0387130112330789

From the above output, we get a mean squared error value of 0.0025, almost 70% less than 0.00839, obtained via the 1D CNN in a previous article.

Similarly, the mean absolute error value of 0.038 is 57% less than 0.0893, obtained via the 1D-CNN.

The following script plots the actual and predicted stock values.

# converting predictions and targets to actual values

y_test = scaler.inverse_transform(y_test)
y_true = scaler.inverse_transform(predictions)

plt.figure(figsize=(10,6))
plt.plot(y_test, color='green', label='True Stock Price')
plt.plot(y_true, color='blue', label='Predicted Stock Price')
plt.title('Stock Price Prediction')
plt.xlabel('Dates')
plt.ylabel('Stock Price')
plt.legend()
plt.show()

Output:

Conclusion

In this article, I explained how to create a stock price prediction model in TensorFlow Keras using stacked Transformer encoder layers. The results show that transformer encoder layers significantly outperform the 1D CNN model for stock market price prediction.
I hope you liked the article; feel free to leave feedback or suggestions.