How to Avoid Keyword Stuffing & Fix Over Optimization in SEO

Fotolia Subscription Monthly 4685447 Xl Stock

Are you worried that you’ve stuffed too many keywords into your content?

When it comes to optimizing your site for search engines, many new users have a tendency to stuff their content with keywords. However, this is not a good practice and could lead to over-optimization, which can then lead to being penalized by search engines like Google.

In this article, we will show you how to avoid keyword stuffing and fix over-optimization in SEO.

How to avoid keyword stuffing and fix over optimization in SEO

What is Keyword Stuffing?

Keyword stuffing is filling a web page with keywords to manipulate search engines in the hopes of getting higher rankings.

In the early days of search engine optimization (SEO), it was easy to exploit search engines and use keyword stuffing to boost ranking. However, search engines like Google have become a lot smarter and can penalize sites that use this as an exploit.

There are different ways you can do keyword stuffing in your content. For instance, repeating words and phrases unnecessarily, listing or grouping text together unnaturally, or inserting blocks of keywords that appear out of context.

Here’s an example of how using the same keyphrase repeatedly in a single paragraph can lead to keyword stuffing.

Keyword stuffing example

Another way site owners can stuff search terms is by adding hidden text to the source code of the page. Users won’t be able to see this, but search engine crawlers will. Google does not like this practice.

That said, let’s look at how keyword stuffing can impact your site’s SEO.

Why is Keyword Stuffing Bad for SEO?

If you’re starting out with WordPress SEO, then it can be easy to get carried away and add the same keyword lots of times in the content. However, you should know that it goes against the web search policies of Google.

This could lead to a penalty from Google, where your site can be demoted in rankings. In worst cases, Google can also remove your page from its search engine results.

Besides that, keyword stuffing also leads to poor user experience because the content can be come hard to read. People might not find your content useful and exit the website. As a result, your site might look spammy and you won’t be able to build a healthy relationship with your audience.

Having said that, let’s look at different ways you can fix over-optimization and avoid keyword stuffing.

1. Measure Your Content’s Keyword Density

The easiest way of avoiding keyword stuffing is by measuring the keyword density of your content. Keyword density checks how many times a search term is used within the content.

You can use WPBeginner Keyword Density Checker to get started. It is a free tool that doesn’t require signup, registration, or installation.

Simply enter the URL or text of your content into the tool and click the ‘Check’ button.

WPBeginner keyword density checker tool

Next, the tool will analyze your content and show you the results.

You can then see how many times a keyword is being used on the web page. For instance, in the screenshot below, you can see the word ‘parrotfish’ occurs 28 times or has a 13.66% density.

The Free WPBeginner Keyword Density Checker Tool

After finding the density of the search term, you can then edit your content and remove words and phrases that are repeated multiple times.

A best SEO practice suggests that keyword density should be around 2%. You can use this as a guideline and ensure your content isn’t over-optimized.

2. Assign a Primary Keyword to Each Content

Another way you can fix over-optimization for SEO is by assigning a primary keyword or phrase to each blog post and page.

You should conduct keyword research and pick a search term that best represents the main topic of your content. This way, your content will focus on a specific issue and you’ll be better able to fulfill the search intent

If you try to optimize a web page with multiple keywords with different intent, then you’ll leave your site in a big mess. It will confuse search engines from understanding your content and who it is for, which will prevent your page from ranking for the right keyword.

There are different keyword research tools you can use to find the primary search term for your content. We recommend using Semrush, as it is a complete SEO tool that offers powerful features.

The Semrush keyword overview tools

You get a detailed overview of the keyword along with other valuable information. For instance, Semrush shows search volume, intent, keyword difficulty, and more for the search term.

Once you’ve found a primary keyword, you can use the All in One SEO (AIOSEO) plugin to optimize your content for the search term. AIOSEO is the best SEO plugin for WordPress that lets you add focus keyphrases to each post and page.

Adding focus keyphrase for your blog post

The plugin analyzes your content for the keyphrase, shows a score, and provides tips to improve keyword optimization. AIOSEO also integrates with Semrush to help you find more related keywords.

To learn more, please see our guide on how to properly use focus keyphrases in WordPress.

3. Use Synonyms and Related Keywords

You can avoid keyword stuffing by using LSI (latent semantic indexing) or related keywords for your content.

These are search terms that are closely related to the primary keyword. Related keywords also help search engines better understand your content.

Using different variations of keywords, synonyms, or long tail phrases can also help avoid keyword stuffing. It gives you more flexibility in incorporating different topics into your article.

You can find related keywords using the WPBeginner’s Keyword Generator tool. Simply enter your main search term or topic in the search bar and click the ‘Analyze’ button.

Keyword generator tool

The tool is 100% free to use and generates over 300 keyword ideas.

You can then use different variations in your article to avoid keyword stuffing.

keyword analysis report

Besides that, you can also search the primary keyword on Google and then scroll down to see related searches.

This will give you even more keyword variations to use in your content and fix over-optimization issues.

Related searches

4. Add Value by Extending the Word Count

Next, you can create long-format content to cover the topic in detail and help achieve higher rankings.

Extending the word count gives you the opportunity to cover multiple sub-topics, answer different questions users might have, and easily use keyword variations to avoid stuffing.

This also helps you use different search terms naturally instead of forcing them in every sentence. Plus, it offers a better reading experience for users.

While extending the word count will help avoid keyword stuffing, you should also focus on content quality. Google and other search engines emphasize creating content that’s valuable. So, we recommend writing for your users instead of focusing on keyword placement.

One way of extending the word count and diversifying the use of keywords is by adding a FAQ section at the bottom of the post.

Include a FAQ section

5. Include Keywords in On-Page SEO Optimization

You can also avoid keyword stuffing and fix over optimization by placing the target search term in different places during the on-page SEO process.

On-page SEO is optimizing a webpage for search engines and users. It refers to anything you do on the page itself to boost its rankings in search engine page results (SERP).

By spreading the placement of keywords across different page elements, you can easily fix keyword stuffing issues. For instance, there are different page elements where you can add the main keyword. These include the title, meta description, subheadings, permalink, and more.

With AIOSEO, it is very easy to perform on-page SEO and ensure your content is properly optimized. You can add meta descriptions, focus keyphrases, build internal links, and get suggestions for improvement.

Post title and meta description example

Similarly, adding keywords to image alt text lets you rank for image search and allows you to diversify the use of primary search terms across the content.

It can help show screenshots from your blog post as featured snippets, helping you get more traffic.

Adding alt text, a description, caption, and more to images in WordPress

You can learn more by following our tips to optimize your blog posts for SEO.

We hope this article helped you learn how to avoid keyword stuffing and fix over-optimization in SEO. You may also want to see our guide on a 13-point WordPress SEO checklist for beginners and must-have WordPress plugins for business sites.

If you liked this article, then please subscribe to our YouTube Channel for WordPress video tutorials. You can also find us on Twitter and Facebook.

The post How to Avoid Keyword Stuffing & Fix Over Optimization in SEO first appeared on WPBeginner.

Exploring The Features And Flexibility Of Astro

Fotolia Subscription Monthly 4685447 Xl Stock

Over the past few years, many new frontend frameworks have been released, offering developers a wide range of options to choose the one that best fits their projects. In this article, we will analyze Astro, an open-source project released with an MIT license. The first version, v1.0, was released in August 2022 as a web framework tailored for high-speed and content-focused websites.

One year later, in August 2023, they released Astro 3.0 with a lot of new features like view transitions, faster-rendering performance, SSR enhancements for serverless, and optimized build output, which we will cover later in the article. On October 12, 2023, they announced Astro 3.3 with exciting updates, such as the <Picture/> component for image handling.

How Can Data Professionals Increase Conversion Rates in 2024?

Featured Imgs 26

We all have mastered the science of maximizing outputs from the given data in the last decade. However, converting that data into meaningful insights is the real challenge and opportunity! Over the years, a swaddle of 3rd party products has claimed higher ROI, either by optimizing ad spending, improving data analysis strategies, or overhauling the backend. And yet, the website conversion rates across all sectors haven’t crossed 2.5% in 2023

If the average user appetite to purchase has increased and the internet bandwidths have improved, why have the conversion rate numbers not improved? This post discusses often-overlooked strategies to improve website conversion rates and how data professionals can help.

Using Unblocked to Fix a Service That Nobody Owns

Featured Imgs 26

Working in technology for over three decades, I have found myself in a position of getting up to speed on something new at least 100 times. For about half of my career, I have worked in a consulting role, which is where I faced the challenge of understanding a new project the most.

Over time, I established a personal goal to be productive on a new project in half the time it took for the average team member. I often called this the time to first commit or TTFC. The problem with my approach to setting a TTFC record was the unexpected levels of stress that I endured in those time periods. Family members and friends always knew when I was in the early stages of a brand-new project.

Getting Started With AI Functions

Featured Imgs 23

This past week we went "all-in" on AI functions. An AI function is the ability to create AI assistant logic, allowing the chatbot to "do things," instead of just passively generating text.

To understand the power of such functions you can read some of our previous articles about the subject.

Useful CSS Tips And Techniques

Fotolia Subscription Monthly 4685447 Xl Stock

If you’ve been in the web development game for longer, you might recall the days when CSS was utterly confusing and you had to come up with hacks and workarounds to make things work. Luckily, these days are over and new features such as container queries, cascade layers, CSS nesting, the :has selector, grid and subgrid, and even new color spaces make CSS more powerful than ever before.

And the innovation doesn’t stop here. We also might have style queries and perhaps even state queries, along with balanced text-wrapping and CSS anchor positioning coming our way.

With all these lovely new CSS features on the horizon, in this post, we dive into the world of CSS with a few helpful techniques, a deep-dive into specificity, hanging punctuation, and self-modifying CSS variables. We hope they’ll come in handy in your work.

Cascade And Specificity Primer

Many fear the cascade and specificity in CSS. However, the concept isn’t as hard to get to grips with as one might think. To help you get more comfortable with two of the most fundamental parts of CSS, Andy Bell wrote a wonderful primer on the cascade and specificity.

The guide explains how certain CSS property types will be prioritized over others and dives deeper into specificity scoring to help you assess how likely it is that the CSS of a specific rule will apply. Andy uses practical examples to illustrate the concepts and simplifies the underlying mental model to make it easy to adopt and utilize. A power boost for your CSS skills.

Testing HTML With Modern CSS

Have you ever considered testing HTML with CSS instead of JavaScript? CSS selectors today are so powerful that it is actually possible to test for most kinds of HTML patterns using CSS alone. A proponent of the practice, Heydon Pickering summarized everything you need to know about testing HTML with CSS, whether you want to test accessibility, uncover HTML bloat, or check the general usability.

As Heydon points out, testing with CSS has quite some benefits. Particularly if you work in the browser and prefer exploring visual regressions and inspector information over command line logs, testing with CSS could be for you. It also shines in situations where you don’t have direct access to a client’s stack: Just provide a test stylesheet, and clients can locate instances of bad patterns you have identified for them without having to onboard you to help them do so. Clever!

Self-Modifying CSS Variables

The CSS spec for custom properties does not allow a custom property to reference itself — although there are quite some use cases where such a feature would be useful. To close the gap, Lea Verou proposed an inherit() function in 2018, which the CSSWG added to the specs in 2021. It hasn’t been edited-in yet, but Roman Komarov found a workaround that makes it possible to start involving its behavior.

Roman’s approach uses container-style queries as a way to access the previous state of a custom property. It can be useful when you want to cycle through various hues without having a static list of values, to match the border-radius visually, or to nest menu lists, for example. The workaround is still strictly experimental (so do not use it in production!), but since it is likely that style queries will gain broad browser support before inherit(), it has great potential.

Hanging Punctuation In CSS

hanging-punctuation is a neat little CSS property. It extends punctuation marks such as opening quotes to cater to nice, clean blocks of text. And while it’s currently only supported in Safari, it doesn’t hurt to include it in your code, as the property is a perfect example of progressive enhancement: It leaves things as they are in browsers that don’t support it and adds the extra bit of polish in browsers that do.

Jeremy Keith noticed an unintended side-effect of hanging-punctuation, though. When you apply it globally, it’s also applied to form fields. So, if the text in a form field starts with a quotation mark or some other piece of punctuation, it’s pushed outside the field and hidden. Jeremy shares a fix for it: Add input, textarea { hanging-punctuation: none; } to prevent your quotation marks from disappearing. A small tip that can save you a lot of headaches.

Fixing aspect-ratio Issues

The aspect-ratio property shines in fluid environments. It can handle anything from inserting a square-shaped <div> to matching the 16:9 size of a <video>, without you thinking in exact dimensions. And most of the time, it does so flawlessly. However, there are some things that can break aspect-ratio. Chris Coyier takes a closer look at three reasons why your aspect-ratio might not work as expected.

As Chris explains, one potential breakage is setting both dimensions — which might seem obvious, but it can be confusing if one of the dimensions is set from somewhere you didn’t expect. Stretching and content that forces height can also lead to unexpected results. A great overview of what to look out for when aspect-ratio breaks.

Masonry Layout With CSS

CSS Grid has taken layouts on the web to the next level. However, as powerful as CSS is today, not every layout that can be imagined is feasible. Masonry layout is one of those things that can’t be accomplished with CSS alone. To change that, the CSS Working Group is asking for your help.

There are currently two approaches in discussion at the CSS Working Group about how CSS should handle masonry-style layouts — and they are asking for insights from real-world developers and designers to find the best solution.

The first approach would expand CSS Grid to include masonry, and the second approach would be to introduce a masonry layout as a display: masonry display type. Jen Simmons summarized what you need to know about the ongoing debate and how you can contribute your thoughts on which direction CSS should take.

Before you come to a conclusion, also be sure to read Rachel Andrew’s post on the topic. She explains why the Chrome team has concerns about implementing a masonry layout as a part of the CSS Grid specification and clarifies what the alternate proposal enables.

Boost Your CSS Skills

If you’d like to dive deeper into CSS, we’ve got your back — with a few friendly events and SmashingConfs coming up this year:

We’d be absolutely delighted to welcome you to one of our special Smashing experiences — be it online or in person!

Smashing Weekly Newsletter

With our weekly newsletter, we aim to bring you useful, practical tidbits and share some of the helpful things that folks are working on in the web industry. There are so many talented folks out there working on brilliant projects, and we’d appreciate it if you could help spread the word and give them the credit they deserve!

Also, by subscribing, there are no third-party mailings or hidden advertising, and your support really helps us pay the bills. ❤️

Interested in sponsoring? Feel free to check out our partnership options and get in touch with the team anytime — they’ll be sure to get back to you as soon as they can.

Enhancing Website Security: Seamless Authentication and User Management Integration of WordPress with Feather.js

Fotolia Subscription Monthly 4666691 Xl Stock

In the dynamic realm of web development, establishing a secure and user-centric environment stands as a fundamental imperative. The amalgamation of WordPress, renowned for its robust backend capabilities, with the versatile frontend framework Feather.js, presents a compelling avenue for developers to implement sophisticated authentication and user management systems. This article delves into the significance of […]

The post Enhancing Website Security: Seamless Authentication and User Management Integration of WordPress with Feather.js first appeared on WPArena and is written by Nur ul Ain.

Web Design

Category Image 076
Posts about Web Design written by Nick Schäferhoff, Colin Newcomer, Rana Bano, Melissa King, Jenny McKaig Speed, and The WordPress.com Team

Integrating Image-To-Text And Text-To-Speech Models (Part 2)

Category Image 062

In Part 1 of this brief two-part series, we developed an application that turns images into audio descriptions using vision-language and text-to-speech models. We combined an image-to-text that analyses and understands images, generating description, with a text-to-speech model to create an audio description, helping people with sight challenges. We also discussed how to choose the right model to fit your needs.

Now, we are taking things a step further. Instead of just providing audio descriptions, we are building that can have interactive conversations about images or videos. This is known as Conversational AI — a technology that lets users talk to systems much like chatbots, virtual assistants, or agents.

While the first iteration of the app was great, the output still lacked some details. For example, if you upload an image of a dog, the description might be something like “a dog sitting on a rock in front of a pool,” and the app might produce something close but miss additional details such as the dog’s breed, the time of the day, or location.

The aim here is simply to build a more advanced version of the previously built app so that it not only describes images but also provides more in-depth information and engages users in meaningful conversations about them.

We’ll use LLaVA, a model that combines understanding images and conversational capabilities. After building our tool, we’ll explore multimodal models that can handle images, videos, text, audio, and more, all at once to give you even more options and easiness for your applications.

Visual Instruction Tuning and LLaVA

We are going to look at visual instruction tuning and the multimodal capabilities of LLaVA. We’ll first explore how visual instruction tuning can enhance the large language models to understand and follow instructions that include visual information. After that, we’ll dive into LLaVA, which brings its own set of tools for image and video processing.

Visual Instruction Tuning

Visual instruction tuning is a technique that helps large language models (LLMs) understand and follow instructions based on visual inputs. This approach connects language and vision, enabling AI systems to understand and respond to human instructions that involve both text and images. For example, Visual IT enables a model to describe an image or answer questions about a scene in a photograph. This fine-tuning method makes the model more capable of handling these complex interactions effectively.

There’s a new training approach called LLaVAR that has been developed, and you can think of it as a tool for handling tasks related to PDFs, invoices, and text-heavy images. It’s pretty exciting, but we won’t dive into that since it is outside the scope of the app we’re making.

Examples of Visual Instruction Tuning Datasets

To build good models, you need good data — rubbish in, rubbish out. So, here are two datasets that you might want to use to train or evaluate your multimodal models. Of course, you can always add your own datasets to the two I’m going to mention.

Vision-CAIR

  • Instruction datasets: English;
  • Multi-task: Datasets containing multiple tasks;
  • Mixed dataset: Contains both human and machine-generated data.

Vision-CAIR provides a high-quality, well-aligned image-text dataset created using conversations between two bots. This dataset was initially introduced in a paper titled “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models,” and it provides more detailed image descriptions and can be used with predefined instruction templates for image-instruction-answer fine-tuning.

There are more multimodal datasets out there, but these two should help you get started if you want to fine-tune your model.

Let’s Take a Closer Look At LLaVA

LLaVA (which stands for Large Language and Vision Assistant) is a groundbreaking multimodal model developed by researchers from the University of Wisconsin, Microsoft Research, and Columbia University. The researchers aimed to create a powerful, open-source model that could compete with the best in the field, just like GPT-4, Claude 3, or Gemini, to name a few. For developers like you and me, its open nature is a huge benefit, allowing for easy fine-tuning and integration.

One of LLaVA’s standout features is its ability to understand and respond to complex visual information, even with unfamiliar images and instructions. This is exactly what we need for our tool, as it goes beyond simple image descriptions to engage in meaningful conversations about the content.

Architecture

LLaVA’s strength lies in its smart use of existing models. Instead of starting from scratch, the researchers used two key models:

  • CLIP VIT-L/14
    This is an advanced version of the CLIP (Contrastive Language–Image Pre-training) model developed by OpenAI. CLIP learns visual concepts from natural language descriptions. It can handle any visual classification task by simply being given the names of the visual categories, similar to the “zero-shot” capabilities of GPT-2 and GPT-3.
  • Vicuna
    This is an open-source chatbot trained by fine-tuning LLaMA on 70,000 user-shared conversations collected from ShareGPT. Training Vicuna-13B costs around $300, and it performs exceptionally well, even when compared to other models like Alpaca.

These components make LLaVA highly effective by combining state-of-the-art visual and language understanding capabilities into a single powerful model, perfectly suited for applications requiring both visual and conversational AI.

Training

LLaVA’s training process involves two important stages, which together enhance its ability to understand user instructions, interpret visual and language content, and provide accurate responses. Let’s detail what happens in these two stages:

  1. Pre-training for Feature Alignment
    LLaVA ensures that its visual and language features are aligned. The goal here is to update the projection matrix, which acts as a bridge between the CLIP visual encoder and the Vicuna language model. This is done using a subset of the CC3M dataset, allowing the model to map input images and text to the same space. This step ensures that the language model can effectively understand the context from both visual and textual inputs.
  2. End-to-End Fine-Tuning
    The entire model undergoes fine-tuning. While the visual encoder’s weights remain fixed, the projection layer and the language model are adjusted.

The second stage is tailored to specific application scenarios:

  • Instructions-Based Fine-Tuning
    For general applications, the model is fine-tuned on a dataset designed for following instructions that involve both visual and textual inputs, making the model versatile for everyday tasks.
  • Scientific reasoning
    For more specialized applications, particularly in science, the model is fine-tuned on data that requires complex reasoning, helping the model excel at answering detailed scientific questions.

Now that we’re keen on what LLaVA is and the role it plays in our applications, let’s turn our attention to the next component we need for our work, Whisper.

Using Whisper For Text-To-Speech

In this chapter, we’ll check out Whisper, a great model for turning text into speech. Whisper is accurate and easy to use, making it perfect for adding natural-sounding voice responses to our app. We’ve used Whisper in a different article, but here, we’re going to use a new version — large v3. This updated version of the model offers even better performance and speed.

Whisper large-v3

Whisper was developed by OpenAI, which is the same folks behind ChatGPT. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. The original Whisper was trained on 680,000 hours of labeled data.

Now, what’s different with Whisper large-v3 compared to other models? In my experience, it comes down to the following:

  • Better inputs
    Whisper large-v3 uses 128 Mel frequency bins instead of 80. Think of Mel frequency bins as a way to break down audio into manageable chunks for the model to process. More bins mean finer detail, which helps the model better understand the audio.
  • More training
    This specific Whisper version was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio that was collected from Whisper large-v2. From there, the model was trained for 2.0 epochs over this mix.

Whisper models come in different sizes, from tiny to large. Here’s a table comparing the differences and similarities:

Size Parameters English-only Multilingual
tiny 39 M
base 74 M
small 244 M
medium 769 M
large 1550 M
large-v2 1550 M
large-v3 1550 M
Integrating LLaVA With Our App

Alright, so we’re going with LLaVA for image inputs, and this time, we’re adding video inputs, too. This means the app can handle both images and videos, making it more versatile.

We’re also keeping the speech feature so you can hear the assistant’s replies, which makes the interaction even more engaging. How cool is that?

For this, we’ll use Whisper. We’ll stick with the Gradio framework for the app’s visual layout and user interface. You can, of course, always swap in other models or frameworks — the main goal is to get a working prototype.

Installing and Importing the Libraries

We will start by installing and importing all the required libraries. This includes the transformers libraries for loading the LLaVA and Whisper models, bitsandbytes for quantization, gtts, and moviepy to help in processing video files, including frame extraction.

#python
!pip install -q -U transformers==4.37.2
!pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q gradio
!pip install -q gTTS
!pip install -q moviepy

With these installed, we now need to import these libraries into our environment so we can use them. We’ll use colab for that:

#python
import torch
from transformers import BitsAndBytesConfig, pipeline
import whisper
import gradio as gr
from gtts import gTTS
from PIL import Image
import re
import os
import datetime
import locale
import numpy as np
import nltk
import moviepy.editor as mp

nltk.download('punkt')
from nltk import sent_tokenize

# Set up locale
os.environ["LANG"] = "en_US.UTF-8"
os.environ["LC_ALL"] = "en_US.UTF-8"
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

Configuring Quantization and Loading the Models

Now, let’s set up a 4-bit quantization to make the LLaVA model more efficient in terms of performance and memory usage.

#python

# Configuration for quantization
quantization_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.float16
)

# Load the image-to-text model
model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text",
  model=model_id,
  model_kwargs={"quantization_config": quantization_config})

# Load the whisper model
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("large-v3", device=DEVICE)

In this code, we’ve configured the quantization to four bits, which reduces memory usage and improves performance. Then, we load the LLaVA model with these settings. Finally, we load the whisper model, selecting the device based on GPU availability for better performance.

Note: We’re using llava-v1.5-7b as the model. Please feel free to explore other versions of the model. For Whisper, we’re loading the “large” size, but you can also switch to another size like “medium” or “small” for your experiments.

To get our assistant up and running, we need to implement five essential functions:

  1. Handling conversations,
  2. Converting images to text,
  3. Converting videos to text,
  4. Transcribing audio,
  5. Converting text to speech.

Once these are in place, we will create another function to tie all this together seamlessly. The following sections provide the code that defines each function.

Conversation History

We’ll start by setting up the conversation history and a function to log it:

#python

# Initialize conversation history
conversation_history = []

def writehistory(text):
  """Write history to a log file."""
  tstamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
  logfile = f'{tstamp}_log.txt'
  with open(logfile, 'a', encoding='utf-8') as f:
    f.write(text + '\n')

Image to Text

Next, we’ll create a function to convert images to text using LLaVA and iterative prompts.

#python
def img2txt(input_text, input_image):
  """Convert image to text using iterative prompts."""
  try:
    image = Image.open(input_image)

    if isinstance(input_text, tuple):
      input_text = input_text[0]  # Take the first element if it's a tuple

      writehistory(f"Input text: {input_text}")
      prompt = "USER: <image>\n" + input_text + "\nASSISTANT:"
      while True:
        outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})

          if outputs and outputs[0]["generated_text"]:
            match = re.search(r'ASSISTANT:\s*(.*)', outputs[0]["generated_text"])
            reply = match.group(1) if match else "No response found."
            conversation_history.append(("User", input_text))
            conversation_history.append(("Assistant", reply))
            prompt = "USER: " + reply + "\nASSISTANT:"
            return reply  # Only return the first response for now
          else:
            return "No response generated."
  except Exception as e:
    return str(e)

Video to Text

We’ll now create a function to convert videos to text by extracting frames and analyzing them.

#python
def vid2txt(input_text, input_video):
  """Convert video to text by extracting frames and analyzing."""
  try:
    video = mp.VideoFileClip(input_video)
    frame = video.get_frame(1)  # Get a frame from the video at the 1-second mark
    image_path = "temp_frame.jpg"
    mp.ImageClip(frame).save_frame(image_path)
    return img2txt(input_text, image_path)
  except Exception as e:
    return str(e)

Audio Transcription

Let’s add a function to transcribe audio to text using Whisper.

#python
def transcribe(audio_path):
  """Transcribe audio to text using Whisper model."""
  if not audio_path:
    return ''

  audio = whisper.load_audio(audio_path)
  audio = whisper.pad_or_trim(audio)
  mel = whisper.log_mel_spectrogram(audio).to(model.device)
  options = whisper.DecodingOptions()
  result = whisper.decode(model, mel, options)
  return result.text

Text to Speech

Lastly, we create a function to convert text responses into speech.

#python
def text_to_speech(text, file_path):
  """Convert text to speech and save to file."""
  language = 'en'
  audioobj = gTTS(text=text, lang=language, slow=False)
  audioobj.save(file_path)
  return file_path

With all the necessary functions in place, we can create the main function that ties everything together:

#python

def chatbot_interface(audio_path, image_path, video_path, user_message):
  """Process user inputs and generate chatbot response."""
  global conversation_history

  # Handle audio input
  if audio_path:
    speech_to_text_output = transcribe(audio_path)
  else:
    speech_to_text_output = ""

  # Determine the input message
  input_message = user_message if user_message else speech_to_text_output

  # Ensure input_message is a string
  if isinstance(input_message, tuple):
    input_message = input_message[0]

  # Handle image or video input
  if image_path:
    chatgpt_output = img2txt(input_message, image_path)
  elif video_path:
      chatgpt_output = vid2txt(input_message, video_path)
  else:
    chatgpt_output = "No image or video provided."

  # Add to conversation history
  conversation_history.append(("User", input_message))
  conversation_history.append(("Assistant", chatgpt_output))

  # Generate audio response
  processed_audio_path = text_to_speech(chatgpt_output, "Temp3.mp3")

  return conversation_history, processed_audio_path

Using Gradio For The Interface

The final piece for us is to create the layout and user interface for the app. Again, we’re using Gradio to build that out for quick prototyping purposes.

#python

# Define Gradio interface
iface = gr.Interface(
  fn=chatbot_interface,
  inputs=[
    gr.Audio(type="filepath", label="Record your message"),
    gr.Image(type="filepath", label="Upload an image"),
    gr.Video(label="Upload a video"),
    gr.Textbox(lines=2, placeholder="Type your message here...", label="User message (if no audio)")
  ],
  outputs=[
    gr.Chatbot(label="Conversation"),
    gr.Audio(label="Assistant's Voice Reply")
  ],
  title="Interactive Visual and Voice Assistant",
  description="Upload an image or video, record or type your question, and get detailed responses."
)

# Launch the Gradio app
iface.launch(debug=True)

Here, we want to let users record or upload their audio prompts, type their questions if they prefer, upload videos, and, of course, have a conversation block.

Here’s a preview of how the app will look and work:

Looking Beyond LLaVA

LLaVA is a great model, but there are even greater ones that don’t require a separate ASR model to build a similar app. These are called multimodal or “any-to-any” models. They are designed to process and integrate information from multiple modalities, such as text, images, audio, and video. Instead of just combining vision and text, these models can do it all: image-to-text, video-to-text, text-to-speech, speech-to-text, text-to-video, and image-to-audio, just to name a few. It makes everything simpler and less of a hassle.

Examples of Multimodal Models that Handle Images, Text, Audio, and More

Now that we know what multimodal models are, let’s check out some cool examples. You may want to integrate these into your next personal project.

CoDi

So, the first on our list is CoDi or Composable Diffusion. This model is pretty versatile, not sticking to any one type of input or output. It can take in text, images, audio, and video and turn them into different forms of media. Imagine it as a sort of AI that’s not tied down by specific tasks but can handle a mix of data types seamlessly.

CoDi was developed by researchers from the University of North Carolina and Microsoft Azure. It uses something called Composable Diffusion to sync different types of data, like aligning audio perfectly with the video, and it can generate outputs that weren’t even in the original training data, making it super flexible and innovative.

ImageBind

Now, let’s talk about ImageBind, a model from Meta. This model is like a multitasking genius, capable of binding together data from six different modalities all at once: images, video, audio, text, depth, and even thermal data.

Source: Meta AI. (Large preview)

ImageBind doesn’t need explicit supervision to understand how these data types relate. It’s great for creating systems that use multiple types of data to enhance our understanding or create immersive experiences. For example, it could combine 3D sensor data with IMU data to design virtual worlds or enhance memory searches across different media types.

Gato

Gato is another fascinating model. It’s built to be a generalist agent that can handle a wide range of tasks using the same network. Whether it’s playing games, chatting, captioning images, or controlling a robot arm, Gato can do it all.

The key thing about Gato is its ability to switch between different types of tasks and outputs using the same model.

GPT-4o

The next on our list is GPT-4o; GPT-4o is a groundbreaking multimodal large language model (MLLM) developed by OpenAI. It can handle any mix of text, audio, image, and video inputs and give you text, audio, and image outputs. It’s super quick, responding to audio inputs in just 232ms to 320ms, almost like a real conversation.

There’s a smaller version of the model called GPT-4o Mini. Small models are becoming a trend, and this one shows that even small models can perform really well. Check out this evaluation to see how the small model stacks up against other large models.

Conclusion

We covered a lot in this article, from setting up LLaVA for handling both images and videos to incorporating Whisper large-v3 for top-notch speech recognition. We also explored the versatility of multimodal models like CoDi or GPT-4o, showcasing their potential to handle various data types and tasks. These models can make your app more robust and capable of handling a range of inputs and outputs seamlessly.

Which model are you planning to use for your next app? Let me know in the comments!

Chris’ Corner: Variations on What Not to Do

Fotolia Subscription Monthly 4685447 Xl Stock

I think the nail is in coffin now: you should never design something for the web with only one (or even a narrow set) of particular viewport sizes in mind. It’s just so darn tempting to think that way. You have a couple of pretty specific screen sizes in front of you right now, you likely design toward those to some degree. Design tools often ask you to draw a rectangle that represent a screen to design for. Testing tools sometimes show you a site at a set of pre-set screen sizes. It can feel normal and fine to design toward, say, three sizes, and hone in on them. Honestly, that might end up working fine, but it might not! It might lead to some awkward in-betweens, especially if you are very rigid in writing CSS that only changes at those specific breakpoints only.

That’s the thing, really. You just don’t have to think in really specific breakpoints anymore. Media query width breakpoints are still a fine tool, but now we’ve got viewport units, container units, container queries, calc/min/max/clamp, and all sorts of other stuff that allow you to design components and pages that work well and look good at the size and under the conditions they are in. It’s just a better way to code. But this stuff has only relatively recently arrived in CSS so it’ll take a minute for it all to settle in.

This isn’t even really new news. Over a decade ago, I was like, yo, there are a ton of different sizes that your site is getting viewed at. Deal with it. Now we can properly.

AND NOW FOR SOMETHING COMPLETELY DIFFERENT

Have websites gone to crap? Browse around popular sites, and I think you’ll land on an easy yes. Especially on mobile, cripes. Just to name a few: they are too slow to load, the ads and popups are too obtrusive, and there is too much usage of fixed-position elements that reduce usable area.

This website User Inyerface satirized it recently, and it’s pretty funny (ya know, if being intentionally frustrated is your thing, gamers should relate).

People have been worried about this for ages, and it never seems to get any better.

This all just makes me sad. Fortunately, most things are fine.

AND NOW FOR SOMETHING COMPLETELY DIFFERENT

Have you seen the popover API? It’s a neat idea, already play-with-able in Chrome. Think styled tooltips. The idea is that you connect some interaction (click of a button) to toggling another element with more information or context. Amazingly, to me, this HTML totally works in Chrome with no CSS or JavaScript at all:

<button popovertarget="my-popover">Open Popover</button>

<div id="my-popover" popover>
  <p>I am a popover with more information.</p>
</div>

You can style stuff with CSS of course, but the basics of the interaction work without. Like a <details> element.

Anytime we get any form of “state management” outside of JavaScript, the people will play! There are countless games made in CSS thanks to the whole idea of the :checked selector in CSS and using the ~ combinator to select other elements.

This time, leave it to Garth Heyes who has made Tic-Tac-Toe entirely in HTML only. That’s gotta be a first.

Wanna see it? Fair warning first. It’s 170 MB (!!) of HTML and “over half a million nodes”. Chrome really struggles with this. It took my machine maybe near a minute to even render the first page, and each click took a while as well. If you’re down try it, see the demo.

AND NOW FOR SOMETHING RELATED BUT DIFFERENT

So now that we’ve looked at something you absolutely shouldn’t do on the web, here’s Heather Buchel with some things you absolutely should do on the web. Heather ain’t even mad that we’re building websites with newfangled tech and trying to share code across platforms and all that, but, just, like, don’t break stuff. Don’t break super duper basic stuff that websites easily do and are good for everyone. I’ll hijack her whole list, but of course go read it for more context:

  • Let me copy text so I can paste it.
  • If something navigates like a link, let me do link things.
  • Let me zoom in on my browser without the website getting all out of whack.
  • Do responsive things.
  • Let me have hover styles.
  • If the UI completely changes when I click on something, as if I’ve navigated to a new page, give me a browser history update and a new URL.
  • Let me see scroll bars.
  • Stop hijacking my typical browser shortcuts for use in your own app.

Reasonable asks, no?

AND NOW FOR SOMETHING ALONG THOSE SAME LINES

Onnnnneeee more thing you should be really careful about doing on the web. Adam Silver: The problem with sticky menus and what to do instead.

One problem is fairly obvious with sticky menus: they overlap stuff! They get in the dang way far too often.

But there are other things that cause problems that you might not see right away. Adam mentions zooming. One little zoom or too might kick a sticky/fixed element right off the page. Also, if something opens a sticky menu, and that menu happens to be taller than the viewport, you’ve got issues. You either need that area to be scrollable (but nested scrolling sucks) or you require users to scroll likely further than they want to just to see more of the menu. Ughghadk.

Adam lists three more that are just as bad or worse, and even less obvious at first glance. I’ll force you over there to see them. But I’ll snag the good ending, featuring the alternatives:

  1. Keep pages short: Sticky menus are a symptom of long pages so fix the root cause.
  2. Just let users scroll: It’s a myth that scrolling is a problem. Even on mobile, the top of the page is a flick or 2 away mostly.
  3. Put relevant links in context: For example, add a subscribe form to the end of a post or add a CTA to a pricing section.
  4. Use a back-to-top link: They’re relatively unobtrusive (but only do this once you exhaust the other options).

Send Time Optimization

Featured Imgs 26

Did you know that email Send Time Optimization (STO) can improve the open rate by up to 93%? Awesome! Or it might only be 10%. A slightly more credible case study claims that message delivery at the right time resulted in an open rate of 55%, a click rate of 30%, and a conversion rate of 13%. I’ll take that increase any day if there’s a positive ROI. 

Optimization can be applied to any number of problems. It can be applied equally to content, where it may be to the customer’s benefit, as it can be applied to price, where optimization can deliver the maximum possible price for merchants.