June 16, 2024 – The Blog Pros

First I want to say that my only interest in the technical details of web development are from an end user perspective. I usually go to this site to get my weather forecasts. Their current display format is to present the forecasts horizontally but it will likely be changing soon to vertical. My two most visited pages are the hourly and seven day forecasts. It's always annoyed me that on my display (16x9 laptop) most of the space is wasted. On the seven day page it doesn't matter, but on the hourly page it could easily display twelve hours or more of data.

But when the switch is made to the vertical format the hourly page will show only five hours of data with pointless "content continues below" page breaks. The seven day forecast shows only one day per screen. It seems to me that if you are planning, for example, a multi-day activity (perhaps a trip to the beach/cottage) it makes for easier planning to see multiple days (the more the better) on one page. For single-day events, perhaps a trip to the park, seeing more hours at once is much better than fewer.

My question (and I do have one) is, how difficult is it for the web site to customize the html for the target device? I realize that many more people would access this site on a smart phone than a desktop/laptop computer, but shouldn't major sites like this try to please both smart phone users as well as dinosaurs?

As a data scientist, I have extensively used the Hugging Face library for processing unstructured data such as images, text, and audio. My previous blogs have covered various transformer models for these types of data. Lately, however, I discovered that Hugging Face also provides transformer models for tabular data. One such transformer is the Meta Tree Transformer.

This article will explore using the Meta Tree Transformer model to classify tabular data, detailing each process step and providing insights based on the Bank Note Authentication dataset.

Installing and Importing Required Libraries

You must install and import the following libraries to run the codes in this article.


!pip install metatreelib
!pip install --upgrade scikit-learn
!pip install imodels


from metatree.model_metatree import LlamaForMetaTree as MetaTree
from metatree.decision_tree_class import DecisionTree, DecisionTreeForest
from metatree.run_train import preprocess_dimension_patch
from transformers import AutoConfig
from sklearn.metrics import accuracy_score
import imodels # pip install imodels
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import random

Loading and Preprocessing the Dataset

The dataset used in this tutorial is the Bank Note Authentication dataset, which you can download from Kaggle. The dataset contains features extracted from images of banknotes and is used to classify whether a banknote is authentic or not.

The dataset consists of the following columns:

variance
skewness
curtosis
entropy
class

The class column is the target variable, indicating whether the banknote is authentic (1) or not (0).

First, we need to read the dataset and preprocess it. We will use the pandas library to read the dataset from a CSV file.

This will load the dataset into a pandas DataFrame and display the first few rows to get an overview of the data.


# Load the dataset
file_path = '/content/BankNote_Authentication.csv'  # Path to the dataset
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()

Output:

Next, we split the dataset into training and testing sets using sklearn.model_selection.train_test_split. Here, 20% of the data is reserved for testing, ensuring we have sufficient data to evaluate the model's performance.


# Split the dataset into features and target variable
X = df.drop(columns=['class'])
y = df['class']

# Split the data into training and testing sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2,

DataLoader for Batching

To handle the entire dataset in batches of 256, we will create a custom dataset class and use PyTorch's DataLoader to batch and shuffle the data. The batch size is set to 256 since the Meta Tree transformer expects the data to be in batches of 256 records.


class TabularDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        feature = self.features[idx]
        label = self.labels[idx]
        return torch.tensor(feature, dtype=torch.float32), torch.tensor(label, dtype=torch.float32)

# Convert data to tensors
train_features = train_X.values
train_labels = torch.nn.functional.one_hot(torch.tensor(train_y.values), num_classes=2).float().numpy()

# Create Dataset
train_dataset = TabularDataset(train_features, train_labels)

# Parameters
batch_size = 256

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Setting Up the Meta Tree Transformer Model

To begin with, we need to initialize the Meta Tree Transformer model and adjust its configuration to match our dataset, particularly the number of features and classes.

The model is configured to handle a different number of features by default. We set config.n_feature to the number of features in our dataset (train_X.shape[1]).

Similarly, the model is configured for a different number of output classes. We set config.n_class to the number of classes in our dataset (2).



# Initialize Model
model_name_or_path = "yzhuang/MetaTree"

config = AutoConfig.from_pretrained(model_name_or_path)
# Override config parameters to match your dataset
config.n_feature = train_X.shape[1]
config.n_class = 2

model = MetaTree.from_pretrained(
    model_name_or_path,
    config=config,
    ignore_mismatched_sizes=True
)

decision_tree_forest = DecisionTreeForest()

# Set the depth of the model
model.depth = 2

Training the Model with Batches

Next, we train the model using the batches provided by the DataLoader.


# Training loop
for batch_features, batch_labels in train_loader:
    # Prepare the batch for the model
    batch = {"input_x": batch_features, "input_y": batch_labels, "input_y_clean": batch_labels}
    batch = preprocess_dimension_patch(batch, n_feature=train_X.shape[1], n_class=2)

    # Generate decision tree
    outputs = model.generate_decision_tree(batch['input_x'], batch['input_y'], depth=model.depth)
    decision_tree_forest.add_tree(DecisionTree(auto_dims=outputs.metatree_dimensions, auto_thresholds=outputs.tentative_splits, input_x=batch['input_x'], input_y=batch['input_y'], depth=model.depth))

    print("Decision Tree Features: ", [x.argmax(dim=-1) for x in outputs.metatree_dimensions])
    print("Decision Tree Thresholds: ", outputs.tentative_splits)

Evaluating the Model

Finally, we evaluate the model's performance on the test set.


# Predict using the decision tree forest
test_X_tensor = torch.tensor(test_X.values, dtype=torch.float32)
tree_pred = decision_tree_forest.predict(test_X_tensor)

tree_pred = tree_pred.argmax(dim=-1).squeeze().numpy()

# Calculate accuracy
accuracy = accuracy_score(test_y, tree_pred)
print("MetaTree Test Accuracy: ", accuracy)

Output:


MetaTree Test Accuracy:  0.8727272727272727

Conclusion

The Meta Tree Transformer offers a powerful method for classifying tabular data by combining the interpretability of decision trees with the robust performance of transformer models. In this tutorial, we walked through the process of setting up the model, preprocessing the data, training with multiple batches, and evaluation.

In my experience, the performance of the Meta Tree Transformer was on par with simpler algorithms like Random Forest and AdaBoost. Experimenting with different parameters and datasets can further enhance its performance, making it a valuable addition to any data scientist's toolkit.

Feel free to leave your feedback and the results you obtained using Transformer models on tabular data.

Day: June 16, 2024

Formatting web pages for various displays

Tabular Data Classification with Hugging Face Meta Tree Transformer