databricks | The Blog Pros

January 7, 2022

Databricks vs Snowflake: The Definitive Guide

There is a lot of discussion surrounding Snowflake and Databricks in determining which modern cloud solution is better for analytics. However, both solutions were purpose-built to handle different tasks, so neither should be compared from an “apples to apples” perspective.

With that in mind, I’ll do my best to break down some of the core differences between the two and share the pros/cons of each as unbiasedly as possible. Before diving into the weeds of Snowflake and Databricks though, it is important to understand the overall ecosystem.

November 29, 2021

How to Evaluate MLOps Platforms

Companies that have pioneered the application of AI at scale did so using their own in-house ML platforms (uber, LinkedIn, Facebook, Airbnb). Many vendors are now making these capabilities available to purchase as off-the-shelf products. There's also a range of open-source tools addressing MLOps. The rush to the space has created a new problem — too much choice. There are now hundreds of tools and at least 40 platforms available:

(Timeline image from Thoughtworks Guide to Evaluating MLOps Platforms.)

May 29, 2021

Azure Databricks: 14 Best Practices For a Developer

1. Choice of Programming Language

The language depends on the type of cluster. A cluster can comprise of two modes, i.e., Standard and High Concurrency. A High Concurrency cluster supports R, Python, and SQL, whereas a Standard cluster supports Scala, Java, SQL, Python, and R.
Spark is developed in Scala and is the underlying processing engine of Databricks. Scala performs better than Python and SQL. Hence, for the Standard cluster, Scala is the recommended language for developing Spark jobs.

2. ADF for Invoking Databricks Notebooks

Eliminate Hardcoding: In certain scenarios, Databricks requires some configuration information related to other Azure services such as storage account name, database server name, etc. The ADF pipeline uses pipeline variables for storing the configuration details. During the Databricks notebook invocation within the ADF pipeline, the configuration details are transferred from pipeline variables to Databricks widget variables, thereby eliminating hardcoding in the Databricks notebooks.
Notebook Dependencies: It is relatively easier to establish notebook dependencies in ADF than in Databricks itself. In case of failure, debugging a series of notebook invocations in an ADF pipeline is convenient.

Cheap: When a Notebook is invoked through ADF, the Ephemeral job cluster pattern is used for processing the spark job because the lifecycle of the cluster is tied to the job lifecycle. These short-life clusters cost lesser than the clusters which are created using the Databricks UI.

3. Using Widget Variables

The configuration details are made accessible to the Databricks code through the widget variables. The configuration data is transferred from pipeline variable to widget variables when the notebook is invoked in the ADF pipeline. During the development phase, to model the behavior of a notebook run by ADF, widget variables are manually created using the following line of code.

March 28, 2021

Execute Spark Applications on Databricks Using the REST API

Introduction

While many of us are habituated to executing Spark applications using the 'spark-submit' command, with the popularity of Databricks, this seemingly easy activity is getting relegated to the background. Databricks has made it very easy to provision Spark-enabled VMs on the two most popular cloud platforms, namely AWS and Azure. A couple of weeks ago, Databricks announced their availability on GCP as well. The beauty of the Databricks platform is that they have made it very easy to become a part of their platform. While Spark application development will continue to have its challenges - depending on the problem being addressed - the Databricks platform has taken out the pain of having to establish and manage your own Spark cluster.

Using Databricks

Once registered on the platform, the Databricks platform allows us to define a cluster of one or more VMs, with configurable RAM and executor specifications. We can also define a cluster that can launch a minimum number of VMs at startup and then scale to a maximum number of VMs as required. After defining the cluster, we have to define jobs and notebooks. Notebooks contain the actual code executed on the cluster. We need to assign notebooks to jobs as the Databricks cluster executes jobs (and not Notebooks). Databricks also allows us to setup the cluster such that it can download additional JARs and/or Python packages during cluster startup. We can also upload and install our own packages (I used a Python wheel).

August 27, 2020

Model Experiments, Tracking, and Registration Using MLflow on Databricks and StreamSets

Learn how StreamSets, a modern data integration platform for DataOps, can help expedite operations at some of the most crucial stages of machine learning lifecycle and MLOps.

Data Acquisition and Preparation

Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of MLOps and the need for collaboration between data scientists and data engineers.

March 10, 2020

Collecting Logs in Azure Databricks

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. In this blog, we are going to see how we can collect logs from Azure to ALA. Before going further we need to look how to set up a Spark cluster in Azure.

Create a Spark Cluster in Databricks

In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace.
You are redirected to the Azure Databricks portal. From the portal, click New Cluster.
Under “Advanced Options,” click on the “Init Scripts” tab. Go to the last line under the “Init Scripts" section. Under the “destination” dropdown, select “DBFS," and enter “dbfs:/databricks/spark-monitoring/spark-monitoring.sh” in the text box. Click the “add” button.

Run a Spark SQL job

In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook.
In the Create Notebook dialog box, enter a name, select language, and select the Spark cluster that you created earlier.