Import data into CockroachDB with Kerberos authentication

Articles Covering CockroachDB and Kerberos

I find the topic of Kerberos very interesting and my colleagues commonly refer to me for help with this complex topic. I am by no means an expert at Kerberos, I am however familiar enough with it to be dangerous. That said, I've written multiple articles on the topic which you may find below:

I was recently asked by a customer whether GSSAPI gets in the way of doing a table import in CockroachDB. The short answer is it shouldn't as GSSAPI is abstracted from any bulk-io operations. I've previously written articles on doing an import into Cockroach, here and here and encourage you to review those articles. So today we're going to focus on specifically the import with Kerberos.

Data Federation With CockroachDB and Presto

Motivation

A customer inquired whether data federation is possible natively in CockroachDB. Unfortunately, CockroachDB does not support features like foreign data wrappers and such. A quick search returned a slew of possibilities and Presto being a prominent choice, sparked my interest.

High-Level Steps

  • Install Presto
  • Configure Postgresql catalog
  • Configure TPCH catalog
  • Verify
  • Wrap up

Step by Step Instructions

Install Presto

I'm using a Mac and luckily there's a homebrew package available.

Exploring CockroachDB With Flyway Schema Migration Tool

Today, I am going to quickly introduce you to Flyway and some of the new capabilities in CockroachDB leveraging schema migrations. This is by no means a deep-dive on Flyway, for that, I highly recommend you get familiar with Flyway's documentation. With that, let's dive in.

I will continue to use a docker-compose environment for the following tutorial as it fits nicely with the iterative model of development and deployment with schema migration tools. We will need a recent CockroachDB image. My current folder tree looks like so:

Exploring CockroachDB with ipython-sql and Jupyter Notebook

Today, I will demonstrate how ipython-sql can be leveraged in querying CockroachDB.  This will require a secure instance of CockroachDB for the reasons I will explain below. 

Running a secure docker-compose instance of CRDB is beyond the scope of this tutorial. Instead, I will publish everything you need to get through the tutorial in my repo, including the Jupyter Notebook. You may also use CRDB docs to stand up a secure instance and change the URL in the notebook to follow along.

This post will dive deeper into the Python ecosystem and build on my previous Python post. Instead of reliance on pandas alone, we're going to use a popular SQL extension called ipython-sql, a.k.a. SQLmagic to execute SQL queries against CRDB.


As stated earlier, we need to use a secure instance of CockroachDB. In fact, from this point forward, I will attempt to write posts only with secure clusters, as that's the recommended approach. Ipython-sql uses sqlalchemy underneath and it expects database URLs in the format postgresql://username:password@hostname:port/dbname. CockroachDB does not support password fields with insecure clusters, as passwords alone will not protect your data.

Import Data From Hadoop Into CockroachDB

CockroachDB can natively import data from HTTP endpoints, object storage with respective APIs, and local/NFS mounts. The full list of supported schemes can be found here.

It does not support the HDFS file scheme and we're left to our wild imagination to find alternatives.
As previously discussed, the Hadoop community is working on Hadoop Ozone, a native scalable object store with S3 API compatibility. For reference, here's my article demonstrating CockroachDB and Ozone integration. The limitation here is that you need to run Hadoop 3 to get access to it. 

What if you're on Hadoop 2? There are several choices I can think of off the top of my head. One approach is to expose webhdfs and IMPORT using an http endpoint. The second option is to leverage previously discussed Minio to expose HDFS via HTTP or S3. Today, we're going to look at both approaches.

My setup consists of a single-node pseudo-distributed Hadoop cluster with Apache Hadoop 2.10.0 running inside a VM provisioned by Vagrant. Minio runs as a service inside the VM and CockroachDB is running inside a docker container on my host machine.
  • Information on CockroachDB can be found here.
  • Information on Hadoop Ozone can be found here.
  • Information on Minio can be found here.
  1. Upload a file to HDFS.

I have a CSV file I created with my favorite data generator tool, Mockaroo.

curl "https://api.mockaroo.com/api/38ef0ea0?count=1000&key=a2efab40" > "part5.csv"
hdfs dfs -mkdir /data
hdfs dfs -chmod -R 777 /data
hdfs dfs -put part5.csv /data


CockroachDB With Kerberos and Docker Compose

Articles covering CockroachDB and Kerberos

I find the topic of Kerberos very interesting and my colleagues commonly refer to me for help with this complex topic. I am by no means an expert at Kerberos, I am however familiar enough with it to be dangerous. That said, I've written multiple articles on the topic which you may find below:

As our customers are increasing their footprint and considering production use cases, I'm being asked to walk through the typical steps of enabling Kerberos auth for Cockroach. As this process is pretty heavy-handed, a few of us at Cockroach had sought out a repeatable process in getting an on-demand environment quickly and efficiently. CockroachDB source code is a good starting point for learning the inner workings of CRDB. I knew there were compose recipes available with Kerberos to test through the integration but typically they are written for Go language tests. We decided to introduce our own docker compose with nothing but a Kerberos realm, an instance of CockroachDB, and a Postgres client to connect with. The last part is only necessary for a bit longer as we're actively working on building GSSAPI support into the cockroach CLI.

CockroachDB CDC Using Minio as Cloud Storage Sink – Part 3

This is the third in the series of tutorials on CockroachDB and Docker Compose. Today, we’re going to explore CDC capability in CockroachDB Enterprise Edition using Minio object store as a sink. To achieve this, we’re going to reuse the compose file from the first two tutorials and finally bring this to a close. Without further ado

You can find the first post here and the second post here.

Running CockroachDB With Docker Compose and Minio – Part 2

CockroachDB, Docker Compose, and Minio

This is my second post on creating a multi-service architecture with docker-compose. We're building a microservice architecture with CockroachDB writing changes in real-time to an S3 bucket in JSON format. S3 bucket is served by a service called Minio. It can act like an S3 appliance on-premise or serve as a local gateway to your cloud storage.

You can find the first post here.

Running CockroachDB With Docker Compose – Part 1

Over the next few weeks, I will be publishing a series of tutorials on CockroachDB and various third-party tools to demonstrate how easy it is to integrate CockroachDB into your daily work. Today, we're covering docker-compose and a single node Cockroach cluster as that will be a foundation for the next blog post.

CockroachDB and Docker Compose

This is the first in a series of tutorials on CockroachDB and Docker Compose.

CockroachDB Admission Control

Last week, while running a workload consisting of 200 different queries, we noticed right away that a CPU imbalance was causing a performance issue. Looking at the first graph, below, you can see right away that one of the three CockroachDB nodes was operating at near 100% CPU. Not ideal.

At the time we installed that cluster, we utilized version 21.1.11 of CockroachDB, which was the most current. Now fast forward in time to November 16 — version 21.2.0 has just been released, with a new feature called Admission Control. From the docs for this new feature, one scenario where Admission Control can help is when

Configuring CockroachDB With Active Directory

Today, I'm going to cover CockroachDB Active Directory integration. Under the covers, Cockroach utilizes GSSAPI. Today, Cockroach only supports user mapping. It does not support user sync between AD OU to a Cockroach role.

My lab environment consists of an AD controller running Windows Server 2016 as a VirtualBox VM and a Vagrant VM with CentOS 7 hosting CockroachDB. The VMs share a host-only network.  This was critical in my setup so that my Cockroach node could interact with AD on port 88.

Loading Thousands of Tables in Parallel With Ray Into CockroachDB Because Why Not?

I came across an interesting scenario working with one of our customers. They are using a common data integration tool to load hundreds of tables into CockroachDB simultaneously. They reported an issue that their loads fail intermittently due to an unrecognized error. As a debug exercise I set out to write a script to import data from an http endpoint into CRDB in parallel. 

Disclosure: I do not claim to be an expert in CRDB, Python, or anything else for that matter. This is an exercise in answering a why not? question more so than anything educational. I wrote a Python script to execute an import job and need to make sure it executes in parallel to achieve the concurrency scenario I've originally set out to do. 

Migrate a SQL Server Database to CockroachDB

The fastest way to get started with SQL Server is via available Docker containers. I’m using the following tutorial to deploy SQL Server on Ubuntu from my Mac. My SQL Server-Fu is a bit rusty and I opted for following this tutorial to restore WideWordImporterssample database into my Docker container. You may also need SQL Server tools installed on your host and you may find direction for Mac OS and Linux at the following site, users of Windows are quite familiar with a download location for their OS. 

I also used the following directions to install SQL Server tools on my Mac but ran into compatibility issues with the drivers in my Docker container. This will be a debug session for another day.

Run SQL Server in Docker

Pull the SQL Server image:

docker pull mcr.microsoft.com/mssql/server:2019-GA-ubuntu-16.04


Global Tables vs Duplicate Indexes

With CockroachDB version 21.1, a new concept of “global tables” was introduced. This functionality is a good fit for situations where read latency on a table needs to be very low and write latency can be much higher than normal [read: it needs to be fast when doing lookups, but we can tolerate lengthy writes]. It is also a fit for when reads must be up to date for business reasons or because the table is referenced by foreign keys or where latency-critical reads cannot be tied to specific regions. So using the global tables functionality is ideal for any kind of reference table, where we want to get the data out very quickly, but we can handle long inserts and updates.

With global tables, the lengthy writes go about doing all the pre-work to prevent any contention and serializable errors that we would traditionally see. When we read from the table, all the work has previously been done to avoid any contention and we can access the closest replica and read from it, effectively as if we were doing a follower read but in real time. 

CockroachDB TIL: Volume 4

Topics

Topic 1: Reset Network and Execution Statistics in CLI

This topic will cover what trivial misconfigurations may lead to if you don't keep client and server versions uniform. These mismatched versions may lead to unforeseen consequences. This is one of those situations where a user reported missing info from their CLI. In this case, they were missing execution and network latency which we conveniently output after every query.

 
artem@free-tier.gcp-us-central1.cockroachlabs.cloud:26257/defaultdb> select * from results limit 1;

                   id                  |   city    |  type   |               owner_id               |    creation_time    | status |   current_location    |       ext

---------------------------------------+-----------+---------+--------------------------------------+---------------------+--------+-----------------------+-------------------

  aaaaaaaa-aaaa-4800-8000-00000000000a | amsterdam | scooter | c28f5c28-f5c2-4000-8000-000000000026 | 2019-01-02 03:04:05 | in_use | 62609 Stephanie Route | {"color": "red"}

(1 row)



Time: 49ms total (execution 2ms / network 48ms)

Specifically, the last line of the output above.

CockroachDB Grants and Schemas Explained

In this post, we are going to walk through some common tasks related to granting non-admin users privileges on CockroachDB tables and schemas, and explain why the results you get may not be what you are expecting. By the end of this post, I hope you'll have a much clearer picture of how database, table, and schema privileges work in CockroachDB.

Let's use a simple scenario.

CockroachDB TIL: Volume 3

Previous Articles

Topic 1: Force Lookup Join Instead of Merge or Hash Join

To make the optimizer prefer lookup joins over merge or hash joins when performing foreign key checks, set the prefer_lookup_joins_for_fks session variable to on (merge joins are the default for single-row inserts and hash joins are likely to appear with more rows. There are times when a lookup join will be more efficient than the chosen plan by the optimizer. The lesson here is to test all available options and not to accept the defaults!

set prefer_lookup_joins_for_fks=on;

How To Use CockroachDB With Your Django Application

This tutorial is intended to be a quick ramp-up on CockroachDB with Django.  In case you're searching for a proper Django tutorial, this is not it. At the time of writing, django-cockroachdb library is available in two versions, (2 and 3). This tutorial will cover version 3, and is inspired by the Digital Ocean tutorial using Django with PostgreSQL. I am going to highlight the steps where this tutorial differs from the original. For everything else, we will assume the tutorial is followed as is.

I originally wrote this post two years ago and had since updated it as CockroachDB RBAC is no longer an enterprise-only feature so we can skip that step. I'm also including steps to launch a Docker instance to make this more portable and comprehensive.