HarperDB: More Than a Database

Introduction

I recently had a very interesting conversation on our podcast with Ron Lewis, the Director of Innovation and Engineering at Lumen Technologies. Ron brought up the notion that HarperDB is more than just a database, and for certain users or projects, HarperDB is not serving as a database at all. How can this be possible?

Database, Explained

Well, what really is a database? Wikipedia states “In computing, a database is an organized collection of data stored and accessed electronically from a computer system.” Another site simply states that “A database is a systematic collection of data. They support electronic storage and manipulation of data. Databases make data management easy.”

HarperDB vs MongoDB vs PostgreSQL

Many people learn or understand new things relative to things they already know. This makes sense, it’s probably a natural instinct. When it comes to products and technology, a lot of people ask “how are you different,” but different from what? You need some sort of baseline to start from, so you can say, “Similar to X, but different because of Y.” Because of this, comparisons, competitive analysis, and feature matrices are a great way to understand which technology solutions are right for you. So today let’s do a comparison of three different database systems.

As stated in my Database Architectures and Use Cases article: In most cases, it’s not that one database is better than the other, it’s that one is a better fit for a specific use case due to numerous factors. The point of this article is not to determine which database is the best, but to help uncover the factors to consider when selecting a database for your specific project. With MongoDB and PostgreSQL being two of the most popular tools out there, you may already know that there are tons of resources comparing the two. However, with HarperDB being a net new database, I thought it might be helpful to throw it in the mix to provide further clarity.

Building a Database Written in Node.js From the Ground Up

The founding team at HarperDB built the first and only database written in Node.js. A few months back, our CEO Stephen Goldberg was invited to speak at a Women Who Code meetup to share the story of this (what some called crazy) endeavor. Stephen discussed the architectural layers of the database, demonstrated how to build a highly scalable and distributed product in Node.js, and demoed the inner workings of HarperDB. You can watch his talk at the link above, and even read a post from back in 2017, but since we all love Node.js and it’s an interesting topic, I’ll summarize here.

The main (and simplest) reason we chose to build a database in Node is that we knew it really well. We got flak for not choosing to Go, but people now accept that Go and Node are essentially head to head (in popularity and community support). Zach, one of our co-founders, recognized that with the time it would have taken to learn a new language, it would never be worth it.

Database Architectures and Use Cases – Explained

With over 300 databases on the market, how do you determine which is right for your specific use case or skill set?

We continue to see the common debate of SQL vs. NoSQL and other database comparisons all over social media and platforms like dev.to. In most cases, it’s not that one database is better than the other, it’s that one is a better fit for a specific use case due to numerous factors.

How Databases Have Changed

To learn about the current and future state of databases, we spoke with and received insights from 19 IT professionals. We asked, "How have databases changed in the past year or two?" Here’s what they shared with us:

Cloud

  • The biggest trend is a massive transition to fully managed database services in the cloud. This shift gives developers the ability to work with data to support both real-time transactional apps and deep analytics, by using a single platform that minimizes data movement and allows them to extract value faster.
  • 1) It used to be all about cost reduction, today the bigger motivation is becoming more real-time as a company. Make decisions at real-time — detect fraud, risk, inventory optimization. Trying to build next-gen apps to provide more contextual UX or improve business process.

    2) Also, the ability to do advanced analytics like AI/ML.

    3) Not just do for internal purposes, want to build a data-intensive application and make available to users — personalization, offers, real-time risk engines, fraud detection, recommendation, predictive maintenance. Go beyond managing the business to improving UX/CX to drive revenue and reduce cost. Continue to be performant and scale. Different types of data require different types of data models.

    4) Cloud delivery model able to deploy data platform and databases hybrid multi-cloud, on-prem and port workloads to different deployment environments.

Choice

  • The most substantial change we’ve seen in the past couple of years is the explosion of choices available through mainstream and specialty cloud vendors. Companies like Snowflake are capturing a lot of enterprises looking for help managing their data warehouses, while major vendors like Azure and Google Cloud are capitalizing on popular products like MySQL and Postgres by offering them as a managed service in their offerings.

DBaaS

  • Not a lot in the databases themselves. Most of the activity has been on the NoSQL side. We are seeing more comprehensive and better support from the cloud vendors. E.g., AWS support for managed SQL server. DBaaS has growth potential and potential to mean customers who won’t spend on a top-notch DBA have access to a database.
  • 1) As a service acceleration as a delivery modality for testing and production. 2) New kinds of databases evolving a set of tools that can be provided by MongoDB, Redis, Neo, and partners. 3) What’s happening from a category with graph and time series. 4) Move to the cloud and containerization includes the fluidity of different platforms. Playing well with different technological evolutions. Spark and HDFS running on their own rather than Hadoop.

Fit

  • Some of the hype has died down and people are more pragmatic about using the right tools to solve their problems. Customers are excited about particular solutions to the problems they are trying to solve rather than focusing on the most recent solution to be rolled out. People are savvier and pragmatic about what tools are good for.
  • There’s been a real shift towards matching the right database tool for the right database job, and the number of databases that teams use is dramatically increasing. Relatedly, databases are also more and more niche (e.g. time series databases, CockroachDB, etc).

Other 

  • The emergence of databases and technology to deal with unstructured data. Traditionally the database world managed structured data with a relational database. The other one is databases opening themselves up to tools like Python and R for data science and machine learning. Combining data science tools with databases has been a big theme we have seen.
  • In the past few years, we see major adoption in geospatial data management. Almost every database vendor (IBM, Oracle, and MS) has support for spatial data. NoSQL (or Document) databases are seeing an increase in adoption too, to handle lots of those pictures/photos that we (mobile users) share online!
  • As storage speed and capacity of SSD drives have increased, it’s opened a lot of doors to concentrate on the data and what to do with it. Because data is growing at such a rapid pace, databases are seen as more than tools — they are strategic elements in managing change and growth.
  • How to handle hybrid data. Modern use cases like customer journey and hyper-personalization are something that’s become important. For all of these use cases, you need behavioral, social, and transactional data. How you integrate this data to solve specific business problems to come up with good recommendation engines is the key. Maniacal focus on solving the hybrid data problem where data different data from different locations separate compute storage from performance while working at elastic scale. Ease of use is going to be a huge focus in the future but it’s not there yet.
  • The realization that there’s a new set of database requirements in the SQL relational model with the release of Google Cloud Spanner and Cockroach DB.
  • 1) A lot of the NoSQL started offering SQL because that’s what people want. You don’t have to learn the nuances of the query languages. 2) Other than the niche products giving you features provided by the traditional players there are specialty products AWS came out with a blockchain database.
  • There is so much data generated that being able to consume and query has become the main asset over the past year or so, as results need to get closer and closer to real time. Edge storage has really become a thing, and there have been some impacts in the open source community.
  • Earlier, databases were used for more transactional workloads. MongoDB used transactions so that a mobile application can update a bunch of records atomically or none at all. Neo4j used transactions so that you can accurately update a set of graph edges. In the last two years, we are seeing more people use a database solely for fast analytics. Elasticsearch is trying to move over from log analytics to search analytics. We're pushing ahead with a strong focus on event analytics.
  • Databases are being made easier to work with today. When developers don’t have to worry hugely about schemas, scaling, and performance, they can focus in on what they do best; writing great code! Nowadays, the leading-edge databases and data grids are also self-managing, self-healing and can scale elastically triggered by the demands of the business.
  • Ability to handle streaming data and the democratization of the location of data. Now people have sensor and mobile data to look at streaming data over time. Move from training ML algorithms to run against the data and write to tables immediately.
  • 1) More connectivity to external forces, tables that sit on top an object store like S3. 2) Expanded capabilities to enable customers to process data. 3) Processing data within the database. 4) On the DataDevOps side, self-tuning automation with fewer DBAs needed with self-patch and upgrades. More databases being used rather than companies just selecting one.

Here are the contributors of insight, knowledge, and experience: