John Lafleur | The Blog Pros

March 7, 2024

How to Load Data From MongoDB to Postgres Destination

MongoDB is a distributed database that is built for modern transactional and analytical applications and may be used for rapidly changing, multi-structured data. On the other hand, PostgreSQL is an SQL database that has all of the features that you require from a relational database. If you are unsure of the differences between these systems, on the MongoDB website, you can find an article that compares PostgreSQL and MongoDB.

Choosing one or the other between MongoDB and PostgreSQL may not be your only option – in fact, because each database has different strengths, you may wish to use them side-by-side. If this is your case, then you may need to sync data between them.

February 20, 2024

A Deep Dive Into Data Orchestration With Airbyte, Airflow, Dagster, and Prefect

This article delves into the integration of Airbyte with some of the most popular data orchestrators in the industry – Apache Airflow, Dagster, and Prefect. We'll not only guide you through the process of integrating Airbyte with these orchestrators but also provide a comparative insight into how each one can uniquely enhance your data workflows.

We also provide links to working code examples for each of these integrations. These resources are designed for quick deployment, allowing you to seamlessly integrate Airbyte with your orchestrator of choice.

December 17, 2023

Airbyte and Llamaindex: ELT and Chat With Your Data Without Writing SQL

There are some great guides out there on how to create long-term memory for AI applications using embedding-based vector stores like ChromaDB or Pinecone. These vector stores are well-suited for storing unstructured text data. But what if you want to query data that’s already in a SQL database - or what if you have tabular data that doesn’t make sense to write into a dedicated vector store?

For example, what if we want to ask arbitrary historical questions about how many GitHub issues have been created in the Airbyte repo, how many PRs have been merged, and who was the most active contributor overall time? Pre-calculated embeddings would not be able to answer these questions, since they rely upon aggregations that are dynamic and whose answers are changing constantly. It would be nearly impossible - and efficient - to try to answer these questions with pre-formed text documents and vector-based document retrieval.

September 8, 2023

Using Open Source for Data Integration and Automated Synchronizations

Apache Airflow and Airbyte are complementary tools that can be used together to meet your data integration requirements. Airbyte can be used to extract data from hundreds of sources and load it to any of its supported destinations. Airflow can be used for scheduling and orchestration of tasks, including triggering Airbyte synchronizations. The combination of Airflow and Airbyte provides a flexible, scalable, and maintainable solution for managing your data integration and data processing requirements.

In this tutorial, you will install Airbyte Open Source and Apache Airflow running in a local Docker Desktop environment. After installation, you will configure a simple Airbyte connection. Next, you will create an Airflow-directed acyclic graph (DAG), which triggers a data synchronization over the newly created Airbyte connection and then triggers (orchestrates) some additional tasks that depend on the completion of the Airbyte data synchronization.

August 28, 2023

AI Shouldn’t Waste Time Reinventing ETL

The recent progress in AI is very exciting. People are using it in all sorts of novel ways, from improving customer support experiences and writing and running code to making new music and even accelerating medical imaging technology.

But in the process, a worrying trend has emerged: the AI community seems to be reinventing the data movement (aka ELT). Whether they call them connectors, extractors, integrations, document loaders, or something else, people are writing the same code to extract data out of the same APIs, document formats, and databases and then load them into vector DBs or indices for their LLMs.

February 27, 2023

The Benefits of Open-Source ELT

Open-source technology is becoming increasingly popular in the data integration industry, and for good reasons. Open source creates the right incentives, allowing users to own their data entirely, unlike closed source, where you build knowledge in a proprietary tool with a price tag. Open source also creates communities around common problems, allowing for the exchange of valuable knowledge and collaborative problem-solving.

In this article, we will start investigating the reasons behind the adoption success of open source before delving deeper into the data integration industry, more specifically focusing on open-source vs. closed-source ELT (Extract, Load, Transform) solutions. We will discuss how open-source ELT allows for greater control over the data integration process, more efficient data processing, and cost savings for organizations. Additionally, we will explore the growing trend of open-source ELT adoption in the industry and examine the future of open-source data integration.

January 31, 2023

Why Open Source Is Much More Than Just a Free Tier

Open source has been on the rise for the past few decades. From small startups to large enterprises, open source has now become a crucial part of the software development process. While open source is often thought of as simply a free alternative to proprietary software, it is actually so much more than that.

In this article, we will explore the reasons why open source has been so successful, the areas where it has not been as successful, and the differences between open source and free tiers of software, with a deeper look on the data infrastructure industry.

October 24, 2022

The Rise of the Semantic Layer: Metrics On-The-Fly

A semantic layer is something we use every day. We build dashboards with yearly and monthly aggregations. We design dimensions for drilling down reports by region, product, or whatever metrics we are interested in. What has changed is that we no longer use a singular business intelligence tool; different teams use different visualizations (BI, notebooks, and embedded analytics).

Instead of re-creating siloed metrics in each app, we want to define them once, open in a version-controlled way and sync them into each visualization tool. That’s what the semantic layer does, primarily defined as YAML. Additionally, the semantic layer adds powerful layers such as APIs, caching, access control, data modeling, and metrics layer.

October 25, 2021

How to Migrate Your Data From Redshift to Snowflake

For decades, data warehousing solutions have been the backbone of enterprise reporting and business intelligence. But, in recent years, cloud-based data warehouses like Amazon Redshift and Snowflake have become extremely popular. So, why would someone want to migrate from one cloud-based data warehouse to another?

The answer is simple: More scale and flexibility. With Snowflake, users can quickly scale out data and compute resources independently by automatically adding nodes. Using the VARIANT data type, Snowflake also supports storing richer data such as objects, arrays, and JSON data. Debugging Redshift is not always straightforward as well, as Redshift users know. Sometimes it goes beyond feature differences that could trigger a desire to migrate. Maybe your team just knows how to work with Snowflake better than Redshift, or perhaps your organization wants to standardize on one particular technology.

October 21, 2021

How Open Source Can Help You Scrape LinkedIn in a Postgres Database

“Data” is changing the face of our world. It might be part of a study helping to cure a disease, boost a company’s revenue, make a building more efficient, or drive ads that you keep seeing. To take advantage of data, the first step is to gather it and that’s where web scraping comes in.

This recipe teaches you how to easily build an automatic data scraping pipeline using open source technologies. In particular, you will be able to scrape user profiles on LinkedIn and move these profiles into a relational database such as PostgreSQL. You can then use this data to drive geo-specific marketing campaigns or raise awareness for a new product feature based on job titles.

October 13, 2021

How To Build GitHub Activity Dashboard With Open-Source

In this article, we will be leveraging Airbyte - an open-source data integration platform and Metabase - an open-source way for everyone in your company to ask questions and learn from data - to build the GitHub activity dashboard above.

Airbyte provides us with a rich set of source connectors, and one of those is the GitHub connector which allows us to get data off a GitHub repo. We are going to use this connector to get the data of the Airbyte repo and copy them into a Postgres database destination. We will then connect this database to Metabase in order to create the activity dashboard. In order to do so, we will need:

July 30, 2021

Why ETL Needs Open Source to Address the Long Tail of Integrations

Over the last year, our team has interviewed more than 200 companies about their data integration use cases. What we discovered is that data integration in 2021 is still a mess.

The Unscalable Current Situation

At least 80 of the 200 interviews were with users of existing ETL technology, such as Fivetran, StitchData, and Matillion. We found that every one of them was also building and maintaining their own connectors even though they were using an ETL solution (or an ELT one — for simplicity, I will just use the term ETL). Why?

May 12, 2021

How “User Success” Helps Us Become the Most Active Slack Community

Today, we’re celebrating three important milestones for Airbyte. Within just 7 months of the release of our very first product (MVP), which had only 6 connectors, we became the most active Slack community of data professionals around data integration. This is our first milestone.

As you might already know, we are a transparent company. Every month or so, we publish information on our project and company that would be confidential in other companies, such as:

April 5, 2021

How to Visualize the Time Spent by Your Team in Zoom Calls

In this article, we will show you how you can understand how much your team leverages Zoom, or spends time in meetings, in a couple of minutes. We will be using Airbyte (an open-source data integration platform) and Tableau (a business intelligence and analytics software) for this tutorial.

Here is what we will cover:

February 24, 2021

How to Save and Search Your (Free-Tier) Slack History

The Slack free tier saves only the last 10K messages. For social Slack instances, it may be impractical to upgrade to a paid plan to retain these messages. Similarly, for an open-source project like Airbyte where we interact with our community through a public Slack instance, the cost of paying for a seat for every Slack member is prohibitive.

However, searching through old messages can be really helpful. Losing that history feels like some advanced form of memory loss. What was that joke about Java 8 Streams? This contributor question sounds familiar—haven't we seen it before? But you just can't remember!

December 9, 2020

Why You Should NOT Build Your Data Pipeline on Top of Singer

Singer.io is an open-source CLI tool that makes it easy to pipe data from one tool to another. At Airbyte, we spent time determining if we could leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations (targets).

For the sake of this article, let’s say we are trying to build a tool that can do the following:

August 22, 2019

Software Quality: The Top 10 Metrics to Build Confidence

How do you measure quality in software engineering? I guess this is the question there will always be a debate on. There are so many approaches to this question that finding only one answer is just impossible. In this article, we will be listing the quality-related metrics that the top engineering teams have been keeping track of, and see when and how you should use them.

However, note that when you think about it, one can wonder if the quality is a goal in itself. The confidence in being able to grow and change behaviors without disruption seems to be more what matters. In that case, quality metrics are surely important, but their evolution over time is at least as important.

August 16, 2019

How to Use Data to Improve Your Sprint Retrospectives

Most agile teams do sprint retrospectives at least once a month, to iterate and improve on their software development process and workflow. However, a lot of those same teams rely only on their feelings to “know” if they have actually improved. But you need an unbiased reference system if you want to compare how two sprints went.

Depending on what you’re focusing on, there are metrics that you might be interested in. For instance, if your team uses estimation, tracking how those estimates pan out could be worthy indeed, and comparing the variance across sprints could provide such a metric.