Best Practices for Building the Data Pipelines

In my previous article ‘Data Validation to Improve Data Quality’, I shared the importance of data quality and a checklist of validation rules to achieve it. Those validation rules alone may not guarantee the best data quality. In this article, we focus on the best practices to employ while building the data pipelines to ensure data quality. 

1. Idempotency

A data pipeline should be built in such a way that, when it is run multiple times, the data should not be duplicated. Also, when a failure happens and it is resolved and run again, there should not be a data loss or improper alterations. Most pipelines are automated and run on a fixed schedule. By capturing the logs of previous successful runs such as the parameters passed (date range), record inserted/modified/deleted count, timespan of the run, etc., the next run parameters can be set relative to the previous successful run. For example, if a pipeline runs every hour and a failover happens at 2 pm, the next run should capture the data from 1 pm automatically and the timeframe should not be incremented until the current run is successful.