Parquet Data Filtering With Pandas

When it comes to filtering data from Parquet files using pandas, several strategies can be employed. While it’s widely recognized that partitioning data can significantly enhance the efficiency of filtering operations, there are additional methods to optimize the performance of querying data stored in Parquet files. Partitioning is just one of the options.

Filtering by Partitioned Fields

As previously mentioned, this approach is not only the most familiar but also typically the most impactful in terms of performance optimization. The rationale behind this is straightforward. When partitions are employed, it becomes possible to selectively exclude the need to read entire files or even entire directories of files (aka, predicate pushdown), resulting in a substantial and dramatic improvement in performance.

Coalesce With Care

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)?

Python
 
sparkSession.sql("select * from my_website_visits where post_id=317456")
.write.parquet("s3://reports/visits_report")