spark sql | The Blog Pros

August 4, 2022

Spark-Radiant: Apache Spark Performance and Cost Optimizer

Spark-Radiant is the Apache Spark Performance and Cost Optimizer. Spark-Radiant will help optimize performance and cost considering catalyst optimizer rules, enhance auto-scaling in Spark, collect important metrics related to a Spark job, Bloom filter index in Spark, etc.

Spark-Radiant is now available and ready to use. The dependency for Spark-Radiant 1.0.4 is available in Maven central. In this blog, I will discuss the availability of Spark-Radiant 1.0.4, and its features to boost performance, reduce cost, and the increase observability of the Spark Application. Please refer to the release notes docs for Spark-Radiant 1.0.4.

January 19, 2022

Coalesce With Care

Here is a quick Spark SQL riddle for you; what do you think can be problematic in the next spark code (assume that spark session was configured in an ideal way)?

    Python
   
   sparkSession.sql("select * from my_website_visits where post_id=317456")
.write.parquet("s3://reports/visits_report")

May 29, 2020

Dynamic Partition Pruning in Spark 3.0

With the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many new features along with it. Among them, dynamic partition pruning is one. Before diving into the features which are new in Dynamic Partition Pruning let us understand what is Partition Pruning.

Partition Pruning in Spark

In standard database pruning means that the optimizer will avoid reading files that cannot contain the data that you are looking for. For example,

May 27, 2020

Spark Structured Streaming Using Java

Spark provides streaming library to process continuously flowing of data from real-time systems.

Concept

Spark Streaming is originally implemented with DStream API that runs on Spark RDD’s where the data is divided into chunks from the streaming source, processed and then send to destination.