python for data science | The Blog Pros

Golden Ratio Primes With Python

The golden ratio is the larger root of the equation

φ² - φ - 1 = 0.

February 25, 2019

Automating Hadoop Computations on AWS

The Hadoop framework provides a lot of useful tools for big data projects. But it is too complex to manage it all by yourself. Several months ago, I was deploying a Hadoop cluster using Cloudera. And I discovered that it works well only for an architecture in which compute and storage capacity is constant. It is a nightmare to use a tool like Cloudera for a system that needs to scale. That is where cloud technologies come in and make our life easier. Amazon Web Services (AWS) is the best option for this use case. AWS provides a managed solution for Hadoop called Elastic Map Reduce (EMR). EMR allows developers to quickly start Hadoop clusters, do the necessary computations, and terminate them when all the work is done. To automate this process even further, AWS provides an SDK for EMR services. Using it, you can launch your Hadoop task with a single command. I'll show how it is done in an example below.

I am going to execute a Spark job on a Hadoop cluster in EMR. My goal will be to compute average comment length for each star rating (1-5) for a large dataset of customer reviews on amazon.com. Usually, to execute Hadoop computations, we need all the data to be stored in HDFS. But EMR integrates with S3 and we don’t need to launch data instances and copy large amounts of it for the sake of a two-minute computation. This compatibility with S3 is a big advantage of using EMR. Many datasets are distributed using S3, including the one I’m using in this example (you can find it here).

January 11, 2019

10 Reasons to Learn Python in 2019

If you follow my blog regularly then you may be wondering why am I writing an article to tell people to learn Python? Didn’t I ask you to prefer Java over Python a couple of years ago?

Well, things have changed a lot since then. In 2016, Python replaced Java as the most popular language in colleges and universities and has never looked back.

GBase 8a Implementation Guide: Resource Assessment
No categories
1. Disk Storage Space Evaluation The storage space requirements for a GBase cluster are calculated based on the data volume of the business system, the choice of compression algorithm, and the number of cluster replicas. The data volume of a business s... […]
A Look Into Netflix System Architecture
No categories
Ever wondered how Netflix keeps you glued to your screen with uninterrupted streaming bliss? Netflix Architecture is responsible for the smooth streaming experience that attracts viewers worldwide behind the scenes. Netflix's system architecture emphas... […]
High Availability and Disaster Recovery (HADR) in SQL Server on AWS
No categories
High Availability and Disaster Recovery (HADR) play a vital role in maintaining the integrity of data, reducing downtime, and safeguarding against data loss in enterprise database systems. AWS offers a range of HADR options for SQL Server, which levera... […]
Terraform Tips for Efficient Infrastructure Management
No categories
Terraform is a popular tool for defining and provisioning infrastructure as code (IaC), improving consistency, repeatability, and version control. But you need to know how to use it properly to extract maximum value from it as an infrastructure managem... […]
Integration Testing With Keycloak, Spring Security, Spring Boot, and Spock Framework
No categories
In today's security landscape, OAuth2 has become a standard for securing APIs, providing a more robust and flexible approach than basic authentication. My journey into this domain began with a critical solution architecture decision: migrating from bas... […]

Proudly powered by WordPress