Data Lakes with Apache Spark + EMR Cluster 🚤

You can take a trip to a polluted lake or a clean lake and how the water looks, tastes, or where it comes from matters.

Data lakes is a new analogy to what Data Warehouse was till not too long ago. We are still using the same hardware for Data lakes, but with new tools which makes possible to cover more ground.

In my previous project STAR vs 3NF 🥊 SCHEMA I prepared the data to be ready for use by BI applications with the OLAP cubes. It’s a structure that has been validated and vetted through several implementations and successful cases. When I learned about Data lakes: the tools, the language, Serverless (Python, learning Scala) I felt that I save a bit of automation by looking into it.

Don’t get me wrong, like any other technology it’s flexible, there are pros and cons, budget, analysis of your workload, and team work.

Data is the new oil 🛢

As I mention, automation, but not really. Instead of creating tables and doing the ETL dance, lets do the ELT 💃 dance.

Big Data frameworks like Spark focuses on what, where, and how to what Hadoop couldn’t

  • What type of files you read/write has more variety

  • Where the files reside; filesystem or databases

  • How everything becomes available through DataFrames + SQL

df ='s3a://.../file.csv', sep=';', inferSchema=True, header=True)
user_table = df.spark.sql("""
	SELECT user_id as id', 'year(ts) as year                        
	FROM log_data                         
user_table.write.parquet('users', partittionBy='year')

Jupyter Notebook

We will perform our transformation and have them save in S3 for which our BI apps could connect to or we could attached to the cluster, but they are expensive 💰. S3 it’s cheap and it doesn’t get shutdown.

Another step is the Schema-on-read for this process to be possible, and if you noticed there is a lot of steps.


This project was completed under the Data Engineer Udacity Nanodegree link

tech: AWS EMR (Spark+HDFS), Python, Notebooks