🌟 AWS Serverless Data Lake Jumpstart 🌟

In this Lab experiment, a cloud-native and future proof serverless data lake architecture has been built using the Lambda Architecture data processing pattern.

Objective :

To analyze dataset on the flow of yellow taxi trips from pickup borough to dropoff borough in New York city.
To emphasis analysis on taxi trips originating from Manhatton only

Used services

Amazon S3: Source to fetch raw dataset about taxi trips in New York; Destination to keep the transformed data with ETL (Extract, Transform, Load) process for analysis purpose
AWS Glue: To discover and catalog the fetched source data from Amazon S3, to apply ETL process on the raw data to format it for visualization, and to catalog the transformed data as well.
Amazon Athena: To query the cataloged dataset prior to applying ETL process and afterwards.
Amazon Quicksite: Analyze the tranform data to visualize the flow of yellow taxi trips in New York from pickup borough to dropoff borough. Also filtering the dataset to focus on trips originating from Manhatton only

Prior preparations for the Lab from Amazon

Amazon S3 buckets were created.
sample data files were copied to Amazon S3 buckets.
an IAM (Identity and Access Management) service role was created.

Requirements to run the labs

Instructor led (I followed this worflow using a workshop studio with free amazon user account valid for 72 hourse.)
Self paces

Labs executed

Lab 1: Discovering and cataloging the data
Lab 2: Exploring the data
Lab 3: Transforming the data
Lab 4: Enriching the data
Lab 5: Visualizing the data

Lab 1: Discovering and Cataloging the Data

A data lake has an array of data sources and formats. So it is important to discover and catalog the acquired data to understand it and enable integrations with other purpose built AWS services.

In this lab, An AWS Glue crawler is created to auto discover the schema of the data stored in amazon S3.
The discovered information (from above) is registered in the AWS Glue Catalog. This enables AWS Glue to use the stored catalog information for ETL processing. It will also allow AWS Athena to run queries on the data stored in Amazon S3.

Action 1. Creating a Glue Crawler:

a. Setting crawler properties, a.k.a name

b. Choosing data sources and classifiers. Here, S3 has been selected as the data source and it's path has been added for crawling purposes.

c. Configuring security settings. Here, ServerlessAnalyticsRole has been selected as the existing IAM role.

d. Setting outputs and scheduling. At first, a new database nyctaxi_db has been created as the target database. Later, when available for selection, nyctaxi_db is included in the output confifuration with raw_ as table name prefix.

e. After reviewing the configurations, the glue crawler nyx_taxi_crawler is created.

Instructions from AWS to create crawlers can be found here.

Action 2. Running the Crawler

Here the data would be cataloged by running the created crawler. When the crawler nyx_taxi_crawler has run successfully, a value of 2 would appear under the column Table changes from the last run under the crawler list.

Instructions from AWS to run a crawler can be found here.

Action 3. Review the metadata in Glue Data Catalog

To review metadata of the created database, the schemas of any table included in the Database can be checked.

for example, the raw_yellow_trip table from the nyctaxi_db database has the following 18 columns:
veodorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid, payment_type, tpep_pickup_datetime, fare_amount, extra, mta_tax, tip_amount, tolls_amount, improvement_surcharge, total_amount, congestion_surcharge.

Instructions from AWS to run a crawler can be found here.

Glue Data Catalog contains references to data that is used as sources and targets of the extract, transform, and load (ETL) jobs in AWS Glue.

Lab 2: Exploring the data

In this lab, Amazon Athena has been used to explore the data and and check for data quality issues. To fix the quality issues, relevant table properties in the AWS Glue Catalougs have been updated.

Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location specified in Amazon S3. If necessary, the files in this location can be accessed to work with them. Query result files can also be directly downleaded from the Athena console.

These instructions should be followed when launching Amazon Athena for the first time. <b/r>

Action 1: Previewing the table data

Data included in the tables raw_yellow_tripdata and taxi_zone_lookup can be previewed by clicking on the menu icon ⋮ beside the relevant table, and then selecting Preview table.

Instructions from AWS to use Amazon Athena for previewing table data from a database can be found here.

Action 2: Identifying and Fixing Data Quality Issues

In this lab, quotes associated with string values under different tables were removed to improve the quality of the data to be analyzed.

In the previous step, quotes ("") around the data values (e.g. from columns borough, zone, etc.) in the CSV file associated with the taxi_zone_lookup table were found. These quotes shouldn't be part of the data to be analyzed later. So to read the CSV file properly, table properties in AWS glue were updated to use the serialization library OpenCSVSerDe. In addition, the existing Serde parameter(s) was removed and was replaced with the following key-value pairs:

escapeChar, \
quoteChar, "
escapeChar, ,

When the preview query for taxi_zone_lookup was run, the column values appeared with removed quotes.

Instructions from AWS to update the Serialization lib for previewing table data from a database can be found here.

Action 3: Running SQL Queries to Explore the data

The following SQL queries were run to explore the data:

1. Checking number of yellow taxi trip records

SELECT COUNT(*) "Count" FROM raw_yellow_tripdata;

Resulting CSV file can be found here

2. Exploring data categories

-- observe NULL values
SELECT vendorid, COUNT(*) "Count"
FROM  raw_yellow_tripdata
GROUP BY vendorid
ORDER BY 1;

Resulting CSV file can be found here

-- observe other categories
SELECT pulocationid, COUNT(*) "Count"
FROM   raw_yellow_tripdata
GROUP BY pulocationid
ORDER BY 1;

Resulting CSV file can be found here

-- observe NULL values
SELECT payment_type, COUNT(*) "Count"
FROM   raw_yellow_tripdata
GROUP BY payment_type
ORDER BY 1;

Resulting CSV file can be found here

3. Explore records with NULL Vendor ID

-- observe other columns with NULL values
-- passenger_count, ratecodeid, store_and_fwd_flag, payment_type
SELECT * 
FROM   raw_yellow_tripdata
WHERE  vendorid IS NULL
LIMIT 100;

Resulting CSV file can be found here

4. Explore records by time period

-- tpep_pickup_datetime is defined as STRING
-- observe record counts that falls outside of the time period 
SELECT SUBSTR(tpep_pickup_datetime, 1, 7) "Period", COUNT(*) "Total Records"
FROM   raw_yellow_tripdata
GROUP BY SUBSTR(tpep_pickup_datetime, 1, 7) 
ORDER BY 1;

Resulting CSV file can be found here

5. Count records that falls outside of year 2020

-- records with incorrect pickup datetime values
SELECT COUNT(*) "Count"
FROM   raw_yellow_tripdata 
WHERE  SUBSTR(tpep_pickup_datetime, 1, 7) NOT LIKE '2020%';

Resulting CSV file can be found here

6. Count records with NULL values (based on Vendor ID) that falls within 2020

-- Records with NULL categories like Vendor ID
SELECT COUNT(*) "Count"
FROM   raw_yellow_tripdata
WHERE  vendorid IS NULL
AND    SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020%';

Resulting CSV file can be found here

7. Count records that falls in the last quarter of 2020, exclude records with missing Vendor ID

-- Total records in BER months, excluding columns with missing Vendor ID
SELECT COUNT(*) "Count"
FROM   raw_yellow_tripdata
WHERE  vendorid IS NOT NULL
AND    SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%';

Resulting CSV file can be found here

8. Join taxi trips data with taxi zone look up table

-- explore data with lookup information
-- observe column names from lookup tables
SELECT td.*, pu.*, do.*
FROM   raw_yellow_tripdata td, 
       raw_taxi_zone_lookup pu, 
       raw_taxi_zone_lookup do 
WHERE  td.pulocationid = pu.locationid AND
       td.pulocationid = do.locationid AND
       vendorid IS NOT NULL AND
       SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%'
LIMIT 100;

Resulting CSV file can be found here

-- Count total joined records for the last quarter of 2020.
SELECT COUNT(*) "Count"
FROM   raw_yellow_tripdata td, 
       raw_taxi_zone_lookup pu, 
       raw_taxi_zone_lookup do 
WHERE  td.pulocationid = pu.locationid AND
       td.pulocationid = do.locationid AND
       vendorid IS NOT NULL AND
       SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%';

Resulting CSV file can be found here

Instructions from AWS to explore the data using SQL queries can be found here.

P.S. In the AWS instructions, the table name taxi_zone_lookup in the JOIN queries need to be replaced as raw_taxi_zone_lookup. Otherwise, queries won't be executed successfully.

Lab 3: Transforming the data

In this lab, AWS Glue Studio has been used to transform existing data from the tables of nyxtaxi_db database by enriching them with additional data from other sources to make the data from raw_yellow_taxi and raw_taxi_zone_lookup tables more useful and insightful during analysis.

It is recommended to plan in details for the necessary transformation steps. The following actions were executed to transform the data:

Action 1: Reading yellow trip data from S3, raw_yellow_tripdata table

Step 1: Adding an ETL Job using Amazon Glue Studio

Step 2: Adding a data source

Action 2: Cleaning yellow trip data

The data from the raw_yellow_taxi table was cleaned in 2 steps:

Step 1: Removing records with NULL values

Any record containing NULL value in the corresponding columns vendorid, payment_type, passenger count, and ratecodeid was discarded.

Step 2: Filtering records within a time period

To narrow down the data for efficient processing, all records with invalid values in the corresponding tpep_pickup_datetime were discarded.

Action 3: Joining yellow trip data with taxi zone lookup to obtain pickup location information

Step 1:Reading the lookup data from S3, raw_taxi_zone_lookup table.

Step 2: Renaming column names of the lookup data to differentiate drop-off locations from pickup locations.

Action 4: Joining yellow trip data with taxi zone lookup to obtain drop-off location information

Action 5: Performing data transformation on joined dataset

Step 1:Adding lookup table for Taxi Drop-off Zone

Step 2:Modifying column names of Taxi Drop-off Zone lookup table

Step 3:Performing join between Yellow Trips data and Taxi Drop-off Zone lookup data

Action 6: Saving processed dataset to S3 in a query optimized format

Step 1:Modifying column names and data types of the joined dataset

Step 2:Saving transformed data into S3

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Images		Images
query-results		query-results
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟 AWS Serverless Data Lake Jumpstart 🌟

Objective :

Used services

Prior preparations for the Lab from Amazon

Requirements to run the labs

Labs executed

Lab 1: Discovering and Cataloging the Data

Action 1. Creating a Glue Crawler:

Action 2. Running the Crawler

Action 3. Review the metadata in Glue Data Catalog

Lab 2: Exploring the data

Action 1: Previewing the table data

Action 2: Identifying and Fixing Data Quality Issues

Action 3: Running SQL Queries to Explore the data

Lab 3: Transforming the data

Action 1: Reading yellow trip data from S3, raw_yellow_tripdata table

Action 2: Cleaning yellow trip data

Action 3: Joining yellow trip data with taxi zone lookup to obtain pickup location information

Action 4: Joining yellow trip data with taxi zone lookup to obtain drop-off location information

Action 5: Performing data transformation on joined dataset

Action 6: Saving processed dataset to S3 in a query optimized format

Action 7: Running the job to reading yellow trip data from S3, raw_yellow_tripdata table

Lab 4: Enriching the data

Lab 5: Visualizing the data

Terminologies (in order of appearances)

About

Releases

Packages

License

Nazarah/aws-serverless-datalake-jumpstart

Folders and files

Latest commit

History

Repository files navigation

🌟 AWS Serverless Data Lake Jumpstart 🌟

Objective :

Used services

Prior preparations for the Lab from Amazon

Requirements to run the labs

Labs executed

Lab 1: Discovering and Cataloging the Data

Action 1. Creating a Glue Crawler:

Action 2. Running the Crawler

Action 3. Review the metadata in Glue Data Catalog

Lab 2: Exploring the data

Action 1: Previewing the table data

Action 2: Identifying and Fixing Data Quality Issues

Action 3: Running SQL Queries to Explore the data

Lab 3: Transforming the data

Action 1: Reading yellow trip data from S3, raw_yellow_tripdata table

Action 2: Cleaning yellow trip data

Action 3: Joining yellow trip data with taxi zone lookup to obtain pickup location information

Action 4: Joining yellow trip data with taxi zone lookup to obtain drop-off location information

Action 5: Performing data transformation on joined dataset

Action 6: Saving processed dataset to S3 in a query optimized format

Action 7: Running the job to reading yellow trip data from S3, raw_yellow_tripdata table

Lab 4: Enriching the data

Lab 5: Visualizing the data

Terminologies (in order of appearances)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages