In this Lab experiment, a cloud-native and future proof serverless data lake architecture has been built using the Lambda Architecture data processing pattern.
- To analyze dataset on the flow of yellow taxi trips from pickup borough to dropoff borough in New York city.
- To emphasis analysis on taxi trips originating from Manhatton only
- Amazon S3: Source to fetch raw dataset about taxi trips in New York; Destination to keep the transformed data with ETL (Extract, Transform, Load) process for analysis purpose
- AWS Glue: To discover and catalog the fetched source data from Amazon S3, to apply ETL process on the raw data to format it for visualization, and to catalog the transformed data as well.
- Amazon Athena: To query the cataloged dataset prior to applying ETL process and afterwards.
- Amazon Quicksite: Analyze the tranform data to visualize the flow of yellow taxi trips in New York from pickup borough to dropoff borough. Also filtering the dataset to focus on trips originating from Manhatton only
- Amazon S3 buckets were created.
- sample data files were copied to Amazon S3 buckets.
- an IAM (Identity and Access Management) service role was created.
- Instructor led (I followed this worflow using a workshop studio with free amazon user account valid for 72 hourse.)
- Self paces
Lab 1: Discovering and cataloging the data
Lab 2: Exploring the data
Lab 3: Transforming the data
Lab 4: Enriching the data
Lab 5: Visualizing the data
A data lake has an array of data sources and formats. So it is important to discover and catalog the acquired data to understand it and enable integrations with other purpose built AWS services.
In this lab, An AWS Glue crawler is created to auto discover the schema of the data stored in amazon S3.
The discovered information (from above) is registered in the AWS Glue Catalog. This enables AWS Glue to use the stored catalog information for ETL processing. It will also allow AWS Athena to run queries on the data stored in Amazon S3.
a. Setting crawler properties, a.k.a name
b. Choosing data sources and classifiers. Here, S3 has been selected as the data source and it's path has been added for crawling purposes.
c. Configuring security settings. Here, ServerlessAnalyticsRole has been selected as the existing IAM role.
d. Setting outputs and scheduling. At first, a new database nyctaxi_db has been created as the target database. Later, when available for selection, nyctaxi_db is included in the output confifuration with raw_ as table name prefix.
e. After reviewing the configurations, the glue crawler nyx_taxi_crawler is created.
Instructions from AWS to create crawlers can be found here.
Here the data would be cataloged by running the created crawler. When the crawler nyx_taxi_crawler has run successfully, a value of 2 would appear under the column Table changes from the last run under the crawler list.
Instructions from AWS to run a crawler can be found here.
To review metadata of the created database, the schemas of any table included in the Database can be checked.
for example, the raw_yellow_trip table from the nyctaxi_db database has the following 18 columns:
veodorid
, tpep_pickup_datetime
, tpep_dropoff_datetime
, passenger_count
, trip_distance
, ratecodeid
, store_and_fwd_flag
, pulocationid
, dolocationid
, payment_type
, tpep_pickup_datetime
, fare_amount
, extra
, mta_tax
, tip_amount
, tolls_amount
, improvement_surcharge
, total_amount
, congestion_surcharge
.
Instructions from AWS to run a crawler can be found here.
Glue Data Catalog contains references to data that is used as sources and targets of the extract, transform, and load (ETL) jobs in AWS Glue.
In this lab, Amazon Athena has been used to explore the data and and check for data quality issues. To fix the quality issues, relevant table properties in the AWS Glue Catalougs have been updated.
Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location specified in Amazon S3. If necessary, the files in this location can be accessed to work with them. Query result files can also be directly downleaded from the Athena console.
These instructions should be followed when launching Amazon Athena for the first time. <b/r>
Data included in the tables raw_yellow_tripdata and taxi_zone_lookup can be previewed by clicking on the menu icon ⋮ beside the relevant table, and then selecting Preview table.
Instructions from AWS to use Amazon Athena for previewing table data from a database can be found here.
In this lab, quotes associated with string values under different tables were removed to improve the quality of the data to be analyzed.
In the previous step, quotes ("") around the data values (e.g. from columns borough, zone, etc.) in the CSV file associated with the taxi_zone_lookup table were found. These quotes shouldn't be part of the data to be analyzed later. So to read the CSV file properly, table properties in AWS glue were updated to use the serialization library OpenCSVSerDe. In addition, the existing Serde parameter(s) was removed and was replaced with the following key-value pairs:
escapeChar
,\
quoteChar
,"
escapeChar
,,
When the preview query for taxi_zone_lookup was run, the column values appeared with removed quotes.
Instructions from AWS to update the Serialization lib for previewing table data from a database can be found here.
The following SQL queries were run to explore the data:
1. Checking number of yellow taxi trip records
SELECT COUNT(*) "Count" FROM raw_yellow_tripdata;
Resulting CSV file can be found here
2. Exploring data categories
-- observe NULL values
SELECT vendorid, COUNT(*) "Count"
FROM raw_yellow_tripdata
GROUP BY vendorid
ORDER BY 1;
Resulting CSV file can be found here
-- observe other categories
SELECT pulocationid, COUNT(*) "Count"
FROM raw_yellow_tripdata
GROUP BY pulocationid
ORDER BY 1;
Resulting CSV file can be found here
-- observe NULL values
SELECT payment_type, COUNT(*) "Count"
FROM raw_yellow_tripdata
GROUP BY payment_type
ORDER BY 1;
Resulting CSV file can be found here
3. Explore records with NULL Vendor ID
-- observe other columns with NULL values
-- passenger_count, ratecodeid, store_and_fwd_flag, payment_type
SELECT *
FROM raw_yellow_tripdata
WHERE vendorid IS NULL
LIMIT 100;
Resulting CSV file can be found here
4. Explore records by time period
-- tpep_pickup_datetime is defined as STRING
-- observe record counts that falls outside of the time period
SELECT SUBSTR(tpep_pickup_datetime, 1, 7) "Period", COUNT(*) "Total Records"
FROM raw_yellow_tripdata
GROUP BY SUBSTR(tpep_pickup_datetime, 1, 7)
ORDER BY 1;
Resulting CSV file can be found here
5. Count records that falls outside of year 2020
-- records with incorrect pickup datetime values
SELECT COUNT(*) "Count"
FROM raw_yellow_tripdata
WHERE SUBSTR(tpep_pickup_datetime, 1, 7) NOT LIKE '2020%';
Resulting CSV file can be found here
6. Count records with NULL values (based on Vendor ID) that falls within 2020
-- Records with NULL categories like Vendor ID
SELECT COUNT(*) "Count"
FROM raw_yellow_tripdata
WHERE vendorid IS NULL
AND SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020%';
Resulting CSV file can be found here
7. Count records that falls in the last quarter of 2020, exclude records with missing Vendor ID
-- Total records in BER months, excluding columns with missing Vendor ID
SELECT COUNT(*) "Count"
FROM raw_yellow_tripdata
WHERE vendorid IS NOT NULL
AND SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%';
Resulting CSV file can be found here
8. Join taxi trips data with taxi zone look up table
-- explore data with lookup information
-- observe column names from lookup tables
SELECT td.*, pu.*, do.*
FROM raw_yellow_tripdata td,
raw_taxi_zone_lookup pu,
raw_taxi_zone_lookup do
WHERE td.pulocationid = pu.locationid AND
td.pulocationid = do.locationid AND
vendorid IS NOT NULL AND
SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%'
LIMIT 100;
Resulting CSV file can be found here
-- Count total joined records for the last quarter of 2020.
SELECT COUNT(*) "Count"
FROM raw_yellow_tripdata td,
raw_taxi_zone_lookup pu,
raw_taxi_zone_lookup do
WHERE td.pulocationid = pu.locationid AND
td.pulocationid = do.locationid AND
vendorid IS NOT NULL AND
SUBSTR(tpep_pickup_datetime, 1, 7) LIKE '2020-1%';
Resulting CSV file can be found here
Instructions from AWS to explore the data using SQL queries can be found here.
P.S. In the AWS instructions, the table name
taxi_zone_lookup
in the JOIN queries need to be replaced asraw_taxi_zone_lookup
. Otherwise, queries won't be executed successfully.
In this lab, AWS Glue Studio has been used to transform existing data from the tables of nyxtaxi_db
database by enriching them with additional data from other sources to make the data from raw_yellow_taxi
and raw_taxi_zone_lookup
tables more useful and insightful during analysis.
It is recommended to plan in details for the necessary transformation steps. The following actions were executed to transform the data:
Step 1: Adding an ETL Job using Amazon Glue Studio
Step 2: Adding a data source
The data from the raw_yellow_taxi
table was cleaned in 2 steps:
Step 1: Removing records with NULL values
Any record containing NULL value in the corresponding columns vendorid
, payment_type
, passenger count
, and ratecodeid
was discarded.
Step 2: Filtering records within a time period
To narrow down the data for efficient processing, all records with invalid values in the corresponding tpep_pickup_datetime
were discarded.
Step 1:Reading the lookup data from S3, raw_taxi_zone_lookup table.
Step 2: Renaming column names of the lookup data to differentiate drop-off locations from pickup locations.
Step 1:Adding lookup table for Taxi Drop-off Zone
Step 2:Modifying column names of Taxi Drop-off Zone lookup table
Step 3:Performing join between Yellow Trips data and Taxi Drop-off Zone lookup data
Step 1:Modifying column names and data types of the joined dataset
Step 2:Saving transformed data into S3