Read the dataset
diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
price in US dollars
weight of the diamond. 1 carat is 0.2 gramm
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
diamond colour, from D (best) to J (worst)
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
length in mm (0--10.74)
width in mm (0--58.9)
depth in mm (0--31.8)
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
width of top of diamond relative to widest point (43--95)
diamonds.count()
diamonds.printSchema()
display(diamonds.show())
display(diamonds)
diamonds.describe("carat","price").show()
display(diamonds.describe("carat","price"))
We can use different data transformations like
- select
- where / filter (like where in SQL)
- grouBy
- agregations like avg, sum, max, min, mean, count
- limit
- join
- drop (drops columns or rows)
- create new columns
display(diamonds.select('cut').distinct())
expensive_diamonds = diamonds.where(diamonds["price"] > 15000)
dimonds_cut = diamonds.select("cut", "price").groupBy("cut").avg("price")
display(dimonds_cut)
Click on the '+' next to the table. Click on 'Visualization'. Visualization Type should be 'Bar'. Click on Save.
Why has the best cut ("ideal") not the best average price?
Create a new dataframe dimonds_prices that calculates the average price per all 4 Cs (color, carat, cut, clarity)
Create a Visualization: Create a Scatterplot, with 'carat' on the x-column, avg(price) on the y-column, and cut on the group by
Save the Dataframe in the Delta Lake
diamonds.write.format("delta").mode("overwrite").save("/delta/diamonds")
Now we create a SQL table
%sql
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'
try it out:
%sql
SELECT * from diamonds
Create a screenshot of the notebook and upload it to Moodle
To find more tutorials, you can go for different topics:
- Getting started: https://docs.databricks.com/getting-started/index.html
- Data Engineering: https://docs.databricks.com/workspace-index.html
- Machine Learning: https://docs.databricks.com/machine-learning/index.html
- Data Warehousing and SQL: https://docs.databricks.com/sql/index.html
- Delta Lake: https://docs.databricks.com/delta/index.html