Skip to content

Commit

Permalink
Lecture03 (#44)
Browse files Browse the repository at this point in the history
* ready to work on lecture 03

* added images to sideload.

* Lecture02 (#43)

* initial commit for lecture 02-

* initial commit for lecture 02-

* making progress

* making progress

* hdfs.

* lecture02 Make sure that the order is PersistentVolume, Deployment, Service and updated the documentation

* lecture02 Updated docs

* resolved alias, kubectl cmds.

* updated hdfs datanodes from deployment to pod resoruces.

* ready to test exercises.

* added cleanup section.

* added nudging to open github issue for bugs.

* lecture02 Added configuration for HDFS cli

* changes ferences from README.md -> README.

---------

Co-authored-by: Anders Launer Baek-Petersen <[email protected]>

* ready to work on lecture 03

* added images to sideload.

* staged and push to change to another computer.

* serives are up......

* working on exercises.

* changed the strcuture of the manifest files for hdfs.

* kafka, consumers and producers, registry, ksqldb, and connect modules.

* updated cleanup sections

* bugfix: updated filename in readme.

* lecture03 Update the configuration for the kafka deployments to use configmaps instead in case they have to be changed during runtime

* lecture03 Update documentation for lecture 2

* lecture03 Working on Sqoop

* Update kafka-values.yaml

* lecture03 Can now ingest PostgresSql database into HDFS using Sqoop

* lecture03 Working on Flume

* lecture03 Can now use Flume to ingest data into Kafka

* lecture03 Updated cleanup

* Update documentation

* lecture03 Updated docs for Flume

* lecture03 Updated docs for Flume

* lecture03 Updated docs for Sqoop

* read the exercises.

---------

Co-authored-by: Kasper Svane <[email protected]>
  • Loading branch information
anderslaunerbaek and Svane20 authored Aug 26, 2024
1 parent dd732d5 commit 6d43690
Show file tree
Hide file tree
Showing 17 changed files with 1,277 additions and 7 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ kubeconfig.yml
templates/kubeconfig.j2
infrastructure/tmp
lectures/01/hello-kubernetes/
__pycache__/
__pycache__/
8 changes: 8 additions & 0 deletions infrastructure/images.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,11 @@ registry.gitlab.sdu.dk/jah/bigdatarepo/interactive:latest
bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
apache/hadoop:3
registry.gitlab.sdu.dk/jah/bigdatarepo/kafka-connect:7.3.1
bitnami/kafka:3.8.0-debian-12-r3
confluentinc/cp-schema-registry:7.3.1
confluentinc/cp-ksqldb-server:7.3.1
confluentinc/cp-ksqldb-cli:7.3.1
redpandadata/console:v2.7.1
dvoros/sqoop:latest
bde2020/flume
21 changes: 16 additions & 5 deletions lectures/02/exercises.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,17 @@ For the next exercises you will be working with the Alice in Wonderland book. Th

</details>

### Exercise 4 - Interacting with HDFS cluster using Python
### Exercise 4 (optional) - Mount HDFS config to interactive container

Instead of manually specifying the URI ("hdfs://namenode:port") and making sure you connect to the correct name node you can let the HDFS client decide this for you using a HDFS config file(s) (called "core-site.xml" and "hdfs-site.xml").

The `hdfs-cli.yaml` creates a Kubernetes resource called a ConfigMap. A ConfigMap is a resource that contains key-value pairs used for configuration. It can be used in various ways, but we want to mount the ConfigMap as a file to the interactive container.

Before we mount it we want to take a look at the ConfigMap and try to understand it.

To create the interactive container and mount the config, use the provided [hdfs-cli.yaml file](../../services/hdfs/hdfs-cli.yaml)

### Exercise 5 - Interacting with HDFS cluster using Python

We now want to try to interact with the HDFS cluster using Python. To do this, there are a few files provided:

Expand All @@ -127,7 +137,7 @@ We now want to try to interact with the HDFS cluster using Python. To do this, t

**Further reading**: You can read more about the HDFS Python library [here](https://hdfscli.readthedocs.io/en/latest/quickstart.html#python-bindings).

### Exercise 5 - Analyzing file and saving result in JSON format using Python
### Exercise 6 - Analyzing file and saving result in JSON format using Python

Now we know how to put files in HDFS, read files from HDFS and how to interact with HDFS. The next exercise will analyze the data and save the results to a JSON file in HDFS.

Expand All @@ -144,7 +154,7 @@ Now we know how to put files in HDFS, read files from HDFS and how to interact w
1. What are the five most common words in Alice in Wonderland?
1. How many times are they repeated?

### Exercise 6 - Analyzing file and saving result in Avro format using Python
### Exercise 7 - Analyzing file and saving result in Avro format using Python

Instead of saving the result as a JSON file, we will now try to save it as an Avro file.

Expand All @@ -158,7 +168,7 @@ You should see that the script reads the Alice in Wonderland file similarly to [
1. Run the [`counting-avro.py`](./counting-avro.py) file.
1. Read and output the result of the stored files directly from HDFS using HDFS CLI.

### Exercise 7 - Analyzing file and saving result in Parquet format using Python
### Exercise 8 - Analyzing file and saving result in Parquet format using Python

We will now try to save a Parquet file to HDFS.

Expand All @@ -172,7 +182,7 @@ We will now try to save a Parquet file to HDFS.
1. Read and output the result of the stored files directly from HDFS using HDFS CLI.
1. How many column do the dataframe have?

### Exercise 8 - Create six fictive data sources
### Exercise 9 - Create six fictive data sources


The objective of this exercise is to create a fictive data source. We want to create a Python program that enables the simulation of multiple data sources. The fictive data source could be a sensor that measures the wattage of an electricity line. The sample rate of the sensor will be adjustable. However, this will default to 1Hz. The ID of the sensor must differentiate the six data streams and the valid range of the wattage for these electricity lines is between ±600MW.
Expand Down Expand Up @@ -252,6 +262,7 @@ It is now up to you to take the components and gue them together in the [`data-s
To clean up the resources created in this lecture, you can follow the steps below:
- run the follow cmd: `kubectl delete pod <name>` created in [exercise 2](#exercise-2---interacting-with-hdfs-cluster-using-cli).
- cd into the `services/hdfs` folder in the repository.
1. `kubectl delete -f hdfs-cli.yaml` (if used)
1. `kubectl delete -f datanodes.yaml`
1. `kubectl delete -f namenode.yaml`
1. `kubectl delete -f configmap.yaml`
Expand Down
Loading

0 comments on commit 6d43690

Please sign in to comment.