Lecture04 #48

anderslaunerbaek · 2024-08-22T21:15:20Z

No description provided.

…unning the program locally or inside the kubernetes cluster.

anderslaunerbaek · 2024-08-26T13:59:27Z

Hi @Svane20,
Please have a validation round here whenever you have time. :)

Svane20 · 2024-08-26T18:33:35Z

@anderslaunerbaek I am running into issues with different Python versions with the Spark Jobs

root@interactive:/home/root# python3 word-count.py alice-in-wonderland.txt
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/26 18:31:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/26 18:31:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.1.155.20 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1100, in main
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version (3, 11) than that in driver 3.10, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1211)
        at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

anderslaunerbaek · 2024-08-27T07:53:47Z

@anderslaunerbaek I am running into issues with different Python versions with the Spark Jobs

root@interactive:/home/root# python3 word-count.py alice-in-wonderland.txt
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/26 18:31:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/26 18:31:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.1.155.20 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/bitnami/spark/python/lib/pyspark.zip/pyspark/worker.py", line 1100, in main
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version (3, 11) than that in driver 3.10, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)
        at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1211)
        at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

Hi @Svane20,
Thanks for the feedback. I will look into why, you need version 3.10 for running Spark locally and version 3.11 for submitting it to the Spark environment.

anderslaunerbaek added 7 commits August 21, 2024 08:37

making sure the interactive container have openjdk-8-jdk

db55849

staged to checkout other solution

b10a3c4

first iteration of lecture 04. Need to look into clearify cation of r…

6c7050d

…unning the program locally or inside the kubernetes cluster.

updated spark values... for helm.. ned to try to see if they work.

a37d37d

updated spark application for estimating pi.

06636a4

merged main into lecture04.

f3af8c3

spark exercises done. Need validation.

f7f6e37

anderslaunerbaek marked this pull request as ready for review August 26, 2024 13:58

anderslaunerbaek requested a review from Svane20 August 26, 2024 13:58

anderslaunerbaek and others added 14 commits August 27, 2024 09:54

Merge branch 'main' into lecture04

399916f

staged for now..

ea32cb9

staged for now.

9ed7e4a

try to build using github actions.

2d829f6

Bug fixxxxed!

379c828

markdown linting..

e52baa8

updated dockerfile to include HDFS CLI.

c249433

updated dockerfile to include HDFS CLI.

957fe5e

bugfix.

ab91162

markdown lintr.

c2cfcdd

proof read and test 04.

7b246eb

proof read and test 04. removed unused files in python files.

22245df

Merge branch 'main' into lecture04

01fd816

merged layers in dockerfile

9ef9b4e

anderslaunerbaek merged commit ce1c21b into main Sep 3, 2024
3 checks passed

anderslaunerbaek deleted the lecture04 branch September 3, 2024 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lecture04 #48

Lecture04 #48

anderslaunerbaek commented Aug 22, 2024

anderslaunerbaek commented Aug 26, 2024

Svane20 commented Aug 26, 2024 •

edited

Loading

anderslaunerbaek commented Aug 27, 2024

Lecture04 #48

Lecture04 #48

Conversation

anderslaunerbaek commented Aug 22, 2024

anderslaunerbaek commented Aug 26, 2024

Svane20 commented Aug 26, 2024 • edited Loading

anderslaunerbaek commented Aug 27, 2024

Svane20 commented Aug 26, 2024 •

edited

Loading