Skip to content

How to set Spark configuration in ProcessingStep? #2724

Answered by neilmcguigan
neilmcguigan asked this question in Help
Discussion options

You must be logged in to vote

Figured it out. The answer is to use get_run_args():


# this uploads `configuration` to s3 and converts it to a ProcessingInput:
run_args = processor.get_run_args(
    "processor.py",
    arguments=[
        "--output",
        "s3://bucket/prefix/"
    ],
    configuration=[{
        "Classification": "spark-defaults",
        "Properties":{
            "spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs": "false" # avoid creating _SUCCESS file in output:
        }
    }]
)

processing_step = ProcessingStep(
    name="process",
    processor=processor,
    code=run_args.code,
    job_arguments = run_args.arguments,
    inputs=run_args.inputs    
)

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by neilmcguigan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Help
Labels
None yet
1 participant