Skip to content

Gobblin Schedulers

sahilTakiar edited this page Jan 23, 2016 · 22 revisions

Introduction

Quartz

Azkaban

Oozie

Oozie is a very popular scheduler for the Hadoop environment. It allows users to define complex workflows using XML files. A workflow can be composed of a series of actions, such as Java Jobs, Pig Jobs, Spark Jobs, etc. Gobblin has two integration points with Oozie. It can be run as a stand-alone Java process via Oozie's <java> tag, or it can be run as an Map Reduce job via Oozie.

The following guides assume Oozie is already setup and running on some machine, if this is not the case consult the Oozie documentation for getting everything setup.

Launching Gobblin in Local Mode

This guide focuses on getting Gobblin to run in as a stand alone Java Process. This means it will not launch a separate MR job to distribute its workload. It is important to understand how the current version of Oozie will launch a Java process. It will first start an MapReduce job and will run the Gobblin as a Java process inside a single map task. The Gobblin job will then ingest all data it is configured to pull and then it will shutdown.

Example Config Files

By following the template workflow.xml and job.properties files in gobblin-oozie/src/main/resources/, launching Gobblin on easy can be relatively easy. There are a number of important files in the aforementioned directory:

gobblin-oozie-example-system.properties contains default system level properties for Gobblin. When launched with Oozie, Gobblin will run inside a map task; it is thus recommended to configure Gobblin to write directly to HDFS rather than the local file system. The property fs.uri in this file should be changed to point to the NameNode of the Hadoop File System the job should write to. By default, all data is written under a folder called gobblin-out; to change this modify the gobblin.work.dir parameter in this file.

gobblin-oozie-example-workflow.properties contains default Oozie properties for any job launched. It is also the entry point for launching an Oozie job (e.g. to launch an Oozie job from the command line you execute oozie job -config gobblin-oozie-example-workflow.properties -run). In this file one needs to update the name.node and resource.manager to the values specific to their environment. Another important property in this file is oozie.wf.application.path; it points to a folder on HDFS that contains any workflows to be run. It is important to note, that the workflow.xml files must be on HDFS in order for Oozie to pick them up (this is because Oozie typically runs on a separate machine as any client process).

gobblin-oozie-example-workflow.xml contains an example Oozie workflow. This example simply launches a Java process that invokes the main method of the CliLocalJobLauncher. The main method of this class expects two file paths to be passed to it (once again these files need to be on HDFS). The jobconfig arg should point to a file on HDFS containing all job configuration parameters. An example jobconfig file can be found here. The sysconfig arg should point to a file on HDFS containing all system configuration parameters. An example sysconfig file for Oozie can be found here

Uploading Files to HDFS

Adding Gobblin jar Dependencies

Launching the Job

Debugging Tips

Clone this wiki locally