-
Notifications
You must be signed in to change notification settings - Fork 751
Gobblin Schedulers
Gobblin jobs can be scheduled on a recurring basis using a few different tools. Gobblin ships with a built in Quartz Scheduler. Gobblin also integrates with a few other third party tools.
Gobblin has a built in Quartz scheduler as part of the JobScheduler
class. This class integrates with the Gobblin SchedulerDaemon
, which can be run using the Gobblin `bin/gobblin-standalone.sh script.
So in order to take advantage of the Quartz scheduler two steps need to be taken:
- Use the
bin/gobblin-standalone.sh
script - Add the property
job.schedule
to the.pull
file- The value for this property should be a CRONTrigger
Gobblin can be launched via Azkaban, and open-source Workflow Manager for scheduling and launching Hadoop jobs. Gobblin's AzkabanJobLauncher
can be used to launch a Gobblin job through Azkaban.
One has to follow the typical setup to create a zip file that can be uploaded to Azkaban (it should include all dependent jars, which can be found in gobblin-dist.tar.gz
). The .job
file for the Azkaban Job should contain all configuration properties that would be put in a .pull
file (for example, the Wikipedia Example .pull
file). All Gobblin system dependent properties (e.g. conf/gobblin-mapreduce.properties
or conf/gobblin-standalone.properties
) should also be in the zip file.
In the Azkaban .job
file, the type
parameter should be set to hadoopJava
(see here for more information about the hadoopJava
Job Type). The job.class
parameter should be set to gobblin.azkaban.AzkabanJobLauncher
.
Oozie is a very popular scheduler for the Hadoop environment. It allows users to define complex workflows using XML files. A workflow can be composed of a series of actions, such as Java Jobs, Pig Jobs, Spark Jobs, etc. Gobblin has two integration points with Oozie. It can be run as a stand-alone Java process via Oozie's <java>
tag, or it can be run as an Map Reduce job via Oozie.
The following guides assume Oozie is already setup and running on some machine, if this is not the case consult the Oozie documentation for getting everything setup.
These tutorial only outline how to launch a basic Oozie job that simply runs a Gobblin java a single time. For information on how to build more complex flows, and how to run jobs on a schedule, check out the Oozie documentation online.
This guide focuses on getting Gobblin to run in as a stand alone Java Process. This means it will not launch a separate MR job to distribute its workload. It is important to understand how the current version of Oozie will launch a Java process. It will first start an MapReduce job and will run the Gobblin as a Java process inside a single map task. The Gobblin job will then ingest all data it is configured to pull and then it will shutdown.
gobblin-oozie/src/main/resources/
contains sample configuration files for launching Gobblin Oozie. There are a number of important files in this directory:
gobblin-oozie-example-system.properties
contains default system level properties for Gobblin. When launched with Oozie, Gobblin will run inside a map task; it is thus recommended to configure Gobblin to write directly to HDFS rather than the local file system. The property fs.uri
in this file should be changed to point to the NameNode of the Hadoop File System the job should write to. By default, all data is written under a folder called gobblin-out
; to change this modify the gobblin.work.dir
parameter in this file.
gobblin-oozie-example-workflow.properties
contains default Oozie properties for any job launched. It is also the entry point for launching an Oozie job (e.g. to launch an Oozie job from the command line you execute oozie job -config gobblin-oozie-example-workflow.properties -run
). In this file one needs to update the name.node
and resource.manager
to the values specific to their environment. Another important property in this file is oozie.wf.application.path
; it points to a folder on HDFS that contains any workflows to be run. It is important to note, that the workflow.xml
files must be on HDFS in order for Oozie to pick them up (this is because Oozie typically runs on a separate machine as any client process).
gobblin-oozie-example-workflow.xml
contains an example Oozie workflow. This example simply launches a Java process that invokes the main method of the CliLocalJobLauncher
. The main method of this class expects two file paths to be passed to it (once again these files need to be on HDFS). The jobconfig
arg should point to a file on HDFS containing all job configuration parameters. An example jobconfig
file can be found here. The sysconfig
arg should point to a file on HDFS containing all system configuration parameters. An example sysconfig
file for Oozie can be found here.
- Home
- [Getting Started](Getting Started)
- Architecture
- User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
- Case Studies
- Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
- Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
- Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
- Miscellaneous