-
Notifications
You must be signed in to change notification settings - Fork 751
Gobblin Schedulers
Oozie is a very popular scheduler for the Hadoop environment. It allows users to define complex workflows using XML files. A workflow can be composed of a series of actions, such as Java Jobs, Pig Jobs, Spark Jobs, etc. Gobblin has two integration points with Oozie. It can be run as a stand-alone Java process via Oozie's <java>
tag, or it can be run as an Map Reduce job via Oozie.
The following guides assume Oozie is already setup and running on some machine, if this is not the case consult the Oozie documentation for getting everything setup.
This guide focuses on getting Gobblin to run in as a stand alone Java Process. This means it will not launch a separate MR job to distribute its workload. It is important to understand how the current version of Oozie will launch a Java process. It will first start an MapReduce job and will run the Gobblin as a Java process inside a single map task. The Gobblin job will then ingest all data it is configured to pull and then it will shutdown.
By following the template workflow.xml
and job.properties
files in gobblin-oozie/src/main/resources/
, launching Gobblin on easy can be relatively easy. There are a number of important files in the aforementioned directory:
-
gobblin-oozie-example-system.properties
contains default system level properties for Gobblin- When launched with Oozie, Gobblin will run inside a map task; it is thus recommended to configure Gobblin to write directly to HDFS rather than the local file system. The property
fs.uri
in this file should be changed to point to the NameNode of the Hadoop File System the job should write to. - By default, all data is written under a folder called
gobblin-out
; to change this modify thegobblin.work.dir
parameter
- When launched with Oozie, Gobblin will run inside a map task; it is thus recommended to configure Gobblin to write directly to HDFS rather than the local file system. The property
-
gobblin-oozie-example-workflow.properties
contains default Oozie properties for any job launched -
gobblin-oozie-example-workflow.xml
contains an example Oozie workflow
Here is a high level description of the changes that need to be made to the above configuration files to get Oozie to launch a Gobblin job:
- In
gobblin-oozie-example-workflow.properties
update thename.node
,resource.manager
, andoozie.wf.application.path
parameters- The
oozie.wf.application.path
should point to a directory on HDFS containing all workflow files
- The
- Update the
<arg>
tags for the<java>
action
- Home
- [Getting Started](Getting Started)
- Architecture
- User Guide
- Working with Job Configuration Files
- [Deployment](Gobblin Deployment)
- Gobblin on Yarn
- Compaction
- [State Management and Watermarks] (State-Management-and-Watermarks)
- Working with the ForkOperator
- [Configuration Glossary](Configuration Properties Glossary)
- [Partitioned Writers](Partitioned Writers)
- Monitoring
- Schedulers
- [Job Execution History Store](Job Execution History Store)
- Gobblin Build Options
- Troubleshooting
- [FAQs] (FAQs)
- Case Studies
- Gobblin Metrics
- [Quick Start](Gobblin Metrics)
- [Existing Reporters](Existing Reporters)
- [Metrics for Gobblin ETL](Metrics for Gobblin ETL)
- [Gobblin Metrics Architecture](Gobblin Metrics Architecture)
- [Implementing New Reporters](Implementing New Reporters)
- [Gobblin Metrics Performance](Gobblin Metrics Performance)
- Developer Guide
- [Customization: New Source](Customization for New Source)
- [Customization: Converter/Operator](Customization for Converter and Operator)
- Code Style Guide
- IDE setup
- Monitoring Design
- Project
- [Feature List](Feature List)
- Contributors/Team
- [Talks/Tech Blogs](Talks and Tech Blogs)
- News/Roadmap
- Posts
- Miscellaneous