TDP is a collection of ansible roles which deploy Hadoop-oriented big data services to your target machines. These Hadoop services are deployed with Hadoop component binaries, compiled directly from their open source repositories.
This document is written as a high level technical overview for TDP project contributors, DBAs, sysadmins, data engineers etc. as an aid to understand how TDP deploys and configures Hadoop services.
For a sandbox environment of a TDP cluster, the TDP Getting Started deploys a highly available Hadoop cluster on a virtual cluster on your local machine.
All relative paths in this doc are relative to the appropriate ansible.cfg
file used by TDP ansible roles.
The project root largely contains links to collection generic resources (docs/
, .gitignore
etc.) and then the roles/
directory:
-
The
plugins/
directory includes theaccess_fqdn
filter plugin definition, which is used frequently throughout the TDP ansible roles. TDP role agnostic tools and scripts belong here -
The
playbooks/
directory contains a suite of helper playbooks, including tests and example usage of the various TDP ansible roles. Ansible playbooks which use the TDP roles without changing them belong here -
The
roles
directory is where the TDP roles are located. 1 directory/1 role per Hadoop service. New Hadoop component deployment roles belong here- Some roles install multiple services if the dependency between them is mutual and specific (for example the
roles/Spark/
TDP role will install the tez service as well as the spark service. In such cases, both multiple binaries can be used in a single role
- Some roles install multiple services if the dependency between them is mutual and specific (for example the
- The TDP distribution uses
.tar.gz
format and have been tested on deployments to rhel7 and centos7 machines
- ansible=2.9+ on the ansible host
- java-1.8.0-openjdk on all target nodes
- Network connectivity between all nodes
- Nodes must be able to resolve the fqdn of all other nodes
-
Kerberos
-
TDP currently requires the presence of a KDC and that appropriately configured kerberos-clients are available on each node in the cluster
-
A kerberos admin principal should exist before any deployment (the admin credentials and realm will be used to automate service principal creation)
-
A
krb5.conf
file with this KDC's information should be available on the ansible host atfiles/krb5.conf
-
-
Certificate Authority:
- The files directory on the ansible host project root must contain CA public certificate at
files/root.pem
, andfiles/<fqdn>.pem
files/<fqdn>.key
key and signed certificate for each node in the cluster.
- The files directory on the ansible host project root must contain CA public certificate at
The binaries should be available in the the files/
directory.
The ansible inventory file has 2 important roles in TDP:
- As a source of truth for the node addresses
- As mechanism to control to which servers the TDP roles deploy the Hadoop services to.
An example of an ansible inventory item definition is:
worker-01 ansible_host=192.168.32.10 ansible_connection=ssh ansible_user=vagrant ip=192.168.32.10 domain=tdp
The domain
host variable must be present in the inventory file and the fqdn of the target should be "{{inventory_hostname}}.{{domain}}"
. Any deviation from this will break some default variable definitions (those which build the a node's fqdn based on the ansible inventory file using the TDP access_fqdn
plugin.
The ip
host variable must be present in the inventory file and should correspond to the IP address used within the cluster (it can deviate from the IP address used for Ansible administration). Any deviation from this will break some tasks that rely on this variable.
Check the docs/DEVELOPER_DOCS.md file for more details and examples of the TDP distribution ansible inventory file.
Some of the TDP ansible roles rely on configuration files or clients deployed by other other TDP ansible roles. This creates a hard order of deployment in most cases. This order of deployment is outlined:
+--------------+
| |
| +--------+------+
+-----------------------+-----+ |
| | | KDC |
| +--------+-----+ +-------------+
| | | +-------+-------+ |
| | | | |
| | | | |
| | | +-------v-------+ |
| | | | | |
| | | | ZOOKEEPER | |
| | | | [TDP]| |
| | | +------+--------+ |
| | | | |
| | | +-------v--------+ |
+---------------+ | | | | | |
| | | | +-----+----+ HADOOP +----------+ |
| DB INSTANCE | | | | | | [TDP] | | |
| | | | | | +-------+--------+ | |
+---+---+-------+ | | | +---+ | | |
| | | | | | | | |
| | +--------v-----+ +----v--v-------+ | +------v--------+ +------v--v-----+
| | | | | | | | | | |
| +------> HIVE | | RANGER | +>| OOZIE | | SPARK |
| | [TDP]| | [TDP] | | [TDP]| | [TDP]|
| +--------------+ +-------^-------+ +---------------+ +---------------+
| |
| |
+-------------------------------------+
Example - the order of TDP role execution for an oozie managed hdfs service would be:
- TDP Zookeeper role execution
- TDP Hadoop role execution
- TDP Oozie role execution