Automated Deployment of CDP Private Cloud Clusters

by Tristan Stevens

Posted in Technical | June 15, 2021 7 min read

At Cloudera, we have long believed that automation is key to delivering secure, ready-to-use, and well-configured platforms. Hence, we were pleased to announce the public release of Ansible-based automation to deploy CDP Private Cloud Base. By automating cluster deployment this way, you reduce the risk of misconfiguration, promote consistent deployments across multiple clusters in your environment, and help to deliver business value more quickly.

This blog will walk through how to deploy a Private Cloud Base cluster, with security, with a minimum of human interaction.

“The most powerful tool we have as developers is automation.” — Scott Hanselman

Key Steps

Once we’ve set up the configuration files and automation environment, Ansible will build and configure the cluster without intervention. In the following sections, we will cover:

Setting up the automation environment (the “runner”).
Configuring Credentials (or accepting a trial licence).
Defining the cluster you want built.
Setting up your inventory of hosts (dynamic inventories or static inventories).
Running the playbook.

Environment Setup

We have two options for setting up your execution environment (also known as the “runner”). We can run the quickstart environment, which is a Docker container we can run locally or within a pipeline, or we can install the dependencies on a Linux machine in our data center infrastructure. The Docker container includes all the required dependencies for local execution, and works on Linux, Windows or OSX.

If we’re running in docker, we can simply download and run the quickstart.sh script, and this will launch our docker container for us:

wget https://raw.githubusercontent.com/cloudera-labs/cloudera-deploy/main/quickstart.sh && chmod +x quickstart.sh && ./quickstart.sh

Else, if we’re running outside of Docker, we will clone the cloudera-deploy git repository and then run the centos7-init.sh script which will install Ansible 2.10, Ansible galaxy collections, and their dependencies:

yum install git -y

git clone https://github.com/cloudera-labs/cloudera-deploy.git /opt/cloudera-deploy

cd /opt/cloudera-deploy && git checkout devel && chmod u+x centos7-init.sh && ./centos7-init.sh

Configuring Credentials

You can run without any credentials, but ideally we’ll set up a profile file that can contain paths to cloud credentials (if deploying on public cloud) and to your CDP license file (if you want to use one).

Copy the template profile.yml file to ~/.config/cloudera-deploy/profiles/:

mkdir -p ~/.config/cloudera-deploy/profiles

cp /opt/cloudera-deploy/profile.yml  ~/.config/cloudera-deploy/profiles/default

In this file (~/.config/cloudera-deploy/profiles/default), you can then specify a public/private keypair if required and your CDP licence file, plus a default password for Cloudera Manager:

admin_password: "MySuperSecretPassword1!"

license_file: "~/.cdp/my_cloudera_license_2021.txt"

public_key_file: "~/.ssh/mykey.pub"

private_key_file: "~/.ssh/mykey.pem"

The following method describes deploying CDP Private Cloud onto physical or virtual machines. In some instances (perhaps development environments) it may be desirable to deploy CDP Private Cloud on EC2, Azure VMs or GCE however it should be noted that there are significant cost, performance and agility advantages to using CDP Public Cloud for any public-cloud workloads. This automation will allow for the creation of the requisite VMs to run your cluster on.

If you are running in GCE we can set up our GCP credentials in our profile file. If you are using VMs in Azure or AWS the Default credentials will be automatically collected from your local user profile (.aws or .azure directories). We suggest you set your default infra_type in your profile file to match your preferred default Public Cloud Infrastructure credentials, and check that your Default credentials point to the correct tenants.

#infra_type can be omitted, "aws", "azure" or "gcp". Defaults to aws

infra_type: gcp 

gcloud_credential_file: '~/.config/gcloud/mycreds.json'

Cluster Definition

For CDP Private Cloud clusters, the cluster definition directory is where we are going to define:

Cloudera Manager and Cluster versions
Which services should run on the cluster
Any configuration settings we wish to change from the defaults
Any supporting infrastructure we need: internal or external certificate authorities, Kerberos Key Distribution Centers, provided or provisioned RDBMS (Postgres, MariaDB, or Oracle), parcel repositories, etc
Which security features we wish to enable – Kerberos, TLS, HDFS Transparent Data Encryption, LDAP integration, etc.

The overriding principle is that you should never need to amend the playbooks or the collections – everything that you wish to customise should be customisable through the definition.

Our cluster definition will consist of three parts:

application.yml – this is just a placeholder file for any Ansible tasks you may wish to execute after Deployment
definition.yml – this holds our cluster definition content
inventory_static.ini or inventory_template.ini – A traditional static, or modern dynamic, ‘Ansible Inventory’ of hosts to deploy to.

There is a basic definition file provided in the cloudera-deploy repository; however this only includes the HDFS, YARN, and Zookeeper services.

Let’s start by creating a definition directory:

mkdir /opt/cloudera-deploy/definitions

cp -r /opt/cloudera-deploy/examples/sandbox /opt/cloudera-deploy/definitions/mydefinition


echo yes | cp /opt/cloudera-deploy/roles/cloudera_deploy/defaults/basic_cluster.yml /opt/cloudera-deploy/definitions/mydefinition/definition.yml

We’ll populate the following sections in the /opt/cloudera-deploy/definitions/mydefinition/definition.yml file:

First of all we’ll set the Cloudera Manager Version – we’ll ideally use the latest version (7.3.1 at the time of writing if you are using your Cloudera License File in your Profile explained earlier, although 7.1.4 is the default if you’re using a trial license):

cloudera_manager_version: 7.3.1

Next we’ll define our cluster:

clusters:
  - name: Data Engineering Cluster
    services: [ATLAS, DAS, HBASE, HDFS, HIVE, HIVE_ON_TEZ, HUE, IMPALA, INFRA_SOLR, KAFKA, OOZIE, RANGER, QUEUEMANAGER, SOLR, SPARK_ON_YARN, TEZ, YARN, ZOOKEEPER]
    repositories:
      # For licensed clusters:
      - https://archive.cloudera.com/p/cdh7/7.1.6.0/parcels/
      # For trial clusters uncomment this line:
      # - https://archive.cloudera.com/cdh7/7.1.4/parcels/ 
    security:
      kerberos: true
    configs:
      …
    host_templates:
      …

You can customise the list of services from the list of available services and roles defined in the collection itself. You can include in this section services such as Apache Spark 3, Apache NiFi or Apache Flink although these will require configuration of separate CSDs.

We can specify additional configs, grouped into roles, or for service-wide configs we can use the dummy role “SERVICEWIDE“. Most configuration settings are set to sensible defaults, either by Cloudera Manager or the playbook itself, so you only need to set those which are specific to your environment.

    configs:
      ATLAS:
        ATLAS_SERVER:
          atlas_authentication_method_file: true
          atlas_admin_password: password123
          atlas_admin_username: admin
      HDFS:
        DATANODE:
          dfs_data_dir_list: /dfs/dn
        NAMENODE:
          dfs_name_dir_list: /dfs/nn
        SECONDARYNAMENODE:
          fs_checkpoint_dir_list: /dfs/snn
      IMPALA:
        IMPALAD:
          enable_audit_event_log: true
          scratch_dirs: /tmp/impala
      YARN:
        RESOURCEMANAGER:
          yarn_scheduler_maximum_allocation_mb: 4096
          yarn_scheduler_maximum_allocation_vcores: 4
        NODEMANAGER:
          yarn_nodemanager_resource_memory_mb: 4096
          yarn_nodemanager_resource_cpu_vcores: 4
          yarn_nodemanager_local_dirs:  /tmp/nm
          yarn_nodemanager_log_dirs: /var/log/nm
        GATEWAY:
          mapred_submit_replication: 3
          mapred_reduce_tasks: 6
      ZOOKEEPER:
        SERVICEWIDE:
          zookeeper_datadir_autocreate: true

In the Host template section we will specify which roles will be assigned to each host template. In this simple cluster we only have two host templates: Master1 and Workers. For more complex clusters you may wish to have more host templates. In the next section we will explain how these host templates are applied to cluster nodes.

    host_templates:
      Master1:
        ATLAS: [ATLAS_SERVER]
        DAS: [DAS_EVENT_PROCESSOR, DAS_WEBAPP]
        HBASE: [MASTER, HBASERESTSERVER, HBASETHRIFTSERVER]
        HDFS: [NAMENODE, SECONDARYNAMENODE, HTTPFS]
        HIVE: [HIVEMETASTORE, GATEWAY]
        HIVE_ON_TEZ: [HIVESERVER2]
        HUE: [HUE_SERVER, HUE_LOAD_BALANCER]
        IMPALA: [STATESTORE, CATALOGSERVER]
        INFRA_SOLR: [SOLR_SERVER]
        OOZIE: [OOZIE_SERVER]
        QUEUEMANAGER: [QUEUEMANAGER_STORE, QUEUEMANAGER_WEBAPP]
        RANGER: [RANGER_ADMIN, RANGER_TAGSYNC, RANGER_USERSYNC]
        SPARK_ON_YARN: [SPARK_YARN_HISTORY_SERVER]
        TEZ: [GATEWAY]
        YARN: [RESOURCEMANAGER, JOBHISTORY]
        ZOOKEEPER: [SERVER]
      Workers:
        HBASE: [REGIONSERVER]
        HDFS: [DATANODE]
        HIVE: [GATEWAY]
        HIVE_ON_TEZ: [GATEWAY]
        IMPALA: [IMPALAD]
        KAFKA: [KAFKA_BROKER]
        SOLR: [SOLR_SERVER]
        SPARK_ON_YARN: [GATEWAY]
        TEZ: [GATEWAY]
        YARN: [NODEMANAGER]

Finally we will add any Cloudera Manager settings required, including any CSDs that might need to be installed for non-CDP services.

mgmt:
  name: Cloudera Management Service
  services: [ALERTPUBLISHER, EVENTSERVER, HOSTMONITOR, REPORTSMANAGER, SERVICEMONITOR]

hosts:
  configs:
    host_default_proc_memswap_thresholds:
      warning: never
      critical: never
    host_memswap_thresholds:
      warning: never
      critical: never
    host_config_suppression_agent_system_user_group_validator: true

cloudera_manager_options:
  CUSTOM_BANNER_HTML: "Cloudera Blog Deployment Example"

#cloudera_manager_csds:
#  - https://archive.cloudera.com/p/specific_csd_location

In this file we can also change the defaults for things like databases, kerberos and TLS – although in this sample we will stick with the defaults.

Our complete definition.yml can be found here.

Setting up your inventory

This automation supports both dynamic and static inventories – dynamic meaning that we will provision virtual machines (in AWS) and then build a cluster on those hosts, however they are named, static meaning that we define a configuration file that has a list of pre-existing machines on which to build our cluster.

For a dynamic inventory we need to have configured the cloud credentials above and set the infra_type in either our profile file or in extra_vars. We also need to provide an inventory_template.ini file where the playbook can substitute any cloud-provided hostnames in. Our inventory template will look like this:

[cloudera_manager]
host-1.example.com

[cluster_master_nodes]
host-2.example.com host_template=Master1

[cluster_worker_nodes]
host-3.example.com
host-4.example.com
host-5.example.com

[cluster_worker_nodes:vars]
host_template=Workers

[cluster:children]
cluster_master_nodes
cluster_worker_nodes

[krb5_server]
host-6.example.com

[db_server]
host-6.example.com

[deployment:children]
cluster
cloudera_manager
db_server
krb5_server

In this file we have groups defined for cloudera_manager, cluster_master_nodes, cluster_worker_nodes, krb5_server, and the db_server. The inventory links to the cluster host templates through the use of the host_template variable that is assigned here to both the cluster_worker_nodes and the cluster_master_nodes. Note: Each host can only have one host template. In this file, the number of unique hosts will determine the number of hosts provisioned by the playbook. Note also that the example.com hostnames are just placeholders and will get replaced by the provisioned instance hostnames.

If we wish to use a static inventory, we can create exactly the same file, except replacing host-*.example.com with our provided hostnames. We may also wish to specify any ssh keys or ansible variables for the inventory here, for example:

[deployment:vars]
ansible_ssh_private_key_file=~/.ssh/root_key
ansible_user=root

The static inventory file can either be named inventory_static.ini, or passed in as an argument to the playbook execution using the ‘-i’ ansible runtime flag.

Running the playbook

Once we have the definition and the inventory set up, running the playbook is fairly straightforward. We can run the playbook in stages using some specific tags, or just run the whole thing end to end. We’ve spent time making sure that we can start and restart the playbook without needing to clean anything up in between runs.

To run the playbook use the following command:

ansible-playbook /opt/cloudera-deploy/main.yml \

  -e "definition_path=definitions/mydefinition" <extra arguments>

Other options that you may wish to pass to this command:

Option	Value	Purpose
`-i`	inventory_static.ini	Specify a static inventory to be used instead of a dynamic inventory
`--extra_vars`	key1=value1<space>key2=value2	Specify additional variables to the runtime (e.g. admin_password)
`--ask-pass`	<no value required>	For use when running the playbook without public/private keys, Ansible will prompt for an SSH password
`--tags`	<Comma separated list of tags>	To run the playbook in increments
`--verbose`	0 through to 3	Turn on verbose logging

As an example:

ansible-playbook /opt/cloudera-deploy/main.yml \ 

  -e "definition_path=definitions/mydefinition" \ 

  -i /opt/cloudera-deploy/definitions/mydefinition/inventory_static.ini \ 

  --ask-pass

You can also set the ANSIBLE_LOG_PATH environment variable to ensure that logs are saved to disk and not lost when you close the terminal.

The playbook will handle the installation of the supporting infrastructure, Cloudera Manager and the CDP Private Cloud Base cluster and a KeyTrustee cluster (if required by your submitted configuration). Cluster deployments are normally constrained by network bandwidth for parcel distribution and on the speed of your hardware, but it’s realistic to deploy a small to medium sized cluster in less than two hours.

Summary

In this blog we walked through the mechanics of how to automate the deployment of CDP Private Cloud Base onto physical or virtual machines, including in the public cloud. With a simple definition, split into three configuration files for ease of use, we’ve been able to control all aspects of the cluster deployment, including integration with the enterprise infrastructure.

Automation at this scale greatly enhances the CDP Private Cloud Base time to value. Through the use of automation we can rapidly deploy multiple clusters with much greater consistency and much more quickly. If needed, environments can be rebuilt for specific purposes, or templated for even more rapid deployment. And through having more repeatable deployments administrators and developers can spend more time focusing on onboarding tenants and developing new pipelines and insights than on deploying clusters.

Tristan Stevens

Director of Technology, CDP Centre of Excellence

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data