Win $20,000. Help build the future of education. Answer the call. Learn more

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Useful tips, snags we hit, and how we resolved them


Here in this last blog post in our series, we focus on lessons learned from installing, maintaining, and verifying the connectivity of Cloudera Data Platform and IBM Cloud Pak for Data. If you haven’t read the first two posts — A technical deep-dive on integrating Cloudera Data Platform and IBM Cloud Pak for Data and Installing Cloudera’s CDP Private Cloud Base on IBM Cloud with Ansible, then I’d invite you to go back and read them for additional context.

In this installment, we’d like to share some useful tips and tricks and teach you how to avoid common mistakes by first-time installers

Lesson 1: Use a bastion host

Our Cloudera cluster had a total of 8 VMs (3 master nodes, 3 worker nodes, and 2 edge nodes). We wanted easy access to each node and wanted to limit public network traffic to the Cloudera cluster as much as possible. Luckily, there’s already a well-known solution to this problem: using a bastion host.

We spun up a small VM on the same subnet as our Cloudera cluster and could then easily communicate over private network interfaces (10.x.y.z IP addresses). For the installation process, this choice offered the benefit of not dropping connections for long-running Ansible playbooks.

alt_text Figure 1. The architecture of our Cloudera for Cloud Pak for Data environment

Lesson 2: Use VS Code’s Remote Extension Plug-in

When installing Cloudera Data Platform with Ansible playbooks you’re likely going to need to change a few config options and values in the playbooks. We’re not against using Vim, but we opted to use the Visual Studio Code Remote Development Extension Pack. This made searching through the files, modifying values, and uploading and downloading files much easier.

alt_text Figure 2. VSCode’s Remote Development Extension useful for editing files and running commands against our remote machines

Lesson 3: Stick to private networks

This point may seem obvious, but it’s more about being consistent. Anywhere an IP address was to be input, we always made sure to use the private network IP address. This ensured that any traffic would stay on the IBM Cloud network and not the public Internet.

Lesson 4: Eliminate all inbound traffic except RDP on the Windows Active Directory server

Here is a subtle lesson that might otherwise be little tricky to pin down. After a few days of uptime, the health checks on our Cloudera Data Platform were indicating that the hosts could not reach our Active Directory (AD) server. Indeed we discovered that our AD had hung. When we would reboot the AD server things would go back to normal for a day or so and then it would repeat.

We looked over capacity and performance of the server. When we looked at networking utilization, we noticed a high level of traffic going to and from the system from the Internet facing interface. After looking at the server configuration and the traffic, we were able to determine that a vast majority was over the LDAP port.

Since our only use of LDAP is internal, the solution to this problem was to limit the inbound traffic to the AD by creating a rule that only allowed traffic on the RDP protocol, which is used for remote desktop management. On IBM Cloud, we created a custom security group permitting inbound TCP on port 3389 for RDP.

Lesson 5: Mount secondary drives to /data/dfs automatically

The storage requirements for installing Cloudera required us to purchase additional drives to go along with our virtual machines. These drives had to be mounted before running any playbooks. We used a little bit of bash and SSH to do it in an automated way. In our case, we chose to mount the drives to /data/dfs:

for i in {1..8}
do
  ssh cid-vm-0$i mkfs.ext4 -m0 -O sparse_super,dir_index,extent,has_journal /dev/xvdc
  ssh cid-vm-0$i mkdir -p /data/dfs
  ssh cid-vm-0$i mount /dev/xvdc /data/dfs
  ssh cid-vm-0$i 'echo "/dev/xvdc  /data/dfs   ext4  defaults,noatime 1 2" | tee -a /etc/fstab'
done

Lesson 6: Update OpenShift DNS operator so it knows the Cloudera node hostnames

We wanted our IBM Cloud Pak for Data instance which runs on OpenShift be able to communicate with our newly deployed Cloudera Data Platform cluster. We stuck to our “always use private network interfaces” rule, but that resulted in 404s since OpenShift didn’t know how to resolve those hostnames. To get around this, we needed to edit the DNS operator on our OpenShift instance. It’s documented in the OpenShift DNS Documentation, but for brevity, we’ve added what worked for us.

Edit the dns operator default CR: oc edit dns.operator/default update by adding to the spec section:

spec:
  servers:
  - forwardPlugin:
      upstreams:
      - <your private ip>
      - <your public ip>
    name: cdplab-server
    zones:
    - cdplab.local

Then verify the configmap for CoreDNS is updated: oc get configmap/dns-default -n openshift-dns -o yaml

apiVersion: v1
data:
  Corefile: |
    # cdplab-server
    cdplab.local:5353 {
        forward . <your private ip> <your public ip>
    }

Then create a pod and try to access CDP from the pod, and HTML should be returned, not a 404 error message.

bash-4.4$ curl -k https://cid-vm-01.cdplab.local:7183/cmf/home

Lesson 7: Ensure the AD self-signed certificate can be used as a certificate authority

This lesson can be broadly applied to other LDAP and AD scenarios. In our case, we could successfully connect to the Impala service running on Cloudera through Kerberos, but not through LDAP. After double-checking that our LDAP-specific Impala configuration was correct, we were still getting a not-so-helpful “Can’t contact LDAP server” error.

We slowly started to peel back the layers of the problem. We managed to isolate the problem to our LDAP configuration, and we realized this was the case because when we ran ldapsearch in an attempt to bind the user, it gave us the same error message. Ah-ha! Impala was using an OpenLDAP library under the covers.

$ ldapsearch -H ldaps://cid-adc.cdplab.local:636 -D "stevemar@CDPLAB.LOCAL" -b "dc=cdplab,dc=local" '(uid=stevemar)' -W
Enter LDAP Password:
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)

After double-checking that the Windows firewall wasn’t the culprit, we narrowed down the problem to a missing bit of information in the self-signed certificate we had created for the AD. We needed to add the -TextExtension "2.5.29.19={text}CA=true" flag for the Windows New-SelfSignedCertificate command. Our new command looked like (before it was missing the last parameter):

New-SelfSignedCertificate -Subject *.$dnsName `
  -NotAfter $lifetime.AddDays(365) -KeyUsage DigitalSignature, KeyEncipherment `
  -Type SSLServerAuthentication -DnsName *.$dnsName, $dnsName `
  -TextExtension "2.5.29.19={text}CA=true"

Lesson 8: Get familiar with Kerberos concepts and tools

There’s no real single piece of advice here, other than if you’re going to use Kerberos to secure your Cloudera cluster, get familiar with Kerberos concepts, like keytabs, and tools like ktutil and ktpass.

Summary and next steps

We hope you enjoyed reading about some of the pitfalls we encountered and remember some of the tips we shared the next time you’re deploying a data and AI platform. You can learn more about the Cloudera Data Platform for IBM Cloud Pak for Data joint offering.