Motivation:
Ambari Metrics 0.1.0 (AMS) was released with Apache Ambari 2.1.0 (IOP 4.1). When AMS is ran with default configurations, it has the potential to have many resource contention issues. Under the hood, Ambari Metrics¬†0.1.0 uses it’s own instances of HBase 0.98 and Phoenix 4.2 to¬†store metrics and run some basic de-duplication / compaction. As the cluster scales upwards, the disk r/w requests caused by HBase (Ambari Metrics Collector) on a single disk can cause that¬†node to utilize 100% of each CPU (I’ve seen¬†as high as 3000% CPU usage in `top`!).

 

Starting to debug Ambari Metrics
  • Ambari Metrics Collector logs: /var/log/ambari-metrics-collector/
  • Verify available disk space in “df -h”
  • Verify available memory “free -m”
  • See running processes/cpu usage for user ams “top -u ams”
  • See whether the HBase Regionserver and HBase HMaster are still running “ps -ef | grep ams”

 

Below table is an aggregation of some of the more common issues/resolutions caused by Ambari Metrics
Issue     Possible Cause(s) Resolution
Ambari Metrics Collector process is using 100% of available CPU’s. Any service (including Ambari Web UI) running on the same host as Ambari Metrics Collector becomes slow/unresponsive.
Ambari Metrics Collector is running on the same node as Ambari Server
Ambari Metrics is running in embedded mode
C
A
ams-hbase*.log shows multiple zookeeper timeouts CPU Contention on the Metrics Collector Host when running in embedded mode
A
B
Metrics for CPU, Network, among other ‘go missing’ from the Ambari web UI CPU Contention caused by disk r/w bottleneck; ams-hbase master heapsize too low
A
D
¬†Metrics collector fails to start, “port in use” or “Binding to port -1” ¬†Port 61181 doesn’t get stopped E
After adding hosts to Ambari for a total of > 100 hosts, UI error is thrown “Validation failed.¬† Config validation failed” stack-advisor fails to update 1 property for that range of hosts
F
GC Options applied to Ambari Metrics Collector are not applied to collector process AMBARI-14945 G
 
 
 
Resolutions
A. Run Ambari Metrics in Distributed Mode rather than embedded 
If you are running with more than 3 nodes, I strongly suggest running in distributed mode and writing hbase.root.dir contents to hdfs directly, rather than to the local disk of a single node. This applies to already installed and running IOP clusters.
  1. In the Ambari Web UI, select the Ambari Metrics service and navigate to Configs. Update the following properties:
    • General >
      ams_performance_tuning_A11
    • Advanced ams-hbase-site > hbase.cluster.distributed=true
      ams_performance_tuning_A13
    • Advanced ams-hbase-site > hbase.root.dir=hdfs://namenode.fqdn.example.org:8020/amshbase
      ams_performance_tuning_A12
  2. Restart Metrics Collector and affected Metrics monitors
 

 

B. “Cleaning” up a hanging Ambari Metrics Collector (embedded mode only)
In the scenario which you need to run embedded mode (Small clusters, sandbox, vms, etc) you can use the following steps to restore your node’s performance if it has been affected by the Metrics Collector.
  1. If your host has multiple disks, modify the default value used for hbase.root.dir and hbase.tmp.dir, preferably to a lower-utilized disk than the OS is running on
  2. Delete the contents of the ZooKeeper tmp snapshot dir. This will delete any unsaved metrics, effectively removing the backlog/bottleneck caused by disk contention.
    rm -rf /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper
  3. Lower the TTL metrics aggregation. By default these are collected every 2 minute, reducing this to 5min or higher will significantly reduce the extended lag/cpu spikes caused — though you will still see CPU spikes on the new TTL intervals for a short period. Note, in Ambari 2.2 the default value here has been increased to 5 min.

    In the Ambari Web UI, modify the configs for ams-site
    timeline.metrics.host.aggregator.minute.interval : 300

 

C. Moving the metrics collector to a new host.
The below steps include some required work arounds for known issues with open/resolved JIRAs.
  1. Stop Ambari Metrics Service
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context":"Stop All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_COLLECTOR
    
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context":"Stop All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_MONITOR 
  2. Delete the Ambari Metrics Collector from the old host
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/old.metrics.collector.host/host_components/METRICS_COLLECTOR 
  3. Add the Ambari Metrics Collector component to the new host
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X POST http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/new.metrics.collector.host/host_components/METRICS_COLLECTOR 
  4. Install the Ambari Metrics Collector component on the new host
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"HostRoles": {"state": "INSTALLED"}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/hosts/new.metrics.collector.host/host_components/METRICS_COLLECTOR 
  5. Update the Collector hostname used by Metrics Monitory on all hosts in your ambari cluster. The collector hostname is stored the ‘metrics_server’ property in /etc/ambari-metrics-monitor/conf/metric_monitor.ini
     
    #Run on every host in the cluster
    sed -i 's/old.collector.hostname/new.collector.hostname/' /etc/ambari-metrics-monitor/conf/metric_monitor.ini
  6. Start Ambari Metrics service, either via UI or curl call below
    curl -u admin:admin -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo":{"context‚ÄĚ:"Start All Components"},"Body":{"ServiceComponentInfo":{"state":"INSTALLED"}}}' http://ambari.server.host:8080/api/v1/clusters/your_cluster_name/services/AMBARI_METRICS/components/METRICS_COLLECTOR 
Possible Issues: AMBARI-13758,
Fix: In the Ambari Web UI, modify the Ambari Metrics Config:
Under ams-hbase-site , find the hbase.zookeeper.quorum field and update it to ‘localhost‘. Note, in Ambari 2.2.2+, AMS supports using the Stack deployed Zookeeper thus leveraging a true ZK quorum rather than a single instance.

 

D. Increase metrics collector heapsize
The default value for ams-hbase-env : hbase_master_heapsize will often lead to periodic missing metrics from the Ambari Web UI. In future releases the stack-advisor recommendations have been updated.

 

Use https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning as a guideline for tuning the heapsizes for AMS.
For reference, below configs tend to work rather well in the 1-50 node range:

 

Property     Recommended Value 1-50 Nodes
hbase_master_heapsize 2048
hbase_regionserver_heapsize 2048
metrics_collector_heapsize 1024

 

 E. Resolve port issues with Collector

Ensure the port used by the embedded ams zookeeper is free on the collector host,
hbase.zookeeper.property.clientPort default valus is: 61181:

netstat -nltp | grep 61181

Free up this port or change the default clientPort to a free port and restart ambari metrics collector

 

F. Resolve stack-advisor validation failure for >100 hosts
“Validation failed.¬† Config validation failed. ” Appears when saving modifications to Ambari Metrics Configs
    Fix: Update stack_advisor to add default heap recommendations for any unaccounted number of nodes

Option 1: Edit stack_advisor directly:

#Open the BI 4.0 stack advisor on the Ambari Server node(4.1 inherits 4.0)
vim /var/lib/ambari-server/resources/stacks/BigInsights/4.0/services/stack_advisor.py
#At line 684 "totalHostsCount = len(hosts["items"])"
#Add the following bolded line
totalHostsCount = len(hosts["items"])
putAmsHbaseEnvProperty("hbase_master_heapsize", "512m")
#Restart ambari-server

Option 2: Download tar.gz with patched stack_advisor.py


wget http://developer.ibm.com/hadoop/wp-content/uploads/sites/28/2016/02/stack_advisor_ams_patch.tar_.gz
tar -C /var/lib/ambari-server/resources/stacks/BigInsights/4.0/services/ -xvfz stack_advisor_ams_patch.tar.gz

 

G. Ambari Metrics Collector start script patch for java GC options

The Ambari Metrics Collector start script doesn’t properly read the java options used for the collector process. This causes all GC options to be skipped. AMBARI-14945


#Update the /usr/sbin/ambari-metrics-collector script to remove extra quotes on AMS_COLLECTOR_OPTS
sed -i 's/"${AMS_COLLECTOR_OPTS}"/${AMS_COLLECTOR_OPTS}/' /usr/sbin/ambari-metrics-collector

 

Other notable references