Recently I had the pleasure of collaborating with colleagues from Atos and IBM on a paper which will be presented on August 31st at the 10th TPCTC (Technology Conference on Performance Evaluation & Benchmarking) , co-located with VLDB 2018 in Rio de Janeiro, Brazil. The paper is entitled “Requirements for an Enterprise AI Benchmark”.

The decision to work on the paper came about at an exploratory meeting in Paris on April 17, 2018 between Atos and IBM. I would like to thank all the paper co-authors and list the co-authors here:

The full paper will be published in the 2018 TPCPC conference proceedings: http://www.tpc.org/tpctc/default.asp

You can find a recording of a presentation and slides that accompany the paper here http://cognitive-science.info/community/weekly-update/ – In particular:

Extract from 10th TPCTC (Technology Conference on Performance Evaluation & Benchmarking) page

Corporations run performance benchmarks to identify the most cost-effective software and hardware, e.g., when selecting a new solution, or when looking to improve throughput or response time. The paper was motivated by observing that, in spite of the emergence of general purpose AI performance benchmarks, Atos and IBM are creating custom benchmarks for their clients when a performance assessment or prediction is conducted for AI software. The paper seeks to identify the metrics that are needed for an enterprise AI benchmark. In this blog entry, I summarize the paper. The paper is in five parts.

1. Introduction

This section sets the scene by explaining the emergence of the use of AI technology in companies, through the availability of hardware processing power including GPUs, large datasets, and algorithms. The introduction states that the only way today to properly perform a comparative evaluation between AI Systems is to create a custom benchmark test suite based on the user’s specific AI workload and data, and arrange access to the equipment to run the benchmark test suite on each of the systems under consideration.

2. Identification of AI Workflow Bottlenecks

Most of the current AI software under consideration are based on recent academic and industrial advances in machine learning, which includes deep learning (ML/DL). The latter consists of using algorithms that create numerical models from datasets which are then able to correctly predict results from new data. The approach can be supervised model training when the training dataset includes the true values for the variables to predict – or unsupervised training, where the software determines the values to predict.
A performance stress factor for training and inference platforms is related to the level of workload concurrency introduced by parallel incoming requests. Examples include:

  • In the training phase, when several users share the same training platform and concurrently submit intensive Training tasks.
  • In the inference phase, it is likely that several incoming detection requests have to be processed at the same time.

More Performance Considerations for the Training Phase

  • The preparatory stage. It generally includes the data gathering, the data preparation and the annotation / labeling of this dataset, the model choice and underlying characteristics, the feature engineering, and the technical choice for development and execution conditions of the training (ML/DL framework, programming language, physical platform).
  • The iterative stage of model training and optimization of the hyper-parameters (parameters of the chosen models and the training algorithm) until reaching the desired model accuracy

More Performance Considerations for the Inference Phase
The inference phase depends on the target deployment platform: both data center servers and edge servers (such as IoT, cameras and cars). The deployment platform affects:

  • model reductions to adapt to the platform hardware footprint (memory size, compute capacity, energy consumption)
  • response time performance improvements
  • the integration of data pre-processing and post-processing logic
  • extra logical processing workflow (including calling other models) and then packaging in a secured format compatible with the target deployment platform.

In summary, three AI tasks are good candidates for platform benchmarking:

  • Model Training
  • Hyper-parameter Optimization
  • Deployed Model Inference Run-time

More details appear in the following three tables, including Important Performance Indicators, and Potential Technical Bottlenecks in Standalone & Concurrent Scenarios

Model Training Performance Considerations for Enterprises
Hyper-parameter Optimization Performance Considerations for Enterprises
Deployed Model Inference Runtime Performance Considerations for Enterprises

3. Desired Enterprise AI Metrics

Examples of industry-standardized benchmarks are those developed and maintained in the database domain by the Transaction Processing Council (TPC) or the compute domain by the Standard Performance Evaluation Corporation (SPEC) organization. SPEC defines a computer benchmark as a known set of operations, with the following characteristics, by which computer performance can be measured:

  • Specifies a workload
  • Produces at least one metric – Is reproducible
  • Is portable
  • Is comparable
  • Has run rules

The huge diversity of the features, and sometimes conflicting needs, of existing and future AI models clearly show that only measuring a unique parameter for a given benchmark set is not sufficient and would not objectively reflect the computation capabilities of an AI system. The paper recommends considering multiple metrics such as:

  • Real-time latency
  • Computation accuracy
  • Convergence speed
  • Computation time
  • Computation efficiency
  • Hardware resource consumption – Thermal conditions
  • Power capping
  • Energy consumption

The huge diversity of the features, and sometimes conflicting needs, of existing and future AI models clearly show that only measuring a unique parameter for a given benchmark is not sufficient and would not objectively reflect the computation capabilities of an AI system. Therefore, the scoring value must be determined by a function of other measured characteristics. In this section, the paper motivates the following formula:

eval(time, accuracy) = −log(1−accuracy)/time

4. Existing AI Benchmarks – and the Metrics Gap

The following are some of the popular AI benchmarks described briefly along with the metrics gap for enterprise AI.

DeepBench : This benchmark targets low-level operations that are fundamental to deep learning, such as matrix-multiplication, convolutions, and communications, and aims to identify the most appropriate hardware but the benchmark does not consider time-to-accuracy.

TensorFlow : The TensorFlow performance benchmarks are similar to DeepBench, in that they identify the most appropriate hardware, but not time-to-accuracy currently. They are also tied to the TensorFlow Framework.

DAWNBench : DAWNBench allows different deep learning methods to be compared by running a number of competitions. It was the first major benchmark suite to examine end-to-end deep learning training and inference. It does not address data preparation and hyper-parameter optimization work.

MLPerf : MLPerf defines the primary metric as the wall clock time to train a model to a target quality, often hours or days. The target quality is based on the current state of the art publication results, less a small delta to allow for run-to-run variance. MlPerf does not address hyper-parameter optimization nor data preparation.

  • The MLPerf Closed Model Division specifies the model to be used and restricts the values of the hyper parameters (batch size, learning rate, etc.) which can be tuned to enforce a fair and balance comparison of the hardware and software systems.
  • The MLPerf Open Model Division, only requires that same task must be achieved using the same data, but provides fewer restrictions

5. Summary

In conclusion, the paper identifies the following areas as important to enterprises concerned about performance of their AI applications:

1. Model training performance
—- data labeling / preparation
—- time-to-accuracy
—- computational time / cycles
—- throughput-to-accuracy
2. Hyper-parameter optimization performance
3. Inference runtime performance

Priority and of importance of these parameters will depend on enterprise’s needs and expectations.

Related Materials

Join The Discussion

Your email address will not be published. Required fields are marked *