An introduction to Presto, a query engine that powers watsonx.data

The watsonx platform for next-generation enterprise AI demonstrates IBM's continued commitment to both contributing to and employing open source technologies in its offerings.

Watsonx.data, the data store component of watsonx, leverages trusted open source technologies at every level of its stack. Watsonx.data is built on an open data lakehouse architecture, which provides the best of both data warehouse and data lake architectures. This data lakehouse has the high performance of a data warehouse containing structured data, but also provides flexibility in storing a combination of structured data, unstructured data, and semi-structured data from many sources.

One of the open source query engines that enables watsonx.data to provide fast, reliable, and efficient processing of data at scale is Presto.

What is Presto?

Presto is a query engine, which is a piece of software that sits on top of the underlying data storage architecture and fulfills requests for data by optimizing the data retrieval process. More specifically, Presto is a distributed query engine for fast SQL-based data lake analytics.

Presto was open-sourced in 2019 when it was donated to the Linux Foundation and is under the open source governance of the Presto Foundation.

Presto is flexible and supports querying across diverse sources, including both structured relational databases, such as MySQL and PostgreSQL, and unstructured and semi-structured NoSQL data sources, such as MongoDB and HBase. Likewise, Presto supports many different data formats, such as ORC, Avro, Parquet, CSV, JSON, and more.

Presto uses what it calls 'connectors' to integrate with this wide range of external data sources. Any data source can be queried as long as the data source adapts to the API expected by Presto. This makes Presto extremely flexible and extensible.

Presto architecture

Presto is able to quickly query across many petabytes of data stored within these disparate data sources because its architecture is based on the paradigm of massively-parallel processing.

Two types of nodes make up the Presto architecture: a coordinator and one or more workers that process data in parallel. We can get an understanding of the responsibilities of each node by tracing a submitted query.

Coordinators

First, a user submits a SQL statement via a client, such as a command-line interface. The coordinator then parses this query into a format that Presto understands and performs a series of validations on the query. If validation passes, the coordinator makes an optimized work plan, splitting the query into several stages of work to be completed one after the other. These stages are then further broken down into several tasks.

a Presto query plan

Workers

After ensuring that an appropriate number of resources are available, the coordinator delegates these tasks to the worker nodes. Worker nodes process their tasks in parallel, using the relevant connector to access the underlying data source.

The connector used can vary across workers, depending on how the query was optimized and what data sources need to be accessed. As the worker nodes process their tasks, the coordinator continually monitors them using heartbeat signals. Once workers are done, the result of the tasks is sent back to the coordinator.

The coordinator can then assign workers new tasks from any remaining query stages. Once all stages are complete, the coordinator compiles the results from each stage into the final form required by the original query.

Pipelining the query stages across the network in this way ensures that any unnecessary I/O overhead is avoided. Additionally, all processing occurs in-memory, and intermediate data at the task level is stored in a buffer cache.

All of these features ensure that Presto remains extremely performant, even at petabyte sizes.

Summary and next steps

Presto is a distributed query engine that provides fast, reliable, and efficient processing of diverse data at scale.

In addition to the efficiency and inherently scalable nature of Presto, watsonx.data also allows users to spin up multiple Presto engines depending on workload needs. This means that users are able to choose the right engine for the right workload and at the right quantity, all powered by trusted open-source technologies.

If you want an enterprise-grade platform for querying vast amounts of diverse data with SQL, be sure to check out watsonx.data, or explore more articles and tutorials about watsonx.