Get the Code
Published December 13, 2018
Apache Spark works with multiple data sources that include object stores like Amazon S3, IBM Cloud Object Storage (exposing S3 API ), OpenStack Swift, Azure Blob Storage, and more. To access an object store, Spark uses Hadoop modules that contain drivers to the various object stores. Using Hadoop drivers as object store connectors is not unique to Spark. Rather, it’s a common approach in many big data projects, such as Alluxio , MapR, or even Hadoop Map Reduce itself.
Because Hadoop modules and drivers work with file systems and not object stores, they contain flows and operations that are not native to object store operations. As a result, there are unessential calls when Hadoop or Apache Spark interface with the object stores. For example, the temp files and folders Hadoop uses for every write operation are renamed, copied, and deleted. This leads to dozens of useless requests targeted at the object store. Because Hadoop is designed to work with file systems and not object stores, there is a need to make Hadoop flows more object-store-native.
To overcome these limitations, we designed Stocator. A generic connector, Stocator is implicitly designed for object stores and has a very different architecture from existing Hadoop drivers. It doesn’t depend on those Hadoop drivers, and it interacts directly with object stores.
Stocator was released to open source in February 2016 under Apache License 2.0. It’s a Java project and can be built effortlessly with Apache Maven.
Stocator implements the Hadoop FileSystem Interface, so it can be easily integrated into other projects that expect this interface. For example, you can use Stocator with Spark, Hadoop, or Alluxio without any modifications to the projects.
Stocator is a production-ready connector with wide usage and a growing community of followers. The contribution process is easy enough and many people can improve their skills by contributing code to Stocator. Contributing to the project is also a great way to gain experience and a deeper understanding of how Apache Spark works with object stores.
May 6, 2019
September 23, 2019
You can run SQL statements against your Cloud Object Storage data in a serverless fashion with SQL Query. You don't…
Back to top