Quickly set up a development environment to use Stocator to connect with Apache Spark.
Stocator is a connector for object stores. It has a Swift driver to integrate with Bluemix, SoftLayer, or any other OpenStack Swift API-based object store.
Apache Spark works with multiple data sources that include object stores like Amazon S3, OpenStack Swift (such as IBM SoftLayer), Azure Blob Storage, and more. To access an object store, Spark uses Hadoop modules that contain drivers to the various object stores. Using Hadoop drivers as object store connectors is not unique to Spark. Rather, it’s a common approach in many Big Data projects, such as Alluxio , MapR, or even Hadoop Map Reduce itself.
Because Hadoop modules and drivers are designed to work with file systems and not object stores, they contain flows and operations that are not native to object store operations. As a result, there are unessential calls when Hadoop or Apache Spark interface with the object stores. For example, the temp files and folders Hadoop uses for every write operation are renamed, copied, and deleted. This leads to dozens of useless requests targeted at the object store. It’s clear that Hadoop is designed to work with file systems and not object stores, and there is a need to make Hadoop flows more object-store-native.
To overcome these limitations, we designed Stocator. A generic connector, Stocator is implicitly designed for object stores and has a very different architecture from existing Hadoop drivers; it doesn’t depend on those Hadoop drivers, and it interacts directly with object stores.
What technology problem will I solve?
Stocator was released to open source in February 2016 under Apache License 2.0. It’s a Java project and can be built effortlessly with Apache Maven. Stocator implements the Hadoop FileSystem Interface, so it can be easily integrated into other projects that expect this interface. For example, you can use Stocator with Spark, Hadoop, or Alluxio without any modifications to the projects.
Why should I contribute?
Stocator is still in its infancy as a project but has an active and growing community of followers. The contribution process is easy enough, and you’ll definitely be able to improve your skills by contributing code. It’s a great way to gain experience and a deeper understanding of how Spark works with object stores.