13 challenges creating an open, scalable, and secure serverless platform
What we learned during the evolution of the Knative open source project
Serverless is the natural evolution of cloud computing. In essence, serverless comes down to two main features: (1) you “pay by the drink” for all computing resources and (2) you get more fine-grained scaling than you would from larger workloads. However, taking full advantage of this extended computing model requires developers to restructure apps and services into components that can scale down to zero when not needed.
Microservices architectures are a step in the correct direction. And Kubernetes (K8s) as a platform for running microservices is a promising and popular concrete implementation of a core infrastructure for managing containers, which are used to run microservices. However, Kubernetes by itself is not sufficient to meet the needs of serverless workloads, and the layers on top of the base platform do not need to be reinvented by all. Enter Knative in 2019 as a common serverless layer on top of K8s.
Various open source serverless platforms emerged over the last few years, such as OpenFaaS and Apache OpenWhisk. They use components similar to K8s or build upon them, but none were built natively for Kubernetes. Knative emerged from Google in 2019 as a set of custom resources that extend K8s to run serverless workloads. Knative provides two key core components of serving and eventing. Originally, Knative included a build component, but it was spun out as its own project in the first year and is now called Tekton.
State of Knative
Since its inception (and as of this writing) Knative went through 21 iterations that refined the core components, added new ones, and provided a process for extensions. In particular, the eventing component evolved to provide a complete eventing substrate with channels, brokers, and sources (abstractions that represent event providers). The event sources are pluggable with more than 20 sources now available.
The Knative governance evolved from one that was controlled by Google, to one that is more open and inclusive. In particular, by solving the thorny issues of project evolution by including input from all participants and the legal issue of licensing the Knative trademark, which is owned by Google. To address the latter issue, a new Trademark Committee was established with the purpose of defining what the Knative name means and when it can or cannot be used.
As contributors to Knative since it was made public in 2019, we saw the project evolve and we can highlight “challenges” (or roadblocks) that we encountered over time. We hope documenting and listing these challenges will help for posterity and documentation, but more importantly, be a source for future open source projects and for other teams as they attempt similar efforts. We differentiate three groups of challenges:
- Challenges that are purely technical in nature and are not part of Knative project itself, but relate to the dependencies that Knative has (for example, K8s and Istio). We call these platform challenges.
- Other technical challenges that are part of the Knative project itself. We call these component challenges.
- Challenges that are not technical, but are organizational or of a governance nature.
Knative is built on top of K8s. You can think of Knative as a set of layers on top of K8s that together add the ability to run serverless workloads. These layers are created as custom resources or Custom Resource Definitions (CRDs). Since K8s requires a networking mesh to properly enable the execution of various workloads, Knative requires one. However, the Knative developers made that mesh requirement flexible to allow different mesh layers to plug in, which is useful since not all features of a K8s mesh are used in Knative.
During the time we worked with Knative, we identified two challenges that directly relate to the underlying platform. These came midway through our development and adoption, and as we started to test at scale the resulting cloud product that is built on Knative, Kubernetes, and Istio. It’s named IBM Cloud Code Engine. Let’s explore the two challenges that we can extract from our experience.
Challenge 1. Scalability
When you test middleware for scalability, you need close-to-final release bits to avoid early optimizations and keep a gauge on whether scalability requirements are met. As the IBM Cloud Code Engine product neared its first public beta, our teams in Stuttgart and Beijing increasingly ran various benchmarks to validate lower-bound performances of the embedded Knative. Months of back-and-forth transpired to identify the root cause of identified issues and the worldwide team collaborated to start resolving them.
The first important result from the Beijing tests was that the startup time for Knative services suffered as the number of services that were created per cluster increased. In addition, the startup time for services (even one) was not on par with our previous serverless project, IBM Cloud Functions.
The slow startup time was consistent across Knative releases and the team showed that the startup time regularly reached peaks of tens of seconds as the number of services reached near the thousands. A corollary to this result was that this slow startup time varied when a different network ingress was used, such as Istio, Contour, or Kourier. In particular, test installations with Istio suffered the most.
Noticing the persistent Istio results meant we had to engage that community for help. After numerous rounds to replicate and prove that the problem existed, the IBM team working on the Istio project was able to identify the root cause of the problem and provided a working fix in the Istio 1.7.x code base. The patch is now in flight and under various revisions.
We have yet to achieve the full scale and full performance of existing, non-Kubernetes-native solutions, such as the IBM Cloud Functions environment. However, we continue to make progress toward closing this gap.
Challenge 2. Improvements
Creating a non-trivial layer such as Knative on top of an existing platform, in particular, one that is as thorough and complex as Kubernetes, is bound to reveal impedance mismatches. These mismatches can exist in places where you might need to modify and improve the lower components to implement the one on top. Surprisingly, we encountered few places where adding a serverless layer such as Knative exposed issues underneath. Let’s explore two areas.
First, for autoscaling, the faster that K8s information about running pods is relayed, the faster the Knative components can make decisions. Currently, Kubernetes does much of its data collection with probes that run on a periodic schedule. Unfortunately, the frequency of execution is not settable to the subseconds. The ability to run on a faster schedule might enable Knative to respond faster to scaling decisions. We are working to submit a Kubernetes Enhancement Proposal to solve this precise issue.
As mentioned in Challenge 1, we determined places where Istio might be improved and we are finding more. For instance, we discovered during recent scalability testing that Istio must be configured in a specific fashion to avoid Sidecars-to-Pilot communications being overwhelmed when many Knative services exist. We are working with the Istio community to better understand suitable defaults, and to improve the internal communications of Istio components while increasing scalability.
Component (or Knative layer) challenges
As we mentioned, Knative is implemented as layers on top of Kubernetes. Mainly by adding the bits needed to implement serverless workloads. These layers can be decomposed neatly into three components plus one recent catchall addition. The components are: serving, eventing, building, and now, extensions. Knative started with the first three components and quickly spun off the building layer into its own different open source project (Tekton). In time, as various other components emerged that did not fit neatly into the remaining components, they formed another one that we now call extensions.
While the core Knative architecture remained solid over the past year, it however evolved. In particular, the eventing component went through a complete “refactoring” to better delineate event sources and event sinks, and to make these concerns decoupled and pluggable. The serving component evolved itself with point improvements as the different companies that are involved in Knative development tried to put more of their own (and their customers’) esoteric serverless workloads to the test.
In this section, we describe three challenges that we encountered as we tried to use Knative as the foundation for the IBM Cloud Code Engine online platform. Additionally, we discuss the extensions component, which we helped to pioneer with the addition of the command-line interface (CLI) and various plug-ins that came directly from our adoption of Knative itself.
Challenge 3. Autoscaling
Various challenges resulted in the design of the autoscaling subsystem of Knative. After all, the ability to scale workloads down to zero and quickly scale them back up as demands arrive, arguably, constitutes the most important trait of a serverless architecture. While the Knative team worked on a solution to automatically scale pods of deployed services by monitoring their activities and incoming traffic, a simple design decision earlier caused a fundamental point of disagreement with our expected serverless architecture.
Briefly, the original Knative autoscaling component imagined that deployed services would always want to start a pod on a service deployment. This meant that the system can assume that the deployed image is usable and “runnable” before it attempts to create scaling constraints around its execution. In essence, deployed services to Knative always scaled to one and then would scale either to zero, up to more, or down to zero as they experienced a lack of a flow of incoming requests.
While usable, this method meant that deployed workloads always experienced execution time whether that was needed or not (which means users might be “charged” even if their service does not receive a request). In particular, the previous IBM serverless system made the opposite assumption. Deployed workloads were scaled purely on demand and not by virtue of being deployed. This started an effort by the IBM team to submit changes to Knative autoscaling that would enable “scale to zero on deploy.” After various attempts and even an imagination of changing the way Knative collects metrics on pods, the feature was accepted and is now an integral part of the current releases.
The Knative Autoscaler differs from the Horizontal Pod Autoscaler (HPA or the default Kubernetes autoscaler) in a number of ways. While the HPA expects to receive metrics such as CPU and memory usages from the Kubernetes system, the Knative Autoscaler is tightly integrated with the Knative system itself. It is tuned for request-based autoscaling and for fast responsiveness to changes in request count.
Unlike a traditional, non-serverless system, the Knative Autoscaler must be optimized for bursty workloads. For example, if a service is used to respond to GitHub events, it might have zero requests for a long time and then many requests in a short period. Traditional autoscaling systems and strategies struggle to correctly respond to these sorts of workloads. We added various features to the Knative Autoscaler to help it handle these types of stochastic use cases. For example, a
ScaleDownDelay configuration that controls how long the Autoscaler will wait after it sees reduced requests count before executing a scaling down of replicas.
Challenge 4. Event autoscaling
Knative Eventing has autoscaling challenges around two areas, push-based event processing and pull-based event processing.
Push-based event processing is when events are “pushed” into eventing components. When events are pushed over HTTP (such as webhooks), Knative Serving autoscaling can be used. For example, see the Knative GitHub Source that is implemented by using Knative Serving to provide an HTTP webhook to receive events. However, there might be protocols other than HTTP, which is the first challenge of how to autoscale Knative Eventing components when events are received over non-HTTP protocols.
The following diagram shows an event source that sends events using HTTP connections to eventing components. Since HTTP is used, scaling of components can be achieved by using Knative Serving Autoscaler to monitor HTTP connections.
Pull-based event processing is when eventing components pull events from eventing systems directly. In such cases, autoscaling is more complicated because the eventing component must scale based on metrics from the event producer (such as number of events that must be pulled).
The following diagram shows an event source that is using a non-HTTP protocol (such as Kafka) to send events to the eventing components that are scaled by using a non-Knative autoscaler such as KEDA that monitors the event source. For example, the number of messages in a Kafka topic.
For pull-based event processing, the Kubernetes Event-driven Autoscaling (KEDA) project might provide an autoscaling solution as it allows us to define triggers based on the eventing system metrics. KEDA triggers are used to scale eventing components. This approach was explored and a prototype is available in Knative Eventing.
In the future, Knative Serving might support protocols other than HTTP and that might help with push-based autoscaling. However, until other protocols are supported, the most straightforward solution is to use a pull-based approach for eventing systems that do not support HTTP push.
Another autoscaling challenge is related to increasing density and sharing for eventing components across namespaces. In such scenarios, instead of having every namespace own all eventing data plane components, they share system-level components, such as channels or brokers, that autoscale and deliver events to different namespaces.
Challenge 5. Async requests
When you design a decoupled distributed system, it is necessary to create loose communications between components. Such an approach is increasingly important when the system needs to scale.
Among the several ways to achieve these loose communications include an event bus or asynchronous calls between components. The latter is a common pattern where the request from one component to the other does not block or wait for the response from the server, but instead returns immediately. The result of the call is determined afterward through an agreed upon strategy, such as a callback or a shared database.
In the initial release of Knative Serving, service requests are synchronous by design. Any request to a Knative service is blocked until the service completes the requests. It is not always reasonable or expected for services to run in a blocking or synchronous manner. What if the service does computation, or work, that takes up to 10 or 15 minutes? What if the user does not care about immediately receiving the result of a particular function? What if the function ultimately writes to a shared database that can be inspected later?
This type of user experience drove us to begin conversations with the Knative community about support for asynchronous service requests as a first class invocation model for all services. We collaborated with the Knative community to build a proposal and prototype for this asynchronous service support.
This add-on component enables service calls to be made by including the HTTP header
Prefer: respond-async. When this header is provided, the service returns a
202 Accepted response to the user and the request does not have to wait for the service request to complete processing. Powering this feature is a new controller that looks for the asynchronous ingress class and then creates a new ingress with the proper routing rules. As the following diagram shows, asynchronous requests are stored in a shared queue, which triggers the queue consumer to make the synchronous request to the service at a subsequent time.
We plan to support services that are always asynchronous, in which case the header would not be required. You can follow along with our progress at github.com/knative-sandbox/async-component.
Challenge 6. Event sources
Knative Eventing is the component that enables sourcing, transforming, and consuming of events. The system is flexible enough to allow various sources to be integrated into a deployed Knative installation. Event sources are varied. Examples are existing cloud systems and services such as databases and object storage systems. But event sources can be the Knative system itself, as in the Kubernetes cluster that Knative runs on and even the development environment where the software is developed, such as GitHub and GitLab.
The challenge for eventing is to enable an easy path for event source integration. First, the ability to create and integrate event sources into Knative. Second, to complete the path from source to sink, we need the ability to create brokers and channels that can deliver these events from the sources.
The Knative community is great at creating a varied set of event sources that can be used as examples for similar types of events. However, the work to make these event sources even simpler to create is a work in progress and an important ongoing challenge to assure the acceptance of Knative.
Challenge 7. Extending
Kubernetes is a resounding success for many reasons. In particular, the platform is extensible by design. Indeed, Knative itself uses K8s extensions by adding a series of CRDs, which together define Knative.
But what about Knative itself? Should it be extensible? One simple answer is yes, it should be. For example, Knative Eventing has defined plug points and templates for adding new sources, brokers, and channels to the system. This extensibility resulted in over 20 different source definitions.
Another aspect of Knative that is extensible by design is the client CLI. It goes one step further and defines a plug-in model and interface. Using this extension point, the community submitted a series of plug-ins for diagnostics, administration, manipulating event sources, and migrating Knative clusters. More are sure to follow this initial set.
Another extension area is defining a function-as-a-service layer for Knative. While there are many ways to define this layer, such as Knative client plug-ins, the community explored how to add this feature. While the effort is not yet successful, mainly due to the necessary refinement of Knative governance, it raised the need for an official model for extending Knative. The result of these efforts pushed for, and is now included in, the new governance model that the community adopted in the early fall of 2020.
Challenge 8. Ease of use
The Knative Client working group was created to provide a simpler user interface for Knative. From the start, the goal was to create a CLI that simplifies the user experience, but does not limit a user’s ability to access the full set of features that Knative offers.
The CLI exposes the core functionality of Knative as a set of imperative commands. However, in all cases, the user can export and import the associated YAML file for the objects that are created by hand or that result by running the commands one at a time.
Additionally, as mentioned in Challenge 7, the CLI is extensible with both internal and external plug-ins. Internal plug-ins can be embedded (built in) a special release of the CLI that embeds the plug-in’s commands code into the CLI binary. External plug-ins add commands to the CLI dynamically. They are added by placing the plug-in executable file in a known location of the user’s file system.
Challenge 9. Upgrade and rollback support in the Knative Operator
The Knative Operator provides the capabilities to install, manage, and configure Knative components. Before the 0.17 release, one Knative Operator version was used to support one Knative version. If you need to upgrade or roll back Knative, you must install the newer or older version of Knative Operator. With its 0.17 release, Knative Operator made a breakthrough by implementing the support of multiple Knative versions.
Knative applies the semantic versioning in terms of
major.minor.patch version numbers. There is a negative 3 principle for minor numbers. The easiest version that the Knative Operator supports can retrogress to the current minor version minus 3. For example, Operator at 0.18 supports Knative from 0.15 to 0.18. This feature was introduced in Knative Operator 0.16.0. You can specify the version of the Knative component with the
The Knative Operator tries to make the Knative experience much easier. Knative has automated mechanisms so that you don’t worry about the cleanup of the obsolete or useless resources. With the facilitation of multiple-version support, the upgrade and rollback process is smooth by changing the field
spec.version in the Custom Resource. One rule you need to pay attention to is that the Knative Operator cannot upgrade or roll back across multiple minor versions. You must go one minor version up or down in each step. The patch number can be any number, as long as it is available in the releases.
Challenge 10. Input formats
As we mentioned earlier, Knative previously had a build component that became the Tekton project. The build component addressed a usability issue with a serverless platform, which is that users tend to bring their applications in different input formats. Knative proper, and K8s itself, require a container image to sit in a container registry as the input format. But what if I have a Git repository with source code and a Dockerfile in it? Or just a Git repository without a Dockerfile?
Moving Tekton to its own separate project had the advantage that it evolved into a general purpose, in-cluster build mechanism. However, while we worked on IBM Cloud Code Engine, we realized that it is a necessary step but not sufficient. To allow for a user experience such as “here is my source code, run it serverless for me,” at least two more ingredients are necessary: an orchestrator component that sits in front of Tekton and a content library that helps to fill up potential gaps during the build process.
Luckily, these missing components can be found inside the open source ecosystem. For example, the orchestrator can be addressed by projects such as Shipwright or KPack. The content library can be addressed by the Cloud Native Buildpacks from the Cloud Native Computing Foundation (CNCF) and an implementation of them such as Paketo.
Combing these into a system allows a serverless platform to start from any input format of an application:
- Pure source code, which then gets enhanced with a buildpack, built, and the build result published to a container registry.
- Source code and a Dockerfile, which then gets built and the build result published to a container registry.
- A ready-made image, which lives a container registry.
Further enhancements in the future are possible, such as a special functions buildpack where the input source is a function source code.
Open source governance challenges
It’s not surprising that a project such as Knative, which tries to address a common and popular approach to cloud-native applications, had growing pains. The early popularity of Knative resulted in significant name recognition even before the project was used in any products or services.
This early popularity increased as companies adopted and started to use the project although the current release is not yet at 1.0. A series of challenges surfaced around governance and extensions to Knative, and how to keep the project vibrant and welcoming. We explore three in particular here.
Challenge 11. Trademark
As with all projects, naming has an important part to play. When Knative was released, it was not easy to predict that the project and the name would gain so much popularity. In many ways, the same might be said with Kubernetes. Unfortunately, name recognition comes with a drawback.
The major issue is a matter of law; here is a brief synopsis of facts and the impact on the developers and contributors of Knative. Briefly, Google owns the trademark for the Knative name. To maintain and enforce that trademark, you must be or be part of a legal entity. Since Knative is not part of an organization nor another legal entity, Google opted to keep and protect the trademark. This can cause issues when it means that one company among over a dozen contributors has unilateral ownership of an important part of the project.
Initially the Steering Committee (SC) controlled the project and the trademark. However, as more contributors made strides in the project and the governance was opened to others, the trademark became a sticky point to allow the community to move forward. After a few months of heavy debates and participation from the entire community, a solution emerged.
Briefly, the essence of the solution was to separate the control of the project from control of the trademark. The project continues to be steered and controlled by an SC and Technical Oversight Committee (TOC) that is elected from the contributors with limited time to serve. A secondary committee was formed to maintain control over the trademark. Members of the Trademark Committee are based on contributions to the project. As of now, there are three members: Google, IBM and Red Hat, and VMware. Their goals are to define and protect the Knative trademark and brand.
Challenge 12. Bureaucracy
All organizations suffer a form of bureaucracy as they grow. Therefore, it’s not surprising that bureaucratic aspects surfaced for Knative in the first year of full development of the project in the open. The specific item that we want to mention is that many TOC seats were filled with non-contributors or by chosen representatives from the founding organizations, instead of people organically elected from the contributors.
This is not necessarily a bad thing, since initially to bootstrap the project you need to choose leadership and stewardship from seasoned and experienced open source developers instead of the ones that contributed to the nascent project. However, how does a project evolve after the initial bootstrap period to result in an inclusive and varied technical committee?
To alleviate this common shortcoming, many open source projects evolve to include elected leaders instead of chosen ones. This issue surfaced many times during the public debates about governance changes. While the issue is not fully resolved, a clear path and goal for the TOC and SC is to favor leaders that are chosen and promoted from the people who contribute directly to the code.
Challenge 13: How to keep Knative welcoming
It’s no secret that the lifeblood of any open source project is its ability to sustain a vibrant group of contributors. Knative is no different. From the start, the leaders of the Knative project did a fantastic job to welcome and grow the community from a diverse group of participants from large to small organizations.
Today the github.com/knative community stands strong in early 2021 at 495 members, 23 repositories, and more than 10 organizations actively participating. And these metrics do not include the emerging github.com/knative-sandbox organization where extensions projects exist.
The reason that we list how to keep Knative welcoming as a challenge is more out of fear than current reality. The issues around trademarks made it clear that as Knative becomes even more successful, there is a risk that the community might be pressured to follow rules from a minority. While the governance updates resolved the current immediate issues, we are not 100% confident that the fundamental issue is resolved. Though significant progress was made.
One missing component of the governance that might alleviate our concerns is if Knative (and its trademark) was either spun to an independent, nonprofit organization or better joined to one of the existing open source organizations that accepts similar projects, such as the CNCF. While a simple solution, the reality is that Knative did not join an independent organization because of different philosophical views on this subject by members of the current steering and technical committees. Therefore, we are not sure that this will happen. Nonetheless, it appears that the recent governance updates are the result of good faith efforts from all involved. Therefore, we are hopeful that the current “truce” will hold.
Knative is a maturing platform for serverless workloads on top of K8s. It experienced more than 20 releases, includes mechanisms for extensions, and has a growing, vibrant, open source community.
That said, it is not without its challenges. Some are technical and some are not. In this blog post, we highlighted 13 challenges in three areas. Many of these challenges are resolved and we highlighted the solutions; and a few challenges are in flight, therefore we hinted at the solution in progress.
We feel confident that, while not perfect, the Knative community is technically strong, has a culture rooted in inclusion, and is overall welcoming. These traits should allow it to overcome the challenges we listed and future ones that might emerge. We are cautiously optimistic about a great future for this community and the success of the overall project.
We would like to thank Michael Behrendt, Michael Brown, Nimesh Bhatia, Morgan Bauer, Steve Dake, Andrea Frittoli, Daisy Guo, Roland Huss, Nima Kaviani, Lance Liu, Ying Liu, Ryan Moe, Paul Morie, Michael Petersen, Navid Shaikh, Lin Sun, Jeremias Werner, Grace Zhang, Jordan Zhang, Yu Zhuang, and many more in the Knative community at large.