Open source and AI at IBM
The past, present, and future of open source and AI at IBM
This is the first in a series of blog posts about open source technology and artificial intelligence at IBM. In this blog, we identify two mega-trends and then zero in on IBM developer outreach efforts around code, content, and community. Future blogs will focus on IBM research and IBM product team efforts.
The AI mega-trend: Your data is becoming your AI. This is a good thing. Eventually, your AI will truly augment your intelligence, and help you become more creative and productive both at work and on personal tasks. However, we still have a long way to go to fully realize such predicted benefits. For less than a decade, systems like Watson, Siri, and Alexa have been adding capabilities and advancing. For some tasks in some contexts, they work great, but as a general AI companion that you can delegate complex tasks to for help when most needed, there is still a long way to go (Forbus 2015). Nevertheless, step-by-step “our data is becoming our AI,” and this is a mega-trend that today’s students and working professionals need to understand better and actively engage in shaping. We need better ways of learning and working together to shape the way that AI capabilities packaged as a service, with increasing power and performance, are becoming commonplace in business and society (Spohrer and Banavar 2015).
The open source mega-trend: Fortunately, to help us learn and better work together to shape AI capabilities, the AI mega-trend is riding on an even bigger mega-trend: open source. As important as it is to understand AI, it’s arguably even more important to learn about and engage in open source communities. Why? Because increasingly the organizations that we all depend on for business and societal service are powered by open source software. In other words, open source is far more than a few programmers sharing code. In fact, while the world today quite literally runs on open source, those familiar with the original goals of open source hope for even more benefits coming in the future to individuals, not just corporations (Mark 2018). Certainly, open source software has finally come full force to corporations. Why has this happened? Simply put, corporations that make all or part of the code that is associated with their products available in open source have achieved benefits that go far beyond the “all bugs are shallow” (software quality) benefits first observed (Raymond 1999). The ability to work with ecosystem partners to co-create value faster and learn better together is the driving force behind IBM’s approach to open source as well as many other corporations that IBM collaborates with on open source projects (Moore and Ferris 2017). For example, IBM’s proposed Red Hat acquisition, Microsoft’s acquisition of GitHub, and Google’s acquisition of Kaggle reflect the growing importance of code, content, and community to open enterprises. Code is about open source software with test cases. Content includes open data, documentation, and tutorials. Community is all about people in a wide range of roles and organizations co-creating value and learning faster together. The IBM Developer Way is about code, content, and community, and there are opportunities for everyone to get more involved. It has taken some time to get to this point in open source, and it is worth reviewing the history of open source (Raymond, 1999, Ing 2018, Brown 2018, Mark 2018, Moore & Ferris 2018), summarized in four stages:
Sharing: People have been writing and sharing “open source code” from the dawn of programming and programming languages (1950-1980), but in terms of wide-scale acceptance in mission-critical enterprise applications, much more was needed beyond a willingness to write, share, and use code. The long history of IBM in open source and corporate use of open source software has gone through three more stages (renaissances) in just the last three decades from our perspective.
Licensing: In the 1990s, it was about Linux in the enterprise and getting the right licenses for commercial use of larger-scale open source community-driven projects and trusted certifications around commercial-grade mission-critical open source for large enterprises. This important step (licensing) finally cracked opened the door to corporate developers contributing more actively to open source tools that they were using and acceptance of the use of certified open source in the enterprise.
Open governance: In the late 1990s and early 2000s, it was about the flourishing of foundations and open governance organizations being established. Based on licensing and open governance, in the 2010s, the ecosystem for open source now included major corporations, start-ups, GitHub, Red Hat Software, and a growing number of repos used by corporations. The rise of cloud computing, big data availability, and data sciences, as well as new algorithms for human-level pattern recognition (entity detection) for speech, images, and text and artificial intelligence spurred a growing number of repos for scaling such as Spark and even one repo achieving 100,000 stars for the first time in 2018 (Google TensorFlow). This step (open governance) opened the door more widely for corporations to embrace open source strategically while mitigating the risk of one entity having too much control of the future of an open source code base.
Code, content, community: With a solid foundation in place, the current renaissance in open source is actually about a return to the roots of code, content, and community to democratize development in the open enterprise. The democratization of development means that all of the diverse roles in the enterprise, both technical and non-technical (for example, programmers, software engineers, data scientists, data engineers, devops, offering management, subject matter experts (SMEs), marketing professionals, and communications), will be participating in code, content, and community in new ways to accelerate value co-creation and learning together. It means that a growing number of enterprise processes have been transformed into an on-going process of searching for newer and better open source components (inputs and outputs) that are organized into end-to-end continuously improving workflows (services) that drive business outcomes. This current renaissance (code, content, and community) is both a return to the roots of open source to democratize access to powerful technologies to empower co-creating the future together, as well as allowing workers in corporations and governments to benefit from and contribute to an open enterprise powered by people with their own AI. Frankly, we are still figuring out what this means in terms of employees having more individual freedom and responsibility (Whitehurst 2015). Open source communities are a core part of corporate strategy based on business models that actively seek to engage employees, customers, partners, competitors, and others in the ecosystem to co-create service innovations as a community (Spohrer, Kwan, Fisk 2014). How can these types of communities be sustained and co-exist with more traditional strategies and business models (Brown 2018)? We will find out together because there is no turning back now.
What does code, content and community look like in an open enterprise? In the remainder of this blog, we’ll describe specific projects and people working on developer outreach efforts for code, content, and community for both open data and AI as well as system performance in different parts of the IBM Measuring AI Progress with Cognitive Opentech Group (MAP COG).
There is a tremendous amount of open source AI activity in IBM, for example, in research and in product teams. This blog focuses on activities in the open technologies group in IBM. We anticipate that there will be more blogs that describe other IBM open source AI activities.
History: Building on the past
Recently, the Apache Spark community announced its release of v2.4.0. It is the fifth major release in the 2.x code stream. Since its initial launch in 2009, Apache Spark has made great strides, which are driven in large part by a passionate open source community. Apache 2.0 marked a significant leap forward in functionality, stability, and performance. IBM identified Spark as one of the strategic open source projects, and established the Spark Technology center at Watson West in San Francisco to focus efforts on expanding Spark’s core technology to make it enterprise- and cloud-ready and to accelerate the business value of Spark in business applications. From the establishment of the Center until the Apache Spark v2.4.0 release, the Spark Technology Center has made many contributions with over 1300 commits with 69,000+ lines of new code in the areas of Spark Core, SQL, MLlib, Streaming, PySpark, and SparkR, which makes IBM one of the top five contributors to Apache Spark. You can always see the latest contributions at our JIRA Dashboard.
In early 2018, with the accelerated development of AI and especially recent advances in deep learning technology, the Spark Technology Center at IBM expanded its mission. Now part of the IBM Digital Business Group, the center has relaunched as CODAIT (the Center for Open source, Data and AI Technology). Note that codait is a French word for coder. Designated developers and committers continue to enhance Spark’s core technology through code, content, and community advocacy. In addition, this dedicated team continues to serve as a Spark competence center for IBM product development teams.
Present: Embracing the present
Aspects of data set sharing
Data sets and the sharing of data sets are essential for the success of AI to provide the histories from which AI systems base their predictions. Important aspects include the ability to share the data sets and the descriptions of data sets (metadata) across tools and software, applying data governance principles and demonstrating compliance, and integrating subsequent data modifications appropriately.
The Egeria project emerged at the ODPi consortium in 2018 to define APIs and descriptors for metadata sharing without imposing a single tool or repository to manage all metadata. Egeria makes it possible for tools and data repositories to publish metadata information about the data that they manage, and to consume metadata descriptions from other tools to enable the correct interpretation of the consumed data. Egeria is also providing guidelines and templates for data governance and compliance.
There is much discussion on how data sets should be licensed. The Community Data License Agreement (CDLA) from the Linux Foundation is intended to provide for data sets what open source software licenses provided for software and comes in two variants:
The CDLA-Sharing license embodies the principles of copy so that downstream recipients of data that is published under the CDLA-Sharing agreement can use and modify that data, and are required to share their changes to the data.
The CDLA-Permissive agreement is similar to permissive open source licenses so that anyone can use and modify the data that is published under the CDLA-Permissive agreement without being obliged to share any of their changes or modifications.
Aspects of artificial intelligence/machine learning/deep learning
The Adversarial Robustness Toolbox can detect and mitigate malicious attacks against deep learning models.
FfDL (Fabric for Deep Learning) is an award-winning open source project (Infoworld 2018 BOSSIE, open source Machine Learning tools), and an enhanced version is included in IBM Watson Machine Learning.
Large Model Support (LMS) for TensorFlow on PowerAI.
These technologies and other systems will be discussed more in-depth in future blogs, and these are just a few examples of the open source technologies that IBM has available for developers and data scientists. Beyond the open source code itself, we also publish code patterns, blogs, tutorials, and articles on IBM Developer to help developers and data scientists better understand what and how they can build with open source technologies. Our goal is to enable developers and data scientists to transform industries through open source data and AI technologies. This work is grounded in open source code, high-quality content, and building community around open source data and AI technologies.
Future: Co-inspiring and co-creating together
Increasingly, foundations are providing congenial and productive environments for open source collaboration across institutions and enthusiasts, making it possible for individuals to contribute and channel their expertise alongside large enterprises, start-ups, faculty, and students. Examples of foundations that have emerged in the AI and related domains include the ODPi consortium for data and the Deep Learning Foundation (LFDL), both at The Linux Foundation. LFDL incorporates a number of projects such as Acumos AI, which is a platform and framework that makes it easy to build, share, and deploy AI apps. Performance benchmarks are emerging in the area of AI, such as MLPerf, DawnBench, DeepBench, and TensorFlow benchmarks, and we will see more foundations engaging in this area to help estimate the resources that are needed to develop and run AI apps.
Get involved and contribute
The open source AI world is thriving. If you have an interest in AI, there are many ways that you can contribute, such as adding features, testing, or documenting open source frameworks like TensorFlow and PyTorch or model exchange formats like PFA, PMML, and ONNX. You can adopt, retrain, or build and train models and contribute them to asset exchanges such as MAX or Acumos AI. Graph technology forms an underpinning for reasoning, and engaging with the JanusGraph project gives you an understanding of how property graphs work. The ODPi Egeria projects welcome participants to its weekly calls on Thursdays. The Linux Foundation TAC calls are every other Thursday and include recordings of previous talks. If you are a performance expert, consider joining one of the AI performance benchmark initiatives. If you have an interest in ethics and trust, you can get involved in critical projects like AI Fairness 360 and the Adversarial Robustness Toolbox.
Brown TC (2018) A framework for thinking about Open Source Sustainability? 2018 Jul 02
Forbus KD (2016) Software social organisms: implications for measuring AI progress. AI Magazine. 2016 Apr 13;37(1):85-90.
Ing D (2018) Open Innovation Learning: Theory building on open sourcing while private sourcing. Foreword by Spohrer J. 2018 Feb 20
Mark J (2018) Why open source failed. 2018 Jul 30
Moore T and Ferris C (2018) IBM’s approach to open technologies. 2018 Oct 27
Raymond E. The cathedral and the bazaar. Knowledge, Technology & Policy. 1999 Sep 1;12(3):23-49.
Spohrer J, Banavar G (2015). Cognition as a service: An industry perspective. AI Magazine, 2015 Dec 1; 36(4):71-81.
Spohrer J, Kwan SK, Fisk RP (2014) Marketing: a service science and arts perspective. Handbook of service marketing research, Ed Rust RT and Huang MH. Pp. 489-526.
Whitehurst J (2015) The open organization: Igniting passion and performance. Harvard Business Review Press; 2015 May 12.