Open Source AI and Data: How best to keep up with rapid advances?
The new LF AI and Data foundation serves as a single source of truth for developers and data scientists who want to use and contribute to open source AI and data projects.
The Linux Foundation Artificial Intelligence Foundation (LF AI) is merging with ODPi, which has a focus on big data in the enterprise, including governance, business intelligence, and data science education. The merged foundation will be called LF AI and Data. IBM believes this move is great for the AI and data open source space and that the new LF AI and Data Foundation will pave the way for stronger, safer open source AI and data projects.
Why is this important?
The world’s technology increasingly runs on open source software and data. Open source AI software development has led to advances in AI pattern recognition, including image recognition, speech recognition, and entity extraction in text, that were only possible because researchers were able to use open data sets and open source software to benchmark and compare systems and approaches.
The data you or your organization create influences and is influenced by AI. Increasingly, both productivity and quality of service depends on data-driven AI systems across business and society. And those AI systems are largely based on open source software and data sets at their core.
LF AI and Data offers one place to go for open source AI information
AI developers and data scientists can struggle to keep up with the rapid advances in open source AI and data. LF AI and Data offers a primary source of information and community investment in open source projects.
For example, the LF AI Landscape tool tracks nearly 300 AI-related open source community projects, including TensorFlow, PyTorch, scikit-learn, and many others that collectively generate approximately one million lines of code changes every two weeks.
Each project has its own information card that includes information such as the number of GitHub stars and information about the organizations that host code repositories (repos) on GitHub. Currently, the projects collectively have over 1.5-million stars; the public companies with projects have a combined market cap of $13.94T; and the startup companies with projects have combined funding of $53.9B (based on Crunchbase). If you check, the numbers will probably be higher still!
The foundation’s landscape of data and AI-related projects continues to grow, adding a few new popular projects each month. The landscape groups projects into twelve categories: Data (the largest category), Model (the second largest category), Machine Learning, Deep Learning, Reinforcement Learning, Programming, Notebook Environment, Trusted and Responsible AI, Distributed Computing, Security and Privacy, Natural Language Processing, and Education.
Why is this important to the IBM Approach to Open Technologies?
At IBM, we are especially excited about the merger, because like other members of LF AI and members of ODPi, we see this as an investment in the future, building an industry-standard open source data and AI infrastructure. IBM has a long history of supporting open source, and we firmly believe that the open governance of projects in foundations with multivendor governance and vendor-neutral ownership of assets is the best for growth and long-term sustainability of projects.
To date, IBM’s key projects contributed to the LFAI and ODPI–and now being hosted by LF AI and Data include:
- Egeria (graduated): Every AI project generates enormous amounts of metadata. Egeria provides an Apache 2.0-licensed open metadata and governance system, frameworks, APIs, event payloads, and interchange protocols to enable tools, engines, and platforms to exchange metadata. This helps users get the most value from data, while also ensuring the data is properly governed.
Trusted AI (3 of them incubating): The adoption of AI will only happen broadly in business and society if it is both accurate and trusted.
- AI Fairness 360 (incubating): The AI Fairness 360 toolkit is an extensible open-source library containing techniques developed by the research community to help detect and mitigate bias in machine learning models throughout the AI application lifecycle. The AI Fairness 360 package is available in both Python and R.
- AI Explainability 360 (incubating): The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of data sets and machine learning models. The AI Explainability 360 Python package includes a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics.
- Adversarial Robustness Toolbox (incubating): Adversarial Robustness Toolbox (ART) is a Python library for machine learning security. ART provides tools that enable developers and researchers to defend and evaluate machine learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).
OpenDS4All (incubating): OpenDS4All seeks to accelerate the creation of data science curricula at academic institutions. While a great deal of online material is available for data science, including online courses, we recognize that the best way for many students to learn (and for many institutions to deliver) content is through a combination of lectures, recitation or flipped classroom activities, and hands-on assignments. OpenDS4All attempts to fill this important niche. Our goal is to provide recommendations, slide sets, sample Jupyter Notebooks, and other materials for creating, customizing, and delivering data science and data engineering education. The project hosts educational modules that may be used as building blocks for a data science curriculum.
From the Trusted AI projects to Egeria for open metadata governance, IBM sees the value of investing in open source foundations to host projects that are also key to IBM products. IBM Cloud Pak for Data is built on open source industry standard infrastructure with proprietary extensions that provides a fully-integrated super-set of Red Hat’s Open Data Hub, for running AI workloads on OpenShift. The Red Hat Marketplace has a growing number of ecosystems partners like Anaconda that provide ready to use components in data and AI pipelines as well. The new foundation will help to support the open source projects that underpin our most important data and AI products.
How and why to get involved?
For organizations interested in hosting projects in the LF AI and Data or joining as a member, the website provides easy-to-follow links and instructions for taking the first steps. For developers interested in growing their technical eminence in as project, review the existing projects hosted at LF AI and Data, and join the community Slack workspace.
For individuals and organizations, as your data becomes your AI, Linux Foundation AI and Data is an open community with people and information who can help on the journey. Please join LF AI and Data — start with joining the Slack workspace — and join us on this exciting journey!