by Romeo Kienzler | Published December 5, 2018
Artificial intelligenceData scienceMachine learning
Acknowledgement: Thanks to Kevin Turner for reviewing this document multiple times and for his valuable input.
Data scientists tend to use ad hoc approaches. We see a lot of creative hacking of scripts in different programming languages on different machine learning frameworks distributed all over the place on servers and client machines. I’m not complaining about the way data scientists work. I’ve found myself in such highly creative modes many times when I accomplished something significant.
Having complete freedom of choice with programming languages, tools, and frameworks improves creative thinking and evolvement. But at the end of the day, data scientists must fully shape their assets before delivery because there can be many pitfalls if they’re not. I describe these pitfalls below.
From a data scientist perspective, it’s common sense that the actual technology doesn’t matter too
much from a functional perspective because the models and algorithms that are used are defined mathematically. Therefore, the single source of truth is the mathematical definition of the algorithm. For non-functional requirements, this view doesn’t quite hold. For example, the availability and cost of experts for a certain programming language and technology varies heavily. When it comes to maintenance, the chosen technology has a major impact on a project’s success.
Data scientists tend to use programming languages and frameworks in which they are most skilled. This starts with open source technologies like R and R-Studio with its unmanageable universe of packages and libraries and its inelegant and hard-to-maintain syntax. The runner-up is Python with its well-structured and well-organized syntax and associated frameworks Pandas and Scikit-Learn. On the other side of the tools spectrum are completely visual “low-code/no-code” open source tools like Node-RED, KNIME, RapidMiner, and Weka and commercial offerings like SPSS Modeler.
‘The technology that I know best’ is fine for a proof of concept (PoC), hackathon, or start-up style
project. However, when it comes to industry and enterprise project scale, some architectural guidance on technology usage must be in place, howsoever it might manifest.
Given the previous problem statement, it might be obvious to you that uncontrolled growth of data science assets in an enterprise setting can’t be completely tolerated. In large enterprises, a lot of churn can happen with projects and human resources, such as external consultants with specific skills are only hired for a short time attached to a specific project. Usually, if someone leaves the project their knowledge is gone also. Therefore, it’s essential that data science assets aren’t just collections of scripts that are implemented in different programming languages laying around in various locations and environments. Because of the non-collaborative nature under which many data science assets are developed, it follows that reusability of those assets is often limited. Ad hoc documentation, poor code quality, complex and mixed technologies, and a broader lack of expertise are the main drivers for this problem. After these issues are addressed, assets become reusable and dramatically increase in value. For example, if uncoordinated, every data scientist might re-create the ETL (Extract – Transform – Load), data quality assessment, and feature engineering pipeline for the same data source, which can lead to significant overhead and poor quality.
Data scientists are great thinkers. Common sense tells them that brains do not scale. Therefore, data scientists tend to work alone at their own pace in their own manner. If they are stuck, websites like “stackexchange.com” can become their best resources to get help. Maybe it’s ignorance or maybe only a lack of equally skilled peers, but the best technical data scientists often don’t excel in collaboration. For outsiders, it might look like they are following the mind-set “Après moi, le deluge,” assets that are created aren’t shared and organized in a reusable manner. Poor documentation, if present at all, and scattered components make it hard to retrace and replicate previous work. Therefore, a common asset repository and minimum guidelines for proper documentations are essential.
Data scientists are often “hackers” with linear algebra skills and some understanding of business. They are usually not trained software engineers or architects. As stated before, data scientists tend to use the programming language and frameworks in which they are most skilled and progress rapidly to a solution without necessarily having non-functional requirements (NFRs) like scalability, maintainability, and human resource availability in mind. Therefore, I emphasize the need that a solution architect or lead data scientist role be attached to every major data science project to ensure that NFRs are properly addressed. To support such a role with a predefined architectural and process framework is very helpful. But first, let’s look at how a traditional enterprise architecture fits into data science projects.
Before we answer this question, let’s start with a short review on traditional enterprise architecture and then evaluate how an architectural methodology and process model fit in.
Architecture hierarchy. Source: IBM Corporation
At the top of the pyramid is the enterprise architect. The enterprise architect defines standards and guidelines that are valid across the enterprise. Some examples include:
The solution architect works within the framework that the enterprise architect defines. This role defines what technological components fit the project or use case. Some examples are:
The application architect then defines the application within the framework of the solution architecture. Examples of this include:
Finally, the data architect defines the data-related components such as:
So where does the all-mighty, creative data scientist cowboy fit in here? First, we try to define which of the roles defined above that a data scientist might partially take or which roles they might interact.
Let’s look at the roles again from the top to bottom. To get more illustrative, let’s take a metaphor from urban design. An enterprise architect is the one who designs a whole city. They define sewerage systems and roads, for example. A solution architect would be the one designing individual houses whereas an application architect designs the kitchen and a data architect oversees the electrical installation and water supply system.
Finally, the data scientist is responsible for creating the most advanced kitchen ever! They won’t just take an off-the-shelf kitchen. They take individual, ready-made components, but also create original parts where necessary. The data scientist interacts with the application architect mostly. If the kitchen has special requirements, the data architect might be necessary to provide the infrastructure. Keeping this metaphor in mind, how would the kitchen look if it was created by the data scientist alone? It would be a functional kitchen with a lot of features, but most likely lacking some usability. For example, to start the oven you need to log in to a Raspberry Pi and run a shell script. And because parts have been taken from different vendors, including some custom-made hardware, the design of the kitchen might be ugly. Finally, there would be a lot of functionality but some of it is not needed and most of it is undocumented.
Going back to IT, this example illustrates the answer to the original question. Where does our all-mighty, creative data scientist cowboy fit in here?
The data scientist would rarely interact with the enterprise architect. They might interact with the solution architect but will work closely with the application architect and data architect. They don’t need to take over their roles, but they must be able to step into their shoes and understand their thinking. Because data science is an emerging and innovative field, the data scientist must speak at eye level with the architects (which is not the case for an application developer or a database administrator) to transform and influence the enterprise architecture.
I’ll conclude with an example to illustrate what I mean by this. Consider architectural guidelines in which an R-Studio Server is the standard data science platform in the enterprise and all data science projects must use R. This software was approved by the enterprise architect and the on-premises R-Studio Server self-service portal was designed by the solution architect. The data scientist finds a Keras code snippet in Python using the TensorFlow back end that pushes model performance to the moon. This code is open source and maintained by one of the most intelligent brains in artificial intelligence. The data scientist just needs an hour to plug this snippet into the data processing pipeline running on their notebook (yes, they prototype on their notebook because they really don’t like the R-Studio Server installation provided to them). So, what do you think should happen here?
In the old days of the all-mighty architects in an enterprise, the data scientist would have been forced to port the code to R (using a less sophisticated deep learning framework). But here’s the potential. If the data scientist wants to use this code snippet, they should be able to do so. But if this is done without guidance, we end up in the Wild West of data science.
Therefore, let’s look at existing process models and reference architectures to see whether and how we can merge the traditional field of architecture with the emerging field of data science.
CRISP-DM, which stands for Cross-industry Standard Process for Data Mining, is the most widely used open standard process model – if a process model is used at all, of course. CRISP-DM defines a set of phases that make up a data science project. Most importantly, transitions between those phases are bidirectional and the whole process is iterative. This means that after you’ve reached the final stage that you just start the whole process again and refine your work. The following figure illustrates this process.
The CRISP-DM process model. By Kenneth Jensen, based on:
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf – Creative Commons Attribution-Share Alike 3.0 Unported license
In my opinion, this process model is already a good start. But because it is a process model only, it assumes that the architectural decisions on the technology used and the NFAs are already addressed. This makes CRISP-DM a very good model in technologically settled environments like traditional enterprise data warehousing or business intelligence projects.
In a rapidly evolving field like data science, it is not flexible enough.
Due to shortcomings in CRISP-DM, in 2015 IBM released the Analytics Solutions Unified
Method for Data Mining/Predictive Analytics (ASUM-DM) process model. It is based on CRISP-DM but extends it with tasks and activities on infrastructure, operations, project, and deployment, and adds templates and guidelines to all the tasks. There is an open version of ASUM-DM available, but the full-scale version is available only to IBM clients. (For more information, contact firstname.lastname@example.org.)
ASUM-DM is part of a more generic framework called Analytics Solutions Unified Method (ASUM) that provides product- and solution-specific implementation roadmaps covering all IBM Analytics products.
ASUM-DM borrows the process model from ASUM, which is illustrated below.
Analytics Solutions Unified Method (ASUM) Process Model. Source: IBM Corporation
Analytics Solutions Unified Method (ASUM) Process Model Detail. Source: IBM
When the Manifesto for Agile Software Development was published in 2001, heavy processes like Waterfall or V-Model went out of vogue. The main reason for this paradigm shift was the software development crisis in the 1990s where software development just couldn’t keep up with rapidly growing expectations of business stakeholders on time-to-market and flexibility.
Because enterprise clients often have a hard time transitioning to agile processes, IBM created the IBM Cloud Garage Method, an agile software architecture method that is tailored to enterprise transformation. Again, this method is organized in different stages, as shown in the following image.
The IBM Cloud Garage Method. Source: IBM Corporation
A key thing to notice is that cultural change is in the middle of this hexagon. This means that without cultural change the method is doomed to fail. This is important to keep in mind. In the context of data science, we have a head start because data scientists tend to favor lightweight process models, if used at all.
In the IBM Cloud Garage Method, every practitioner sit’s in the same boat. Source: IBM
Here’s a summary of all of the six phases surrounding the cultural change.
Design thinking is the new requirement engineering. Design thinking had its roots in the 1960s, but IBM was one of the major contributors to apply this method to the IT industry. Although usually stated in more complex terms, Design thinking in my opinion has only one purpose: Switch your brain into creative mode. Therefore, writing and drawing is used over speaking and typing. By stepping back, you’ll be able to see, understand, and create the bigger picture.
Design thinking has the user experience in mind and a clear emphasis on the business behind the offering. So these key questions are answered:
The outcome of every think phase is the definition of a minimum viable product (MVP).
The platform cloud revolution is the key enabler for fast prototyping. You can get your prototype running in hours instead of days or weeks. This lets you shorten the iteration cycle by an order of magnitude. This way, user feedback can be gathered daily. Some best practices for this phase include:
Prerequisites for daily delivery are two things. First, build and deployment must be fully automated using a tool chain. Second, every commit to the source code repository must result in a fully, production-ready product that can be tested by users at any time. Cloud-based solutions are tackling this requirement and let developers concentrate on coding.
Continuous Integration and Continuous Delivery. Source: IBM Corporation
When using a cloud runtime, operational aspects of a project are handled by cloud services. Depending on requirements, this can happen in public, private, or hybrid clouds and at an infrastructure, platform, or service level. This way, often the operations team can be made obsolete and developers can concentrate on adding value to the project. Some best practices for this phase include:
High-Availability, Auto-Scaling, and Fault-Tolerance in an intercontinental cloud deployment.
Source: IBM Corporation
Because you’re premising on fully managed cloud runtimes, adding intercontinental high availability/failover, continuous monitoring, and dynamic scaling isn’t a challenge anymore and can be simply activated. Some best practices for this phase include:
Due to the very short iteration cycles and continuous user feedback, hypotheses can be tested immediately to generate informed decisions and drive findings that can be added to the backlog for further pivoting. Some best practices for this phase include:
Evidence based hypothesis testing example. Source: IBM Corporation
Although usually bound to an IBM client, engagement included in the DataFirst Method Design Engagement Offerings (the IBM DataFirst Method is an instance of the IBM Cloud Garage Method) specifically target IT transformation to get infrastructure, processes, and employees ready for AI. For more information visit ibm.biz/DataFirstMethod.
IBM DataFirst Method Process Model. Source: IBM Corporation
Every project is different, and every use case needs different technical components. But they all
can be described in abstract terms. The following list enumerates and explains those.
IBM Data and Analytics Reference Architecture. Source: IBM Corporation
Because you now have an overview of the current state-of-the-art methods and process models for data science on the cloud, it’s time to concentrate on a method that is useful for individual data scientists who want to improve their methodologies and want to minimize architectural overhead and positively influence enterprise architecture from the bottom up. I call this the Lightweight IBM Cloud Garage Method of DataScience. I’ll explain this method in my next article. So stay tuned!
A process model to map individual technology components to the reference architecture.
Artificial intelligenceData science+
Learn how transfer learning allows you to repurpose models for new problems with less data for training. If you're training…
An architectural decisions guide to map individual technology components to the reference architecture and guidelines for deployment considerations.
Back to top