Architectural thinking in the Wild West of data science

Archived content

Archive date: 2023-06-02

This content is no longer being updated or maintained. The content is provided “as is.” Given the rapid evolution of technology, some content, steps, or illustrations may have changed.

Acknowledgement: Thanks to Kevin Turner for reviewing this document multiple times and for his valuable input.

Data scientists tend to use ad hoc approaches. We see a lot of creative hacking of scripts in different programming languages on different machine learning frameworks distributed all over the place on servers and client machines. I'm not complaining about the way data scientists work. I've found myself in such highly creative modes many times when I accomplished something significant.

Having complete freedom of choice with programming languages, tools, and frameworks improves creative thinking and evolvement. But at the end of the day, data scientists must fully shape their assets before delivery because there can be many pitfalls if they're not. I describe these pitfalls below.

Technology blindness

From a data scientist perspective, it's common sense that the actual technology doesn't matter too much from a functional perspective because the models and algorithms that are used are defined mathematically. Therefore, the single source of truth is the mathematical definition of the algorithm. For non-functional requirements, this view doesn't quite hold. For example, the availability and cost of experts for a certain programming language and technology varies heavily. When it comes to maintenance, the chosen technology has a major impact on a project's success.

Data scientists tend to use programming languages and frameworks in which they are most skilled. This starts with open source technologies like R and R-Studio with its unmanageable universe of packages and libraries and its inelegant and hard-to-maintain syntax. The runner-up is Python with its well-structured and well-organized syntax and associated frameworks Pandas and Scikit-Learn. On the other side of the tools spectrum are completely visual "low-code/no-code" open source tools like Node-RED, KNIME, RapidMiner, and Weka and commercial offerings like SPSS Modeler.

'The technology that I know best' is fine for a proof of concept (PoC), hackathon, or start-up style project. However, when it comes to industry and enterprise project scale, some architectural guidance on technology usage must be in place, howsoever it might manifest.

Lack of reproducibility and reusability

Given the previous problem statement, it might be obvious to you that uncontrolled growth of data science assets in an enterprise setting can't be completely tolerated. In large enterprises, a lot of churn can happen with projects and human resources, such as external consultants with specific skills are only hired for a short time attached to a specific project. Usually, if someone leaves the project their knowledge is gone also. Therefore, it's essential that data science assets aren't just collections of scripts that are implemented in different programming languages laying around in various locations and environments. Because of the non-collaborative nature under which many data science assets are developed, it follows that reusability of those assets is often limited. Ad hoc documentation, poor code quality, complex and mixed technologies, and a broader lack of expertise are the main drivers for this problem. After these issues are addressed, assets become reusable and dramatically increase in value. For example, if uncoordinated, every data scientist might re-create the ETL (Extract – Transform – Load), data quality assessment, and feature engineering pipeline for the same data source, which can lead to significant overhead and poor quality.

Lack of collaboration

Data scientists are great thinkers. Common sense tells them that brains do not scale. Therefore, data scientists tend to work alone at their own pace in their own manner. If they are stuck, websites like "stackexchange.com" can become their best resources to get help. Maybe it's ignorance or maybe only a lack of equally skilled peers, but the best technical data scientists often don't excel in collaboration. For outsiders, it might look like they are following the mind-set "Après moi, le deluge," assets that are created aren't shared and organized in a reusable manner. Poor documentation, if present at all, and scattered components make it hard to retrace and replicate previous work. Therefore, a common asset repository and minimum guidelines for proper documentations are essential.

Suboptimal architectural decisions

Data scientists are often "hackers" with linear algebra skills and some understanding of business. They are usually not trained software engineers or architects. As stated before, data scientists tend to use the programming language and frameworks in which they are most skilled and progress rapidly to a solution without necessarily having non-functional requirements (NFRs) like scalability, maintainability, and human resource availability in mind. Therefore, I emphasize the need that a solution architect or lead data scientist role be attached to every major data science project to ensure that NFRs are properly addressed. To support such a role with a predefined architectural and process framework is very helpful. But first, let's look at how a traditional enterprise architecture fits into data science projects.

How much architecture and process are good for a data science project

Before we answer this question, let's start with a short review on traditional enterprise architecture and then evaluate how an architectural methodology and process model fit in.

Architecture hierarchy. Source: IBM Corporation

At the top of the pyramid is the enterprise architect. The enterprise architect defines standards and guidelines that are valid across the enterprise. Some examples include:

Use of open source software is only allowed if having permissive licenses
REST calls always need to use HTTPS
Use of NoSQL databases requires special approval from the enterprise architecture board

The solution architect works within the framework that the enterprise architect defines. This role defines what technological components fit the project or use case. Some examples are:

Historical data must be stored in a Db2 relational database management system (RDBMS)
For high-throughput structured real-time, data Apache Spark Streaming must be used
For low-latency, real-time video stream processing, IBM Steams must be used

The application architect then defines the application within the framework of the solution architecture. Examples of this include:

The UI is implemented using the Model-View-Controller (MVC) pattern
For standard entities, an object relational mapper is used
For complex queries, prepared SQL statements are used

Finally, the data architect defines the data-related components such as:

During ETL, data must be denormalized into a star-model
During ETL, all categorical and ordinal fields must be indexed

So where does the all-mighty, creative data scientist cowboy fit in here? First, we try to define which of the roles defined above that a data scientist might partially take or which roles they might interact.

Let's look at the roles again from the top to bottom. To get more illustrative, let's take a metaphor from urban design. An enterprise architect is the one who designs a whole city. They define sewerage systems and roads, for example. A solution architect would be the one designing individual houses whereas an application architect designs the kitchen and a data architect oversees the electrical installation and water supply system.

Finally, the data scientist is responsible for creating the most advanced kitchen ever! They won't just take an off-the-shelf kitchen. They take individual, ready-made components, but also create original parts where necessary. The data scientist interacts with the application architect mostly. If the kitchen has special requirements, the data architect might be necessary to provide the infrastructure. Keeping this metaphor in mind, how would the kitchen look if it was created by the data scientist alone? It would be a functional kitchen with a lot of features, but most likely lacking some usability. For example, to start the oven you need to log in to a Raspberry Pi and run a shell script. And because parts have been taken from different vendors, including some custom-made hardware, the design of the kitchen might be ugly. Finally, there would be a lot of functionality but some of it is not needed and most of it is undocumented.

Going back to IT, this example illustrates the answer to the original question. Where does our all-mighty, creative data scientist cowboy fit in here?

The data scientist would rarely interact with the enterprise architect. They might interact with the solution architect but will work closely with the application architect and data architect. They don't need to take over their roles, but they must be able to step into their shoes and understand their thinking. Because data science is an emerging and innovative field, the data scientist must speak at eye level with the architects (which is not the case for an application developer or a database administrator) to transform and influence the enterprise architecture.

I'll conclude with an example to illustrate what I mean by this. Consider architectural guidelines in which an R-Studio Server is the standard data science platform in the enterprise and all data science projects must use R. This software was approved by the enterprise architect and the on-premises R-Studio Server self-service portal was designed by the solution architect. The data scientist finds a Keras code snippet in Python using the TensorFlow back end that pushes model performance to the moon. This code is open source and maintained by one of the most intelligent brains in artificial intelligence. The data scientist just needs an hour to plug this snippet into the data processing pipeline running on their notebook (yes, they prototype on their notebook because they really don't like the R-Studio Server installation provided to them). So, what do you think should happen here?

In the old days of the all-mighty architects in an enterprise, the data scientist would have been forced to port the code to R (using a less sophisticated deep learning framework). But here's the potential. If the data scientist wants to use this code snippet, they should be able to do so. But if this is done without guidance, we end up in the Wild West of data science.

Therefore, let's look at existing process models and reference architectures to see whether and how we can merge the traditional field of architecture with the emerging field of data science.

An overview on existing process models for data science

CRISP-DM

CRISP-DM, which stands for Cross-industry Standard Process for Data Mining, is the most widely used open standard process model - if a process model is used at all, of course. CRISP-DM defines a set of phases that make up a data science project. Most importantly, transitions between those phases are bidirectional and the whole process is iterative. This means that after you've reached the final stage that you just start the whole process again and refine your work. The following figure illustrates this process.

crisp-dm-process

The CRISP-DM process model. By Kenneth Jensen, based on: https://kennethagregaardje.wixsite.com/home/crisp-dm - Creative Commons Attribution-Share Alike 3.0 Unported license

In my opinion, this process model is already a good start. But because it is a process model only, it assumes that the architectural decisions on the technology used and the NFAs are already addressed. This makes CRISP-DM a very good model in technologically settled environments like traditional enterprise data warehousing or business intelligence projects.

In a rapidly evolving field like data science, it is not flexible enough.

ASUM-DM

Due to shortcomings in CRISP-DM, in 2015 IBM released the Analytics Solutions Unified Method for Data Mining/Predictive Analytics (ASUM-DM) process model. It is based on CRISP-DM but extends it with tasks and activities on infrastructure, operations, project, and deployment, and adds templates and guidelines to all the tasks.

ASUM-DM is part of a more generic framework called Analytics Solutions Unified Method (ASUM) that provides product- and solution-specific implementation roadmaps covering all IBM Analytics products.

ASUM-DM borrows the process model from ASUM, which is illustrated below.

asum-method

Analytics Solutions Unified Method (ASUM) Process Model. Source: IBM Corporation

asum-method-detail

Analytics Solutions Unified Method (ASUM) Process Model Detail. Source: IBM Corporation

IBM Cloud Garage Method

When the Manifesto for Agile Software Development was published in 2001, heavy processes like Waterfall or V-Model went out of vogue. The main reason for this paradigm shift was the software development crisis in the 1990s where software development just couldn't keep up with rapidly growing expectations of business stakeholders on time-to-market and flexibility.

Because enterprise clients often have a hard time transitioning to agile processes, IBM created the IBM Cloud Garage Method, an agile software architecture method that is tailored to enterprise transformation. Again, this method is organized in different stages, as shown in the following image.

ibm-cloud-garage-method

The IBM Cloud Garage Method. Source: IBM Corporation

A key thing to notice is that cultural change is in the middle of this hexagon. This means that without cultural change the method is doomed to fail. This is important to keep in mind. In the context of data science, we have a head start because data scientists tend to favor lightweight process models, if used at all.

ibm-cloud-garage-method-boat

In the IBM Cloud Garage Method, every practitioner sit's in the same boat. Source: IBM Corporation

Here's a summary of all of the six phases surrounding the cultural change.

Think

Design thinking is the new requirement engineering. Design thinking had its roots in the 1960s, but IBM was one of the major contributors to apply this method to the IT industry. Although usually stated in more complex terms, Design thinking in my opinion has only one purpose: Switch your brain into creative mode. Therefore, writing and drawing is used over speaking and typing. By stepping back, you'll be able to see, understand, and create the bigger picture.

Design thinking has the user experience in mind and a clear emphasis on the business behind the offering. So these key questions are answered:

Who: For whom do we build the offering?
What problem are we trying to solve?
How are we going to solve the problem?

The outcome of every think phase is the definition of a minimum viable product (MVP).

Code

The platform cloud revolution is the key enabler for fast prototyping. You can get your prototype running in hours instead of days or weeks. This lets you shorten the iteration cycle by an order of magnitude. This way, user feedback can be gathered daily. Some best practices for this phase include:

Daily stand-up meetings
Pair-programming and test-driven development
Continuous integration
Automated testing
Refactoring to microservices

Deliver

Prerequisites for daily delivery are two things. First, build and deployment must be fully automated using a tool chain. Second, every commit to the source code repository must result in a fully, production-ready product that can be tested by users at any time. Cloud-based solutions are tackling this requirement and let developers concentrate on coding.

continuous-delivery

Continuous Integration and Continuous Delivery. Source: IBM Corporation

Run

When using a cloud runtime, operational aspects of a project are handled by cloud services. Depending on requirements, this can happen in public, private, or hybrid clouds and at an infrastructure, platform, or service level. This way, often the operations team can be made obsolete and developers can concentrate on adding value to the project. Some best practices for this phase include:

Readiness for high availability
Dark launches and feature toggles
Auto-scaling

high-availability

High-Availability, Auto-Scaling, and Fault-Tolerance in an intercontinental cloud deployment. Source: IBM Corporation

Manage

Because you're premising on fully managed cloud runtimes, adding intercontinental high availability/failover, continuous monitoring, and dynamic scaling isn't a challenge anymore and can be simply activated. Some best practices for this phase include:

Automated monitoring
Fast, automated recovery
Resiliency

Learn

Due to the very short iteration cycles and continuous user feedback, hypotheses can be tested immediately to generate informed decisions and drive findings that can be added to the backlog for further pivoting. Some best practices for this phase include:

A/B testing
Hypothesis-driven development
Real-time user behavior analytics

evidence-based-hypothesis-testing

Evidence based hypothesis testing example. Source: IBM Corporation

IBM DataFirst Method

Although usually bound to an IBM client, engagement included in the DataFirst Method Design Engagement Offerings (the IBM DataFirst Method is an instance of the IBM Cloud Garage Method) specifically target IT transformation to get infrastructure, processes, and employees ready for AI. For more information visit IBM DataFirst.

end-to-end-solution

IBM DataFirst Method Process Model. Source: IBM Corporation

The IBM Data and Analytics Reference Architecture

Every project is different, and every use case needs different technical components. But they all can be described in abstract terms. The following list enumerates and explains those.

Data source: An internal or external data source that includes relational databases, web pages, CSV files, JSON files, text files, video, and audio data.
Enterprise data: Cloud-based solutions that tend to extend the enterprise data model. Therefore, it might be necessary to continuously transfer subsets of enterprise data to the cloud
Streaming analytics: Current state of the art is batch processing. But sometimes the value of a data product can be increased by adding real-time analytics capabilities because most of world's data loses value within seconds. Think of stock market data or the fact that a vehicle camera captures a pedestrian crossing a street.
Data integration: The data is cleansed, transformed, and if possible, downstream features are added.
Data repository: The persistent storage for your data.
Discovery and exploration: An idea of what data you have and how it looks.
Actionable insights: Where most of your work fits. Here you create and evaluate your machine learning and deep learning models.
Applications/Data products: Models are fine, but their value rises when they can be consumed by the ordinary business user. Therefore, you must create a data product. Data products don't necessarily need to stay on the cloud. They can be pushed to mobile or enterprise applications.
Security, information governance, and systems management: An important step that is forgotten easily. It's important to control who has access to which information for many compliance regulations. An enterprise user is part of the architecture because their requirements might differ from a public user. A cloud user's requirements might differ from those of enterprise users

analytics-reference-architecture

IBM Data and Analytics Reference Architecture. Source: IBM Corporation

Conclusion

Because you now have an overview of the current state-of-the-art methods and process models for data science on the cloud, it's time to concentrate on a method that is useful for individual data scientists who want to improve their methodologies and want to minimize architectural overhead and positively influence enterprise architecture from the bottom up. I call this the Lightweight IBM Cloud Garage Method of DataScience. I'll explain this method in my next article. So stay tuned!