IBM and Red Hat — the next chapter of open innovation. Learn more ›
by Divakar Mysore, Shrikant Khupat, Shweta Jain | Updated December 16, 2013 - Published December 17, 2013
Part 3 of this series describes atomic and composite patterns that address the most common and recurring big data problems and their solutions. This article suggests three solution patterns that can be used to architect a big data solution. Each solution pattern uses a composite pattern, which is made of up logical components (also covered in Part 3). At the end of this article, find a list of products and tools that map to the components of each solution pattern.
The following sections describe three solution patterns that can be used to architect a big data solution. To illustrate the patterns, we apply them to a particular use case (how to detect healthcare insurance fraud), but the patterns can be used to address many other business scenarios. Each solution pattern takes advantage of a composite pattern. In the following table, see the list of solution patterns covered here, along with the composite patterns they are based on.
Financial fraud poses a serious risk to all segments of the financial sector. In the United States, insurers lose billions of dollars annually. In India, the loss in 2011 alone totals INR 300 billion. Apart from the financial loss, insurers are losing business because of customer dissatisfaction. Although many insurance regulatory bodies have defined frameworks and processes to control fraud, they are often reacting to fraud rather than taking proactive steps to prevent it. Traditional approaches, such as circulating the roll of black-listed customers, insurance agents, and staff, do not resolve the problem of fraud.
This article proposes a solution pattern for big data solution, based on the logical architecture described in Part 3 of this series and the composite patterns covered in Part 4.
Insurance fraud is an act or omission intended to gain dishonest or unlawful advantage, either for the party committing the fraud or for other related parties. Broad categories of fraud include:
The insurance regulatory boards have established anti-fraud policies, which include well-defined processes for monitoring fraud, for searching for potential fraud indicators (and publishing a list), and for coordinating with law enforcement agencies. The insurers have staff dedicated to analyzing fraudulent claims.
The insurance regulators have well-defined fraud-detection and mitigation processes. Traditional solutions use models based on historical fraud data, black-listed customers and insurance agents, and regional data about fraud peculiar to a certain area. The data available for detecting fraud is limited to the given insurer’s IT systems and a few external sources.
Current fraud-detection processes are mostly manual and work on limited data sets. Insurers may not be able to investigate all the indicators. Fraud is often detected very late, and it is difficult for the insurer to do adequate followup for each fraud case.
Current fraud detection relies on what is known about existing fraud cases, so every time a new type of fraud occurs, insurance companies have to bear the consequences for the first time. Most traditional methods work within a particular data source and cannot accommodate the ever-growing variety of data from different sources. A big data solution can help address these challenges and play an important role in fraud detection for insurance companies.
This solution pattern is based on the store-and-explore composite pattern. It focuses on acquiring and storing the relevant data from various sources inside or outside the enterprise. The data sources shown in Figure 1 are examples only; domain experts can identify the appropriate data sources.
Because a large volume of varied data from many sources must be collected, stored, and processed, this business challenge is a good candidate for a big data solution.
The following diagram shows the solution pattern, mapped onto the logical architecture, described in Part 3.
Figure 1 uses data providers from:
The data required for healthcare fraud detection can be acquired from various sources and systems such as banks, medical institutions, social media, and Internet agencies. It includes unstructured data from sources such as blogs, social media, news agencies, reports from various agencies, and X-ray reports. See the data sources layer in Figure 1 for more examples. With big data analytics, the information from these varied sources can be correlated and combined, and — with the help of defined rules — analyzed to determine the possibility of fraud.
In this pattern, the required external data is acquired from data providers who contribute preprocessed, unstructured data converted to structured or semi-structured format. This data is stored in the big data stores after initial preprocessing. The next step is to identify possible entities and generate ad-hoc reports from the data.
Entity identification is the task of recognizing named elements in the data. All entities required for analysis must be identified, including loose entities that do not have relationships to other entities. Entity identification is mostly performed by data scientists and business analysts. Entity resolution can be as simple as identifying single entities or complex entities based on the data relationships and contexts. This pattern uses the simple-form entity resolution component.
Structured data can be simply converted into the format most appropriate for analysis and directly stored in big data structured storages.
Ad-hoc queries can be performed on this data to get the information like:
Typically, organizations get started with big data by adapting this pattern, as the name implies. Organizations employ an exploratory approach to assess what kind of insight could be generated, given the data available. At this stage, organizations do not generally invest in advanced analytics techniques such as machine learning, feature extraction, and text analytics.
This pattern is more advanced than the getting-started pattern. It predicts fraud at three stages of claim processing:
For cases 1 and 2, the claims can be processed in batch, and the fraud-detection process can be initiated as part of the regular reporting process or as requested by the business. Case 3 can be processed at near-real time. The claims request interceptor intercepts the claim request, initiates the fraud-detection process (if the indicators report it as a possible fraud case), then notifies the stakeholders identified in the system. The earlier the fraud is detected, the less severe is the risk or loss.
Figure 2 uses:
In this pattern, organizations can choose to preprocess unstructured data before analyzing it.
The data is acquired and stored, as-is, in unstructured data storage. It is then preprocessed into a format that can be consumed by the analysis layer. At times, the preprocessing can be complex and time-consuming. Machine-learning techniques can be used for text analytics, and the Hadoop Image Processing Framework can be useful for processing images. The most widely used technique is JSON. The preprocessed data is then stored in structured data storage, such as HBase.
The core component in this pattern is the fraud-detection engine, composed of the advanced analytics capabilities that help predict fraud. Well-defined and frequently updated fraud indicators help identify fraud. The following fraud indicators can help detect fraud, and technology can be used to implement systems to combat fraud. Here’s a list of common fraud indicators:
Traditional methods alone are not adequate to predict fraud. Social-network analytics are required to detect links between licensed and non-licensed healthcare providers and to detect relationships between policy holders, medical institutions, associates, suppliers, and partners. Validating the authenticity of documents and finding the credit score of individuals are difficult tasks to accomplish with traditional approaches.
During analysis, the search for all of these indicators can occur simultaneously on a huge volume of data. Every indicator is weighted. The total weight across all indicators indicates the accuracy and severity of the predicted fraud.
When the analysis is complete, alerts and notifications can be sent to relevant stakeholders, and reports can be generated to show the outcome of analysis.
This pattern is suitable for enterprises that need to perform advanced analytics using big data. It involves performing complex preprocessing so that the data can be stored in a form that can be analyzed using advanced techniques, such as feature extraction, entity resolution, text analytics, machine learning, and predictive analytics. This pattern does not involve taking any action or suggesting recommendations on the output of analysis.
The fraud predictions made in the solution pattern about gaining advanced business insight normally lead to certain actions to be taken, such as rejecting the claim or putting it on hold until additional clarification and information is received or reporting it for legal action. In this pattern, actions are defined for each outcome of the prediction. This action-to-outcome table is referred to as an action-decision matrix.
Figure 3 uses:
Typically, three kinds of actions can be taken:
This pattern is suitable for enterprises that need to perform advanced analytics using big data. This pattern uses advanced capabilities to detect fraud, to notify and alert relevant stakeholders, and to initiate automatic workflows to take action based on outcome of processing.
The following diagram shows how big data software maps to the various components of the logical architecture described in Part 3. These are not the only products, technologies, or solutions that can be used in a big data solution; your own requirements and environment must shape the tools you choose to deploy.
Figure 4 shows big data appliances, such as IBM PureData™ System for Hadoop and IBM PureData System for Analytics, cutting across layers. These appliances have features such as built-in visualization, built-in analytic accelerators, and a single system console. Using an appliance has many advantages. (See Related topics for more information about the IBM PureData System for Hadoop.)
Using big data analytics for detecting fraud has various benefits over traditional approaches. Insurance companies can build systems that include all relevant data sources. An all-encompassing system helps detect uncommon cases of fraud. Techniques such as predictive modeling thoroughly analyze instances of fraud, filter obvious cases, and refer low-incidence fraud cases for further analysis.
A big data solution can also help build a global perspective of the anti-fraud efforts throughout the enterprise. Such a perspective often leads to better fraud detection by linking associated information within the organization. Fraud can occur at a number of source points: claims processing, insurance surrender, premium payment, application for a new policy, or employee-related or third-party fraud. Combined data from various sources enables better predictions.
Analytics technologies enable an organization to extract important information from unstructured data. Although volumes of structured information are stored in data warehouses, most of the crucial information about fraud is in unstructured data, such as third-party reports, which are rarely analyzed. In most insurance agencies, social media data is not appropriately stored or analyzed.
Using business scenarios based on the use case of identifying fraud in the insurance industry, this article describes solution patterns that vary in complexity. The simplest pattern addresses storing data from various sources and doing some initial exploration. The most complex covers how to gain insight from the data and take action based on the analysis.
Each business scenarios is mapped to the appropriate atomic and composite patterns that make up the solution pattern. Architects and designers can apply the solution pattern to define the high-level solution and functional components of the appropriate big data solution.
June 27, 2019
AnalyticsIBM SPSS Statistics+
July 28, 2019
Adversarial Robustness ToolboxAI Fairness 360+
September 20, 2019
Back to top