Tag Archives: Text Mining

Big Data, Data Warehousing and Data Mining

Michael Koploy from Software Advice posed recently a question about plain definitions of some basic Business Intelligence concepts – Big Data, Data Warehousing and Data Mining. Although question seems to be quite simple, it is mind provoking due to changes that BI is experiencing to during last year or two. New developments in this area force us to look again at these concepts. Here is my view on these 3 topics:

Big Data

Simple definition:

The concept of the big data is not new, although it gained popularity during recent years. It describes all the data available to organizations and that includes structured and unstructured data. It is characterized by its large volume, variety and velocity, which makes it challenging to analyze. Until recently organizations tended to limit amount of information by putting breaks and structure through governance and architecture. Too much information was considered bad thing, due to limited capacity of systems and capabilities to process this information.

How it is changing:

The old saying – ‘garbage in – garbage out’ is not true anymore. Organizations realized that among the garbage there might be lot of valuable information that could be monetized. This could be done directly or indirectly and used not only to generate revenues but also to gain competitive advantage. The value of information might not be correctly estimated at the time of its creation or during its initial intended use. Value is often defined by its context – to paraphrase – “the value is in the eye of the beholder”, and it is also time variant. Traditional BI was dealing primarily with the structured data, as it was easier to work with and get results quickly. The rest was mostly ignored or treated as necessary evil. The problem however is that unstructured data constitutes around 80 to 85% of data within the organization, or floating over there in the web, and it could be in one or another way related to the business. Social networks like Facebook, Twitter, blogs, discussions, memos, emails and so on are equal sources of potentially useful information. The winners from losers are separated by ability to see the value where others do not, and ability to use it.


Data Warehousing

Simple definition

Traditionally data warehousing is a process of consolidating and aggregating information from various sources within the organization, and used for historical analysis and reporting. The outputs from the analysis are used for operational, tactical or strategic planning. Before the data could be used for these purposes however it has to go through process of cleanup, standardization, normalization, integration and so on. Once stored in Data Warehouse it could be aggregated, and correlated to find answers to typical business questions.

How it is changing

Once data is in Data Warehouse it becomes relatively non-volatile, time variant, representing subject oriented historical value of data. Here is the problem in the new world – the process of standardization and structuring of the data often strips the most valuable part – intrinsic relationships between data, that might not be visible at the time when the structuring rules are established. Usually Data Warehouses are created with specific goals, and these goals might be changing relatively quickly. Adjusting Data Warehouse to fit these new goals might be as painful as turning a large ship in narrow fiord. In the light of Big Data, the whole concept will have to be reevaluated.


Data Mining

Simple Definition

In short it is discovery of true meaning of data from large datasets that integrates structured and unstructured data. These datasets might come from data warehouses or from any other data sources. Data mining helps to answer specific business questions that might be unique and might not have predefined processing paths.

How it is changing

Data mining is building on available data and thus closely related to the above discussed two terms. Since these terms are changing, so it is the data mining concept. The organizations need to employ innovative techniques like statistical tools, semantic analysis, neural networks, artificial intelligence and so on, to extract information from combination of both structured and unstructured data in order to gain knowledge. This single step is what separates ‘wheat from the chaff’, winners from losers – it is the ‘holy grail’ of Business Intelligence.

Majority prefers ‘big data’ on premises rather than in the cloud

According to recent AIIM’s survey, the ‘big data’ adoption is going to double to 17% during next 12 months. This penetration is going to increase further to about 60% within next 3 years. The survey confirms the old truth – the need for holistic view of the data – over 61% of respondents would like to see integrated information, coming from both – structured and unstructured sources. Classification of unstructured data seems to be ongoing problem, with over 70% of organizations finding that it is easier to find information on the web, rather than on their own internal networks. Although search techniques and tools improved over the years, it seems that the adoption of new technologies is pretty slow. Another big factor playing large role in this is the poor data governance.  With regards to analysis of the data, the requirements don’t seem to be very sophisticated, indicating that organizations still struggle with strategy how to effectively use the ‘big data’. Most respondents would be satisfied simply with basic pattern analysis, keyword correlation, incident prediction and fraud prevention. This fact seems to be confirmed by lack of answer to an important question. When asked about a ‘killer application’ for their business area, over 88% of respondents said that it would make a big difference in their business, but when asked what it would be, majority declined to answer.

Another interesting fact from the report is that most of respondents seem to confuse search with data analytics. Although there are some overlaps between the two, the former is about returning results matching selection criteria, while the latter about processing of the data to return answers about specific business question.

Lastly, not so good news for cloud vendors, over 88% of respondents would prefer on-premise big data storage and analysis, rather than SaaS solutions. This seems to be related to perception of poor data protection on externally hosted applications (although only 64% of respondents explicitly stated this). Majority considers the business insights as organization’s intellectual property. Cloud providers will have to work harder to convince the market, as data security question will continue to be the primary barrier to cloud adoption.

Text analytics and business intelligence

ResearchText analytics is getting more popular recently. Over the years, it was perceived as a step child of business intelligence. Recently I have seen results of a research indicating that most of organizations that implemented business intelligence were still waiting to realize their ROI. I think that the problem is that BI in its current narrow definition of dealing primarily with structured data gives only partial answers to business questions. After all, only 15 to 20 % of information that the organizations deal with is structured. Interestingly – the concept of business intelligence was first introduced in IBM Journal in 1950s by Hans Peter Luhn in his article “A Business Intelligence System”. He defined it as “automatic method to provide current awareness services to scientists and engineers” and “interrelationships of presented facts in such way as to guide action towards desired goal”. Luhn did not refer selectively to structured data, as a matter of fact part of his life was devoted to solving problems of information retrieval and storage faced by libraries, documents and records centers. Even for IBM, in 1950s – computerized methods were still at very early stages. Over the years however, as the computers became part of the business life, the analysis of data went the path of lowest resistance – exploration of data that is structured, and by its nature fairly straightforward to compare, categorize, and identify trends; data that one could apply mathematical models to process. Thus over time the structured data analysis became almost synonymous with business intelligence. The text analytics was still preserved in business domains such as market research or pharma. Recently however, the text analytics is experiencing its renaissance, and there are several reasons for this. One is the national security – governments are spending billions of dollars on development of analytical tools allowing them to search the ‘big data’ in shortest possible time to identify threats. Another one is that lot of organizations also realized that they need to listen more to their customers– hence in market research – disciplines like customer experience management, enterprise feedback management or voice of customer in CRM – are booming. Another aspect that brings acceleration to text analytics rise is the change to the way how we communicate, brought by latest social technologies and the ‘big data’. The concept of ‘big data’ is sort of misleading – after all storage costs and size is not a problem – lot of companies that are selling cloud services – offer few gigabytes here and there for free. The issue is not so much with the size of the data but the size and degree of its ‘unstructureness’. To make sense of the information stored, and make use of it, the organizations need methods, tools and processes to digest and analyze the data. In the last sentence I made purposeful distinction between data and information –the former is set of raw facts while information is the data put in the context, creating specific meaning to the user. This speed of changes in way how we communicate, makes the term ‘text analytics’ old-fashioned already. We are now talking about analysis of all types of unstructured data, not only the text, but also voice messages, videos, drawings, pictures and other rich media.

So what is text analytics about? It is simply set of techniques and models to turn text into data that could be further analyzed, as in traditional business intelligence, allowing organizations to respond to business problems. By generating semantics, text analytics provides link between search and traditional business intelligence, turning data retrieval into information delivery mechanism. The process discovers and presents the uncovered facts, business rules and relationships. There are several analytical methods employed in this process, using statistical, linguistic and structural techniques. Here are few examples:

  • Named entity recognition – to identify from the textual sources names of people, organizations, locations, symbols and so on
  • Similarity detection and disambiguation – based on contextual clues to distinguish that for example the word “bass” refers to fish and not to the instrument
  • Pattern based recognition – based on employing regular expressions, for example to identify and standardize phone numbers, emails, postal codes and so on
  • Concept recognition – clustering data entities around defined ideas
  • Relationship recognition – finding associations between data entities
  • Co-reference recognition – multiple terms referring to the same object, which could be quite complex – in the example below the pronoun refers to two different people:
    • Paul gave money to Stephen. He had nothing left.
    • Paul gave money to Stephen. He was rich.
  • Sentiment techniques – subjective analysis to discover attitude based on source data – opinion, mood, emotion, sentiment
  • Quantitative analysis – extracting semantic or grammatical relationships between words to find meaning

As we can see the difficulty with extracting information from unstructured data could be quite immense, although it is not impossible task. It requires however quite a lot of commitment from the organization to implement it. If done properly, it can help with addressing lot of problems that enterprises face today. These problems are related to perceived information overload, poor information governance, and low quality of metadata that leads to poor findability and knowledge. This in turn impacts organizational productivity. As for information external to organizations – the ‘big data’ question – how to monetize on the social media, will drive new technological and business solutions. It seems that this is one of the areas that will experience substantial growth. Using only ‘transactional’ business intelligence based on structured information, is insufficient for organizations to get the full picture. The solution is rather the ‘integrated business intelligence’ combining structured and unstructured data in providing answers to business questions.