Tag Archives: Search

Text analytics and business intelligence

ResearchText analytics is getting more popular recently. Over the years, it was perceived as a step child of business intelligence. Recently I have seen results of a research indicating that most of organizations that implemented business intelligence were still waiting to realize their ROI. I think that the problem is that BI in its current narrow definition of dealing primarily with structured data gives only partial answers to business questions. After all, only 15 to 20 % of information that the organizations deal with is structured. Interestingly – the concept of business intelligence was first introduced in IBM Journal in 1950s by Hans Peter Luhn in his article “A Business Intelligence System”. He defined it as “automatic method to provide current awareness services to scientists and engineers” and “interrelationships of presented facts in such way as to guide action towards desired goal”. Luhn did not refer selectively to structured data, as a matter of fact part of his life was devoted to solving problems of information retrieval and storage faced by libraries, documents and records centers. Even for IBM, in 1950s – computerized methods were still at very early stages. Over the years however, as the computers became part of the business life, the analysis of data went the path of lowest resistance – exploration of data that is structured, and by its nature fairly straightforward to compare, categorize, and identify trends; data that one could apply mathematical models to process. Thus over time the structured data analysis became almost synonymous with business intelligence. The text analytics was still preserved in business domains such as market research or pharma. Recently however, the text analytics is experiencing its renaissance, and there are several reasons for this. One is the national security – governments are spending billions of dollars on development of analytical tools allowing them to search the ‘big data’ in shortest possible time to identify threats. Another one is that lot of organizations also realized that they need to listen more to their customers– hence in market research – disciplines like customer experience management, enterprise feedback management or voice of customer in CRM – are booming. Another aspect that brings acceleration to text analytics rise is the change to the way how we communicate, brought by latest social technologies and the ‘big data’. The concept of ‘big data’ is sort of misleading – after all storage costs and size is not a problem – lot of companies that are selling cloud services – offer few gigabytes here and there for free. The issue is not so much with the size of the data but the size and degree of its ‘unstructureness’. To make sense of the information stored, and make use of it, the organizations need methods, tools and processes to digest and analyze the data. In the last sentence I made purposeful distinction between data and information –the former is set of raw facts while information is the data put in the context, creating specific meaning to the user. This speed of changes in way how we communicate, makes the term ‘text analytics’ old-fashioned already. We are now talking about analysis of all types of unstructured data, not only the text, but also voice messages, videos, drawings, pictures and other rich media.

So what is text analytics about? It is simply set of techniques and models to turn text into data that could be further analyzed, as in traditional business intelligence, allowing organizations to respond to business problems. By generating semantics, text analytics provides link between search and traditional business intelligence, turning data retrieval into information delivery mechanism. The process discovers and presents the uncovered facts, business rules and relationships. There are several analytical methods employed in this process, using statistical, linguistic and structural techniques. Here are few examples:

  • Named entity recognition – to identify from the textual sources names of people, organizations, locations, symbols and so on
  • Similarity detection and disambiguation – based on contextual clues to distinguish that for example the word “bass” refers to fish and not to the instrument
  • Pattern based recognition – based on employing regular expressions, for example to identify and standardize phone numbers, emails, postal codes and so on
  • Concept recognition – clustering data entities around defined ideas
  • Relationship recognition – finding associations between data entities
  • Co-reference recognition – multiple terms referring to the same object, which could be quite complex – in the example below the pronoun refers to two different people:
    • Paul gave money to Stephen. He had nothing left.
    • Paul gave money to Stephen. He was rich.
  • Sentiment techniques – subjective analysis to discover attitude based on source data – opinion, mood, emotion, sentiment
  • Quantitative analysis – extracting semantic or grammatical relationships between words to find meaning

As we can see the difficulty with extracting information from unstructured data could be quite immense, although it is not impossible task. It requires however quite a lot of commitment from the organization to implement it. If done properly, it can help with addressing lot of problems that enterprises face today. These problems are related to perceived information overload, poor information governance, and low quality of metadata that leads to poor findability and knowledge. This in turn impacts organizational productivity. As for information external to organizations – the ‘big data’ question – how to monetize on the social media, will drive new technological and business solutions. It seems that this is one of the areas that will experience substantial growth. Using only ‘transactional’ business intelligence based on structured information, is insufficient for organizations to get the full picture. The solution is rather the ‘integrated business intelligence’ combining structured and unstructured data in providing answers to business questions.

SharePoint – Records Center or In-Place Records Management?

Folder - records managementSharePoint 2010 brought some new capabilities but at the same time challenged the implementation teams with making some tough decisions. One of them is – how to implement records management. In MOSS 2007 – it was simple; the only possibility to achieve the functionality was through setting up Records Center site. In this case, for the content to be declared as a record, it had to be moved to separate storage area. SharePoint 2010 now offers In-Place Records Management – content that was declared as the record stays where it was originally, but the additional information management policies need to be applied to make sure it is immutable. Which solution is better? Which one should be chosen?

As expected there is no simple answer to this question – it depends. But once the decision is made, the organization needs to live with its consequences. The way back is costly and time consuming, it makes reversing the course usually unfeasible. So what are the pros and cons of either solution? The list below captures some of the key differences and their potential impact. Please note that some of the functionality was split to reflect the fact that business users and records managers are often driven by conflicting requirements – ease of filing, access, finding information and ability to collaborate for business users and ability to restrict access, protection and enforcing retention rules for records managers.

Feature In-place Records Center Comment
Retention Implemented through information management policies by content type. It might provide more flexibility in getting the rules more granular but at the cost of maintenance complexity. Simple – once record is placed in its bucket, it inherits its retention rules. Most of business users are not concerned by the retention; this is of primary interest to records managers. However what needs to be taken into account, if implementing in-place records management, the records lifespan might be longer than the hosting site. This creates potential problems with records preservation when the site needs to be disposed. This could lead to tendency to keep obsolete sites live, exposing the organization to legal and regulatory risks, and increased storage costs.
Security/Accessibility No ability to restrict access to records, the record maintains the same visibility across its lifecycle The content visibility and the ability to see its existence in search results can be restricted This could be a concern for records of sensitive nature especially in areas of HR, and Legal departments, or in case of mergers and acquisitions.
Findability of information – business user perspective Excellent, since records reside within their context in their corresponding libraries and folders Might be poor, since same content types reside in the same buckets. This category addresses primarily needs of business users – to locate quickly and easily the information. Since in case of in-place implementation, records are preserved at their source, it is easy to locate the information through its context. In case of the Records Center implementation, the key success factors are related to good governance policies, their implementation, as well as rich and good quality metadata.
Findability of records / eDiscovery – records manager perspective Usually good, though the search needs to span multiple sites Good since all records are located in Records Center, but eDiscovery will require search in both sites and in Records Center In case of Records Center good quality of metadata is important. eDiscovery of records in Records Center is fairly straightforward and quick, however since eDiscovery covers any content – declared as records or non-declared, it will not eliminate need of searching across all locations.
Ease of records management Complex since records are spread across various sites, libraries and folders Easy since records reside in central location with common sets of rules Managing records declared in-place might become messy. Strict governance and control of granularity of information management policies is required. The governance must include cases how to handle records if their survivability exceeds the site lifespan, as well as defining of who can un-declare or supersede records per site. Auditing of the records management and records reporting becomes more complex.
Ease of site management Complex – since sites contain both mutable and immutable content Simple – sites contain only documents that are not yet declared as records, or stubs to Records Center content Sites with in-place records management become more difficult to manage due to differences in how records and transitory documents are handled. Strict governance is required.
Ability to audit records More complex Simple Ability to audit records in in-place implementation depends on each sites audit policies implementation. There are no out of the box compliance reports available. Strict governance is required.
Administrative security By site administrators By records managers In in-place implementation, site administrators have ability to manage both transitory documents and records. This might not be desirable in case of organization in heavily regulated industries, where single responsibility for preservation of records resides with records managers.
Storage Transitory documents and records reside on the same storage medium Scalability could be easily ensured by placing records on separate storage medium In-place implementation might lead to increased storage requirements for both documents that are being actively collaborated and records that might be rarely accessed. Performance issues, security and organizational disaster recovery requirements must be taken into account (this is not the same as simple backups).
Declaring of Document Sets as records Yes No Current version of SharePoint does not allow for declaring Document Sets as records in Records Center


So how to determine which one is more suitable for given organization? There are several factors that will ultimately influence the decision, like:

–          Company culture – strict or more relaxed

–          How heavily regulated is the industry

–          What are the legal, regulatory and statutory requirements

–          Existing processes for handling records – is there already dedicated staff to manage records?

–          Business continuity planning requirements

–          Existing business processes – are document sets best suitable in the organization (this is weak point however, as I am sure that Microsoft is going to come with solution for Document Sets handling soon)

–          Information growth rate and proliferation of sites and sites collections

Decision on the method of records management implementation should not be taken lightly as it will have long term impacts on costs, change management, user adoption, governance, sites and records management, compliance and others. There is no easy way back.

Where is that tap?

Fortis records managementHere is the latest example of poor records keeping, and associated costs, as it happened last week in the area where I live – Ruptured Line not on maps: Fortis. In short – an excavator ruptured natural gas line, which resulted in evacuation of whole neighborhood, organizing and transporting residents to temporary locations, closed businesses, rerouting traffic, full presence of police, fire and rescue services. It took a while for Fortis – natural gas provider, to locate the leak and cut off the supply. Contractor was not at fault here – before digging, they checked with Fortis if there were any pipes in the area. After the fact, Fortis stated that the pipe was more than 40 years old and was not indicated on the map. I am afraid that in reality the pipe was on a map, as it was supplying gas to a building that does not exist anymore. Rather the problem was that Fortis was not able to locate the latest version of the map, and they based their excavations approval on outdated records.

The positive side of this event is that it should be fairly easy for Fortis to develop and approve business case for an improved records management system. One of the biggest problems facing implementation of information management projects is that they are always low priority, due to the intangibility of most of the benefits and risks. There is always something more important generating revenues. Documents and records management are mostly perceived as cost centers – until accidents like this happen. Fortunately in this case there was no further damage and nobody was injured. But definitely this is an opportunity to quantify the costs and risks in the business case and get the problem fixed. In this case – these will be the costs of the emergency services, evacuation, investigation, and problem rectifying and so on. Safety, Health and Environment risks will come on the top of priorities and let’s not forget about reputational risks – protecting the public trust, and the organization in litigation, would one follow. One door closes, another opens….

Is Email on its way out?

Recently I read some predictions that the email is an idea of the past and eventually is going to vanish. Although I do not agree with this statement in its entirety, there is some merit in this way of thinking. Email might soon share the same fate as the phone (not to mention epistolography – does anybody still remembers the art of writing letters?). On a forefront of this new development is Atos – I think the first organization that officially banned the use of emails replacing them with more collaborative tools. They must know what they are doing after all this organization is pretty large with 42 offices around the world and 74,000 of employees. As a matter of fact, couple of years ago I worked for a company with over 25 offices across the world and the instant messenger was our primary contact tool. With rapid eruption of social networking technologies, the near real-time collaboration and the cloud platforms, the importance of emails is going to diminish. As Atos CEO said, on average their employees were getting 200 emails per day, from that only 10% was useful, and middle managers were spending 25% of their time searching for information. From my personal experience, this sounds right.

On the other hand the social technologies bring new challenges from point of view of information management – like for example – how to treat them as records, how to deal with their retention, how to retain the knowledge. The bigger challenge however is personal productivity, if everyone is chatting with everyone; then they have no time to do any work. This type of collaboration cannot be replacement for ability to store, search, find and use the information. So information management is becoming now even more important, before the big wave hits destroying the efficiency instead of enabling it, the workers must know where to find the information, and have easy access to it, rather than trying to find it by chatting. This is the point where the email has advantage, with tools like Outlook – the search is quite simple and it is easy to associate the content with its business context. The governance has a key role to play here, on one of our recent programs we implemented a policy to block 50% of time to focus on the work that was planned, including collaborating ‘within’ the teams, and devoting the rest of the time to coordination with other teams, planning, meetings, answering emails, administrative work and so on.

Overall, no doubt – while our world is changing dramatically when it comes to communication and collaboration, our information management strategy and governance needs to adjust accordingly.

Three things that annoy me in SharePoint

No doubt about it, SharePoint is a good tool when it comes to document management and collaboration. However there is couple of problems that still do not make this product great. For example, when it comes to implementation of taxonomy and search, there are at least three things that require looking for some workarounds.

              1. Cannot delete custom content types.

Once you created a content type, that’s it, you are done – you won’t be able to delete it. Sure, there is a link in Site Settings to delete this content type; the only problem is that SharePoint will not allow you to do it. Instead, you are going to get messages that the content type is in use, even if you ensured that this content type was unlinked. There are some blog posts showing how to work around this problem, but all of them require running direct action queries on MS SQL content database. Obviously it is possible to be done, but not really feasible for production environment in most of organizations. To avoid this issue, implementation teams need to make sure that the taxonomy is tight on the paper, and then test with a pilot before production implementation.

2. Drop-off library works only with Document type items.

Drop-off library is a great concept, allowing for building set of rules that facilitate an automatic movement of documents to corresponding libraries, based on their content type. Unfortunately this works only on Document types, or your own custom types inherited from Document class. So if your customers would like to use it for images or audio files, they will have to move the files manually to their target locations. This could become confusing – for one type they can use drop off, for the others they cannot. So, when planning implementation, consider this during alignment of the end user processes, and if you still decide to benefit from this functionality, make sure that the change management team gives enough attention to it.

3. Lack of native support for indexing of PDF files.

PDF today became standard when a user wants to make document portable, light-weight and read-only. Unfortunately SharePoint 2010 indexing service currently does not support this type of files. There is couple of add-ons that could be installed, but they range in performance, quality and cost. I believe that this is such an important feature that it should be part of the out-of-the-box installation.

 Small things but make life more difficult – hopefully SharePoint 2012 will address them.

Classification or Search?

Couple of days ago, there was an interesting post by Michael Schrage where he questioned need for information classification in today’s (mostly electronic) world. I often hear same opinion from people who rely primarily on MS Outlook for storage and search of their documents. Apart from the fact that it rubs the IT administrators and record managers wrong way, there is some merit in his way of thinking. People usually get what they want – the information could be easily found and is easily accessible.

But why it is like this and is it applicable to all documents? First of all, we live in a world where information governance lies somewhere on a continuum between total ‘anarchy’ – where all documents live unorganized in one place, and a ‘tyranny’ – where every document, from the moment it is created, is classified and tracked. One side of the spectrum could be considered as for free spirited, right brain people, the other one for left brainer bureaucrats or ‘Type As’ as Schrage describes them. But reality lies somewhere in between, each of us personally leans to smaller or larger degree to one or the other end of the spectrum. My personal believe is that for us personally and as it is for organizations, to be really productive and creative, we need to balance on the edge of the chaos and tyranny.  To Schrage’s point – people quite often waste their time classifying the information that does not have to be classified. But then why do we classify in the first place? There is couple of objectives. The first one is most obvious – to easily find information, and this is what Schrage is referring to.

Not long time ago, when the documents existed only in physical form – people invented classification to locate and to find information. A good example is Dewey’s Decimal Classification system used in the libraries. First you locate books based on the class and subject, once you found it, you use index to find information within it. Electronic documents moved the limits of such system further, giving new capabilities and opportunities to search.

In case of my personal account with MS Outlook or with Twitter, Schrage is right. The value of classification of my emails for purpose of search is low. Outlook is pretty good and flexible allowing me to locate needed information fairly quickly. But why is it like this? This happens primarily because MS Outlook captures all the needed metadata describing context of the email automatically, with me spending no time on this. Sender address, date sent, received, subject, and content are searchable. Additionally the email treads functionality makes things easier to dig in deeper into messages when needed. This works so well since I am intimately familiar with my emails, and can easily recollect and associate the information with its context. But this is not going to be the same case if I inherit mailbox from someone else. Although the search might help with narrowing the results, I will need more to figure out what the message is about, and if it corresponds to what I am looking for. So, as per Schrage point – this does work for my personal productivity, but it will not help in case of an organization where I have to collaborate.

So, although I agree that classification is not needed here, and as a matter of fact it could be even restrictive, the key to success is the metadata describing the content. In case of Outlook, as I already mentioned, some of it is captured automatically. In other cases, however the metadata needs to be added, to keep the context with the content. It could be manual, but this is what most of people perceive as a ‘waste’ activity. It could be automatic, and to some degree it is possible as with MS Office documents. However, there still be some metadata that only the author could decide, as it corresponds to his or her intentions. Additionally the metadata itself could have its own classification or hierarchy to be meaningful.

So search and findability are one of the objectives of the classification. Another one, and especially important in case of organizations, is the records classification. Records should be kept for periods of time prescribed in retention schedules, usually based on document type classification. So here the classification is not going to disappear.

In summary, I agree that importance of classification will be diminishing as the technology evolvs. The automatic classification will definitely be of help but it is not there yet today. As artificial intelligence tools will become more truly ‘intelligent’ and capability of the systems will increase to analyze the content of the data, the need for manual classification will be limited. But the real purpose behind the scenes will remain – the accuracy and completeness of the metadata. Tools like Google Search or SharePoint 2010 with FAST search engine are on right track to narrow the search scope and to mine the results. Ability to use enterprise keywords, with good search analytics will help with the findability. However the need for classification will not disappear, but it will become of limited importance to most of the users.