Tag Archives: Classification

Majority prefers ‘big data’ on premises rather than in the cloud

According to recent AIIM’s survey, the ‘big data’ adoption is going to double to 17% during next 12 months. This penetration is going to increase further to about 60% within next 3 years. The survey confirms the old truth – the need for holistic view of the data – over 61% of respondents would like to see integrated information, coming from both – structured and unstructured sources. Classification of unstructured data seems to be ongoing problem, with over 70% of organizations finding that it is easier to find information on the web, rather than on their own internal networks. Although search techniques and tools improved over the years, it seems that the adoption of new technologies is pretty slow. Another big factor playing large role in this is the poor data governance.  With regards to analysis of the data, the requirements don’t seem to be very sophisticated, indicating that organizations still struggle with strategy how to effectively use the ‘big data’. Most respondents would be satisfied simply with basic pattern analysis, keyword correlation, incident prediction and fraud prevention. This fact seems to be confirmed by lack of answer to an important question. When asked about a ‘killer application’ for their business area, over 88% of respondents said that it would make a big difference in their business, but when asked what it would be, majority declined to answer.

Another interesting fact from the report is that most of respondents seem to confuse search with data analytics. Although there are some overlaps between the two, the former is about returning results matching selection criteria, while the latter about processing of the data to return answers about specific business question.

Lastly, not so good news for cloud vendors, over 88% of respondents would prefer on-premise big data storage and analysis, rather than SaaS solutions. This seems to be related to perception of poor data protection on externally hosted applications (although only 64% of respondents explicitly stated this). Majority considers the business insights as organization’s intellectual property. Cloud providers will have to work harder to convince the market, as data security question will continue to be the primary barrier to cloud adoption.

Implementation of Records Management in SharePoint 2010 is not trivial

DecisionRecords management implementation in SharePoint is not a trivial thing. I wrote about this on couple of occasions in the past. Earlier this week there was an interesting presentation from ARMA, expanding on some of these topics.

First of all – SharePoint out-of-the-box implementation will provide only a partial and rather informal – records solution. Many people consider Department of Defense DoD 5015.2 records management requirements as an overkill. This might be true for most of non-governmental organizations, although ARMA identified that of 168 requirements in DoD 5015.2, at minimum 105 are considered as those that make system a robust records management application. SharePoint 2010 satisfies 72 of these requirements. That leaves gap of 33 requirements that needs to be addressed. There are two ways of doing this – getting SharePoint implementation customized or getting a third party add-ons to handle the records management. Both of the solutions have their own pros and cons related to costs, licensing, training and operational support requirements.

Among the issues that need to be addressed are:

–          Centralized file plan, linked to a retention schedule. I wrote about this earlier – this requires usage of records center rather than in-place records management.

–          Securing, management and maintenance of the file plan by the records managers. This includes securing top levels of the file plan hierarchy but with ability to allow delegated departmental records clerks to create and maintain third level of subject and case file folders.

–          Proper disposition process – SharePoint OOTB handles automatic deletions, but disposition process needs to be customized, including records qualification, reviews, approvals, cutoff times, and records state status updates

–          Distinction between the subject records and the case file records. The significant difference between the two is related to the above process, where the entire content of the Document Set in case file record must be disposed at the same time, preventing the users from destroying the record partially.

–          Centralized management of Information Management Policies in SharePoint, due to required security levels. Information Management Gallery is not enough, and this also impacts ability to implement in-place records management, where control of these policies and maintenance of the security becomes quickly impractical.

–          Ability to monitor ingestion of records, their classification status, and retention events. This includes bulk uploads and changes to records metadata. Even on document level it is currently a huge pain in SharePoint.

–          To manage the records across their lifecycle, proper metadata must be collected and updated along their way. The specific records related metadata needs to be defined and implemented during the rollout.

–          MS Outlook integration with ability to declare emails with their attachments as records, and ability to add records specific metadata.

In either case – customization of SharePoint or integration of third party add-ons requires lot of thought planning, and tough decisions making.

Three things that annoy me in SharePoint

No doubt about it, SharePoint is a good tool when it comes to document management and collaboration. However there is couple of problems that still do not make this product great. For example, when it comes to implementation of taxonomy and search, there are at least three things that require looking for some workarounds.

              1. Cannot delete custom content types.

Once you created a content type, that’s it, you are done – you won’t be able to delete it. Sure, there is a link in Site Settings to delete this content type; the only problem is that SharePoint will not allow you to do it. Instead, you are going to get messages that the content type is in use, even if you ensured that this content type was unlinked. There are some blog posts showing how to work around this problem, but all of them require running direct action queries on MS SQL content database. Obviously it is possible to be done, but not really feasible for production environment in most of organizations. To avoid this issue, implementation teams need to make sure that the taxonomy is tight on the paper, and then test with a pilot before production implementation.

2. Drop-off library works only with Document type items.

Drop-off library is a great concept, allowing for building set of rules that facilitate an automatic movement of documents to corresponding libraries, based on their content type. Unfortunately this works only on Document types, or your own custom types inherited from Document class. So if your customers would like to use it for images or audio files, they will have to move the files manually to their target locations. This could become confusing – for one type they can use drop off, for the others they cannot. So, when planning implementation, consider this during alignment of the end user processes, and if you still decide to benefit from this functionality, make sure that the change management team gives enough attention to it.

3. Lack of native support for indexing of PDF files.

PDF today became standard when a user wants to make document portable, light-weight and read-only. Unfortunately SharePoint 2010 indexing service currently does not support this type of files. There is couple of add-ons that could be installed, but they range in performance, quality and cost. I believe that this is such an important feature that it should be part of the out-of-the-box installation.

 Small things but make life more difficult – hopefully SharePoint 2012 will address them.

Classification or Search?

Couple of days ago, there was an interesting post by Michael Schrage where he questioned need for information classification in today’s (mostly electronic) world. I often hear same opinion from people who rely primarily on MS Outlook for storage and search of their documents. Apart from the fact that it rubs the IT administrators and record managers wrong way, there is some merit in his way of thinking. People usually get what they want – the information could be easily found and is easily accessible.

But why it is like this and is it applicable to all documents? First of all, we live in a world where information governance lies somewhere on a continuum between total ‘anarchy’ – where all documents live unorganized in one place, and a ‘tyranny’ – where every document, from the moment it is created, is classified and tracked. One side of the spectrum could be considered as for free spirited, right brain people, the other one for left brainer bureaucrats or ‘Type As’ as Schrage describes them. But reality lies somewhere in between, each of us personally leans to smaller or larger degree to one or the other end of the spectrum. My personal believe is that for us personally and as it is for organizations, to be really productive and creative, we need to balance on the edge of the chaos and tyranny.  To Schrage’s point – people quite often waste their time classifying the information that does not have to be classified. But then why do we classify in the first place? There is couple of objectives. The first one is most obvious – to easily find information, and this is what Schrage is referring to.

Not long time ago, when the documents existed only in physical form – people invented classification to locate and to find information. A good example is Dewey’s Decimal Classification system used in the libraries. First you locate books based on the class and subject, once you found it, you use index to find information within it. Electronic documents moved the limits of such system further, giving new capabilities and opportunities to search.

In case of my personal account with MS Outlook or with Twitter, Schrage is right. The value of classification of my emails for purpose of search is low. Outlook is pretty good and flexible allowing me to locate needed information fairly quickly. But why is it like this? This happens primarily because MS Outlook captures all the needed metadata describing context of the email automatically, with me spending no time on this. Sender address, date sent, received, subject, and content are searchable. Additionally the email treads functionality makes things easier to dig in deeper into messages when needed. This works so well since I am intimately familiar with my emails, and can easily recollect and associate the information with its context. But this is not going to be the same case if I inherit mailbox from someone else. Although the search might help with narrowing the results, I will need more to figure out what the message is about, and if it corresponds to what I am looking for. So, as per Schrage point – this does work for my personal productivity, but it will not help in case of an organization where I have to collaborate.

So, although I agree that classification is not needed here, and as a matter of fact it could be even restrictive, the key to success is the metadata describing the content. In case of Outlook, as I already mentioned, some of it is captured automatically. In other cases, however the metadata needs to be added, to keep the context with the content. It could be manual, but this is what most of people perceive as a ‘waste’ activity. It could be automatic, and to some degree it is possible as with MS Office documents. However, there still be some metadata that only the author could decide, as it corresponds to his or her intentions. Additionally the metadata itself could have its own classification or hierarchy to be meaningful.

So search and findability are one of the objectives of the classification. Another one, and especially important in case of organizations, is the records classification. Records should be kept for periods of time prescribed in retention schedules, usually based on document type classification. So here the classification is not going to disappear.

In summary, I agree that importance of classification will be diminishing as the technology evolvs. The automatic classification will definitely be of help but it is not there yet today. As artificial intelligence tools will become more truly ‘intelligent’ and capability of the systems will increase to analyze the content of the data, the need for manual classification will be limited. But the real purpose behind the scenes will remain – the accuracy and completeness of the metadata. Tools like Google Search or SharePoint 2010 with FAST search engine are on right track to narrow the search scope and to mine the results. Ability to use enterprise keywords, with good search analytics will help with the findability. However the need for classification will not disappear, but it will become of limited importance to most of the users.