Using Tesseract-OCR within the Client

In a previous post I showed how to generate OCR renditions via Powershell.  The process worked quite well, and the accuracy is higher than other solutions.  After that post I went to upload the powershell scripts to github and decided to re-run each script against a new dataset. 

As I ran the OCR script I noticed a few things I did not like about it:

  1. The script ran fine for hours and then bombed because the search results went stale
  2. I must remember to run the script after each import of records, or no OCR renditions
  3. I had to create a custom property to track whether an OCR rendition was generated

To overcome these challenges I'll need to write some code.  Time to break out Visual Studio and build a new solution.  So let's dive right in!  


I opened up Microsoft Visual Studio 2017 and created a new solution with two projects: a C# class library for the add-in, and a C# class library for the Ocr functionality.  Here I'm splitting the Ocr functionality into a separate project because in the next post I'll create an event processor plug-in.  To make this work I updated the first project to reference the second and set a build dependency between the two.

Next I implemented the ITrimAddIn interface and organized the interface stubs into logical regions, as shown below.  I also created a folder named MenuLinks and created two new classes within: UpdateOcrRendition and RemoveOcrRendition.  Those classes will expose the menu options to the users within the client.

2017-11-14_8-03-16.png

The two menu link classes look are defined as follows:

 
using HP.HPTRIM.SDK;
 
namespace CMRamble.Ocr.ClientAddin.MenuLinks
{
    public class UpdateOcrRendition : TrimMenuLink
    {
        public const int LINK_ID = 8002;
        public override int MenuID => LINK_ID;
        public override string Name => "Update Ocr Rendition";
        public override string Description => "Uses the document content to generate OCR text";
        public override bool SupportsTagged => true;
 
    }
}
 
 
using HP.HPTRIM.SDK;
namespace CMRamble.Ocr.ClientAddin.MenuLinks
{
    public class RemoveOcrRendition : TrimMenuLink
    {
        public const int LINK_ID = 8003;
        public override int MenuID => LINK_ID;
        public override string Name => "Remove Ocr Rendition";
        public override string Description => "Remove any Ocr Renditions";
        public override bool SupportsTagged => true;
    }
}
 

Now in the Add-in class I create a local variable to store the array of MenuLinks, update the Initialise interface stub to instantiate that array, and then force the GetMenuLinks method to return that array....

private TrimMenuLink[] links;
public override void Initialise(Database db)
{
    links = new TrimMenuLink[2] { new MenuLinks.UpdateOcrRendition(), new MenuLinks.RemoveOcrRendition() };
}
public override TrimMenuLink[] GetMenuLinks()
{
    return links;
}

Next up I need to complete the IsMenuItemEnabled method.  I do this by switching on the command link ID passed into the method.  I compare it to the constant value that backs my Menu Link Id's.  If you look closely at the code below, you'll notice that I'm calling "HasOcrRendition" when the link matches my RemoveOcrRendition link.  There is no such method in the out-of-the-box .Net SDK.  Here I'll be calling a static extension method contained inside the other library.  I'm doing this because I know I'll need that same capability (to know if there is an Ocr rendition) across multiple libraries.  It also makes the code easier to read.

public override bool IsMenuItemEnabled(int cmdId, TrimMainObject forObject)
{
    switch (cmdId)
    {
        case MenuLinks.UpdateOcrRendition.LINK_ID:
            return forObject.TrimType == BaseObjectTypes.Record && ((HP.HPTRIM.SDK.Record)forObject).IsElectronic;
        case MenuLinks.RemoveOcrRendition.LINK_ID:
            return forObject.TrimType == BaseObjectTypes.Record && ((Record)forObject).HasOcrRendition();
        default:
            return false;
    }
}

The last two methods I need to implement within my record add-in are named "ExecuteLink".  Here I'll hand the implementation details off to a static class contained within my second project.  Doing so makes this code easy to understand and even easier to maintain.

public override void ExecuteLink(int cmdId, TrimMainObject forObject, ref bool itemWasChanged)
{
    HP.HPTRIM.SDK.Record record = forObject as HP.HPTRIM.SDK.Record;
    if ((HP.HPTRIM.SDK.Record)record != null)
    {
        switch (cmdId)
        {
            case MenuLinks.UpdateOcrRendition.LINK_ID:
                RecordController.UpdateOcrRendition(record);
                break;
            case MenuLinks.RemoveOcrRendition.LINK_ID:
                RecordController.RemoveOcrRendition(record);
                break;
            default:
                break;
        }
    }
}
public override void ExecuteLink(int cmdId, TrimMainObjectSearch forTaggedObjects)
{
    switch (cmdId)
    {
        case MenuLinks.UpdateOcrRendition.LINK_ID:
            RecordController.UpdateOcrRenditions(forTaggedObjects);
            break;
        case MenuLinks.RemoveOcrRendition.LINK_ID:
            RecordController.RemoveOcrRenditions(forTaggedObjects);
            break;
        default:
            break;
    }
}

Now I need to build the desired functionality within the solution's second project.  To start I'll go ahead and import the tesseract library via the Nuget package manager.  As of this post the latest stable version was 3.0.2.  Note that I also imported the CM .Net SDK and System.Drawing.

2017-11-14_8-21-48.png

Next I downloaded the latest english language data files and placed them into the required tessdata sub-folder.  I also updated the properties of each so that they copy to the output folder if needed.

2017-11-14_8-29-59.png

I decide to now implement the remove ocr rendition feature.  One method will work on a single record and a second method will work on a set of tagged objects (same approach as with the Client Addin).  To make it super simple I'm not presenting any sort of user interface or options.  

#region Remove Ocr Rendition
public static bool RemoveOcrRendition(Record record)
{
    return record.RemoveOcrRendition();
}
public static void RemoveOcrRenditions(TrimMainObjectSearch forTaggedObjects)
{
    foreach (var result in forTaggedObjects)
    {
        HP.HPTRIM.SDK.Record record = result as HP.HPTRIM.SDK.Record;
        if ((HP.HPTRIM.SDK.Record)record != null)
        {
            RemoveOcrRendition(record);
        }
    }
} 
#endregion

I again used an extension method, this time naming it "RemoveOcrRendition".  I create a new class named "RecordExtensions", mark it static, and implement the functionality.  I also add one last extension method that handles the creation of a new ocr rendition.  The contents of that class is included below.

using HP.HPTRIM.SDK;
namespace CMRamble.Ocr
{
    public static class RecordExtensions
    {
        public static void AddOcrRendition(this Record record, string fileName)
        {
            if (record.HasOcrRendition()) record.RemoveOcrRendition();
            record.ChildRenditions.NewRendition(fileName, RenditionType.Ocr, "Ocr");
        }
        public static bool RemoveOcrRendition(this Record record)
        {
            bool removed = false;
            for (uint i = 0; i < record.ChildRenditions.Count; i++)
            {
                RecordRendition rendition = record.ChildRenditions.getItem(i) as RecordRendition;
                if ((RecordRendition)rendition != null && rendition.TypeOfRendition == RenditionType.Ocr)
                {
                    rendition.Delete();
                    removed = true;
                }
            }
            record.Save();
            return removed;
        }
        public static bool HasOcrRendition(this Record record)
        {
            for (uint i = 0; i < record.ChildRenditions.Count; i++)
            {
                RecordRendition rendition = record.ChildRenditions.getItem(i) as RecordRendition;
                if ((RecordRendition)rendition != null && rendition.TypeOfRendition == RenditionType.Ocr)
                {
                    return true;
                }
            }
            return false;
        }
    }
}

Now that I have the remove ocr rendition functionality complete I can move onto the update functionality.  In order to OCR the file I must first extract it to disk.  Then I can extract the text by calling the tesseract library and saving the results back as a new ocr rendition.  The code below implements this within the Record Controller class (which is invoked by the addin).

#region Update Ocr Rendition
public static bool UpdateOcrRendition(Record record)
{
    bool success = false;
    string extractedFilePath = string.Empty;
    string ocrFilePath = string.Empty;
    try
    {
        // get a temp working location on disk
        var rootDirectory = Path.Combine(Path.GetTempPath(), "cmramble_ocr");
        if (!Directory.Exists(rootDirectory)) Directory.CreateDirectory(rootDirectory);
        // formulate file name to extract, delete if exists for some reason
        extractedFilePath = Path.Combine(rootDirectory, $"{record.Uri}.{record.Extension}");
        ocrFilePath = Path.Combine(rootDirectory, $"{record.Uri}.txt");
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
        // fetch document
        record.GetDocument(extractedFilePath, false"OCR"string.Empty);
        // get the OCR text
        ocrFilePath = TextExtractor.ExtractFromFile(extractedFilePath);
        // use record extension method that removes existing OCR rendition (if exists)
        record.AddOcrRendition(ocrFilePath);
        record.Save();
        success = true;
    }
    catch (Exception ex)
    {
    }
    finally
    {
        FileHelper.Delete(extractedFilePath);
        FileHelper.Delete(ocrFilePath);
    }
    return success;
}
public static void UpdateOcrRenditions(TrimMainObjectSearch forTaggedObjects)
{
    foreach (var result in forTaggedObjects)
    {
        HP.HPTRIM.SDK.Record record = result as HP.HPTRIM.SDK.Record;
        if ((HP.HPTRIM.SDK.Record)record != null)
        {
            UpdateOcrRendition(record);
        }
    }
}
#endregion

I placed all of the tesseract logic into a new class named TextExtractor.  Within that class I have one method that takes a file name and returns the name of a file containing all of the ocr text.  If I use tesseract on a PDF though it will give me back the text layers from within the PDF, which defeats my goal.  I want tesseract to OCR the images within the PDF. 

To accomplish that I used the Xpdf command line utility pdftopng, which extracts all of the images to disk.  I then iterate over each image (just like I did within the original powershell script) to generate new OCR content.  As each image is processed the results are appended to an ocr text file.  That text file is what is returned to the record controller.

using CMRamble.Ocr.Util;
using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
using Tesseract;
namespace CMRamble.Ocr
{
    public static class TextExtractor
    {
        /// <summary>
        /// Exports all images from PDF and then runs OCR over each image, returning the name of the file on disk holding the OCR results
        /// </summary>
        /// <param name="filePath">Source file to be OCR'd</param>
        /// <returns>Name of file containing OCR contents</returns>
        public static string ExtractFromFile(string filePath)
        {
            var ocrFileName = string.Empty;
            var extension = Path.GetExtension(filePath).ToLower();
            if (extension.Equals(".pdf"))
            {   
                // must break out the original images within the PDF and then OCR those
                var localDirectory = Path.Combine(Path.GetDirectoryName(filePath), Path.GetFileNameWithoutExtension(filePath));
                ocrFileName = Path.Combine(Path.GetDirectoryName(filePath), Path.GetFileNameWithoutExtension(filePath) + ".txt");
                FileHelper.Delete(ocrFileName);
                // call xpdf util pdftopng passing PDF and location to place images
                Process p = new Process();
                p.StartInfo.UseShellExecute = false;
                p.StartInfo.RedirectStandardOutput = true;
                p.StartInfo.FileName = "pdftopng";
                p.StartInfo.Arguments = $"\"{filePath}\" \"{localDirectory}\"";
                p.Start();
                string output = p.StandardOutput.ReadToEnd();
                p.WaitForExit();
                // find all the images that were extracted
                var images = Directory.GetFiles(Directory.GetParent(localDirectory).FullName, "*.png").ToList();
                foreach (var image in images)
                {
                    // spin up an OCR engine and have it dump text to the OCR text file
                    using (var engine = new TesseractEngine(@"./tessdata""eng"EngineMode.Default))
                    {
                        using (var img = Pix.LoadFromFile(image))
                        {
                            using (var page = engine.Process(img))
                            {
                                File.AppendAllText(ocrFileName, page.GetText() + Environment.NewLine);
                            }
                        }
                    }
                    // clean-up as we go along
                    File.Delete(image);
                }
            }
            return ocrFileName;
        }
    }
}

All done!  I can now compile the add-in and play with it.  First I added the menu links to my home ribbon.  As you can see below, clicking the remove ocr rendition link changes the number of renditions available.

2017-11-14_8-54-24.gif

Along the same line, if I click update ocr rendition then the number of renditions is increased...

2017-11-14_8-59-56.gif

In the next post I'll incorporate the same functionality within an event processor plugin, so that all records have their content OCR'd via tesseract.  

You can download the full source for this solution here: 

https://github.com/HPECM/Community/tree/master/CMRamble/Ocr

Export Mania 2017 - Record Addin

This is the third of four posts trying to tackle how to achieve the export of a meta-data file along with electronic documents.  We need/want to have the electronic documents to have the record number in the file names, instead of the standard (read: oddball) naming conventions of the various features.  In this post I'll show how to create a custom record addin that achieves the requirement.

So let's dive right on in!


I created a new C# Class library, imported the CM .Net SDK (HP.HPTRIM.SDK), and created an Export class that will implement the ITrimAddin interface.

2017-10-16_19-03-38.png

Next I'll use the Quick Action feature of Visual Studio to implement the interface.  It generates all of the required members and methods, but with exceptions for each.  I immediately reorganized what was generated and update it so that it does not throw exceptions.

Collapsed appearance of the class

Collapsed appearance of the class

I find it helpful to organize the members and methods into regions reflective of the features & functionality.  For this particular add-in I will ignore the "Save and Delete Events" and "Field Customization" regions.  Currently my private members and public properties regions look like shown below.

#region Private members
private string errorMessage;
#endregion
 
#region Public Properties
public override string ErrorMessage => errorMessage;
#endregion

If I expand my Initialization region I see two methods: Initialise and Setup.  Initialise is invoked the first time the add-in is loaded within the client.  Setup is invoked when a new object is added.  For now I don't truly need to do anything in either, but in the future I would use the initialise method to load any information needed for the user (maybe I'd fetch the last extraction path from the registry, a bunch of configuration data from somewhere in CM, etc).

#region Initialization
public override void Initialise(Database db)
{
}
public override void Setup(TrimMainObject newObject)
{
}
#endregion

Next I need to tackle the external link region.  There are two types of methods in this region: ones that deal with the display of menu links and the others that actually perform an action.  My starting code is shown below.  

#region External Link
public override TrimMenuLink[] GetMenuLinks()
{
    return null;
}
public override bool IsMenuItemEnabled(int cmdId, TrimMainObject forObject)
{
    return false;
}
public override void ExecuteLink(int cmdId, TrimMainObject forObject, ref bool itemWasChanged)
{
 
}
public override void ExecuteLink(int cmdId, TrimMainObjectSearch forTaggedObjects)
{
}
#endregion

First I'll tackle the menu links.  The TrimMenuLink class, shown below, is marked as abstract.  This means I need to create my own concrete class deriving from it. 

2017-10-16_18-20-50.png

Note that the constructor is marked protected.  Thankfully, because of that, I can eventually do some creative things with MenuLink (maybe another blog post someday).  For now I'll just add a class to my project named "ExportRecordMenuLink".  I apply the same process to it once it's generated, giving me the results below.

using HP.HPTRIM.SDK;
 
namespace CMRamble.Addin.Record.Export
{
    public class ExportRecordMenuLink : TrimMenuLink
    {
        public override int MenuID => 8001;
 
        public override string Name => "Export Record";
 
        public override string Description => "Exports records to disk using Record Number as file name";
 
        public override bool SupportsTagged => true;
    }
}

Now that I've got a Menu Link for my add-in, I go back and to my main class and make a few adjustments.  First I might as well create a private member to store an array of menu links.  Then I go into the intialise method and assign it a new array (one that contains my new addin).   Last, I have the GetMenuLinks method return that array.  

private TrimMenuLink[] links;
 
public override void Initialise(Database db)
{
    links = new TrimMenuLink[1] { new ExportRecordMenuLink() };
}
public override TrimMenuLink[] GetMenuLinks()
{
    return links;
}

The IsMenuItemEnabled method will be invoked each time a record is "selected" within the client.  For my scenario I want to evaluate if the object is a record and if it has an electronic document attached.  Though I also need to ensure the command ID matches the one I've created in the ExportRecordMenuLink.

public override bool IsMenuItemEnabled(int cmdId, TrimMainObject forObject)
{
    return (links[0].MenuID == cmdId && forObject.TrimType == BaseObjectTypes.Record && ((HP.HPTRIM.SDK.Record)forObject).IsElectronic);
}

Almost done!  There are two methods left to implement, both of which are named "ExecuteLink".  The first deals with the invocation of the add-in on one object.  The second deals with the invocation of the add-in with a collection of objects.  I'm not going to waste time doing fancy class design and appropriate refactoring.... so pardon my code.  

public override void ExecuteLink(int cmdId, TrimMainObject forObject, ref bool itemWasChanged)
{
    HP.HPTRIM.SDK.Record record = forObject as HP.HPTRIM.SDK.Record;
    if ( (HP.HPTRIM.SDK.Record)record != null && links[0].MenuID == cmdId )
    {
        FolderBrowserDialog directorySelector = new FolderBrowserDialog() { Description = "Select a directory to place the electronic documents", ShowNewFolderButton = true };
        if (directorySelector.ShowDialog() == DialogResult.OK)
        {
            string outputPath = Path.Combine(directorySelector.SelectedPath, $"{record.Number}.{record.Extension}");
            record.GetDocument(outputPath, falsestring.Empty, string.Empty);
        }
    }
}

In the code above I prompt the user for the destination directory (where the files should be placed).  Then I formulate the output path and extract the document.  I should be removing any invalid characters from the record number (slashes are acceptable in the number but not on the disk), but again you can do that in your own implementation.

I repeat the process for the next method and end up with the code shown below.

public override void ExecuteLink(int cmdId, TrimMainObjectSearch forTaggedObjects)
{
    if ( links[0].MenuID == cmdId )
    {
        FolderBrowserDialog directorySelector = new FolderBrowserDialog() { Description = "Select a directory to place the electronic documents", ShowNewFolderButton = true };
        if (directorySelector.ShowDialog() == DialogResult.OK)
        {
            foreach (var taggedObject in forTaggedObjects)
            {
                HP.HPTRIM.SDK.Record record = taggedObject as HP.HPTRIM.SDK.Record;
                if ((HP.HPTRIM.SDK.Record)record != null)
                {
                    string outputPath = Path.Combine(directorySelector.SelectedPath, $"{record.Number}.{record.Extension}");
                    record.GetDocument(outputPath, falsestring.Empty, string.Empty);
                }
            }
        }
    }
}

All done!  Now, again, I left out the meta-data file & user interface for now.  If anyone is interest then add a comment and I'll create another post.  For now I'll compile this and add it to my instance of Content Manager.  

2017-10-16_18-52-41.png

Here's what it looks like so far for the end-user.

2017-10-16_18-53-56.png

A wise admin would encourage users to place this on a ribbon custom group, like shown below.

2017-10-16_18-56-04.png

When I execute the add-in on one record I get prompted for where to place it...

2017-10-16_18-57-08.png

Success!  It gave me my electronic document with the correct file name.

2017-10-16_18-58-41.png

Now if I try it after tagging all the records, the exact same pop-up appears and all my documents are extracted properly.

2017-10-16_19-00-20.png

Hopefully with this post I've shown how easy it is to create custom add-ins.  These add-ins don't necessarily need to be deployed to everyone, but often times that is the case.  That's the main reason people shy away from them.  But deploying these is no where near as complicated as most make it seem.

You can download the full source here.

Making email management as painless as possible

In a previous post I showed how a check-in style leads to newly cataloged emails kicking-off a workflow.  The check-in style resulted in a new label in gmail that I tied to a filter.  Items surfaced via the filter got attached to my gmail label, which in-turn created a record that initiated a workflow.  It's a one-time setup that works well for handling my online form submission.

I get other emails though.  Emails related to projects, for instance.  Some projects may last for 2 days, whilst others may last for 2 months. When I'm on the road I don't want to have to stop and setup a new check-in style for a new project.  In fact, I don't really want to have to setup a new check-in style at all.  Why am I adding/removing check-in styles?  Surely there's got to be a better way.

What I want to have happen

What I want to have happen

 

I want my check-in styles to be managed automatically.  Specifically I want:

  1. A check-in style to exist if there are any incomplete workflows where I'm assigned to an activity and the initiating record is a container.
  2. A check-in style to be removed if the workflow referenced above is completed.

 

 

Time to break out visual studio and write some code! 

If you're not a techie you can scroll to the very end to see the results.


First things first, I create a new class library project.  I referenced the .Net SDK and imported the log4net library via the NuGet package manager.  That gives me a solution that looks like this.

2017-10-08_22-52-20.png

I specify that my Addin implements the Trim Event Processor AddIn interface, which requires that I create a method named ProcessEvent.  In that method I test that the event I react to is one of the relevant events from an activity.  If it is, I call the ManageCheckInStyle method I create next.  Otherwise I exit out of my addin.  

namespace CMRamble.EventProcessor.WorkflowCheckInStyles
{
    public class Addin : TrimEventProcessorAddIn
    {
        private static readonly ILog log = LogManager.GetLogger(typeof(Addin));
        public override void ProcessEvent(Database db, TrimEvent evt)
        {
            XmlConfigurator.Configure();
            switch (evt.EventType)
            {
                case Events.ActivityAssigned:
                case Events.ActivityReassigned:
                case Events.ActivityCompleted:
                case Events.ActivityUndone:
                case Events.ActivityCurrent:
                case Events.ActivitySkipped:
                case Events.ActivityNeedsAuthorization:
                    ManageCheckInStyle(db, evt);
                    break;
                default:
                    break;
            }
        }
 
    }
}

Now I create the skeleton of my ManageCheckInStyle method, like shown below.

private void ManageCheckInStyle(Database db, TrimEvent evt)
{
    try
    {
        Activity activity = new Activity(db, evt.ObjectUri);
        if ( activity != null && activity.AssignedTo != null )
        {
            log.Debug($"Activity Uri {evt.ObjectUri}");
            Workflow workflow = activity.Workflow;
            if ( workflow != null && (workflow.Initiator != null && workflow.Initiator.RecordType.UsualBehaviour == RecordBehaviour.Folder) )
            {
                if ( workflow.IsComplete )
                {
                    log.Debug($"Workflow Uri {workflow.Uri} Is Completed");
                }
                else
                {
                    log.Debug($"Workflow Uri {workflow.Uri} not completed");
                }
            }
        }
    }
    catch ( TrimException ex )
    {
        log.Error($"Exception: {ex.Message}", ex);
    }
    finally
    {
    }
}

Then I implemented each of the requirements I laid-out at the top of the post (something is assigned, and workflow is completed).  First I code the logic for when something is assigned to me.

// ensure that there is a check-in style for this container
TrimMainObjectSearch styleSearch = new TrimMainObjectSearch(db, BaseObjectTypes.CheckinStyle)
{
    SearchString = $"owner:{activity.AssignedTo.Uri} container:{workflow.Initiator.Number}"
};
if ( styleSearch.Count == 0 )
{
    log.Debug($"Creating new check-in style");
    CheckinStyle style = new CheckinStyle(workflow.Initiator);
    style.RecordType = new RecordType(db, "Document");
    style.Name = workflow.Initiator.Title;
    style.StyleOwner = activity.AssignedTo;
    style.MoveToDeletedItems = false;
    style.Save();
} else
{
    log.Info("Check-in style already exists");
}

Last,  I create the logic for when the workflow is completed.

// when no other assigned activities for this container 
TrimMainObjectSearch activitySearch = new TrimMainObjectSearch(db, BaseObjectTypes.Activity)
{
    SearchString = $"workflow:[initiator:{workflow.Initiator.Number}] assignee:{activity.AssignedTo.Uri} not done"
};
if (activitySearch.Count == 0 )
{   // there are no other assigned activities
    TrimMainObjectSearch styleSearch = new TrimMainObjectSearch(db, BaseObjectTypes.CheckinStyle)
    {
        SearchString = $"owner:{activity.AssignedTo.Uri} container:{workflow.Initiator.Number}"
    };
    foreach (CheckinStyle style in styleSearch)
    {
        style.Delete();
    }
}

That's it.  I hit compile and I have a valid add-in.  I go and register it in the Enterprise Studio, like shown below.

2017-10-08_23-01-02.png

To test I go find a project folder and attach a workflow (assigning it to myself).  I then flip on over to my email and see the title of that project folder is now a label in my email (or a folder if using Outlook).

2017-10-08_23-15-40.png

If I complete the workflow the check-in style is automatically removed.  Sweet!  Now I can focus on my real work, and not the constant maintenance of my linkage between email & content manager.