Automating the generation of Tesseract OCR text renditions
Although IDOL will index the contents of PDF documents, it does not perform its' own OCR of the content (at least the OEM connector for CM does not). In the JFK archives this means I can only search on the stamped annotation on each image. Even if IDOL re-OCR'd documents, I can't easily extract the words it finds. I need to do that when researching records, performing a retention analysis, culling keywords for a record hold, or writing scope notes for categorization purposes. In the previous post I created a record addin that generated a plain text file that held OCR content from the tesseract engine.
Moving forward I want to automate these OCR tasks. For instance, anytime a new document is attached we should have a new OCR rendition generated. I think it makes sense to take the solution from the previous post and add to it. The event processor plugin I create should call the same logic as the client add-in. If this approach works out, I can then add a ServiceAPI plugin to expose the same functionality into that framework.
So I took the code from the last post and added another C# class library. I added one class that derived from the event processor addin class. It required one method be implemented: ProcessEvent. Within that method I check if the record is being reindex, the document has been replaced, the document has been attached, or a rendition has changed. If so I called the methods from the TextExtractor library used in the previous post.
using HP.HPTRIM.SDK; using System; using System.IO; using System.Reflection; namespace CMRamble.Ocr.EventProcessorAddin { public class Addin : TrimEventProcessorAddIn { #region Event Processing public override void ProcessEvent(Database db, TrimEvent evt) { Record record = null; RecordRendition rendition; if (evt.ObjectType == BaseObjectTypes.Record) { switch (evt.EventType) { case Events.ReindexWords: case Events.DocReplaced: case Events.DocAttached: case Events.DocRenditionRemoved: record = db.FindTrimObjectByUri(BaseObjectTypes.Record, evt.ObjectUri) as Record; RecordController.UpdateOcrRendition(record, AssemblyDirectory); break; case Events.DocRenditionAdded: record = db.FindTrimObjectByUri(BaseObjectTypes.Record, evt.ObjectUri) as Record; var eventRendition = record.ChildRenditions.FindChildByUri(evt.RelatedObjectUri) as RecordRendition; if ( eventRendition != null && eventRendition.TypeOfRendition == RenditionType.Original ) { // if added an original rendition = eventRendition; RecordController.UpdateOcrRendition(record, rendition, Path.Combine(AssemblyDirectory, "tessdata\\")); } break; default: break; } } } #endregion public static string AssemblyDirectory { get { string codeBase = Assembly.GetExecutingAssembly().CodeBase; UriBuilder uri = new UriBuilder(codeBase); string path = Uri.UnescapeDataString(uri.Path); return Path.GetDirectoryName(path); } } } }
Note that I created the AssemblyDirectory property so that the tesseract OCR path can be located correctly. Since this is spawned from TRIMEvent.exe the executing directory is the installation path of Content Manager. The tesseract language files are in a different location though. To work around this I pass the AssemblyDirectory property into the TextExtractor.
I updated the UpdateOcrRendition method in the RecordController class so that it accepted the assemblypath. If the assembly path is not passed then I default the value to the original value which is relative. The record add-in can then be updated to match this approach.
Within the TextExtractor class I added a parameter to the required method. I could then pass it directly into the tesseract engine during instantiation.
If you expand upon this concept you can see how it's possible to use different languages or trainer data. For now I need to go back and add one additional method. In the event processor I reacted to when a new rendition was added, but I didn't implement the logic. So I need to create a record controller method that works for renditions.
public static bool OcrRendition(Record record, RecordRendition sourceRendition, string tessData = @"./tessdata") { bool success = false; string extractedFilePath = string.Empty; string ocrFilePath = string.Empty; try { // get a temp working location on disk var rootDirectory = Path.Combine(Path.GetTempPath(), "cmramble_ocr"); if (!Directory.Exists(rootDirectory)) Directory.CreateDirectory(rootDirectory); // formulate file name to extract, delete if exists for some reason extractedFilePath = Path.Combine(rootDirectory, $"{sourceRendition.Uri}.{sourceRendition.Extension}"); ocrFilePath = Path.Combine(rootDirectory, $"{sourceRendition.Uri}.txt"); FileHelper.Delete(extractedFilePath); FileHelper.Delete(ocrFilePath); // fetch document var extract = sourceRendition.GetExtractDocument(); extract.FileName = Path.GetFileName(extractedFilePath); extract.DoExtract(Path.GetDirectoryName(extractedFilePath), true, false, ""); if (!String.IsNullOrWhiteSpace(extract.FileName) && File.Exists(extractedFilePath)) { ocrFilePath = TextExtractor.ExtractFromFile(extractedFilePath, tessData); // use record extension method that removes existing OCR rendition (if exists) record.AddOcrRendition(ocrFilePath); record.Save(); success = true; } } catch (Exception ex) { } finally { FileHelper.Delete(extractedFilePath); FileHelper.Delete(ocrFilePath); } return success; }
Duplicating code is never a great idea, I know. This is just for fun though so I'm not going to stress about it. Now I hit compile and then register my event processor addin, like shown below.
I then enabled the configuration status and saved/deployed...
Over in the client I removed the OCR rendition by using the custom button on my home ribbon...
When I then monitor the event processor I can see somethings been queued!
A few minutes later I've got a new OCR rendition attached.
Progress! Next thing I need to do is train tesseract. Many of these records are typed and not handwritten. That means I should be able to create a set of trainer data that improves the confidence of the OCR text. Additionally, I'd like to be able to compare the results from the original PDF and the tesseract results.