Generating Keywords for the JFK Archive

To recap where I'm at in this current series (JFK Archives):

  1. I've imported 6685 records into Content Manager.
  2. I've OCR'd each of the PDF's I've downloaded from NARA by using Adobe Acrobat Pro.
  3. I've attached the OCR'd PDF as the main document and added two renditions: the original PDF and an OCR plain text file.  
  4. For audio records the original rendition is the recording and the PDF becomes a "transcript" (re-captioned Rendition Type of Other1).

Now I'm exploring my results.  When using the standard thick client I've got to constantly click back and forth between different tabs in the view pane.  That approach works well in many other types of solutions, but for a "reading room" (of sorts) this archive is going to be challenging to work with. 

2017-11-02_5-38-40.gif

I wish the view pane was split horizontally so that I can have both preview & properties.  Notice that when I preview the PDF's I cannot select any of the text that was OCR'd.  I don't truly need to be able to select those words; but, if I could, I would know that those I could select were searchable.

2017-11-02_5-50-26.png

When I used Adobe Acrobat Pro to OCR these documents I left it with the default "Searchable Image" output option.  I could have selected "Editable Text and Images".  Doing so would have generated PDF's searchable PDF's where the text could be selected. 

It took almost one day for one computer to generate the OCR'd PDF's, and it did crash 3 times over that day.  Adobe gobbles up memory until it ultimately crashes, but I could pick up where it left off.  The thought of going through this process again isn't appealing.

If my goal is to be able to see the OCR'd text, I can take a look at the OCR text rendition.  Unfortunately this requires me to open the properties dialog, click onto the rendition tab, find the rendition and select it.  Plus I'm finding the contents less and less appealing with each viewing.

I don't think that is English

I don't think that is English

The quality of the original scans isn't all that great and is often illegible.  Why can't I just strip out all of the noise and maybe generate a count of unique words?  I think I'll take a stab at extracting all of the unique words from the OCR'd text, sort by most frequently used first, and then save the results into a new property named "NARA Content Keywords".  

Words are ordered in from most frequent to least frequent

Words are ordered in from most frequent to least frequent

I think I should exclude all "words" that are just a single character.  I see a few underscores, so I'll need to filter those out as well.  Might as well add a little more spacing between the words while I'm at it.

2017-11-02_7-33-19.png

Each of these words is content searchable.  I picked a weird word "otential" (probably should have been "potential") and sure enough it came back in a content search.  Some of these words should be excluded and considered to be noise.  So I filter those out and now have something more interesting to review.

2017-11-02_7-44-12.png

As I ran my powershell script to do this on all of my records I noticed another issue: Adobe Acrobat generated a 0 byte OCR text file.  When I opened the output from Adobe there were no words in the "OCR'd PDF" either.  Talk about frustrating!  

2017-11-02_8-56-18.png

I can still get what I need though, I just need to be a bit more creative.  Since I'm in powershell and working with these records, I can leverage two free, open-source libraries: pdftopng and PSImaging.  I can use pdftopng to extract all of the images from a PDF and PSImaging to extract all text from those images.  Then I can organize my words and save them back to the record.  I'll create a second field so that I can compare the differences.

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

File system activity from the powershell script: extract PDF, extract images, push OCR into txt

If I look at these word lists within Content Manager, I can see a stark difference.  I still need to do some refining on my noise list, but I'm excited to see some useful keywords surfacing from this effort.  One main reason for this is that PSImaging leverages tesseract-OCR (from google) and, in my experience, has better OCR results.

2017-11-02_9-32-33.png

Here's the powershell I used to generate this last round of keywords.

Clear-Host
Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
Write-Progress -Activity "Generating OCR Keywords" -Status "Loading" -PercentComplete 0
#prep a temp spot to store OCR text files
$tmpFolder = "$([System.IO.Path]::GetTempPath())\cmramble"
if ( (Test-Path $tmpFolder) -eq $false ) { New-Item -Path $tmpFolder -ItemType Directory }
#prep word collection and word regex
$regex = [regex]"(\w+)"
$noiseWords = @("the", "to", "and", "subject", "or", "of", "is", "in", "be", "he", "that", "with", "was", "on", "have", "had", "as", "has", "at", "but", "no", "his", "these", "from", "any", "there")
$records = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch $db, Record
$records.SearchString = "electronic"
$x = 0
foreach ( $result in $records ) 
{
    $x++
    $record = [HP.HPTRIM.SDK.Record]$result
    Write-Progress -Activity "Generating OCR Keywords" -Status "$($record.Number)" -PercentComplete (($x/$records.Count)*100)
    #fetch the record
    if ( $record -ne $null ) 
    {
	    for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
	    {
		    $rendition = $record.ChildRenditions.getItem($i)
		    #find original rendition
		    if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Original ) 
		    {
                $words = [ordered]@{}
			    #extract it
			    $extract = $rendition.GetExtractDocument()
			    $extract.FileName = "$($record.Uri).txt"
			    $extract.DoExtract("$($tmpFolder)", $true, $false, $null)
			    $localFileName = "$($tmpFolder)\$($record.Uri).pdf"
                #get a storage spot for the image(s)
                $pngRoot = "$($tmpFolder)\$($record.Uri)\"
                if ( (Test-Path $pngRoot) -eq $false ) { New-Item -ItemType Directory -Path $pngRoot | Out-Null }
                #extract images
                &pdftopng -r 300 "$localFileName" "$pngRoot" 2>&1 | Out-Null
                #generate OCR from each image
                $ocrTxt = "$([System.IO.Path]::GetDirectoryName($pngRoot))\$($record.Uri).txt"
                $files = Get-ChildItem $pngRoot | ForEach-Object {
                    Export-ImageText $_.FullName |  Add-Content $ocrTxt
                    Remove-Item $_.FullName -Force
                }
                Get-Content $ocrTxt | Where-Object {$_ -match $regex} | ForEach-Object {
				    $matches = $regex.Matches($_)
				    foreach ( $match in $matches ) 
				    {
                        if ( ($match -ne $null) -and ($match.Value.Length -gt 1) -and ($noiseWords.Contains($match.Value.ToLower()) -eq $false)  ) {
					        if ( $words.Contains($match.Value) ) 
					        {
						        $words[$match.Value]++
					        } else {
						        $words.Add($match.Value, 1)
					        }
                        }
				    }
			    }
			    #reorder words
			    $words = $words.GetEnumerator() | Sort-Object Value -Descending
			    $wordText = ''
			    #generate string of just the words (no counts)
			    $words | ForEach-Object { $wordText += ($_.Name + '  ') }
			    #stuff into CM
			    $record.SetFieldValue($db.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, "NARA OCR Keywords"), (New-Object HP.HPTRIM.SDK.UserFieldValue($wordText)))
				#replace OCR txt
				for ( $i = 0; $i -lt $record.ChildRenditions.Count; $i++ ) 
				{
					$rendition = $record.ChildRenditions.getItem($i)
					#remove any OCR
					if ( $rendition.TypeOfRendition -eq [HP.HPTRIM.SDK.RenditionType]::Ocr ) 
					{
						$rendition.Delete()
					}
				}
				$record.ChildRenditions.NewRendition($ocrTxt, [HP.HPTRIM.SDK.RenditionType]::Ocr, "OCR") | Out-Null
				$record.Save() | Out-Null
                Remove-Item $ocrTxt -Force
                Remove-Item $pngRoot -Force
                Remove-Item -Path $localFileName
		    }
	    }
    }
}

Audio, Renditions, and Searchable PDF's

I ran the import and noticed a few errors.  There were 17 records in the collection with two file names in the "File Name" field. I could not find an explanation for this on the NARA website.  Though the meta-data makes it clear that the record is a magnetic tape.

The 17 records with one Audio and one Transcript

The 17 records with one Audio and one Transcript

Up to this point I had stored just the PDF's within Content Manager no different than if I had dragged-and-dropped them into CM.  Now I have an audio file and supporting transcript.  I noticed a problem with these transcripts, which helped me realize I have a problem with all of them.  They haven't been OCR'd and I therefore cannot perform an effective analysis of the contents.

Annotations go into IDOL... the rest is useless to IDOL

Annotations go into IDOL... the rest is useless to IDOL

In the image above I can see the word "release" in the annotation and the word "London" in the image.  If I run document content searches for these two words, users might expect to see at least one result for each word.

936 records have "released" in content

936 records have "released" in content

No records found with "London" in content

No records found with "London" in content

I did perform a full reindex of IDOL after I imported my records.  The counts looked good and I received no errors.  The size of the index is much smaller than I expected though.  Below you can see the total size of the archive is ~9GB.  You generally hear people say IDOL should be 10%-20% the size of the document store.  Mine is about 2%.

Document store size 

Document store size 

IDOL Size -- smaller than expected

IDOL Size -- smaller than expected

Luckily I have Adobe Acrobe Pro and can create a new action to enhance the existing PDF images, OCR the content, save a plain text copy, and save a searchable PDF.  As shown below, doing this one time often makes more sense then implementing something more complicated within Content Manager.

2017-10-30_20-54-45.png

On the disk I can see there are now two new files for each PDF....

There is a new OCR PDF and OCR TXT file for each PDF

There is a new OCR PDF and OCR TXT file for each PDF

I can fix my records in Content Manager with a powershell script.  Here I want to iterate each of the records in the NARA provided TSV file and attach the files according to my own logic.  If it's an audio recording then that gets saved as an original rendition, with the original PDF saved as a transcript (Rendition Type Other1).  Then the OCR'd PDF becomes the document for each record and the OCR txt file gets attached as an OCR rendition.

Write-Progress -Activity "Attaching Documents" -Status "Loading Metadata" -PercentComplete 0
$metaData = Get-Content -Path $metadataFile | ConvertFrom-Csv -Delimiter "`t"
$x = 0
foreach ( $meta in $metaData ) 
{
    Write-Progress -Activity "Attaching Documents" -Status "$($meta.'Record Num')" -PercentComplete (($x/$metaData.Length)*100) 
    $record = New-Object HP.HPTRIM.SDK.Record -ArgumentList $db, $meta.'Record Num'
    $original = $meta.'File Name'
    try {
        if ( $original.Contains(';') ) 
        {
            #store WAV as original
            $record.ChildRenditions.NewRendition("$($docFolder)\$($original.Split(';')[1])", [HP.HPTRIM.SDK.RenditionType]::Original, "Audio Recording") | Out-Null
            #store the PDF as an transcript (re-captioned Other1)
            $pdfFileName = $($original.Split(';')[0])
            $record.ChildRenditions.NewRendition("$($docFolder)\$pdfFileName", [HP.HPTRIM.SDK.RenditionType]::Other1, "Transcription") | Out-Null
            #store the OCR'd PDF as main object
            $pdfOcrFileName = $([System.IO.Path]::GetFileNameWithoutExtension($pdfFileName)+"-ocr.pdf")
            $record.SetDocument("$($docFolder)\$pdfOcrFileName)", $false, $false, "")
            #store OCR TXT as rendition
            $ocrTxt = $([System.IO.Path]::GetFileNameWithoutExtension($pdfFileName)+"-ocr.txt")
            $record.ChildRenditions.NewRendition($ocrTxt, [HP.HPTRIM.SDK.RenditionType]::Ocr, "Adobe Acrobat") | Out-Null
        } else {
            #store OCR'd PDF as main object
            $pdfOcrFileName = $([System.IO.Path]::GetFileNameWithoutExtension($original)+"-ocr.pdf")
            $record.SetDocument("$($docFolder)\$($pdfOcrFileName)", $false, $false, "")
            #store file as original
            $record.ChildRenditions.NewRendition("$($docFolder)\$($original)", [HP.HPTRIM.SDK.RenditionType]::Original, "Official Record") | Out-Null    
            #stick OCR TXT as rendition
            $ocrTxtFileName = $([System.IO.Path]::GetFileNameWithoutExtension($original)+"-ocr.txt")
            $record.ChildRenditions.NewRendition("$($docFolder)\$($ocrTxtFileName)", [HP.HPTRIM.SDK.RenditionType]::Ocr, "Adobe Acrobat" )| Out-Null
        }
        $record.Save()
    } catch {
        Write-Error $_
    }
    $x++
}

Technically I don't need to store the OCR txt file as a rendition, but IDOL does not give me the ability to "extract" the words it has found within the documents.  I need these words available to me for my document analysis and search requirements.

IDOL will also need some time to finish its' own indexing of all these changes.  I know exactly how much space was used by both the document store and IDOL before I made these changes.  It will be interesting to see how much each grows.