Recreating the NARA JFK Assassination Archive within Content Manager

This past week the National Archives released numerous records from their JFK Assassination collection.  As I reviewed what they've released I found myself trying to imagine how it might be managed within Content Manager (they are using OpenText).  What sort of configuration would facilitate a declassification/review/release process?  How would I handle new legislation and its' impact on my records retention?  I hope to address these thoughts, and a few more, over the coming posts.  

Before digging into those advanced topics, I'm going to try and import the released records into a new instance of Content Manager.  I started by visiting the 2017 release page for the collection.  The page shows me the meta-data for the records, a link to download each, and a link to download an excel spreadsheet (containing all the meta-data). 

2017-10-28_22-35-20.png

I downloaded the spreadsheet and then opened it to take a look at the columns.  I should be able to map each field to a property within Content Manager, or decide to not import it at all.  A few of the fields will need to have lookup sets created to support them.  

Spreadsheet fields and my thoughts on mapping each:

  • File Name -- URL of the electronic record, which will need to be retrieved and then attached to the CM record
  • Record Num -- the NARA record number which will also be used as the CM record number
  • NARA Release Date -- record date published
  • Formerly Withheld -- create a new custom string property with supporting lookup set
  • Agency -- create a new custom string property with supporting lookup set
  • Doc Date -- record date created
  • Doc Type -- create a new custom string property with supporting lookup set
  • File Num -- this is interesting because values refer to container titles (not container record numbers), so initially I'll map it to a CM custom string property and later determine what to do with containers.
  • To Name -- create a new custom string property
  • From Name -- create a new custom string property
  • Title -- record title
  • Num Pages -- create a new custom number property
  • Originator -- record owner
  • Record Series -- if containers are used then this series would most likely be indicated on that container, so for now I'll import the values into a new custom string property and tackle it in a later post.
  • Review Date -- create a new custom date property
  • Comments -- here I'll create a new custom string property.  Although I could use the existing record notes property, there are behavioral settings at the record type level which impact user's abilities to interact with that field and therefore shouldn't be used relied upon (I don't think so at least).
  • Pages Released -- create a new custom number property

With the mapping details worked out I can move on to the configuration.  I always step through configuration in the following order: security, locations, schedules, classifications, thesaurus terms (keywords), lookup sets, custom properties, record types and system settings.  The remainder of this post will step through each and at the end I'll have a dataset with all of these records imported.  

Terminology

Within the National Archives some CM concepts are referenced differently.  For instance, Security Levels and Caveats are referred to as Security Classifications and Supplemental Markings respectively; so I decided to first change the terminology option on the compliance tab of the system options to "US Common".

2017-10-29_5-46-29.png

Security

With my terminology now aligned with NARA, I created the standard security classification rankings. I'm not going to create any supplemental markings at the moment.  Since I intend to demonstrate a CM declassification process it makes sense to have a few levels available as soon as possible, even if I don't use them during my initial import.

2017-10-29_5-47-17.png

Locations

The filter option within my spreadsheet application allows me to see that there are a handful of values I could use as a location list.  Now each of these locations should have a default security profile with the "Top Secret" security classification, as each could possibly have records at that level.  

2017-10-29_5-55-28.png

I have a couple of options with regard to locations:

  1. Manually create each location, especially since there are less than twenty distinct originators in the spreadsheet. 
  2. Use a spreadsheet formula/function to export a new worksheet that will then be imported via DataPort's Spreadsheet Formatter
  3. Save the spreadsheet off as a CSV file and then use powershell to create the locations (thereby giving me greater flexibility but also requiring some manual efforts). 

I decided to go the last route because I can easily re-use what I create.  I downloaded the workbook from NARA and converted it from Excel to CSV.  I then wrote this powershell script to extract/create all of the unique originators:

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
$metadataFile = "F:\Dropbox\CMRamble\JFK\nara_jfk_2017release.tsv"
$owners = Get-Content -Path $metadataFile | ConvertFrom-Csv -Delimiter "`t" | select "Originator" | sort-object -Property "Originator" -Unique
foreach ( $owner in $owners ) 
{
    if ( [String]::IsNullOrWhiteSpace($owner.Originator) -ne $true ) {
        $location = New-Object HP.HPTRIM.SDK.Location -ArgumentList $db
        $location.TypeOfLocation = [HP.HPTRIM.SDK.LocationType]::Organization
        $location.IsWithin = $true
        $location.SortName = $owner.Originator
        $location.Surname = $owner.Originator
        $location.SecurityString = "Top Secret"
        $location.Save()
        Write-Host "Created $($location.FullFormattedName)"
    }
}
Output from the script

Output from the script

Schedules, Classifications, Thesaurus Terms

I don't really know how I'll leverage these features, yet.  So for now I don't really need to do anything with them.

Lookup sets

According to my field mapping I need three lookup sets: Formerly Withheld, Agency, and Doc Type.  I could manually create each of these, use Data Port, or use a powershell script.  Since I have a powershell script handy that can extract unique values from a column, I elect to go the powershell route.

Here is my powershell script that creates each of these three lookup sets and adds items for each unique value:

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
$metadataFile = "F:\Dropbox\CMRamble\JFK\nara_jfk_2017release.tsv"
$metaData = Get-Content -Path $metadataFile | ConvertFrom-Csv -Delimiter "`t" 
$fields = @("Formerly Withheld", "Agency", "Doc Type")
foreach ( $field in $fields ) 
{
    $fieldItems = $metaData | select $($field) | sort-object -Property $($field) -Unique | select $($field)
    $lookupSet  = $null
    $lookupSet = New-Object HP.HPTRIM.SDK.LookupSet -ArgumentList $db, $field
    if ( $lookupSet -eq $null ) 
    { 
        $lookupSet = New-Object HP.HPTRIM.SDK.LookupSet -ArgumentList $db 
        $lookupSet.Name = $field
        $lookupSet.Save()
        Write-Host " Created '$($field)' set"
    } 
    foreach ( $item in $fieldItems ) 
    {
        $itemValue = $($item.$($field))
        $lookupItem = New-Object HP.HPTRIM.SDK.LookupItem -ArgumentList $db, $itemValue
        if ( $lookupItem.Error -ne $null ) 
        {
            $lookupItem = New-Object HP.HPTRIM.SDK.LookupItem -ArgumentList $lookupSet 
            $lookupItem.Name = $itemValue
            $lookupItem.Save()
            Write-Host " Added '$($itemValue)' item"
        }
		$lookupItem = $null
    }
	$lookupSet = $null
}

Here's what the results look like in my dataset...

2017-10-29_18-16-06.png

Custom Properties

Next I created each of the required custom properties, as defined in the mapping notes at the top of this post.  I, as usual, don't like doing this manually via the client.  Powershell makes it very easy!

Here's my powershell script to create a field for every column in the spreadsheet.  Note that I created custom properties for some of the columns that I've mapped to stock record properties (title for instance).  I decided to store the original data on the record and then populate the stock properties as needed (which will be done for title, doc date, number).  

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
 
$lookupFields = @("NARA Formerly Withheld", "NARA Agency", "NARA Doc Type")
foreach ( $lookupField  in $lookupFields ) 
{
    $field = New-Object HP.HPTRIM.SDK.FieldDefinition $db
    $field.Name = $lookupField
    $field.LookupSet = New-Object HP.HPTRIM.SDK.LookupSet $db, $lookupField
    $field.Save()
 
    Write-Host "Created Property for Lookup Set '$($lookupField)'"
}
$stringFields = @("NARA File Num", "NARA To Name", "NARA From Name", "NARA Title", "NARA Originator", "NARA Record Series", "NARA Comments", "NARA Record Num", "NARA File Name")
foreach ( $stringField  in $stringFields ) 
{
    $field = New-Object HP.HPTRIM.SDK.FieldDefinition $db
    $field.Name = $stringField
    $field.Format = [HP.HPTRIM.SDK.UserFieldFormats]::String
    $field.Length = 50
    $field.Save()
 
    Write-Host "Created Property '$($stringField)' as String with length $($field.Length)"
}
$numberFields = @("NARA Num Pages", "NARA Pages Released")
foreach ( $numberField  in $numberFields ) 
{
    $field = New-Object HP.HPTRIM.SDK.FieldDefinition $db
    $field.Name = $numberField
    $field.Format = [HP.HPTRIM.SDK.UserFieldFormats]::Number
    $field.Save()
 
    Write-Host "Created Property '$($numberField)' as Number"
}
 
$dateFields = @("NARA Release Date", "NARA Doc Date", "NARA Review Date")
foreach ( $dateField in $dateFields ) 
{
    $field = New-Object HP.HPTRIM.SDK.FieldDefinition $db
    $field.Name = $dateField 
    $field.Format = [HP.HPTRIM.SDK.UserFieldFormats]::Date
    $field.Save()
 
    Write-Host "Created Property '$($dateField)' as Date"
}

Here's the output from the script...

2017-10-29_19-08-39.png

Record Types

My dataset was prepopulated during creation because I elected to have the standard data template loaded (this was an option on the initialize page of the new dataset creation wizard).  I can update the pre-created document record type so that it can support the records I want to import.  During a later post I will work with folders/containers.

As the image below shows, I tagged all of the available custom properties on the document record type.  

2017-10-29_19-15-45.png

I also changed the numbering pattern to a bunch of x's.  This will allow me to stuff any value into the record number field during import.

2017-10-29_19-17-16.png

System Settings

At this point I'm ready to import the entire JFK archive.  I don't need to worry about any of the system settings.  I'll leave all of the default options (except terminology).

Import the Archive

In order for me to use DataPort, I would need to manipulate the data source.  That's because I want to store two copies of several fields.  For instance: the title.   I wish to save the title column into both the record title and a custom property named "NARA Title".  I could duplicate the column within the spreadsheet and then craft a DataPort project to map the two fields.  I could also write a powershell script that manages these ETL (Extract, Transform, Load) tasks. 

Here's the powershell script I crafted to import this spreadsheet (which was saved as a tab delimited file):

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect
$metadataFile = "F:\Dropbox\CMRamble\JFK\nara_jfk_2017release.tsv"
$metaData = Get-Content -Path $metadataFile | ConvertFrom-Csv -Delimiter "`t"
$doc = New-Object HP.HPTRIM.SDK.RecordType -ArgumentList $db, "Document"
$fields = @("Agency", "Comments", "Doc Date", "Doc Type", "File Name", "File Num", "Formerly Withheld", "From Name", "NARA Release Date", "Num Pages", "Originator", "Record Num", "Record Series", "To Name", "Review Date", "Title")
foreach ( $meta in $metaData ) 
{
    #download the record if needed
    $localFileName = "$PSScriptRoot\$($meta.'File Name')"
    if ( (Test-Path -Path $localFileName) -eq $false ) 
    {
        Write-Host "Downloading $($meta.'File Name')"
        [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
        $url = "https://www.archives.gov/files/research/jfk/releases/$(($meta.'File Name').ToLower())"
        $wc = New-Object System.Net.WebClient
        $wc.DownloadFile($url, $localFileName)
    }
    #check if record already exists
    $record = $null
    $existingObject = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch $db, Record
    $existingObject.SearchString = "number:$($meta.'Record Num')"
    if ( $existingObject.Count -eq 1 ) 
    {
        foreach ( $rec in $existingObject ) {
            $record = [HP.HPTRIM.SDK.Record]($rec)    
            break
        }
    } else {
        $record = New-Object HP.HPTRIM.SDK.Record -ArgumentList $db, $doc
        $record.LongNumber = $meta.'Record Num'
        $record.SetDocument($localFileName)
    }
    #correct title
    $record.TypedTitle = (&{If([String]::IsNullOrWhiteSpace($meta.Title)) { $meta.'File Name' } else { $meta.Title }})
    #attach document if necessary
    if ( $record.IsElectronic -eq $false ) {
        $record.SetDocument($localFileName)
    }
    #populate custom properties
    foreach ( $field in $fields ) 
    {
        if ( $field -eq "NARA Release Date" ) {
            $record.SetFieldValue((New-Object HP.HPTRIM.SDK.FieldDefinition -ArgumentList $db, "$($field)"), (New-Object HP.HPTRIM.SDK.UserFieldValue -ArgumentList $(($meta).$field)))
        } else {
            $record.SetFieldValue((New-Object HP.HPTRIM.SDK.FieldDefinition -ArgumentList $db, "NARA $($field)"), (New-Object HP.HPTRIM.SDK.UserFieldValue -ArgumentList $(($meta).$field)))
        }
    }
    $record.Save()
    Write-Host "Imported Record $($meta.'Record Num')$($record.TypedTitle)"
}

Sweet!  I executed the script and now I've got the full NARA JFK Archive housed within an instance of Content Manager!

My next goal with this archive: setup IDOL and analyze these documents!  Stay tuned.

Adding Geolocation to my Json Import

My initial import just mapped the facility ID into the expanded record number property and the name into the title.  I can see from the Json response that there is a "latlng" property I can import.  It has two real numbers separated by a comma.  If I map that to the GPS Location field within Content Manager, I get this error message:

Details: Setting property failed. Item Index: 295, Property Caption: 'GPS Location', Value: 27.769625, -82.767725    Exception message: You have not entered a correct geographic location. Try POINT(-122.15437906 37.443134073) or use the map interface to mark a point.

Funny how DataPort is so demanding with regards to import formats.  Even funnier that it gives you no capability to transform data during "port".  I'll need to add some features into my Json import data formatter: add value prefix, suffix, and a geolocation converter feature to each property.  

I'll use the prefix in record numbers moving forward (everyone does that).  I'll use suffixes possibly in record numbers, but more likely on record titles (inserting a name, facility, region, etc, in title).  I'll use a dodgy static mapping for the geolocation converter (whatever gets my current data into the POINT structure).

2017-10-13_19-49-48.png

Now when I import from this json source I'll have additional record number protections and an importable geolocation.  Also notice that I'm exposing two properties to Content Manager: name and title.  Both of these point to the "name1" property of the original source.  Since DataPort only allows you to match one column to one source, you cannot re-use an import source property.  In my example I want to push a copy of the original value into a second additional field.  Having this flexibility gives me just what I need.

DataPort's Text File Import Byte Order Mark Requirement

Ever since DataPort was released I've heard people grumble about the requirement that text files include a Byte Order Mark.  Eventually, I too found it annoying.  Might as well create a new data port import formatter that solves the problem.

The easiest solution is to add a Byte Order Mark when there isn't one.  That takes 4-5 lines of code, so rather simple.  Once that's been done Data Port needs to be updated to reflect the new option and we're all set.

Feel free to use my copy.  Or follow the instructions to recreate this yourself.


Here's how I accomplished this....

First I decompiled the out-of-the-box Data Port Import Formatter (ImportDataFormatterTab) using IL Spy, copying the contents to my clipboard:

Next I launched Visual Studio and created a new class library project named "ImportDataFormatterTabAutoBOM":

I removed the default class Visual Studio gave me:

I added a new class named "ImportDataFormatterTabAutoBOM":

I replaced the content of the new class file with the content from IL Spy and immediately add "AutoBOM" to the class name:

I removed unnecessary references and imported those required:

Two errors have to be fixed before continuing:

Replace the assert with code which shows the assert message via the windows forms message box:

if (this.m_reader == null)
{
    System.Windows.Forms.MessageBox.Show("m_reader == null. StartImport must be called prior to calling GetNextItem");
    return null;
}

Replace the Browse method as follows:

public string Browse(Form parentForm, string searchPrefix, Point bottomRight, Dictionary<AdditionalDataKeysDescriptiveData> additionalData)
{
    string fileName = searchPrefix;
    using ( OpenFileDialog ofd = new OpenFileDialog() )
    {
        if ( ofd.ShowDialog() == DialogResult.OK )
        {
            this.VerifyBOM(fileName = ofd.FileName);
        }
    }
    return fileName;
    //string text = Helper.BrowseForFileOpen(parentForm, searchPrefix, "Text Files|*.txt|All Files|*.*");
    //this.VerifyBOM(text);
    //return text;
}

Add code to the VerifyBOM method that adds a BOM if it's not detected in the file:

private void VerifyBOM(string fileName)
       {
           if (!string.IsNullOrWhiteSpace(fileName))
           {

                if (this.m_fileHasBom.HasValue && (string.IsNullOrWhiteSpace(this.m_fileName) || this.m_fileName == fileName))
               {
                   this.m_fileHasBom = new bool?(this.m_fileHasBom.Value);
               }
               else
               {
                   if (!HP.HPTRIM.Framework.IO.IOHelper.FileHasBOM(fileName))
                   { // no byte order mark.... let's fix that and try again
                       String fileContent = System.IO.File.ReadAllText(fileName);
                       using (TextWriter writer = File.CreateText(fileName))
                       {
                           writer.Write("\xfeff");
                           writer.Write(fileContent);
                           writer.Close();
                       }
                   }
                   this.m_fileHasBom = new bool?(HP.HPTRIM.Framework.IO.IOHelper.FileHasBOM(fileName));
               }
               if (!this.m_fileHasBom.Value)
               {
                   throw new Exception(TrimApplicationBase.GetMessage(MessageIds.dp_err_noBOM, fileName));
               }
           }
       }

Compile the solution, open windows explorer to the debug output folder, and copy the output file to the clipboard:

Past the compiled class library into the Content Manager installation directory:

Navigate to the AppData directory for Data Port preferences:

Edit the Import Data Formatters configuration file and duplicate the existing Import Formatter section:

  <DataFormatterDefinition>
    <DisplayName>Tab Delimited</DisplayName>
    <AssemblyName>HP.HPTRIM.DataPort.Common.dll</AssemblyName>
    <ClassName>HP.HPTRIM.DataPort.Framework.DataFormatters.ImportDataFormatterTab</ClassName>
  </DataFormatterDefinition>

Tweak the values to reflect what was just built in the class library:

<ArrayOfDataFormatterDefinition xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <DataFormatterDefinition>
    <DisplayName>Tab Delimited</DisplayName>
    <AssemblyName>HP.HPTRIM.DataPort.Common.dll</AssemblyName>
    <ClassName>HP.HPTRIM.DataPort.Framework.DataFormatters.ImportDataFormatterTab</ClassName>
  </DataFormatterDefinition>
  <DataFormatterDefinition>
    <DisplayName>Tab Delimited (Auto BOM)</DisplayName>
    <AssemblyName>ImportDataFormatterTabAutoBOM.dll</AssemblyName>
    <ClassName>HP.HPTRIM.DataPort.Framework.DataFormatters.ImportDataFormatterTabAutoBOM</ClassName>
  </DataFormatterDefinition>
</ArrayOfDataFormatterDefinition>

Launch DataPort and use the new formatter: