Vigilante Archivist (part 1)

After reading a few news articles (1,2,3) about shady drug treatment facilities, I've decided to see how to go about archiving, analyzing, and monitoring how these facilities promote their activities.  I'll accomplish this with a mixture of Content Manager, Powershell, Node.Js, Tesseract-OCR, and archive.org.

Before building anything new, I did some googling to find out what sort of solutions already exist to archive websites.  I found archive.org and see that they already perform this task.  No point in re-inventing the wheel, right?

If I look in content manager I can see that a box was created for one of the facilities referenced in the articles...

2018-02-19_8-42-57.png

So I flip over to the web archive and search for the website...

2018-02-19_8-42-02.png

If I click on any given day then I can see what the website looked on that date...

2018-02-19_8-45-23.png

Sweet.... so I don't necessarily have to write my own crawler to index sites.  I'll just need a script that checks each website and stores the last date it was indexed by web.archive.org.  Eventually this script may do other things and will be scheduled to run once daily.

2018-02-15_20-12-26.png

After running the script I can see the changes within the dataset....

2018-02-19_8-47-18.png

I found 347 of the 1100 facilities have no historical snapshots on the web archive.  A few disallow crawlers outright, preventing web.archive.org from indexing it.   For instance, the first box in the image above placed a "robots.txt" file at the root of their site.

2018-02-19_8-50-43.png

Since roughly 25% of the facilities can't be archived in this manner I need to explore other options.  For now I can start analyzing these sites on a more routine basis.  I can also request that web.archive.org index these sites as I detect changes.

 

In the mean time here's the powershell script I used for this task...

Clear-Host
#Import .Net SDK for Content Manager
Add-Type -Path "d:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
 
#Instantiate a connection to the default dataset
$Database = New-Object HP.HPTRIM.SDK.Database
$Database.Connect
 
$FacilityRecordTypeName = "Facility"
$FacilityWebsiteFieldName = "Facility Website"
$FacilityLastSnapshotFieldName = "Facility Last Snapshot"
$FacilityWebsite = [HP.HPTRIM.SDK.FieldDefinition]$Database.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, $FacilityWebsiteFieldName)
$FacilityLastSnapshot = [HP.HPTRIM.SDK.FieldDefinition]$Database.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, $FacilityLastSnapshotFieldName)
 
$Facilities = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch -ArgumentList $Database, Record
$Facilities.SearchString = "type:[name:$($FacilityRecordTypeName)$($FacilityWebsite.SearchClauseName):*"
 
$RowNumber = 0
foreach ( $Record in $Facilities ) {
    $RowNumber++
    $Facility = [HP.HPTRIM.SDK.Record]$Record
    Write-Progress -Activity "Checking Records" -Status "$($Facility.Number)$($Facility.Title)" -PercentComplete (($RowNumber/($Facilities.Count))*100)
    
    $Website = $Facility.GetFieldValueAsString($FacilityWebsite, [HP.HPTRIM.SDK.StringDisplayType]::Default, $false)
    $WayBackUrl = ("http://archive.org/wayback/available?url=" + $Website)
    try {
        $Response = Invoke-RestMethod $WayBackUrl
        if ( $Response -ne $null ) {
            if ( $Response.archived_snapshots.closest -eq $null ) {
                $Facility.SetFieldValue($FacilityLastSnapshot, (New-Object HP.HPTRIM.SDK.UserFieldValue(New-Object HP.HPTRIM.SDK.TrimDateTime)))
                $Facility.Save()
            } else {
                $Facility.SetFieldValue($FacilityLastSnapshot, (New-Object HP.HPTRIM.SDK.UserFieldValue($Response.archived_snapshots.closest.timestamp)))
                $Facility.Save()
            }
        } else {
        }
    } catch {
        Write-Host "        Error: $($_)"
    }
 
}

Additional/Custom Properties in Elasticsearch

When the content index is initially created it will contain a field mapping for many of the stock properties within Content Manager. 

Field Mapping for the default ES index (as generated by Enterprise Studio)

Field Mapping for the default ES index (as generated by Enterprise Studio)

When you create a custom property you indicate whether the field should be present in the content index.  It is checked by default.  

2017-12-08_14-05-44.png

Once you click Next and then click finish, the custom property is available within Content Manager.  However, it is not yet created in Elasticsearch.  It will be added to the index mapping the first time a document has a value in this property.  Once a value has been indexed, the field is listed in the index mapping.

2017-12-08_14-32-23.png

As I add additional meta-data fields for each record, more meta-data is added to the Elasticsearch index.  Below is an example record after many fields have been added but there's no content in the electronic file.

Note that document content can be empty

Note that document content can be empty

The document in the example was one I downloaded from NARA and saved into CM.  Before they released it they scrubbed all meta-data from the file.  Since KeyView (the product used to extract text) found no text, nothing could be added to the content field for CM.  I corrected this by installing the Tesseract OCR Plugins and generating a new OCR rendition.   

2017-12-09_23-20-23.png

If I now delete a property from Content Manager entirely, it gives me a few warnings about data but nothing about the content index.

2017-12-09_23-29-34.png

Since I deleted the agency field, I wanted to test and see what is now contained within the index.  When I search on "KISS/SCOW" (without quotes), the value from the previous example, I get an error message.  It explains that the slash in the value doesn't parse to a valid query.

2017-12-10_14-23-25.png

If I surround the value with quotes then it parses correctly and shows me a result.  

2017-12-10_14-26-51.png

When I check the record via Kibana I can see that the KISS/SCOW string exists in both the document content and the record meta-data...

2017-12-10_14-34-33.png

There are several options within the client with regarding to reindexing.  The first is an option off the administrative record context menu.

2017-12-11_13-37-53.png

Invoking this action prompts the administrator which type of reindex should be performed.  As you can see below there is no mention of the document content.  Submitting the reindex requests result in an event being queued for each of the options selected.

2017-12-11_13-38-39.png

No changes to the Elasticsearch index for this record.  However, if I use the Administration ribbon and perform a manual re-index of just one record, as shown below, then the value in the custom property has been removed.

Custom Property Meta-data Removed from Elasticsearch

Custom Property Meta-data Removed from Elasticsearch

A quick peek at the index's mapping shows that the custom property still remains known to the index.  Without it we would lose sight of the fact that it remains on other records.  

Elasticsearch Head Information Window for CM Index

Elasticsearch Head Information Window for CM Index

Now I need to decide what I want to do with the rest of the dataset.  There are another ~6600 records which had data in this field.  I could reindex the entire lot via the administration ribbon, like I just did.  I could also write a script to adjust elasticsearch entirely outside the scope of CM.  Lastly, I could table the issue until the next upgrade (which for most organizations is every 2-4 years).  

Routine CSV Export without DataPort

This post is in response to a question on the forum.  The question asked how DataPort could be used to routinely export 3 fields from a record saved search into a CSV file and 2 fields from a location saved search into a separate CSV file.  Here I'll show how to accomplish this with powershell.

First I wrote out the script...

Add-Type -Path "D:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll"
 
$recordSearchSaveName = "My Saved Records"
$userSavedSearchName = "My Saved Users"
$recordCsv = "c:\temp\records.csv"
$userCsv = "C:\temp\locations.csv"
 
$db = New-Object HP.HPTRIM.SDK.Database
$db.Connect()
 
$records = New-Object HP.HPTRIM.SDK.TrimMainObjectsearch -ArgumentList $db, Record
$records.SearchString = "saved:[$($recordSearchSaveName)]"
 
$recordResults = New-Object System.Collections.ArrayList
foreach ( $record in $records ) {
    $obj = new-object PSObject
    $obj | add-member -membertype NoteProperty -name "title" -value "$(([HP.HPTRIM.SDK.Record]$record).Title)"
    $obj | add-member -membertype NoteProperty -name "number" -value "$(([HP.HPTRIM.SDK.Record]$record).Number)"
    $obj | add-member -membertype NoteProperty -name "barcode" -value "$(([HP.HPTRIM.SDK.Record]$record).Barcode)"
    $recordResults.Add($obj) | Out-Null
}
 
$users = New-Object HP.HPTRIM.SDK.TrimMainObjectsearch -ArgumentList $db, Location
$users.SearchString = "saved:[$($userSavedSearchName)]"
 
$userResults = New-Object System.Collections.ArrayList
foreach ( $user in $users ) {
    $obj = new-object PSObject
    $obj | add-member -membertype NoteProperty -name "title" -value "$(([HP.HPTRIM.SDK.Location]$user).FullFormattedName)"
    $obj | add-member -membertype NoteProperty -name "number" -value "$(([HP.HPTRIM.SDK.Location]$user).Barcode)"
    $userResults.Add($obj) | Out-Null
}
 
$recordResults | Export-Csv -Path $recordCsv -NoTypeInformation
Write-Host "Exported Records to $recordCsv"
$userResults | Export-Csv -Path $userCsv -NoTypeInformation
Write-Host "Exported Locations to $userCsv"

Then I ran it...

2018-02-09_11-02-05.png

And my results matched my expectations...

2018-02-09_11-05-34.png

Now just schedule this powershell to run on your defined schedule!