Vigilante Archivist (part 1)
After reading a few news articles (1,2,3) about shady drug treatment facilities, I've decided to see how to go about archiving, analyzing, and monitoring how these facilities promote their activities. I'll accomplish this with a mixture of Content Manager, Powershell, Node.Js, Tesseract-OCR, and archive.org.
Before building anything new, I did some googling to find out what sort of solutions already exist to archive websites. I found archive.org and see that they already perform this task. No point in re-inventing the wheel, right?
If I look in content manager I can see that a box was created for one of the facilities referenced in the articles...
So I flip over to the web archive and search for the website...
If I click on any given day then I can see what the website looked on that date...
Sweet.... so I don't necessarily have to write my own crawler to index sites. I'll just need a script that checks each website and stores the last date it was indexed by web.archive.org. Eventually this script may do other things and will be scheduled to run once daily.
After running the script I can see the changes within the dataset....
I found 347 of the 1100 facilities have no historical snapshots on the web archive. A few disallow crawlers outright, preventing web.archive.org from indexing it. For instance, the first box in the image above placed a "robots.txt" file at the root of their site.
Since roughly 25% of the facilities can't be archived in this manner I need to explore other options. For now I can start analyzing these sites on a more routine basis. I can also request that web.archive.org index these sites as I detect changes.
In the mean time here's the powershell script I used for this task...
Clear-Host #Import .Net SDK for Content Manager Add-Type -Path "d:\Program Files\Hewlett Packard Enterprise\Content Manager\HP.HPTRIM.SDK.dll" #Instantiate a connection to the default dataset $Database = New-Object HP.HPTRIM.SDK.Database $Database.Connect $FacilityRecordTypeName = "Facility" $FacilityWebsiteFieldName = "Facility Website" $FacilityLastSnapshotFieldName = "Facility Last Snapshot" $FacilityWebsite = [HP.HPTRIM.SDK.FieldDefinition]$Database.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, $FacilityWebsiteFieldName) $FacilityLastSnapshot = [HP.HPTRIM.SDK.FieldDefinition]$Database.FindTrimObjectByName([HP.HPTRIM.SDK.BaseObjectTypes]::FieldDefinition, $FacilityLastSnapshotFieldName) $Facilities = New-Object HP.HPTRIM.SDK.TrimMainObjectSearch -ArgumentList $Database, Record $Facilities.SearchString = "type:[name:$($FacilityRecordTypeName)] $($FacilityWebsite.SearchClauseName):*" $RowNumber = 0 foreach ( $Record in $Facilities ) { $RowNumber++ $Facility = [HP.HPTRIM.SDK.Record]$Record Write-Progress -Activity "Checking Records" -Status "$($Facility.Number): $($Facility.Title)" -PercentComplete (($RowNumber/($Facilities.Count))*100) $Website = $Facility.GetFieldValueAsString($FacilityWebsite, [HP.HPTRIM.SDK.StringDisplayType]::Default, $false) $WayBackUrl = ("http://archive.org/wayback/available?url=" + $Website) try { $Response = Invoke-RestMethod $WayBackUrl if ( $Response -ne $null ) { if ( $Response.archived_snapshots.closest -eq $null ) { $Facility.SetFieldValue($FacilityLastSnapshot, (New-Object HP.HPTRIM.SDK.UserFieldValue(New-Object HP.HPTRIM.SDK.TrimDateTime))) $Facility.Save() } else { $Facility.SetFieldValue($FacilityLastSnapshot, (New-Object HP.HPTRIM.SDK.UserFieldValue($Response.archived_snapshots.closest.timestamp))) $Facility.Save() } } else { } } catch { Write-Host " Error: $($_)" } }