domingo, 29 de julho de 2012

Convert thousands of files from .Doc to .Html with Powershell (The Saga!)

Hello! After so long I am coming again after a 2-day battle against Powershell and Office assemblies. So, first, what I wanted to do (and actually did :) )...


Convert thousands of Word .doc files to .html for a customer project. Well, first option: do by hand was out of scope... as they were really thousands of .doc files. So let´s go forward for a smart function.


Then I found this amazing piece of code at Technet:
1:  param([string]$docpath,[string]$htmlpath = $docpath)  
2:  $srcfiles = Get-ChildItem $docPath -filter "*.doc"  
3:  $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");  
4:  $word = new-object -comobject word.application  
5:  $word.Visible = $False  
6:  function saveas-filteredhtml  
7:    {  
8:      $opendoc = $word.documents.open($doc.FullName);  
9:      $opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat);  
10:      $opendoc.close();  
11:    }  
12:  ForEach ($doc in $srcfiles)  
13:    {  
14:      Write-Host "Processing :" $doc.FullName  
15:      saveas-filteredhtml  
16:      $doc = $null  
17:    }  
18:  $word.quit();  


Source: http://blogs.technet.com/b/bshukla/archive/2011/09/27/3347395.aspx


OK, perfect, right? Not yet. After copying and saving as a .ps1 file, the errors started...



Unable to find type [Microsoft.Office.Interop.Word.WdSaveFormat]: make sure that the assembly containing this type is l
oaded.
At c:\path\Doc2Html.ps1:5 char:73
+ $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat] <<<< , "wdFormatFilteredHTML");
    + CategoryInfo          : InvalidOperation: (Microsoft.Offic...rd.WdSaveFormat:String) [], RuntimeException
    + FullyQualifiedErrorId : TypeNotFound



Ops... something is not good here. So, let´s troubleshoot. 


1. Download and install the Office Interop Assemblies for your Office version (http://msdn.microsoft.com/en-us/library/15s06t57.aspx) for Office 2007 the download link is here (http://www.microsoft.com/en-us/download/details.aspx?id=18346)


2. Very important: run the Script from a directory in the PATH. For example, after I installed the Interop Assembly I still got the same errors. Than I realized that I was trying to run the script from a USB drive. I copied the file to %SystemRoot% (C:\Windows\system32) in my case. 


So, now it ran yes? Not yet :( again strange errors:


Method invocation failed because [System.__ComObject] doesn't contain a method named 'SaveAS'.


What now? After Googling (or Binging...) I found that this error is related with regionalization and language settings. Mine Office is Portuguese-Brazil, but my PC has Win7 Enterprise - English. I found a simmilar error on this thread http://depsharee.blogspot.com.br/2011_08_01_archive.html that had a link to the a MS page saying that this  is really a BUG: http://support.microsoft.com/default.aspx?scid=kb;en-us;320369

OK, so now just add the hint in the blog to my Powershell code? No. The hint is for C# so I had to search for the Powershell version. Finally found here: http://stackoverflow.com/questions/4105224/how-to-set-culture-in-powershell

$currentThread = [System.Threading.Thread]::CurrentThread
$culture = [System.Globalization.CultureInfo]::InvariantCulture
$currentThread.CurrentCulture = $culture
$currentThread.CurrentUICulture = $culture

Added the lines above to the script and finally the conversion started and converted in the first test 50 docs perfectly to html.

Now mine final code, with regionalization settings:


1:  # Convert .doc to .html  
2:  param([string]$docpath,[string]$htmlpath = $docpath)  
3:  $srcfiles = Get-ChildItem $docPath -filter "*.doc"  
4:  $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");  
5:  $word = new-object -comobject word.application  
6:  $word.Visible = $False  
7:  function saveas-filteredhtml  
8:    {  
9:       $name = $doc.basename  
10:       $savepath = "$htmlpath\" + $name + ".html"  
11:       write-host $name  
12:       Write-Host $savepath  
13:      $opendoc = $word.documents.open($doc.FullName);  
14:      $opendoc.saveas([ref]$savepath, [ref]$saveFormat);  
15:      $opendoc.close();  
16:    }  
17:  ForEach ($doc in $srcfiles)  
18:    {  
19:      Write-Host "Processing :" $doc.FullName  
20:      saveas-filteredhtml  
21:      $doc = $null  
22:    }  
23:  $word.quit();  

I hope this piece of code helps someone... pls comment!