Convert thousands of Word .doc files to .html for a customer project. Well, first option: do by hand was out of scope... as they were really thousands of .doc files. So let´s go forward for a smart function.
Then I found this amazing piece of code at Technet:
1: param([string]$docpath,[string]$htmlpath = $docpath)
2: $srcfiles = Get-ChildItem $docPath -filter "*.doc"
3: $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");
4: $word = new-object -comobject word.application
5: $word.Visible = $False
6: function saveas-filteredhtml
7: {
8: $opendoc = $word.documents.open($doc.FullName);
9: $opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat);
10: $opendoc.close();
11: }
12: ForEach ($doc in $srcfiles)
13: {
14: Write-Host "Processing :" $doc.FullName
15: saveas-filteredhtml
16: $doc = $null
17: }
18: $word.quit();
Source: http://blogs.technet.com/b/bshukla/archive/2011/09/27/3347395.aspx
OK, perfect, right? Not yet. After copying and saving as a .ps1 file, the errors started...
Unable to find type [Microsoft.Office.Interop.Word.WdSaveFormat]: make sure that the assembly containing this type is l
oaded.
At c:\path\Doc2Html.ps1:5 char:73
+ $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat] <<<< , "wdFormatFilteredHTML");
+ CategoryInfo : InvalidOperation: (Microsoft.Offic...rd.WdSaveFormat:String) [], RuntimeException
+ FullyQualifiedErrorId : TypeNotFound
Ops... something is not good here. So, let´s troubleshoot.
1. Download and install the Office Interop Assemblies for your Office version (http://msdn.microsoft.com/en-us/library/15s06t57.aspx) for Office 2007 the download link is here (http://www.microsoft.com/en-us/download/details.aspx?id=18346)
2. Very important: run the Script from a directory in the PATH. For example, after I installed the Interop Assembly I still got the same errors. Than I realized that I was trying to run the script from a USB drive. I copied the file to %SystemRoot% (C:\Windows\system32) in my case.
So, now it ran yes? Not yet :( again strange errors:
Method invocation failed because [System.__ComObject] doesn't contain a method named 'SaveAS'.
What now? After Googling (or Binging...) I found that this error is related with regionalization and language settings. Mine Office is Portuguese-Brazil, but my PC has Win7 Enterprise - English. I found a simmilar error on this thread http://depsharee.blogspot.com.br/2011_08_01_archive.html that had a link to the a MS page saying that this is really a BUG: http://support.microsoft.com/default.aspx?scid=kb;en-us;320369
OK, so now just add the hint in the blog to my Powershell code? No. The hint is for C# so I had to search for the Powershell version. Finally found here: http://stackoverflow.com/questions/4105224/how-to-set-culture-in-powershell
$currentThread = [System.Threading.Thread]::CurrentThread
$culture = [System.Globalization.CultureInfo]::InvariantCulture
$currentThread.CurrentCulture = $culture
$currentThread.CurrentUICulture = $culture
Added the lines above to the script and finally the conversion started and converted in the first test 50 docs perfectly to html.
Now mine final code, with regionalization settings:
1: # Convert .doc to .html
2: param([string]$docpath,[string]$htmlpath = $docpath)
3: $srcfiles = Get-ChildItem $docPath -filter "*.doc"
4: $saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");
5: $word = new-object -comobject word.application
6: $word.Visible = $False
7: function saveas-filteredhtml
8: {
9: $name = $doc.basename
10: $savepath = "$htmlpath\" + $name + ".html"
11: write-host $name
12: Write-Host $savepath
13: $opendoc = $word.documents.open($doc.FullName);
14: $opendoc.saveas([ref]$savepath, [ref]$saveFormat);
15: $opendoc.close();
16: }
17: ForEach ($doc in $srcfiles)
18: {
19: Write-Host "Processing :" $doc.FullName
20: saveas-filteredhtml
21: $doc = $null
22: }
23: $word.quit();
I hope this piece of code helps someone... pls comment!