Skip to content

Extracting text from a Word document

Although I'm writing my mashup book in Microsoft Word, I'd like to publish it in a variety of forms, including HTML, various varieties of XML, PDF, wiki-markup. There are various ways to extract content out of my Word documents, including Word macros, external scripts using the COM interface, or saving the Word 2003 documents as Word XML. I'm partial to using Python to do some simple extraction of text as a first step:

   
import win32com.client  
wd = win32com.client.Dispatch("Word.Application")  
doc = wd.Documents.Open(r'D:\\Document\\PersonalInfoRemixBook\\858Xch05__.doc')  
print doc.Content.Text

I've not been able to find complete reference documentation for the Word 2003 object model. Word 2003 Object Model was a blank page for me under "Objects". Best, probably, to look at documentation for Office XP.

Post a Comment

You must be logged in to post a comment.