使用python转换html文件为doc
有用户手册是html格式的要转换为word格式,并且把多个html文件合并到一个doc中。 之前尝试一个使用其他工具方法,缺点是操作麻烦。
用户python也可以,写成一个脚本,操作比较方便。
原理是:pywin32来通过com来使用office word来转换合并。所以必须在安装了Word的Windows中执行。
pywin32安装:
- 选择合适的版本程序直接安装 https://sourceforge.net/projects/pywin32/
- 使用pip https://pypi.python.org/pypi/pypiwin32
pip install pypiwin32
Example: source code
以下我使用中用到几个方法:
-
把html转为doc
doc = word.Documents.Add(filePath) doc.SaveAs(docFile, FileFormat=0)
-
合并的方法:
finalDoc.Application.Selection.Range.InsertFile(docFile) finalDoc.Application.Selection.Range.InsertBreak(3) #3=word.WdBreakType.wdSectionBreakContinuous finalDoc.Application.Selection.EndKey(6,0) #6=word.WdUnits.wdStory 0=word.WdMovementType.wdMove
-
使用页面视图
finalDoc.ActiveWindow.View.Type=3 #3=Word.WdViewType.wdPrintView
-
让表格自动适应页面
i=0 while(i<len(finalDoc.Tables)): try: finalDoc.Tables[i].AutoFitBehavior(2) #2=Word.WdAutoFitBehavior.wdAutoFitWindow i=i+1
-
把链接的图片添加到文档中
i=0 s=len(doc.InlineShapes) while(i<s): if doc.InlineShapes[i].Type==4: #4=word.WdInlineShapeType.wdInlineShapeLinkedPicture doc.InlineShapes[i].LinkFormat.Update() link=doc.InlineShapes[i].LinkFormat.SourceFullName print(r' '*16+'going to handle picture: '+str(i)+"/"+str(s)+' -->'+link) doc.InlineShapes[i].LinkFormat.SavePictureWithDocument=True i=i+1
-
关闭word时,提示“此文件正由另一个应用程序或用户使用” 提示保存 normal.dot
word.NormalTemplate.Saved = 1
-
多线程
import pythoncom, win32com.client, threading, time
def start():
# Initialize
pythoncom.CoInitialize() # tofix com_error: (-2147221008, 'CoInitialize has not been called.', None, None)
# Get instance
xl = win32com.client.Dispatch('Excel.Application')
# Create id
xl_id = pythoncom.CoMarshalInterThreadInterfaceInStream(pythoncom.IID_IDispatch, xl)
# Pass the id to the new thread
thread = threading.Thread(target=run_in_thread, kwargs={'xl_id': xl_id})
thread.start()
# Wait for child to finish
thread.join()
def run_in_thread(xl_id):
# Initialize
pythoncom.CoInitialize()
# Get instance from the id
xl = win32com.client.Dispatch(
pythoncom.CoGetInterfaceAndReleaseStream(xl_id, pythoncom.IID_IDispatch)
)
time.sleep(5)
if __name__ == '__main__':
start()
参考资料:
https://github.com/zhoucc/easyDatasheet/blob/master/win32com.txt
http://blog.csdn.net/chenjl1031/article/details/8905354
http://blog.csdn.net/lzl001/article/details/8435048
http://msdn.microsoft.com/en-us/library/office/ff837519(v=office.15).aspx
http://www.extendoffice.com/documents/word/635-word-remove-all-hyperlinks.html
http://www.galalaly.me/index.php/2011/09/use-python-to-parse-microsoft-word-documents-using-pywin32-library/
http://www.cnblogs.com/Ss_Andy/archive/2010/09/25/1834386.html
https://stackoverflow.com/questions/26764978/using-win32com-with-multithreading