2007年4月27日星期五

save pictures from MS Word

作为word文档解析的一部分,图片的提取非常的重要。在google上搜索word图片提取找到的基本是VBA的方式,写一段脚本嵌入word中然后执行这样子,一来不熟悉,二来也不符合我对word文档处理的要求。我这里是需要从word文档外部来解析,也就是用一个程序打开word文件,读取其中的内容,这是一种从外部处理的方式,与VBA这种嵌入的方式有较大差别。其实升级到.net framework2.0之后,VSTO已经提供了很多方便操作office文件的方法,当然也包括word。这里就介绍一下借助VSTO和剪贴板来提取word图片的方法。
业务需求

有一个word文档,里面包含了一些构件描述信息和一些图片,要求找出图片另存到一个目录下,然后将该图片替换成一个指示出了图片位置的标签,比如

[img]img/PORT.doc/picture_2.Jpeg[img]

目的是为了在进一步解析构件信息,并存入数据库之后网站的表示层可以直接根据该标签找到图片并显示。
可行的解决方案


既然是要提取图片,那么首先就得在word中找到图片。图片(Picture)在word中会以两种形式存在——Shape和InlineShape——如果要取出所有的图片一定记住不要漏掉了任何一个。但是不是所有的Shape和InlineShape都是picture,我们需要先做判断:

Shape中有两种类型的picture:
MsoShapeType.msoPicture
MsoShapeType.msoLinkedPicture

InlineShape中有两种类型的picture: WdInlineShapeType.wdInlineShapePicture WdInlineShapeType.wdInlineShapeLinkedPicture

找到所有了的图片之后将它们拷贝到剪贴板,然后就可以保存了。基本步骤如下:

1. 首先打开文档。因为要替换图片,那么要求打开文档的时候是可以编辑的——ReadOnly设为false。
2. 读取所有的shape,包括Shape和InlineShape。
3. 读取一个shape判断是否为Picture,如果是则将其选中,并拷贝到剪贴板。
4. 将剪贴板的图片保存到指定目录下。
5. 找到图片在word中的位置,在其前面插入图片标记
6. 将图片从word中删除
7. 继续读取下一个shape

打开文档

oWordApp = new ApplicationClass();
object readOnly = True;

object o_fileName = fileName;
Document wordDoc;
wordDoc = oWordApp.Documents.Open(ref o_fileName,
ref missing, ref readOnly,
ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing,
ref missing, ref missing, ref isVisible,
ref missing, ref missing, ref missing, ref missing);
wordDoc.Activate();

读取所有的shape

IList shapes = new ArrayList();
foreach(Shape shape in doc.Shapes)
{
shapes.Add(shape);
}
foreach(InlineShape shape in doc.InlineShapes)
{
shapes.Add(shape);
}



判断是否为Picture


if (isCommonShape)
{
commonShape = (Shape) shape;
isPicture = (commonShape.Type == MsoShapeType.msoPicture ||
commonShape.Type == MsoShapeType.msoLinkedPicture);
}
else if(isInlineShape)
{
inlineShpae = (InlineShape) shape;
isPicture = (inlineShpae.Type == WdInlineShapeType.wdInlineShapePicture ||
inlineShpae.Type == WdInlineShapeType.wdInlineShapeLinkedPicture);
}
选中,并拷贝到剪贴板

if(isCommonShape)
{
commonShape.Select(ref missing);
}
else
{
inlineShpae.Select();
}

wordApp.Selection.CopyAsPicture();

图片保存到指定目录下

System.Windows.Forms.Clipboard.GetImage().Save(fileNameOfPict, ImageFormat.Jpeg);

插入图片标记

object start = oWordApp.Selection.Start; //Shape的起始位置
doc.Range(ref start, ref start).Text = string.Format("[img]{0}[img]", fileNameOfPict);

将图片从word中删除

commonShape.Delete();




完整的代码

public void ProcessAllPicturesOfDoc(Document doc)
{
IList shapes = new ArrayList();
foreach(Shape shape in doc.Shapes)
{
shapes.Add(shape);
}
foreach(InlineShape shape in doc.InlineShapes)
{
shapes.Add(shape);
}
ExtractShape(shapes,doc,oWordApp);
}
public void ExtractShape(IList shapes,Document doc,ApplicationClass wordApp)
{
object missing = Missing.Value;
string pictDirect = "img/" + doc.Name + "/";
int i = 0;

foreach (object shape in shapes)
{
bool isPicture;
bool isCommonShape = shape is Shape;
bool isInlineShape = shape is InlineShape;

Shape commonShape = null;
InlineShape inlineShpae = null;
//check if the shape is a picture
if (isCommonShape)
{
commonShape = (Shape) shape;
isPicture = (commonShape.Type == MsoShapeType.msoPicture ||
commonShape.Type == MsoShapeType.msoLinkedPicture);
}
else if(isInlineShape)
{
inlineShpae = (InlineShape) shape;
isPicture = (inlineShpae.Type == WdInlineShapeType.wdInlineShapePicture ||
inlineShpae.Type == WdInlineShapeType.wdInlineShapeLinkedPicture);
}
else
{
throw new Exception("unknown Shape");
}

if (isPicture)
{

i++;
//select the range of the shape
//Note: the difference between two methods of selection
if(isCommonShape)
{
commonShape.Select(ref missing);
}
else
{
inlineShpae.Select();
}
//compy the picture to clipboard
wordApp.Selection.CopyAsPicture();
if (System.Windows.Forms.Clipboard.ContainsImage())
{
if (!Directory.Exists(pictDirect))
Directory.CreateDirectory(pictDirect);
string fileNameOfPict = pictDirect + "picture_" + i.ToString() + ".Jpeg";
//save picture
System.Windows.Forms.Clipboard.GetImage().Save(fileNameOfPict, ImageFormat.Jpeg);
//insert the img tag just at the start position of the shape
object start = oWordApp.Selection.Start;
doc.Range(ref start, ref start).Text = string.Format("[img]{0}[img]", fileNameOfPict);
//delete the picture
if(isCommonShape)
{
commonShape.Delete();
}
else
{
inlineShpae.Delete();
}
}
else
{
throw new Exception("error occures when copying picture");
}
}
}
}

以上的代码在VS2005+Office2003+xp2下测试通过,

没有评论: