Friday 29 August 2008

Extracting Data from Word

A bit of a 'interesting' aside today - looking at a 'low-cost' process for exporting some data from Word files.

Marking up Word Docs
As part of my investigation into meta-data, Chris Gray has sent 20 Programme Specification documents to look at. One key issue is the related to consistency, in so much as the data entered is either missing (not known or required) or is not in a set format.

In response to this, in order to get consistency there may be a need to review the data and process it for purpose. Luckily, word does allow areas of text to be marked up using XML tags and then the whole document saved as an xml 'data file'.

The Marking up process

[1] Firstly, an xsd file needs to be created - this is a XML Schema file that contains the elements of the schema to be used for marking up (sadly I cannot find a Dublin Core version - so I am using one that I have written myself) - ProgSpec.xsd.


[2] Open a Word file then go to Tools > Templates and Add-ins > XML Schema Tab > add schema

Apply the schema to the whole document. After this - it is a simple act of just highlighting the text of the document then selecting which element in the schema tree it is to be marked up with.

Once the document has been marked up - saved it as an xml file, but make sure you select 'save data only' as this removes all presentation information that Word generates.

When you view this document in a web browser - you will see the 'pure' data that you have marked up.

What this file gives you is an xml document that can be used to extract data from.

This process would obviously be a lot easier (and not necessary) if the document was already marked up.

1 comment:

atlas245 said...

I thought the post made some good points on extracting data, For extracting data i use python for simple things,data extraction can be a time consuming process but for larger projects like files, the web, or documents i tried http://www.extractingdata.com which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs