Friday, 29 August 2008

Extracting Data from Word

A bit of a 'interesting' aside today - looking at a 'low-cost' process for exporting some data from Word files.

Marking up Word Docs
As part of my investigation into meta-data, Chris Gray has sent 20 Programme Specification documents to look at. One key issue is the related to consistency, in so much as the data entered is either missing (not known or required) or is not in a set format.

In response to this, in order to get consistency there may be a need to review the data and process it for purpose. Luckily, word does allow areas of text to be marked up using XML tags and then the whole document saved as an xml 'data file'.

The Marking up process

[1] Firstly, an xsd file needs to be created - this is a XML Schema file that contains the elements of the schema to be used for marking up (sadly I cannot find a Dublin Core version - so I am using one that I have written myself) - ProgSpec.xsd.


[2] Open a Word file then go to Tools > Templates and Add-ins > XML Schema Tab > add schema

Apply the schema to the whole document. After this - it is a simple act of just highlighting the text of the document then selecting which element in the schema tree it is to be marked up with.

Once the document has been marked up - saved it as an xml file, but make sure you select 'save data only' as this removes all presentation information that Word generates.

When you view this document in a web browser - you will see the 'pure' data that you have marked up.

What this file gives you is an xml document that can be used to extract data from.

This process would obviously be a lot easier (and not necessary) if the document was already marked up.

Tuesday, 19 August 2008

MetaData Schema Are Not Neutral

Choosing which meta-data scheme to use is still proving difficult, as it seems very difficult to work out how to use it - am still quite keen on building my own using DC.

Interesting article: 'Metacrap: Putting the torch to seven straw-men of the meta-utopia'

Representing the collection

One aspect of the validation process that needs to be reflected in the repository is that it is a collection of documents. How we record this is important. as noted in previous meetings - using the category function is not appropriate as this can only be used if the 'taxonomy' is not going to change - not something that is possible for faculties, subjects etc..

In regard to the validation collections - Chris Gray said that he organised them by creating folders given the name of the title of the award and the date - e.g. MA Media Management 02/06/08 - into which each put all related documents.

In this folder there could then be sub-folders related to the types of documents - namely Pre-Val, Val Report and Post-Val.

Therefore, the key metadata to collect here is:
Collection Title: e.g MA Media Management
Pre-Validation Document: e.g. Programme Specification (possibly one of many docs)
Validation Panel Document: e.g. Validation Report (possibly only one)
Post-Validation Document: e.g. Amended Student Handbook (possibly one of many docs)

Based on this idea - for each and every document recorded in the collection, you would enter the following metadata:
  1. Programme Title
  2. University Faculty / School
  3. Method of Delivery
  4. Final Award
  5. Mode of Attendance
  6. TheSiS / Award Code
  7. Teaching Institution
  8. Accreditation / Professional / Statutory Body
  9. Date of Production (validation)

The 'Final Award' (supra-level) + 'Programme Title' + 'Date of Production' could be used to create the 'Collection Title'. Therefore - the example given would be MA Media Management 02/06/08. Recorded against each and every document being uploaded.

In respect to the type of document being uploaded - something would need to be recorded to state whether they were a 'Pre-Validation' or a 'Post Validation' submitted document - or even recorded as a 'Validation Panel' Document.

Hive Explorer

Have been looking at Hive Explorer (a cut down Hive interface).

One feature I looked at was the 'import from file' option in the metadata section of the upload screen.

It turns out that you can create an xml file (referencing a metadata scheme) that contains the metadata you want to record for the document you are uploading.

You can use this as the metadata data input for your file.

Why is this interesting?


From a purely technical an implementation perspective - we now know that Hive can access a document and use it to populate metadata fields. This means that if you can generate an xml file that contains the metadata you want to capture - there is a 'relatively' simple process available to upload this data at this same time as uploading a document.

Once uploaded - view the metadata for the field using 'Export Item Metadata' option. This loads an xml view for the document in a web browser - and this should match what was written in the xml docuent that was uploaded with the document.


How useful this is will become clearer in time....work continues....

Contact from Andy Powell Re: DC Metadata

An email from Andy Powell - regarding extending the Dublin Core Metatdata tree for this project:

Ben,
I no longer work at UKOLN and I no longer have much to do with metadata so this is a brief response.


Looking at your list of properties none look like they are sub-properties of the DC 15 with the exception of Validation Date.

On that basis, I would create a whole new set of properties using a namespace called something like 'qa' (for which you will need to assign a URI (e.,g.
http://purl.org/qa/)).

Suggested properties listed below:
> 1 Teaching Institution (e.g. Staffordshire University, SURF, UK Non-SURF, Overseas) qa:teachingInstitution (
http://purl.org/qa/teachingInstitution) - note URLs for further properties follow this format

> 2 Accreditation / Professional / Statutory Body qa:statutaryBody

> 3 Final Award (e.g. CertHE, BA, BA (Hons), BSc) qa:finalAward

> 4 Programme Title qa:programmeTitle

> 5 UCAS Code(s) qa:UCASCode

> 6 QAA Subject Benchmarking Group(s) qa:QAASubjectBenchmarkingGroup

> 7 Date of Production (validation date) qa:dateValidated

> 8 University Faculty / School qa:facultyOrSchool

> 9 Mode of Attendance / Delivery Method (e.g. Part Time / Full Time)qa:modeOfAttendance

> 10 Method of Delivery (e.g. Blended, Distance, Face-2-Face)qa:methodOfDelivery

> 11 THESIS / Award course code qa:courseCode

These properties need to be declared using RDFS (along the same lines as the DCMI properties at http://purl.org/dc/terms/) with dataValidated flagged as a subproperty of dcterms:date. You can put the RDFS anywhere you like, then set up the http://purl.org/qa/ URL to redirect to that place.

Hope this helps,
Andy


Although good to know - could be more work than necessary.

Friday, 8 August 2008

The technical soultion

Technical meeting - identifying use-cases and current practices
Attended: Chris Gray, Sam Rowley, Song Y and Myself.

We looked at how the current validation documents were submitted, recorded (stored) and used by the QA dept.
  • Historically documents were submitted in a paper format (in a box file) - these recorded and stored in the QA dept.
  • Recently, validation documents were also submitted in an electronic format and these were stored in a shared drive. The folder structure for this was by faculty >> award code.
  • We decided that trying to replicate this in the repository would not be appropriate or necessary - principally as faculties and organizational structures change. However, it was worth recording the faculty name in the metadata (somewhere) was useful for searching purposes - as it would be a key searchable field. The use case would be a user finding validations that were submitted by faculty.
  • In resepct to the documents that are collected, these would be:
    • Programme Specifications
    • Handbook (Student and Award)
    • Module Descriptor
    • Mentor Handbook (Foundation Degrees)
    • Validation Report (often in pdf format)
    • Generic Validation Support Documents (could be multiple instances)
  • Essentially, we noted that the key documentation of interest would be associated with validations that had been successful (not that un-successful validations wouldn't be interesting - it was just an issue of ethics). With this in mind, the QA documents could be 'graded' into the following 'types:
    • Pre-Validation Documents (originally submitted)
    • Validation Report (conditions for success)
    • Post-Validation Documents (amended for success)
Meta-data parsing - is it a pipe dream?
We discussed the feasibility using a program to 'extricate' and extract key words from the validation documents to assist in completing the task of entering key metadata that needs to be recorded by the DIVAS system. Essentially, the key document of interest is the 'Programme Specifications', which has some key fields that match the type of metadata that needs recording:
  • Awarding Body
  • Teaching Institution
  • Accreditation by Professional / Statutory Body
  • Final Awards
  • Programme Title
  • UCAS Codes (possibly not required for metadata)
  • QAA Subject Benchmarking Group (possible not required for metadata)
  • Date of Production
Not recorded on the Programme Specification - but identified as being required:
  • University Faulty / School
  • Method of Delivery (Face-2-Face, Blended
  • Mode of Attendance / Delivery Method (e.g. PT / FT)
  • TheSiS / Award Code

If this is technically possible, the idea is to use this functionality to populate fields in an interface that can be used to assist someone in uploading documents to HIVE. This interface would therefore assist the user in completing the following tasks:
  • Input and record the key metadata for the validation documents (for all documents)
  • Upload documents (as though they were a collection), also indicating which were 'Pre' and 'Post' validation documents (along with the main validation document)
More work is being carried out by Sam, Song and myself into how technically this can be achieved.

Back to the LOM?
For some weeks I have been looking at the most appropriate/useful metadata scheme to use (in conjunction with HIVE). After my meeting with library colleagues, it was noted that using a simple scheme was appropriate, so it was assumed that Dublin Core may be the most useful. However, with the team now looking at using the API functionality of HIVE - it could be argued that through using a simple interface we can use a more complex scheme (like LOM) that offers a greater range of fields/attributes to use - as the user would not be intimidated by all the fields that needed to be populated (many of which would be extranious and confusing). I will look into which LOM fields could be used to record the key metadata for the project.

Work in progress
  1. Understanding and exploring the API functionality of HIVE for the purposes of creating a user friendly way of interacting with HIVE - to complete the following tasks:
    • Uploading documents into HIVE
    • Searching of HIVE
    • Embedding an API for HIVE in NING
  2. Investigating how to extracate data from word documents
  3. Investigate the LOM scheme for HIVE - what needs to be recorded
Future work would involve building in advanced functionality of an API - such as recording metadata for all uploaded documents (like a collection), recording the 'pre' and post' validation status of each document and how to represent any relationships between documents (to reflect the validation process).

Useful report: Nine questions to guide you in choosing a metadata schema

Building a solution

A week of meetings
Have had a good run of meetings with colleagues about this project - the result of which has been an emergent solution to the requirements - with the following dimensions:
  • Technical
  • Social / Community of Practice
  • QA processes
  • User / Outputs
DIVAS Meeting
The 2nd meeting so far and very useful in moving things forward. In terms of what was discussed, the following was highlighted:
  • A lot of the 'value' delivered by the project can be delivered through the community of practice application (NING). Essentially, the meta-data is limited in what it can capture, mainly due to the data being validation specific and too varied to be isolated. Fundamentally, the metadata is useful in top-level searching of validation documents, but the 'hidden' value can really only be discussed in a more text friendly environment.
  • The NING social network (and events) will allow people to offer support along side the outputs of the validation process.
  • The issue of document findability was raised (as it is a key concern) and the method through which it can be presented to users (through NING?). The technical members identified a case to investigate the HIVE API functionality to offer a customizable solution to presenting search data. (See http://en.wikipedia.org/wiki/API)
  • The HIVE API was offered as something which would be useful to investigate, not only for this project, but as a exercise that would be useful in exploring the capabilities of the repository. In addition, this API approach would also be useful in simplifying the process of how metadata is collected and recorded against any uploaded documents.
  • The purpose or 'vision' of the project was discussed - aiming to identify some over-arching requirements for the project, in terms of what outputs are required to be successful - i.e. what a a good solution would look like. It was noted that a good solution would involve matching current QA processes (in regard to recording documentation/findability); having a system that can be 'interrogated' by a user in order to find validation examples that match their area of interest (this would include a 'list' of related documents to that specific validation); and, a search tool that could be embedded or available within the 'context' of a supportive social network.
  • A further meeting was agreed between Sam Rowley, Song Ye, Myself and Chris Gray - to discuss QA processes, technical requirements and use-case scenarios.

Friday, 1 August 2008

Finding the value (Distilling)

Recording the 'value' found in the validation documents is quite difficult. One of the main problems is knowing what the value really is. On one level - the validation documents, particularly the reports, are full of text-based information that are context specific, so the issue here is one of qualifying the information and quantifying how it should be represented.

Reviewing selected 'authentic' documents

I have received some selected validation documents to review. These documents include validation report documents and the originally submitted material. From my initial assessment, it would appear that most have issues very specific to the validation process that would be difficult to categorise for the casual reviewer. However, in terms of some broad themes, these come under the following headings/categories:
  • Student Handbook Revisions
  • Module Handbook Revisions
  • Module Descriptors Revisions
  • Some specific Blackboard VLE issues to resolve
It appears that most of the validation reports reviewed would reference most (if not all) of these headings/categories.

In terms of where to go next - the question is how to collect 'keywords' or headings that would be helpful in identifying the validation report as something which has some specific issues of interest. This could be done by 'atomising' these headings into some sub-keywords/sub-headings - but this wuold require creating some numerous metadata fields to complete.

In terms of value - the community of practice social site would be useful in 're-purposing' and presenting these documents, and thus may offer more of a value-adding exercise than relying on keywords etc - which are in essence only valuable in respect to findability and 'marking-up' the documents for interest to the user (to investigate further - drilldown).

Work continues.....