Greenstone tutorial exercise

Back to wiki
Back to index

Enhanced Word document handling

The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.

  1. In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.

Using Windows native scripting

  1. In the Librarian Interface, open up the reports collection. Switch to the Design panel and select the Document Plugins section on the left-hand side. Double click the WordPlug plugin and switch on the windows_scripting option.

  1. Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the Titles A-Z list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure. But these two are the only ones.

    The default behaviour for WordPlug with windows_scripting is to section the document based on "Heading 1", "Heading 2", "Heading 3" styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles.

    Note, to view style information in Word, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. Click on a section heading and the formatting information will be displayed in this side bar.

  1. Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. WordPlug can be configured to use these styles instead of Heading 1, Heading 2 etc. Next we will configure WordPlug to use the styles found in word05.doc.

Modes in the Librarian Interface

  1. The Librarian Interface can operate in four modes. Go to FilePreferences...Mode and see the four modes and what functionality they provide access to. Librarian is the default mode.

  1. Change the mode to Library Systems Specialist because you will need to use regular expressions to set up the style options in the next part of the exercise.

Defining styles

  1. Open up word05.doc in Word (by double-clicking on it in the Gather pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:

  1. In the Document Plugins section of the Design panel, select WordPlug and click <Configure Plugin...>. Four types of header can be set which are:

    • level1_header (level1Header1|level1Header2|...)
    • level2_header (level2Header1|level2Header2|...)
    • level3_header (level3Header1|level3Header2|...)
    • title_header (titleHeader1|titleHeader2|...)

    These header options define which styles should be considered as title, level 1, level 2 and level 3 styles.

    Set the options as follows (spaces are removed when converting to HTML styles):

    level1_header:(SammaryHeader|ChapterTitle|ReferenceHeading)
    level2_header: SectionHeading
    title_header: PaperTitle

    If you can't see these options in the WordPlug configuration pane, check that you are in Library Systems Specialist mode as described above.

    Once these are set, click <OK>.

  1. Close any documents that are still open in Word, as this can prevent the build process from completing correctly.

  1. Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured.

    If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles.

Removing pre-defined table of contents

  1. If you look at word06.doc you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. WordPlug can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work.

  1. To remove the tables of contents and figures from word06.doc, switch on the delete_toc option in WordPlug. Set the toc_header option to (MsoToc1|MsoToc2|MsoToc3|MsoTof). In this document, the table of contents and list of figures use these four style names. Click <OK>.

  1. Build and preview the collection. word06.doc should now have only one table of contents.

  1. Switch the Librarian Interface back to Librarian mode (FilePreferences...Mode).

Extracting document properties as metadata

  1. Word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the metadata_fields option.

  1. In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties). They have Title, Author, Subject, and Keywords properties. WordPlug can be configured to look for these properties and extract them.

  1. In the Design panel, under Document Plugins, configure WordPlug once again. Switch on the configuration option metadata_fields. Set the value to

    Title,Author<Creator>,Subject,Keywords<Subject>

    This will make WordPlug try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata.

  1. Make sure you have closed all the documents that were opened, then rebuild the collection.

  1. Look at the metadata for the two documents again in the Enrich panel. You should now see ex.Creator and ex.Subject metadata items . This metadata can now be used in display or browsing classifiers etc.