Greenstone tutorial exercise
A collection of Word and PDF files
You will need some source files like those in the sample_files → Word_and_PDF folder.
-
Start a new collection called reports (File → New...), base it on -- New Collection --, and choose Dublin Core as the metadata set.
-
Copy all the files from sample_files → Word_and_PDF → Documents into the collection. You can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.)
-
Switch to the Create panel, and build and preview the collection.
Viewing the extracted metadata
-
Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect.
-
Back in the Librarian Interface, click the Enrich tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with "ex.".
-
Check whether the ex.Title metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.
-
The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether. The PostScript documents (cluster.ps and langmodl.ps do not have extracted titles: what appears in the Titles A-Z list is just the first few characters of the document).
Manually adding metadata to documents in a collection
-
In the Enrich panel, manually add Dublin Core dc.Title metadata to those documents which have incorrect ex.Title metadata. Select word03.doc and double-click to open it. Copy the title of this document ("Greenstone: A comprehensive open-source digital library software system") and return to the Librarian Interface. Scroll up or down in the metadata table until you can see dc.Title. Click in the value box and paste in the metadata.
-
Now add dc.Creator information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as dc.Creator metadata.
-
Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur.
-
Next add dc.Title and dc.Creator metadata for a few of the other documents.
-
You will notice as you add more values, they appear in the Existing values for ... box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors.
If you build and preview your collection at this point, you will see that the Titles A-Z now shows your new Titles. However, the dc.Creator metadata is not displayed. You need to alter the collection design to use the new Dublin Core metadata.
Collection design; branding a collection with an image
-
Change to the Design panel, which is split into several sections. The first section General appears. This allows you to modify the values you provided when defining the collection, if desired. You can also brand the collection using a suitable image.
-
Click on the <Browse...> button associated with URL to 'about page' icon:, and browse to the image sample_files → Word_and_PDF → wrdpdf.gif on your computer. When you select this image, Greenstone automatically generates an appropriate URL for the image. Preview the collection: you should see the new image at the top left of the page.
Information on the General page does not require a rebuild of the collection to take effect. Just go to the Create panel and click <Preview Collection>.
-
If you are on the web, you can easily make your own Greenstone-style icon by going to
and following the instructions there.
Document plugins
-
Back in the Librarian Interface, look at the Document Plugins section of the Design panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the ZIPPlug, TEXTPlug, HTMLPlug, EMAILPlug, ImagePlug, ISISPlug and NULPlug plugins. To delete a plugin, select it and click <Remove Plugin>. GAPlug is required for any type of source collection and should not be removed.
The next section is Search Types. In this exercise, we will not make any changes to this section.
Search indexes
-
The next step in the Design panel is Search Indexes. These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the ex.Source index, which is not particularly useful, by selecting it and clicking <Remove Index>.
-
Modify the ex.Title index to include dc.Title by selecting the index in the Assigned Indexes box and then selecting dc.Title from the Build index on: box. Click <Replace Index>. Searching this index will search both dc.Title and ex.Title metadata. If you want to restrict searching to just the manually added dc.Title metadata, deselect ex.Title from the Build index on: box and click <Replace Index>.
-
You can add indexes based on any metadata. Add a new index based on dc.Creator. Change the Index Name: field to "authors", and select dc.Creator in the Build index on: list. You will need to deselect the ex.Title and dc.Title metadata items. Click <Add Index>.
The next two sections are Partition Indexes and Cross-Collection Search. In this exercise, we will not make any changes to these.
Browsing classifiers
-
The Browsing Classifiers section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two classifiers, AZLists based on ex.Title and ex.Source metadata. These correspond to the Titles A-Z and Filenames buttons on the collection's access bar.
Remove the ex.Source classifier by selecting it and clicking <Remove Classifier>.
-
Modify the ex.Title classifier to use dc.Title instead. Select the classifier and click <Configure Classifier...>. In the metadata box, select dc.Title instead of ex.Title. Click <OK>.
-
Now add an AZCompactList classifier for dc.Creator. Select AZCompactList from the Select classifier to add: drop-down list and click <Add Classifier...>. A popup window Configuring Arguments appears. Select dc.Creator from the metadata drop-down list and click <OK>.
AZCompactList is like AZList, except that values that appear multiple times in the hierarchy are automatically grouped together and a new node, shown as a bookshelf icon, is formed.
The last three sections are Format Features, Translate Text and Metadata Sets. In this exercise, we will not make any changes to these.
-
Switch to the Create panel, and build and preview the collection.
-
Check that all the facilities work properly. There should be three full-text indexes, called text, titles, and authors. The Titles A-Z list should display all the documents to which you have assigned dc.Title metadata (and only those documents). The Authors A-Z list should show one bookshelf for each author you have assigned as dc.Creator, and clicking on that bookshelf should take you to all the documents they authored.
Classifying on multiple metadata
-
The new Titles A-Z list shows only those documents which have been assigned dc.Title metadata. For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as dc.Title. Fortunately there is a way we can use both metadata types in one classifier: specify a list of metadata names in the classifier.
-
In the Browsing Classifiers section of the Design panel, select the AZList for dc.Title in the Currently Assigned Classifiers box and click <Configure Classifier...>. Note you can achieve the same result by double clicking on the classifier.
-
In the metadata field, type ",ex.Title" after the "dc.Title"—i.e. make it read
dc.Title,ex.Title
-
If you have already done the Enhanced Word document handling exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make a similar change to the AZCompactList: make the metadata field read dc.Creator,ex.Creator.
You may notice that AZCompactList has two options after the metadata option: firstvalueonly and allvalues. Manually added metadata can be used to replace or enhance automatically extracted metadata, and these options control exactly which pieces of metadata a document is classified by.
For example, say we have two documents. Document 1 has four Creators specified (dc.Creator = dcA, dc.Creator = dcB, ex.Creator = exA, ex.Creator = exB), while document 2 has three (ex.Creator = exA, ex.Creator = exB, ex.Creator = exC). The following table shows which metadata values each document is classified by, for the different classifier options:
AZCompactList options
|
Document 1
|
Document 2
|
-metadata dc.Creator,ex.Creator | dcA, dcB | exA, exB, exC |
-metadata dc.Creator,ex.Creator -firstvalueonly | dcA | exA |
-metadata dc.Creator,ex.Creator -allvalues | dcA, dcB, exA, exB | exA, exB, exC |
-
Build the collection again and preview it. Now all of the documents should appear in the Titles A-Z list (and extracted Creators should appear in the Authors A-Z list).
Extracted metadata is unreliable. But it is very cheap! On the other hand, manually assigned metadata is reliable, but expensive. The previous section of this exercise has shown how to aim for the best of both worlds by using extracted metadata but correcting it when it is wrong. While this may not satisfy the professional librarian, it could provide a useful compromise for the music teacher who wants to get their collection together with a minimum of effort.