Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: niupepa.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.70w

Scanned image collection

Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page.

The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by PagedImgPlug, a plugin designed precisely for the kind of data we have in this example. For each document, an "item" file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series "Te Whetu o Te Tau") have text representations, and two (from "Te Waka o Te Iwi") have images only. Item files can also specify metadata. In our example the newspaper series is recorded as ex.Title and its date of publication as ex.Date. Issue ex.Volume and ex.Number metadata is also recorded, where appropriate. This metadata is extracted as part of the building process.

  1. Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents; the only metadata used is document title and date, and these are extracted from the "item" files included in the source documents so no metadata set need be stipulated.

  1. In the Gather panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add PagedImgPlug to the collection: click <Add Plugin>, because this plugin will be needed to process the item files.

  1. Some of the files you have just dragged in are the newspaper images; others are text files that contain the text extracted from these images. We want these to be processed by PagedImgPlug, not ImagePlug or TEXTPlug. Switch to the Design panel and delete ImagePlug and TEXTPlug. While you are at it, you could tidy things up by deleting ZIPPlug and all plugins from HTMLPlug to NULPlug as well, since they will not be used. GAPlug and PagedImgPlug remain.

  1. Open up the configuration window for PagedImgPlug by double-clicking on the plugin. Switch on its screenview configuration option by checking the box. The source images we use were scanned at high resolution and are large files for a browser to download. The screenview option generates smaller screen-resolution images of each page when the collection is built. Click <OK>.

  1. Now go to the Create panel, build the collection and preview the result. Search for "waka" and view one of the titles listed (all three appear as Te Whetu o Te Tau). Browse by Titles A-Z and view one of the Te Waka o Te Iwi newspapers. Note that only the Te Whetu o Te Tau newspapers have text; Te Waka o Te Iwi papers don't.

This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection.

Grouping documents by series title and displaying dates within each group

Under Titles A-Z documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using an AZCompactList classifier rather than AZList, and tuning the classifier's format statement.

  1. In the Design panel, under the Browsing Classifiers section, delete the AZList classifiers for ex.Source and ex.Title.

  1. Now add an AZCompactList classifier, setting its metadata option to ex.Title, and add a DateList classifier, setting its metadata option to ex.Date.

  1. In the Format Features section, select the ex.Title classifier in the Choose Feature list, and VList in the Affected Component list. Delete the contents of the HTML Format String box, and add the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → titles_tweak.txt.)

    <td valign="top">[link][icon][/link]</td>
    <td valign="top">
    {If}{[numleafdocs],[ex.Title] ([numleafdocs]),
    {If}{[ex.Volume],Volume [ex.Volume] }
    {If}{[ex.Number],Number [ex.Number] }
    {If}{[ex.Date], [ex.Date]}}
    </td>

    Click <Add Format>.

  1. Build the collection, and preview the new Titles A-Z list.

    As a consequence of using the AZCompactList classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. It works by exploiting the fact that only bookshelf icons define [numleafdocs] metadata. For document nodes, Title is not displayed. instead, Volume, Number and Date information are displayed if present.

Displaying scanned images and suppressing dummy text

When you reach a newspaper, only its associated text is displayed. When either of the Te Waka o Te Iwi newspapers is accessed, the document view presents the message "This document has no text.". No scanned image information (screen-view resolution or otherwise) is shown, even though it has been computed and stored with the document. This can be fixed by a format statement that modifies the default behaviour for DocumentText.

  1. In the Format Features section of the Design panel, select the DocumentText format statement. The default format string displays the document's plain text, which, if there is none, is set to "This document has no text.". Change this to the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → doc_tweak.txt)

    <table><tr>
    <td valign=top>[srclink][screenicon][/srclink]</td>
    <td valign=top>[Text]</td>
    </tr></table>

    and click <Replace Format>.

    Including [screenicon] has the effect of embedding the screen-sized image generated by switching the screenview option on in PagedImgPlug. It is hyperlinked to the original image by the construct [srclink]...[/srclink].

    This modification will display screenview image, but does nothing about the dummy text "This document has no text.", which will still be displayed. To get rid of this, edit the DocumentText format statement again and replace

    <td valign=top>[Text]</td>

    with

    {If}{[Text] ne "This document has no text. ",<td valign=top>[Text]</td>}

    and click <Replace Format>.

  1. Preview the collection and view one of the Te Waka o Te Iwi documents. The line "This document has no text." should now be gone. (Note that it important to get the text exactly right for this to work, including the space after the ".".)

Searching at page level

  1. The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document.

  1. Go to the Search Indexes section of the Design panel. Remove the ex.Source index. Select the document:text index in the Assigned Indexes box, and change the Index Name: to "whole newspapers". Click <Replace Index>. Create a new index: set the Index Name: to "newspaper pages", keep text selected in Build index on:, and change At the level: to section. Click <Add Index>. Click <Set Default Index> on the right hand side to make the "newspaper pages" index the default.

  1. Build and preview the collection. Compare searching in the "whole newspapers" index compared to the "newspaper pages" index. A useful search term for this collection is "aroha".

  1. You will notice that when searching for individual pages, the newspaper image is displayed in the search results. As these images are very large, this is not very useful. Go to Format Features in the Librarian Interface and select the VList format statement from the list of assigned format statements. Remove the second line from the HTML Format String:

    <td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>

    While we are here, lets remove the filename from the display. Remove the following from the last line:

    {If}{[ex.Source],<br><i>([ex.Source])</i>}

    Click <Replace Format>.

    Preview the collection—the search results should be back to normal.

  1. Now you will notice that page level search results only show the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show the newspaper title as well as the page number. Also, lets add in Volume and Number information too.

    In the Format Features section of the Design panel, select Search in Choose Feature, and VList in Affected Component. The previous changes modified VList, so they will apply to all VLists that don't have specific format statements. These next changes are made to SearchVList so will only apply to search results.

    The extracted Title for the current section is specified as [ex.Title] while the Title for the parent section is [parent:ex.Title]. Since the same SearchVList format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.

    Set the format statement to the following text (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak.txt.)

    <td valign="top">[link][icon][/link]</td>
    <td valign="top">
    {If}{[parent:ex.Title],[parent:ex.Title]
    {If}{[parent:ex.Volume],Volume [parent:ex.Volume] }
    {If}{[parent:ex.Number],Number [parent:ex.Number]}: Page [ex.Title],
    [ex.Title] {If}{[ex.Volume], Volume [ex.Volume] }
    {If}{[ex.Number], Number [ex.Number] }}
    <br/><i>({Or}{[parent:ex.Date],[ex.Date]})</i></td>
    </td>

    Click <Add Format>.

    Preview the search results. Items display newspaper title, Volume, Number and Date if available, and pages also display the page number.

In the collection you have just built, newspapers are grouped by series title, and dates are supplied alongside each one to distinguish it from others in the same series. Users can browse chronologically by date, and when a newspaper page is viewed a preview image is shown on the left that displays the original high-resolution version when clicked, accompanied on the right by the plain-text version of that newspaper (if available).