Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: tudor.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.70w

A large collection of HTML files—Tudor

  1. Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the File menu). Fill out the pop-up dialog with appropriate values and leave Dublin Core, which is selected by default, as the metadata set.

  1. In the Gather panel, open the tudor folder in sample_files.

  1. Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection.

  1. Switch to the Create panel and click <Build Collection>.

  1. When building has finished, preview the collection.

Extracting more metadata from the HTML

  1. The browsing facilities in this collection (Titles A-Z and Filenames) are based entirely on extracted metadata. Return to the Enrich panel in the Librarian Interface and examine the metadata that has been extracted for some of the files.

  1. Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, View → Page Source in Mozilla). You will notice that this page has page_topic,content and author metadata.

  1. By default, HTMLPlug only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the Design panel and select the Document Plugins section. Select the plugin HTMLPlug line and click <Configure Plugin...>. A popup window appears. Switch on the metadata_fields option, and set the value to

    Title,Author,Page_topic,Content

    Make sure that you have copied this exactly, with no spaces. Click <OK>.

  1. Switch to the Create panel and rebuild the collection. Go back to the Enrich panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should new be visible.

Blocking the stray images

You've probably noticed that the collection contains a few stray image files, as well as the HTML documents. This is a mistake. The issue is that many of the HTML documents include images, and although Greenstone attempts to determine which images belong to HTML pages and only considers other images for inclusion in the collection, in this case it hasn't been completely successful. (This is because the web site from which these files were downloaded occasionally departs from the usual convention of hierarchical structuring.)

  1. Switch back to the Document Plugins section of the Design panel. Beside plugin HTMLPlug you will see -smart_block. This is the option that attempts to identify images in the HTML pages and block them from inclusion—in this case, it's not smart enough! Configure plugin HTMLPlug again, scroll down the page to locate the smart_block option, and switch it off.

  1. Rebuild and preview the collection. The collection is exactly as before except that these stray images are suppressed. What is happening is that plug-ins operate as a pipeline: files are passed to each one in turn until one is found that can process it. By default (i.e. without smart_block) the HTML plug-in blocks all images, which is appropriate for this collection.

Looking at different views of the files in the Gather and Enrich panels

  1. Switch to the Gather panel and in the right-hand side open englishhistory.net → tudor.

  1. Change the Show Files menu for the right-hand side from All Files to HTM & HTML. Notice the files displayed above are filtered accordingly, to show only files of this type.

  1. Change the Show Files menu to Images. Again, the files shown above alter.

  1. Now return the Show Files setting back to All Files, otherwise you may get confused later. Remember, if the Gather or Enrich panels do not seem to be showing all your files, this could be the problem.