Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: Word_and_PDF.zip
Devised for Greenstone version: 2.70
Modified for Greenstone version: 2.70w

Enhanced PDF handling

Greenstone converts PDF files to HTML using third-party software: pdftohtml.pl. This lets users view these documents even if they don't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files is not so good.

This exercise explores some extra options to the PDF plugin which may produce a nicer version for display. Some of these options use the standard pdftohtml program, others use ImageMagick and Ghostscript to convert the file to a series of images. Ghostscript is a program that can convert Postscript and PDF files to other formats. You can download it from http://www.cs.wisc.edu/~ghost/ (follow the link to the current stable release).

  1. In the Librarian Interface, start a new collection called "PDF collection" and base it on -- New Collection --.

    In the Gather panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf.

    Go to the Create panel and build the collection. Examine the output from the build process. You will notice that one of the documents could not be processed. The following messages are shown: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "15 documents were processed and included in the collection. 1 was rejected".

  1. Preview the collection and view the documents. pdf05-notext.pdf does not appear as it could not be processed. pdf06-weirdchars.pdf was processed but looks very strange. The other PDF documents appear as one long document, with no sections.

Modes in the Librarian Interface

The Librarian Interface can operate in different modes. The default mode is Librarian mode. We can use Expert mode to work out why the pdf file could not be processed.

  1. Use the Preferences... item on the File menu to switch to Expert mode and then build the collection again. The Create panel looks different in Expert mode because it gives more options: locate the <Build Collection> button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05notext.pdf to HTML format". pdftohtml.pl cannot convert a PDF file to HTML if the PDF file has no extractable text.

  1. We recommend that you switch back to Librarian mode for subsequent exercises, to avoid confusion.

Splitting PDFs into sections

  1. In the Document Plugins section of the Design panel, configure PDFPlug. Switch on the use_sections option.

    Build and preview the collection. View the text versions of some of the PDF documents. Note that these are now split into a series of pages, and a "go to page" box is provided. The format is still a bit ugly though.

Using image format

  1. If conversion to HTML doesn't produce the result you like, PDF documents can be converted to a series of images, one per page or slide. This requires ImageMagick and Ghostscript to be installed.

  1. In the Document Plugins section, configure PDFPlug. Set the convert_to option to one of the image types, e.g. pagedimg_jpg. Switch off the use_sections option, as it is not used with image conversion.

  1. Build the collection and preview. All PDF documents have been processed and divided into sections, but each section displays "This document has no text.". For the conversion to images for PDF documents, no text is extracted.

  1. In order to view the documents properly, you will need to modify the format statement. In the Format Features section on the Design panel, select the DocumentText format statement. Replace

    [Text]

    with

    [srcicon]

  1. Preview the collection. Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now.

    In this collection, we only have PDF documents and they have all been converted to images. If we had other document types in the collection, we should use a different format statement, such as:

    {If}{[parent:FileFormat] eq PDF,[srcicon],[Text]}

    FileFormat is an extracted metadata item which shows the format of the source document. We can use this to test whether the documents are PDF or not: for PDF documents, display [srcicon], for other documents, display [Text].

Using process_exp to control document processing (advanced)

  1. Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.

  1. We achieve this by adding two PDFPlug plugins to the collection, with different options. Currently, the Librarian Interface does not allow you to add the same plugin twice to the collection (with the exception of UnknownPlug). You will need to edit the collection configuration file by hand.

    Close the collection in the Librarian Interface. Then open Greenstone → collect → pdfcolle → etc → collect.cfg using a text editor, e.g. WordPad. In the list of plugins, add another PDFPlug, i.e.

    plugin PDFPlug

    Don't worry about the options here - we will add these using the Librarian Interface.

    Note that if you ever need to edit a collection's collect.cfg file by hand, you must close the collection in the Librarian Interface first, otherwise the next time it saves the file, it will overwrite your changes.

  1. Open up the collection again in the Librarian Interface, and go to the Gather panel. Make a new folder called "notext": right click in the collection panel and select New folder from the menu. Change the Folder Name to "notext", and click <OK>.

    Move the two pdf files that have problems with html (pdf05-notext.pdf and pdf06-weirdchars.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files.

  1. Switch to the Document Plugins section of the Design panel. You will see that there are two PDFPlug plugins in the list.

  1. Switch to Library Systems Specialist mode, as you will need to use regular expressions in the options (FilePreferences...Mode)

  1. Configure the two PDFPlug plugins so that the options look like the following:

    plugin PDFPlug -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"
    plugin PDFPlug -convert_to html -use_sections

    The paged_img version must come earlier in the list than the html version. The process_exp for the first PDFPlug will process any PDF files in the notext directory. The second PDFPlug will process any PDF files that are not processed by the first one.

    Note that all plugins have the process_exp option, and this can be used to customize which documents are processed by which plugin. This option is only visible in Library Systems Specialist and Expert modes.

    Change back to Librarian mode.

  1. Edit the DocumentText format statement. PDF files processed as HTML will not have images to display, so we need to make sure they get text displayed instead. Change [srcicon] to {Or}{[srcicon],[Text]}.

  1. Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. "bibliography"), but not the ones that were converted to images (try searching for "banana" or "METS").