Greenstone tutorial exercise

Back to wiki
Back to index
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.70w

Downloading files from the web

The Greenstone Librarian Interface's Download panel allows you to download individual files, parts of websites, and indeed whole websites, from the web.

  1. Start a new collection called webtudor, and base it on -- New Collection --

  1. In a web browser, visit http://englishhistory.net, follow the link to Tudor England, and click <Enter>. You should be at the URL

    http://englishhistory.net/tudor/contents.html

    This is where we started the downloading process to obtain the files you have been using for the tudor collection. You could do the same thing by copying this URL from the web browser, pasting it into the Download panel, and clicking the <Download> button. However, several megabytes will be downloaded, which might strain your network resources—or your patience! For a faster exercise we focus on a smaller section of the site.

  1. In the Download panel, enter this URL

    http://englishhistory.net/tudor/citizens/

    into the Source URL: box. There are several options that govern how the download process proceeds. To copy just the citizens section of the website, select Only mirror files below this URL. If you don't do this (or if you miss out the terminating "/"), the downloading process will follow links to other areas of the englishhistory.net website and grab those as well. Set Download Depth: to Unlimited—we want to follow as many links as necessary to download all the pages.

  1. If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Open the Connection tab in FilePreferences... and switch on the Use proxy connection? checkbox. Enter the proxy server address and port number in the Proxy Host: and Proxy Port: boxes. Click <OK>.

  1. Now click <Download>. If you have set proxy information in Preferences..., a popup will ask for you user name and password. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing.

    More detailed information can be obtained by clicking <View Log>. The process can be paused and restarted as needed, or stopped altogether by clicking <Close>. Downloading can be a lengthy process involving multiple sites, and so Greenstone allows additional downloads to be queued up. When new URLs are pasted into the Source URL: box and <Download> clicked, a new progress bar is appended to those already present in the lower half of the panel. When the currently active download item completes, the next is started automatically.

  1. Downloaded files are stored in a top-level folder called Downloaded Files that appears on the left-hand side of the Gather panel. You may not need all the downloaded files, and you choose which you want by dragging selected files from this folder over into the collection area on the right-hand side, just like we have done before when selecting data from the sample_files folder. In this example we will include everything that has been downloaded.

    Select the englishhistory.net folder within Downloaded Files and drag it across into the collection area.

  1. Switch to the Create panel to build and preview the collection. It is smaller than the previous collection because we included only the citizens files. However, these now represent the latest versions of the documents.