Metadata Crawler

Introduction

Metadata Crawler (aka “Crawler”) is an open source tool designed to automate the discovery and creation of metadata records. If configured to read from a PostgreSQL database such as your SDW, it will create a metadata record in Gemini 2.3 format for each spatial table that it finds.

Crawler can auto-generate the correct metadata for some fields but not for others, as shown in the table below. For the others, we ask you to fill in a spreadsheet (more details below). See https://astuntech.atlassian.net/wiki/spaces/ISHAREHELP/pages/150765635 for more details on the meaning of the elements.

Note: this not the full set of fields for Gemini 2.3 compliance. The ones we don’t think you’ll need to change from the default are pre-entered for you. They can always be changed in GeoNetwork later if you wish!

Metadata Field

Method

Metadata Field

Method

Abstract

Spreadsheet

Alternative Title

Spreadsheet

Bounding Box

Calculated by Crawler

Dataset Reference Date

Calculated by Crawler

Extent Keyword

Default

Keyword (inspire and free text)

Spreadsheet

Limitations on Public Access

Default

Lineage

Default

Maintenance Update Frequency

Spreadsheet

Responsible Organisation

Spreadsheet

Spatial Reference System

Calculated by Crawler

Spatial Resolution

Spreadsheet

Temporal Extent

Default

Title

Calculated by Crawler

Topic Category

Default

Use Constraints

Default

Metadata Point of Contact

Spreadsheet

Key:

  • Calculated by Crawler: auto-calculated values for each dataset.

  • Spreadsheet: individual values for each metadata record. See below for more details.

  • Default: place-holder values for all records. You can change these later on a per-record basis if you need to.

Spreadsheets

To enable you to complete the manual fields listed in the table above, we will auto-generate two spreadsheets for you, populated with each metadata record title and the fields you need to complete. We have split the fields across two spreadsheets to make data entry easier, one spreadsheet will contain contact details and organisational responsibility, while the other will contain the abstracts, alternative title and other remaining fields. Where the text is a controlled value, such as for the maintenance update frequency, lookups have been provided.

Workflow

  1. We derive the spreadsheets based on the spatial tables within your SDW, and send them to you for completion.

  2. Crawler runs in your VPC, and creates metadata records as xml files in an AWS S3 bucket.

    1. At this point, the fields populated are those calculated by Crawler, and the default ones.

  3. GeoNetwork harvests the records from the S3 bucket and assigns them to your metadata portal.

  4. You send the spreadsheets as a CSV email attachment to an email address we provide, associated with a second AWS S3 bucket.

  5. Python scripts on the GeoNetwork server extract the CSV files from the emails, and use the values within to update the metadata records with the individual values.

  6. You check the records in your metadata portal and make any further changes that you need to.

Automating the Crawler Workflow

After your metadata catalog has been pre-populated with metadata, Crawler can be run as a scheduled task to pick up new tables or changes to existing ones (such as a bounding box change).

For subsequent updates, Crawler will pick up new tables and changes to existing tables, but it can’t deal with deletions (metadata records on deleted tables should be explicitly retired in GeoNetwork).

To populate the “spreadsheet fields” from the table above for new records, you can either create a new spreadsheet with these new records in it, or you can edit the records directly in GeoNetwork. Deciding which approach is best will be dependent on the number of records you have to edit.

Crawler isn’t running on a windows server, so you’ll need to speak to your Astun consultant about scheduling the times that Crawler runs.