Introduction

Metadata Crawler (or simply “Crawler”) is an open-source tool designed to automate the discovery and creation of metadata records. If configured to read from a PostgreSQL database such as your SDW, it will create a metadata record in Gemini 2.3 format for each spatial table that it finds.

Crawler can auto-generate the correct metadata for some fields but not for others, as shown in the table below. For the others, we ask you to fill in a spreadsheet (more details below). See Create a UK Gemini 2.3 compliant Metadata Record in GeoNetwork for more details on the meaning of the elements.

Note: this is not the full set of fields for Gemini 2.3 compliance. The ones we don’t think you’ll need to change from the default are pre-entered for you. They can always be changed in GeoNetwork later if you wish.

Metadata Field	Method

Metadata Field	Method
Abstract	Spreadsheet
Alternative Title	Spreadsheet
Bounding Box	Calculated by Crawler
Dataset Reference Date	Calculated by Crawler
Extent Keyword	Default
Keyword (inspire and free text)	Spreadsheet
Limitations on Public Access	Default
Lineage	Default
Maintenance Update Frequency	Spreadsheet
Responsible Organisation	Spreadsheet
Spatial Reference System	Calculated by Crawler
Spatial Resolution	Spreadsheet
Temporal Extent	Default
Title	Calculated by Crawler
Topic Category	Default
Use Constraints	Default
Metadata Point of Contact	Spreadsheet

Key:

Calculated by Crawler: auto-calculated values for each dataset.
Spreadsheet: individual values for each metadata record. See below for more details.
Default: place-holder values for all records. You can change these later on a per-record basis if you need to.

Spreadsheets

To enable you to complete the manual fields listed in the table above, we will auto-generate two spreadsheets for you, populated with each metadata record title and the fields you need to complete. We have split the fields across two spreadsheets to make data entry easier, one spreadsheet will contain contact details and organisational responsibility, while the other will contain the abstracts, alternative title and other remaining fields. Where the text is a controlled value, such as for the maintenance update frequency, lookups have been provided.

Workflow

We derive the spreadsheets based on the spatial tables within your SDW and send them to you for completion.
Crawler runs in your VPC and creates metadata records as XML files in an AWS S3 bucket.
1. At this point, the fields populated are those calculated by Crawler and the default ones.
GeoNetwork harvests the records from the S3 bucket and assigns them to your metadata portal.
You send the spreadsheets as a CSV email attachment to an email address we provide, associated with a second AWS S3 bucket.
Python scripts on the GeoNetwork server extract the CSV files from the emails and use the values within to update the metadata records with the individual values.
You check the records in your metadata portal and make any further changes that you need to.

Automating the Crawler Workflow

After your metadata catalogue has been pre-populated with metadata, Crawler can be run as a scheduled task to pick up new tables or changes to existing ones (such as a bounding box change).

For subsequent updates, Crawler will pick up new tables and changes to existing tables, but it can’t deal with deletions (metadata records on deleted tables should be explicitly retired in GeoNetwork).

To populate the “spreadsheet fields” from the table above for new records, you can either create a new spreadsheet with these new records in it, or you can edit the records directly in GeoNetwork. Deciding which approach is best will be dependent on the number of records you have to edit.

Crawler isn’t running on a Windows server, so you’ll need to speak to your Astun consultant about scheduling the times that Crawler runs.

iShare Help

Metadata Crawler

Introduction

Spreadsheets

Workflow

Automating the Crawler Workflow

Related content