Clean up data from Remote services

When we use data that comes from remote services such as Elgin Roadworks or Uniform we have no control over the data that gets sent through. Often this includes badly encoded characters, which look like gobbledegook when displayed as part of iShare. Luckily we can strip out these characters with a regular expression string replacement function in PostgreSQL.

String replacement functions are described here. The one we need is:

regexp_replace(string text,pattern text, replacementtext [, flags text])

This takes 4 parameters (the first three are mandatory, the last is optional).

The string or column name that we want to find the text in
The regular expression pattern that we are going to search for
The text we want to use as a replacement
Whether we want to just replace the first example that we find, or every example

The regular expression that we're going to use will strip out all characters that are not in the basic Latin character set (comprising lower and upper case letters, numbers, and basic punctuation).

Control characters fall in the range \u0000 to \u001F and include things such as NULL and escape characters.

Basic Pattern

[^\u0000-\u007F]

This means "any character NOT (^) in the Unicode range u0000-u007F.

To make this work in PostgreSQL, dealing with escaping quotes and so on, we need to write it as follows:

Pattern for PostgreSQL

E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]'

This matches the following characters in the range \\u0020-\\u007E:

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

as well as the individual characters \\u00A3 \\u00A9 \\u00AE:

£ © ®

A good reference that can be used to identify unicode values is http://unicode-table.com/en/.

If you wanted to include the Yen sign (¥) then the Unicode number is listed as U+00A5 which would be represented as \\u00A5 when used in the regular expression.

Select records containing invalid characters in a column

To test a column to see if any rows contain characters that are not listed in the above regular expression you can use the Select statement to query the data.

It outputs the original column and a version with the non-standard characters replaced with the pipe symbol (|) so you can see the position of the characters. Only affected rows are returned.

Assuming we are checking a column called notes_1 in a table called traffic_transport.highways_schms then query would be as follows:

Example Select statement

select notes_1, regexp_replace(notes_1, E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]', '|', 'g') as check from traffic_transport.highways_schms where notes_1 similar to E'%[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]%';

The optional flag needs to be set to 'g' for 'greedy' to replace all examples of the pattern in the given string.

It will need to be run for each textual column.

Update records containing invalid characters in a column

Using the same example as above, to update the text in a column removing characters outside the range matched by the regular expression use:

Example Update statement

update traffic_transport.highways_schms set notes_1 = regexp_replace(notes_1, E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]', '', 'g');

The optional flag needs to be set to 'g' for 'greedy' to replace all examples of the pattern in the given string.

It will need to be run for each textual column.

This can be included in an at_sys_create_table or an at_sys_executecmd Stored Procedure Task in iShare Studio (from v6.0.0 these are called wkf_create_table & wkf_executecmd respectively, under the -Workflow- Function filter).