Clean up data from Remote services
When we use data that comes from remote services such as Elgin Roadworks or Uniform we have no control over the data that gets sent through. Often this includes badly encoded characters, which look like gobbledegook when displayed as part of iShare. Luckily we can strip out these characters with a regular expression string replacement function in PostgreSQL.
String replacement functions are described here. The one we need is:
regexp_replace(string text,pattern text, replacementtext [, flags text])
This takes 4 parameters (the first three are mandatory, the last is optional).
- The string or column name that we want to find the text in
- The regular expression pattern that we are going to search for
- The text we want to use as a replacement
- Whether we want to just replace the first example that we find, or every example
[^\u0000-\u007F]
E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]'
This matches the following characters in the range \\u0020-\\u007E:
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
as well as the individual characters \\u00A3 \\u00A9 \\u00AE:
£ © ®
select notes_1, regexp_replace(notes_1, E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]', '|', 'g') as check from traffic_transport.highways_schms where notes_1 similar to E'%[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]%';
The optional flag needs to be set to 'g' for 'greedy' to replace all examples of the pattern in the given string.
It will need to be run for each textual column.
Update records containing invalid characters in a column
update traffic_transport.highways_schms set notes_1 = regexp_replace(notes_1, E'[^\\u0020-\\u007E\\u00A3\\u00A9\\u00AE]', '', 'g');
The optional flag needs to be set to 'g' for 'greedy' to replace all examples of the pattern in the given string.
It will need to be run for each textual column.
This can be included in an at_sys_create_table or an at_sys_executecmd Stored Procedure Task in iShare Studio (from v6.0.0 these are called wkf_create_table & wkf_executecmd respectively, under the -Workflow- Function filter).