Monday, 20 June 2011

Removing Special Characters

Lets say you have a database that have names of people and that some names have special characters, for instance, Zeélie,Wentzël,Oösthüizen or Oosthuÿsen. For some reason you have a third party application that you send through the names you have in your database, for example, a payroll system. But this system validates names and does't accept names with special characters. Therefore, you want to send through names without the special characters to this application, here is how you can do it in SQL:

We start by creating a table that will hold the special characters and the equivalent text character, for instance, é is equivalent to e.

CREATE TABLE dbo.tblSpecialCharacters (SeqId int identity(1,1),SpecialCharacter nvarchar(50),TextCharacter nvarchar(50))

INSERT INTO dbo.tblSpecialCharacters (SpecialCharacter,TextCharacter)
VALUES ('é','e')

INSERT INTO dbo.tblSpecialCharacters (SpecialCharacter,TextCharacter)
VALUES ('ë','e')

INSERT INTO dbo.tblSpecialCharacters (SpecialCharacter,TextCharacter)
VALUES ('ö','o')

INSERT INTO dbo.tblSpecialCharacters (SpecialCharacter,TextCharacter)
VALUES ('ü','u')

-- Create a Tally table that we are going to use to loop through the names with special characters. 
-- For this exmaple we are just going to end at 200, you can however, go as far as 20000 or more
-- depending on your needs.
SELECT TOP 100 IDENTITY(INT,1,1) AS N INTO dbo.tblTally FROM Master.dbo.SysColumns sc1,Master.dbo.SysColumns sc2

--- Add a Primary Key to maximize performance
  ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally_N 
        PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

Now create a function that will be removing the special characters:

CREATE FUNCTION [dbo].[RemoveCharacter]  
(  
@name nvarchar(200)  
)  
RETURNS nvarchar(200)
AS  
BEGIN  

declare @numCharacters int
-- Get the number of characters that the name has
set @numCharacters = (select count(s.SpecialCharacter)
FROM SpecialCharacters s,Tally t 
WHERE t.n <= len(@name) 
and s.SpecialCharacter=substring(@name,t.n,1))

declare @count int 
-- Set counter to 1
set @count = 1

-- Loop through the number of characters that needs to be 
-- replaced 
while @count<=@numCharacters
begin
-- Replace the character with the equivalent text.
-- Make sure you select only one character at a time. This way the subquery does't return more than
-- one record set.
set @name = replace(@name,(SELECT TOP 1(s.SpecialCharacter) FROM SpecialCharacters s,Tally t WHERE t.n <= len(@name) and s.SpecialCharacter=substring(@name,t.n,1)),
(SELECT TOP 1(s.TextCharacter) FROM SpecialCharacters s,Tally t WHERE t.n <= len(@name) and s.SpecialCharacter=substring(@name,t.n,1)))

set @count = @count + 1
end 
RETURN  
-- return the name without the characters.
@name
END  

Now lets say you have a table of people, in this example we will use a temporary table of employees who have special characters as follows:

CREATE TABLE #tblEmployees (SeqId int identity(1,1), Firstname nvarchar(200), Surname nvarchar(200))

INSERT INTO #tblEmployees (Firstname,Surname)
VALUES ('Tinashe','Zeélie')

INSERT INTO #tblEmployees (Firstname,Surname)
VALUES ('Takura','Wentzël')

INSERT INTO #tblEmployees (Firstname,Surname)
VALUES ('Tafadzwa','Oösthüizen')

INSERT INTO #tblEmployees (Firstname,Surname)
VALUES ('Peter','Oosthuÿsen')

SELECT FIRSTNAME,
dbo.RemoveCharacter(SURNAME) AS SURNAME
FROM #tblEmployees

DROP TABLE #tblEmployees

Results:

FirstnameSurname
TinasheZeelie
TakuraWentzel
TafadzwaOosthuizen
PeterOosthuysen
From the results, the characters are removed. This is just a small record set. You can test to see how the function fares, when lets say you have to pass through more than 10000 records of employees. Or that you have to do an inner join with another table, for instance the salary table.

This is just one of the ways you can remove special characters in SQL, I am sure there are other ways this can be achieved. If you have any additions, subtractions or suggestions please do let me know. Hope this helps someone.