I’ve always wondered how this worked in SQL Server… 🤔
Estimated Reading Time: 13 minutes
Difficulty: Intermediate
This post is purely illustrative and educational. It showcases a working version of the Soundex algorithm in Python3+. This code is not to be used in production as is and should be modified or rewritten to meet the requirements and standards of a production-level implementation.
If you would rather skip to the code, or this article is TLDR, no worries! See the code here.
The Soundex Algorithm is a phonetic algorithm that converts an alphanumeric string into a 4 character code by how it is pronounced in English. The 4 character code is comprised of one letter followed by three digits. The first time I came across this algorithm was with a task I was given to perform address and name matching for a skiptracing worflow using Stored Procedures in SQL Server 2000 (eeeeek! 😬 ). It made absolutely no sense to me at the time how it was able to determine how “Geoff” and “Jeff” SOUND alike to a computer. As I made a list of algorithms to familiarize myself with and implement this year, SOUNDEX made it on the list, and here I am. This algorithm can be implemented in any language, I decided to use Python3+ to become more fluent in it.
So how do we get V532 from “VanDeusen”?
The algorithm seems to be straight-forward. However, it has a few gotchas that can be observed in many available open-source implementations which will be addressed below. Let’s list the rules of the algorithm (they can be implemented in varying orders) :
From SOUNDEX Wikipedia
- Remove all non-alpha characters
- If two of the same characters are adjacent in the original alphanumeric string, only retain the first character.
- If two of the same characters are separated by h, w, y only retain the first character.
- If two of the same characters are separated by a, e, i, o, u retain both characters
- Retain the first character of the alphanumeric string and drop all other occurrences of a, e, i, o, u, y, h, w, y
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v = 1
- c, g, j, k, q, s, x, z = 2
- d, t = 3
- l = 4
- m, n = 5
- r = 6
- If there are too few characters in the alphanumeric string to assign three numbers, append zeros until there are three numbers. If there are four or more numbers, retain only the first three
Vowels and the Highway (hwy)
One of the first steps is to Keep it Short and Simple (KISS Principle). We don’t need a clever algorithm to handle vowels and h, w, y we simply need a code. Uppercase characters will be used to make a lexicon given the rules above.
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 0 | 1 | 2 | 9 | 0 | 2 | 2 | 4 | 5 | 5 | 0 | 1 | 2 | 6 | 2 | 3 | 0 | 1 | 9 | 2 | 9 | 2 |
Implementation Step #1 – Create the lexicon
Moving from A to Z, use the mappings provided by the algorithm to create a string that can later be iterated character by character. 0’s represent vowels, and 9’s represent h, w, y.
# codings is a representation of the alphabet according to the SOUNDEX algorithm
# 0 represents vowels : 'A','E','I','O','U', where 9 represents 'H','W','Y'.
# Both 0 and 9 are skipped charcters.
codings : str = "01230129022455012623019292"
Implementation Step #2 – Lexicon translation
This step is the “coding” step, where the input is transformed into its coded output as per the lexicon. This implementation breaks apart the transformation step from the code generation step for readability and separation of concerns.
# use an uppercase representation of the input for easier validation and iterate over each character
for char in list(input.upper()) :
# get the ascii character code for the current character
charCode : int = ord(char)
# VALIDATON : if the char is not [A-Z] skip it
if charCode < 65 and charCode > 90 :
continue
# SOUNDEX : Step 1a. Retain the first letter of the input
if len(finalCode) < 1 :
finalCode.append(char)
# SOUNDEX : Step 2. Replace consonants with digits (after the first letter):
code : int = codings[charCode - 65]
inputCoded.append(code)
- Get the ASCII representation of the current uppercase character of the iteration
- Given A=65 and Z=90 as ASCII character codes, determine if the character is inclusively between 65 and 90, if not the character skips any further processing (the equivalent of deleting the character in the output of the routine)
- Extract the first character of the input to use as the prefix of the final code
- Use the lexicon to obtain the character value per the specification. When using ASCII it is easy to subtract A from 65 to get “0” which can be used as an index for the lexicon. Append this character to the representation of the input as coded numbers
The above steps will translate a name into a string of numbers to be parsed in the next step so VanDeusen will be translated into 105300205
| V | A | N | D | E | U | S | E | N |
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 5 | 3 | 0 | 0 | 2 | 0 | 5 |
Implementation Step #3 – SOUNDEX Code Creation
This step creates the 4 character code from the lexicon translation. While this implementation can be merged with the previous step, this iteration is kept apart from the prior iteration for readability and separation of concerns.
# iterate over the coded input, the index is needed to compare like characters with a 0 or 9 between them
for index, char in enumerate(inputCoded) :
# prevent further iteration once the finalCode has reached the max length according to the SOUNDEX algorithm
if len(finalCode) == 4 :
break
# determine if the current character is a skippable (or to be removed) character as per the algorithm
skipchar : bool = char in ['0','9']
# SOUNDEX : Step 3b. Remove all the zero (0 or 9) digits (performed by skipping)
if skipchar :
continue
# store the previous character if available
if index > 0 :
previousChar = inputCoded[index-1]
# the first character should be skipped for processing
else :
continue
# SOUNDEX : Step 3a. Replace all adjacent same digits with one digit (performed by skipping)
if char == previousChar :
continue
# SOUNDEX : Step 3.
#
# If two or more letters with the same number are adjacent in the original word (before step 1),
# only retain the first letter; also two letters with the same number separated by 'h' , 'w' or 'y'
# are coded as a single number, whereas such letters separated by a vowel are coded twice.
# This rule also applies to the first letter.
if index > 2 and inputCoded[index-2] == char :
if previousChar == '0' :
finalCode.append(char)
continue
# characters equaling 9 are skipped if the previous and next value are the same
elif previousChar == '9' :
continue
finalCode.append(char)
- If the length of the finalCode is 4 we can stop iterating as that is the maximum number of characters needed for code completion
- If the current character is 0 or 9 it can be skipped or not added to the final code
- If there is a previous character available it should be stored for future iterations
- If this character is the first character it can be skipped as it was already added in Implementation Step #2
- Deduplicate adjacent identical characters
- If two identical characters are separated by a vowel ( 0 ) append the character to the final code
- If two identical characters are separated by h, w, y then skip this character (it is not added to the final code)
- Finally, if this iteration hasn’t skipped, append the character to the final output
| V | N | D | S |
|---|---|---|---|
| V | 5 | 3 | 2 |
The above steps will convert 105300205 into the final output of V532. Remember that the first character “V” was extracted in Implementation Step #2
Thats it!
This was an extremely interesting algorithm that proved to be challenging at times. This implementation may differ from SQL implementations as eluded to by the Wiki article, but this is the traditional approach that meets the conditions specified :
Using this algorithm, both “Robert” and “Rupert” return the same string “R163” while “Rubin” yields “R150”. “Ashcraft” and “Ashcroft” both yield “A261”. “Tymczak” yields “T522” not “T520” (the chars ‘z’ and ‘k’ in the name are coded as 2 twice since a vowel lies in between them). “Pfister” yields “P236” not “P123” (the first two letters have the same number and are coded once as ‘P’), and “Honeyman” yields “H555”.
