Name scanning in Tipalti
Multiple organization in the US (OFAC, Dept. of State, BIS, etc.) publish lists containing names of individuals and companies that are targeted under one or more sanction programs.
In the event that an entity from one of these sanction lists is potentially involved in a money transfer, additional due diligence must be conducted before proceeding.
Tipalti helps its customers maintain compliance by scanning these lists for every payee being paid.
This proves to be a challenging process. The difficulties stem from two main issues:
- The comparison is a name-to-name comparison; meaning we need to compare two strings that can vary highly but still be equivalent linguistic-wise.
- Existing algorithms that are commonly used for name comparisons can have poor performance with foreign names that are transcribed to English.
Common algorithms for name comparison and why they don’t always work
Our scanning process is comprised of multiple stages in which we run algorithms like Jaro-Winkler distance, Levenshtein distance, Soundex similarity, etc. Each algorithm operates at a different stage of the process and has a different purpose, but there are times when even using all of these algorithms, the result is not sufficient.
Let’s look at the following example: The Chinese name ä¹ is usually transcribed to English as “Xi” but since the approximate pronunciation in English is “Shi” or “Shee” we need to treat those as equivalent. How do the algorithms mentioned above perform when comparing “Xi” with “Shi”?
Comparison of "Xi" and "Shi"
Alogrithm Score
--------- -----
Levenshtein 33 (out of 100)
Jaro-Winkler 61 (out of 100)
Soundex NOT EQUAL (X000 vs S000)
As you can see, all of the scores here are too low to signal that there’s a match.
Let’s see how we handle these cases in Tipalti.
Tipalti’s phonetic rules engine
Given the problem of non-English names that can be transcribed to English in multiple forms, we needed to design a solution that can generate all the different forms of the same name.
We do this by creating rule-based patterns that can describe multiple forms of the same string. Here’s an example:
Let’s assume my Hebrew name can be transcribed to English as both “Shauli” and “Shaoli”. In this case, we can create a rule that allows us to replace “ao” with “au”. This rule would look like “{au,ao}”, and a matching pattern for my name would like “sh{au,ao}li”. This pattern describes both forms of my name.
We can also incorporate two rules in the same pattern. For example, let’s add another rule for the suffix of my name: “{li,ly,lee}”.
Now when applying these two rules together we will get the following pattern “sh{au,ao}{li,ly,lee}” which matches the following strings: “shauli”, “shauly”, “shaulee”, “shaoli”, “shaoly”, “shaolee”.
In order to create this set of rules, we use the help of linguistic experts that specialize in different languages.
Let’s take a look under the hood of our phonetic rules engine.
How does the phonetic rules engine work?
This is a general outline of the engine algorithm (pseudo-code):
// This method receives a name (e.g. "shauli") and a set of phonetic rules (e.g. "{ao,au}")
// and returns all the equivalent strings this rules allow (e.g. "shauli", "shaoli").
GenerateEquivalentNames(name, rules)
1. Find all the appearances of the rules in the name.
2. Integrate the rules into the name in its matching positions (e.g. the output will look like "sh{au,ao}li").
3. Generate all string combinations from the above pattern.
After generating the list of equivalent names, we scan each of them against the sanctions lists.
Conclusion
This approach may increase false positives, but that’s preferred for security controls. It’s better to limit more conservatively at first, and then allow access once you have some form of human verification rather than be irresponsible and less discriminant. We’re trying to adhere to a federal law, after all.
If you have specific questions of how the engine works, please feel free to contact me or leave a comment.