This post will be pretty abstracted, since I can't share data or scripts from work, but I'll rewrite a few things and hopefully make it easy to understand without the real data or scenario.
Background
Say you have a large table of data, that you need to match some things on, but it sometimes (rarely) contains characters outside of the normal range of the normal ASCII character range. This is the one that you are probably used to if you only speak English. But, outside of English, there are a huge number of characters that are often used - stuff like umlauts, accents, and Cyrillic characters. Since I do work on data that mostly comes from user-submitted forms on a website, I often have to combat these 'weird' characters (our terminology for them).
So What do we do?
Theres a few options on what you can do with these characters. In our situation, we are going to use them to make strings to match two different inputs on, so we want to try to get them standardized to be the same. This causes problems.
Options
Scrub them entirely
The first, and probably easiest thing would be to scrub them entirely. This is one of the worst things we can do, since it means if a human puts their information in one time with a 'weird' character, and another time with it transliterated by themselves to be a standard character.
Replace them with something else using a common tool
There are a bunch of tools out there to try and transliterate the characters for you. Since I'm mostly Linux at this point, and in various scripting languages, I'm used to iconv
. This can be used with something like
cat file | iconv -f utf-8 -t ascii//translit > output
This will take a UTF-8 formatted file, and attempt to convert it to fall just into ASCII. It tries, but I've never seen it work. It usually just spits out question marks in place of characters that it can't figure out.
Come up with a specialized character remap for your data.
This is what I did. I pulled the data out to a file, did a ton of cleaning on it, and actually figured out which characters could be removed, which should be transliterated, and what I can ignore.
How did I do it?
It took me a long time to get the point of having something working, particularly because it was hard to get all of my available tools to play well with UTF8 and unicode characters.
Finally, I got something.
Lets assume that you have a big file of data. I'm going to step through how I analyzed the file, and figured out what I needed to do.
Step One - Clean Up the good stuff
For this, I used the totally awesome and fast bash command tr
. It actually stands for transliterate, and is typically used to switch one character to another. It can also be used to delete character ranges, which is perfect for our purposes. Its also extremely fast - for 100MB, it finished in about 3 seconds.
cat infile | tr -d '[[:alnum:] \t\n\r]' > outfile
This will delete (-d) any character that is a standard letter or number (A-Z, a-z, and 0-9), as well as removing spaces, tabs, and newlines. This means we end up with a mostly cleaned file, full of weird characters and punctuation, and also sets up the file to be worked through a lot faster. For my 100 meg file, it reduced it to about 1.2meg.
Step Two - Convert all the characters in there to their Unicode Codepoint
Unicode is a way to describe pretty much any character, and is really easy to google and figure out what it means. I used a Perl script for this, which I clobbered together from a StackOverflow Answer
perl -C7 -ne 'for(split(//)){print sprintf("U+%04X", ord($_))." ".$_."\n"}' < infile > outfile
What this does is this:
- Set everything in perl-land to use UTF-8
- loop over everything in the file and run some script.
Here is the perl program itself, with a documented explanation:
for (split(//)) { # for every line, since perl reads by lines, split on nothing, causing a character-by-character loop
print sprintf("U+%04X", ord($_)) . " " . $_ . "\n"; #ord returns the numerical version of the character
}
Its really simple, but the real magic here was figuring out the sprintf - it turns a number into the U+ thing.
This will spit out a list of characters, but won't group them up...yet.
Step three - count how many times each happens.
This was more bash utilities:
cat infile | sort | uniq -c | sort -nr > outfile
This just sorts it, and then uniq -c counts how many each is seen, then the second sort sorts them by the count of how many each is seen and we stick it in a file.
Optional step 4: Remove the remaining ASCII characters.
This was a much bigger program, and a little more complex. I'll post most of it here, but it is missing a few things that were too specific to our environment to really show.
What this does is build an array of the characters we absolutely know are OK - they are ascii code 32 through 126. I build that in a foreach loop, because I like clarity in my Perl if it is not a one-liner.
Once we have our valid characters, we loop over our inputted file (which is the one from step 3), and see if the valid character is in it. I did a double foreach on this, because it was a little clearer than using some of the more complicated techniques.
my @valid_chars = (32..126);
my @comp_array;
foreach my $chr (@valid_chars) {
my $current_ascii = chr($chr);
push(@comp_array, sprintf("U+%04X", ord($current_ascii)));
}
my @final_array;
foreach my $inline (@input_lines) {
my $has_match = 0;
foreach my $comp_line (@comp_array) {
my $re = qr/$comp_line/;
$has_match++ if ($inline =~ $re);
}
print "$inline\n" if ($has_match == 0);
}
Then, all you have to resort and you have a pretty accurate representation of the weird characters in your data.