SQL for Address Standardization: An Engineer's Approach

How to Use SQL to Standardize Messy Addresses

As a software engineer, I've encountered my fair share of data nightmares. One that keeps popping up? Inconsistent address data. Whether you're building a CRM, an e-commerce platform, or just trying to make sense of a legacy database, messy addresses can be a real headache.

Recently, I tackled this problem head-on using SQL. In this post, I'll share the approach I developed for standardizing addresses using nothing but good old SQL. If you're dealing with address data and want a no-nonsense solution, stick around.

Why Bother with Using SQL for Address Standardization?

Before we dive into the code, let's quickly touch on why this matters:

Clean Data = Happy Life: Trust me, future you will thank present you for standardizing those addresses.
Search That Actually Works: Ever tried to find "123 Main St" when your database has it as "123 Main Street"? Yeah, not fun.
Accurate Analytics: Can't do proper analysis if half your New York addresses are "NY" and the other half are "New York".
Efficient Operations: If you're doing any kind of mailing or shipping, standardized addresses are a must.

The Toolkit for SQL Address Standardization

Alright, let's get our hands dirty with some SQL. I've broken this down into several macros, each handling a different part of the address. We'll look at standardizing the street address, city, state, and ZIP code.

Taming the Street Address with SQL

First up, the street address. This is often the messiest part, with abbreviations all over the place. Here's a macro I cooked up to handle it:

{% macro standardize_address1(address_field) %}
    UPPER(
        REGEXP_REPLACE(
            -- Replace common abbreviations
            REGEXP_REPLACE(
                REGEXP_REPLACE(
                    REGEXP_REPLACE(
                        REGEXP_REPLACE(
                            REGEXP_REPLACE(
                                REGEXP_REPLACE(
                                    REGEXP_REPLACE(
                                        {{ address_field }},
                                        '\\b(STREET|STRT|STRE|STRT.)\\b', 'ST'
                                    ),
                                    '\\b(AVENUE|AVENU|AVNUE|AVEN|AVN|AVE.)\\b', 'AVE'
                                ),
                                '\\b(ROAD|RD.)\\b', 'RD'
                            ),
                            '\\b(DRIVE|DRIV|DRV|DR.)\\b', 'DR'
                        ),
                        '\\b(BOULEVARD|BOULV|BLVD|BLV|BLVD.)\\b', 'BLVD'
                    ),
                    '\\b(COURT|CRT|CT.)\\b', 'CT'
                ),
                '\\b(PLACE|PLC|PL.)\\b', 'PL'
            ),
            -- Remove any non-alphanumeric characters except spaces and periods
            '[^A-Za-z0-9. ]', ''
        )
    )
{% endmacro %}

What's happening here:

Everything gets uppercased. Why? Consistency, my friends.
Common abbreviations get standardized. "STREET", "STRT", "ST." all become "ST".
Any weird characters that aren't letters, numbers, spaces, or periods? Gone.

City Names: Keep It Simple

City names are usually less problematic, but we still want to ensure consistency:

{% macro standardize_city(city) %}
    INITCAP(TRIM(REGEXP_REPLACE({{ city }}, '\\s+', ' ')))
{% endmacro %}

This little macro:

Squashes multiple spaces into single spaces.
Capitalizes the first letter of each word (hello, INITCAP function).
Trims any leading or trailing spaces, because why not?

States: Two Letters, No More, No Less

Ah, states. Full names, abbreviations, postal codes - it's a mess. Let's fix that:

{% macro standardize_state(state) %}
    CASE
        WHEN UPPER(TRIM({{ state }})) IN ('ALABAMA', 'ALA', 'AL') THEN 'AL'
        WHEN UPPER(TRIM({{ state }})) IN ('ALASKA', 'AK') THEN 'AK'
        -- ... (other states)
        WHEN UPPER(TRIM({{ state }})) IN ('WYOMING', 'WYO', 'WY') THEN 'WY'
        ELSE UPPER(TRIM({{ state }}))
    END
{% endmacro %}

This macro is all about consistency:

Uppercase and trim the input. Always.
Match against known variations (full name, common abbreviations).
Convert everything to the standard two-letter postal code.
If we don't recognize it? Return the input, but uppercased and trimmed.

ZIP Codes: Five Digits of Fun

Last but not least, ZIP codes:

{% macro standardize_zip(zip_code) %}
    CASE
        WHEN REGEXP_LIKE({{ zip_code }}, '^[0-9]{5}(-[0-9]{4})?$') THEN LEFT({{ zip_code }}, 5)
        WHEN REGEXP_LIKE({{ zip_code }}, '^[0-9]{9}$') THEN LEFT({{ zip_code }}, 5)
        ELSE NULL
    END
{% endmacro %}

Here's what's going on:

We check if it's a valid 5-digit ZIP code, with or without the extra four digits.
If it is, we take the first five digits.
If it's not in a format we recognize? NULL. Because bad data is worse than no data.

Putting It All Together

Now, let's see how we use these macros in a real query:

SELECT
    {{ standardize_address1('raw_address') }} AS standardized_address,
    {{ standardize_city('raw_city') }} AS standardized_city,
    {{ standardize_state('raw_state') }} AS standardized_state,
    {{ standardize_zip('raw_zip') }} AS standardized_zip
FROM
    raw_address_table

Run this query, and voilà - you've got yourself a table of clean, standardized addresses.

Wrapping Up

There you have it - a set of SQL macros to whip your address data into shape. Is it perfect? Nope. Will it solve all your address problems? Probably not. But it's a solid start, and it's saved me countless hours of data cleaning.

Remember, this is just a starting point. Depending on your specific needs and the quirks of your data, you might need to tweak these macros. And for really complex address validation, you might want to look into specialized services.

But for many cases, this SQL-based approach can get you 80% of the way there with minimal fuss. And in my book, that's a win.

Got any tips for improving these macros? Or horror stories about messy address data? Reach out directly contact. Happy coding!