Matt Bourke

May 31 2010

Regex day email validation

Posted by Matt at 4:09 PM
6 comments
- Categories: Coldfusion

Email Validation Tutorial for Regex newbies.


[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+\.?[a-zA-Z]+

 

This tutorial is aimed at people who'll see the above regex and think "WTF is this?"


Regex is literally the easiest thing for a programmer to learn, far easier then CF and HTML, but if you're unfamiliar with regex, it looks.... weird....

 

http://www.regular-expression-day.com/ is today, so I thought i'd write (what I hope is) a simple tutorial for people who know nothing of regex.

 

Note: This is the first time I've posted code on my blog, so don't expect it to be colour coded or very well laid out, my apologies also I'm rushing this a bit, in my lunch break so I apologise for any grammar, typo, spelling etc.


Ok so lets get started building a regex for validating the email addresses below. 


Note: I'm not 100% sure what exactly is valid for an email address, so look at this as an example for learning basic regex rather then the correct regex for email address validation


Open up a new file in Eclipse, Dreamweaver etc. copy the below email addresses
_super.man@abc.com
super.man@abc.com
super.m/a/n@abc.com
super.man@abc.co.uk
super.man@abc.travel
super_man@abc.com
superMan@abc.com
123superman@abc.com
123@abc.com
.superman@abc.com

The above is all you'll need in your new file, now hit CTRL + F, tick the "regular expression" box and untick "match case" and you're ready :)

So what will the pattern look like?
Well every email address has an @ symbol

Also every email address will have text on either side of the @, so lets create a pattern using square brackets []
Regex will find one of what you place inside square brackets
[a-zA-Z0-9_] will find every lower case character from a-z every upper case A-Z and all numbers from 0-9, I've also included an underscore as many email addresses contain them.

[a-zA-Z0-9_]@ search on this and it will find the following
n@
etc, so any one character in the pattern followed by an @ statement.
But if we add a + to the end of the pattern it will find all the characters matching that pattern till the pattern is broken.
[a-zA-Z0-9_]+@

will now find

man@ etc
If we add a dot to the pattern like
[a-zA-Z0-9_.]+@
it will now find 
_super.man@
super.man@ etc


However we don't want a dot, as email addresses cannot start with a dot, so what we can do is create 2 patterns, one without a dot and a new one with a dot
[a-zA-Z0-9_]+[a-zA-Z0-9_.]+@ (notice they both have + plus symbols, so they'll both get repeated until the pattern is broken).
The above regex will find all email addresses except for the one containing slashes. 


So lets add a slash to the second pattern
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@
Now this will also return

super.m/a/n@


Slashes in email addresses are perfectly valid, however I'm not sure if a email address can start with a slash.
Now we can test the first part of an email address, but this is of little use unless we can validate the second half.


Lets add a pattern straight after the @ symbol.
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+
This will now validate all email addresses up to the Dot.


So lets add a dot
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+.
Now if you run this it'll work, however its actually broken, try it on the following email address.
super.man@abc&.com


The dot is a "Special Character" in Regex, it matches any character. Try it by appending a + after it, see what you get.
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+.+


What we want to do is escape it, for this we'll use a \ slash, this slash will escape all special characters.
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.


This should now invalidate.
super.man@abc&.com


Lets append another pattern.
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+
This will validate for .com at the end, but also dot anything.
What if we want to check for .co.uk
or .com.au etc ?


[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+\.[a-zA-Z]+
We only need to check for an extra Dot and another pattern of Alphabet characters.
This latest addition to the expression will find
super.man@abc.co.uk


Great! right?
The problem now is that it will no longer find
super.man@abc.com as it doesn't contain a second dot.

So what to do? make it optional!!
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+\.?[a-zA-Z]+
Notice we've added the ? question mark after the dot, this will make the preceding item optional.
This will now validate all of the email addresses.


If they don't contain a second dot, the following
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+\.?[a-zA-Z]+
is the same as
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+[a-zA-Z]+
Which is of course the same as
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+


Now as my lunch break is fast ending I'll show you a little trick before sending you on your way to becomming a regex guru.

Wouldn't it be better if we could make our regex shorter by using some short-hand characters? well regex is great for this.
\w is identical to [a-zA-Z0-9_] (unless you're using dot net)
So....
[a-zA-Z0-9_]+[a-zA-Z0-9_./]+@[a-zA-Z0-9_]+\.[a-zA-Z]+\.?[a-zA-Z]+
is identical to
[\w]+[\w./]+@[\w_]+\.[a-zA-Z]+\.?[a-zA-Z]+

By now you've probably thought, but this will also validate
test@test.com.testttttttttt
Well you're correct, for this you could add a character limit.
[\w]+[\w./]+@[\w_]+\.[a-zA-Z]+\.?[a-zA-Z]{2,4}
This will say "repeat the last pattern between 2 and 4 times."

Now if we want to validate say a form field using the reFindNoCase() function, we need to tell it where the start and end of the string is.
We do this by adding a ^ to the start of the expression and a $ to the end
<cfset rgxValidateEmail = "^[\w]+[\w./]+@[\w_]+\.[a-zA-Z]+\.?[a-zA-Z]{2,4}$" />
<cfif reFindNoCase(rgxValidateEmail, trim(form.email))> 

If you run the below code, it will find the last email address as invalid, however if you didn't include the ^ and $ it will return all addresses as valid, infact the last will return a value of 2 as that is the position it was found, as the dot is in the first position.

<cfset rgxValidateEmail = "^[\w]+[\w./]+@[\w_]+\.[a-zA-Z]+\.?[a-zA-Z]{2,4}$" />
<cfset lstEmail = "_super.man@abc.com,super.man@abc.com,super.man@abc_123.com, super.m/a/n@abc.com,super.man@abc.co.uk,super.man@abc.travel,super_man@abc.com,

superMan@abc.com,123superman@abc.com,123@abc.com,.superman@abc.com" />

<cfloop index = "ListElement" list = "#lstEmail#">
    <cfoutput>
      <cfif reFindNoCase(rgxValidateEmail, trim(ListElement))>
        #ListElement# <span style="color:green">Valid</span>
        <cfelse>
        #ListElement# <span style="color:red">Invalid</span>
      </cfif>
      <br />

    </cfoutput>

</cfloop>

Now that you have a basic understanding of regex, you must realise that this validation will work for most email addresses you would come accross,
but.... what if someone uses more characters like % etc, well you could add that to the pattern, but what if they use japanese text.
well the following is meant to work for all languages.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
nice isn't it.


What I suggest you do now is visit the following site.
http://www.regular-expressions.info
So you understand why my regex isn't perfect please see.
http://www.regular-expressions.info/email.html
To get started with your own regular expressions check out this.
http://www.regular-expressions.info/reference.html
It is the main page that I have book marked in my browser.

to do: polish up this tutorial and fix up the layout

Please let me know below if this has helped you understand a little regex.


 

 

Comments


Jim

Jim wrote on 06/01/10 6:55 AM

Nice work Matt. Reg Ex is definately worth learning, it makes some of those mind bending problems actually quite easy. Some government agencies will actually give out regex code to ensure that you validate against their standard. For example the BS7776 standard for postcodes uses the following.
[code]
(GIR 0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]|[A-HK-Y][0-9]([0-9]|[ABEHMNPRV-Y]))|[0-9][A-HJKPS-UW]) [0-9][ABD-HJLNP-UW-Z]{2})
[/code]

As my new buddy and Coldbox guru, Luis Majano would say. 'They're Super Powerful, I would recommend you read more about them' But to be fair he says that about a lot of stuff!

Not sure about slashes in emails though, are you sure about that?

Jim

Jim wrote on 06/01/10 6:56 AM

Ah doesn't look like this shonky blog supports the [code][/code] tag. So if you're using the postcode example above, I would leave those off.

Who wrote this crappy site anyway?

matt

matt wrote on 06/01/10 7:19 AM

Hi Jim,

I need to play around a lot more with what the blog can and can't do I only realised today I hadn't implemented a preview page yet lol.
I was reading around and turns out email addresses can have some weird characters, many networks may have slashes in email addresses, which makes it hard to decide what kind of regex validation you want for an email address, to heavy and you might block some real email addresses, make it a little weak and you get some weird random addresses come through ya form.

Barny

Barny wrote on 06/02/10 3:00 AM

Does the regex cover '+' symbols in the user part of the address? One of the nifty things gmail lets you do is add to your address and have it all delivered to your gmail address, so if you had
example@gmail.com
then
example+whatever@gmail.com
will also be delivered to your address

http://www.googletutor.com/2005/06/11/gmail-plus-aliases/

Barny

Barny wrote on 06/02/10 3:01 AM

Oh, nice article BTW

Matt

Matt wrote on 06/02/10 3:41 AM

This regex is more of a tutorial rather then an actual recommended regex, as I mentioned in it I'm not sure exactly what is valid in an email, as there is some weird characters you can have, like the slashes for example.
I think you can actually have john+test@test.com as an email address, I thought a + is allowed, I'm guessing gmail doesn't allow people to register test+test@gmail.com otherwise it'll break gmails feature.

The regex tutorial I wrote will not correctly validate all email addresses, however someone learning regex will now know where to begin in writing one that does, for this I included the links at the bottom of the article.

I just chose email addresses for the tutorial as its one of the easiest things to learn from (I think).
I may actually write my own regex for an email form I'll put on this site, for this I'll be sure to take note of the gmail feature you mention.
cheers

Write your comment



(it will not be displayed)



Leave this field empty:

















 
 
© 2010 Matt Bourke