Jump to content
Larry Ullman's Book Forums

Recommended Posts

Sure, Larry, I understand – and after all, it's your forum, isn't it? You can even ban me from it at a click of a button whenever you wish. 

 

But try to look at it in a different perspective. What if someone is searching Google for "regex to validate email" and finds this already popular thread in your forum? They will be attracted by the discussion, and then they may become curious: "who is this guy, Larry Ullman? Oh, interesting, he wrote quite a few books on PHP! And on other languages! Why don't I check them out!"

 

And then that person will buy your books, read them, learn from them, and will become a better web developer and programmer, and at the end there will be one more person in the world liking and respecting you for your teaching and writing abilities.

 

Is this a quixotic pursuit? Most likely, yes. It may even be herculean. But I think it's worth it. 

 

On the other hand – learning to forge effective regular expressions – is that a quixotic pursuit? I don't think so, I think it's actually a very reasonable and rational pursuit. 

Link to comment
Share on other sites

Dimitri, I think Larry has already provided you with more than enough links and information to give you what you want.

Beyond what he has already given you, what do you want?

What is it that you still want to know (and please be specific), because I simply can't understand what else you want/need to know?

All the info you need is already available from what Larry has given you (which is an awful lot, I think).

  • Upvote 1
Link to comment
Share on other sites

Sometimes, it's not the end of the travel that's important, but rather the travel itself. I get that part completly, Dimitry.

 

That being said, you got to respect other people's time. Larry passing on this thread once a satisfying answer is provided can't really be looked down upon. However, escpecially as a student and hobby programmer, I can appriciate the interest. Sometimes, those interest must be persued on your own. Most people don't have the time, even if they'd probably both want to and find the topic interesting.

 

I'm not really interested in Regexes, and I definitly don't have the knowlege to help you. Just wanted to rant a little, as usual... And give you a little nod. ;)

  • Upvote 1
Link to comment
Share on other sites

Antonio, point taken, thank you. 

 

Ditto HartleySan, but to answer your question about what else I want to know, it's this:

 

I'm well aware I'm caught in the newbie trap of trying to invent the perfect regex to match an email address. But I don't have the goal to match all possible versions of an email address, I just want to figure out the regex that would work with a majority of normal email addresses. The purpose of that "quixotic pursuit" is not to validate emails in any practical application, but rather to master the PHP flavor of regex, using email validation as example, for the lack of a better one – and I'm stuck with that goal, because the chapter on regex in Larry's book is where that book suddenly became challenging for me. So if Larry wants to bail out, it's fine with me, and it's definitely his right, but I can't. 

 

Does this make any sense to you?   

 

So back to "what do I still want to know" question:

 

I'm actually satisfied with the part of the most recent version of that regex that comes after the @. I looked far and wide and for the life of me I can't come up with the kind of valid domain name that wouldn't be matched by that part, so I'm okay with it. 

 

What comes before @, however, I don't like at all. It sucks. It would validate anything: _______@somewebsite.com, .@somewebsite.com, and so on. As I said, I want it to validate a reasonable majority of normal email addresses. 

 

So I want to figure out ways to improve it, and that's why I ask questions about it. And I could, of course, be doing that on some other guy's forum, but since it's Larry's book that I'm studying, it's only logical to ask the questions here. 

 

Help from someone knowledgeable would be appreciated, even though I obviously can't insist.

Link to comment
Share on other sites

The first time I read this book, I also got stuck on the regex chapter (and I ended up leaving it for the longest time before I finally went back to it and really learned it). It's probably the hardest single chapter in the book.

In addition, I totally understand your interest in pursuing a good regex for educational reasons. That's fine.

What I can't understand is the following:

 

 

I just want to figure out the regex that would work with a majority of normal email addresses.

 

This is a very relative and arbitrary thing to say.

To be honest, I get the feeling that you yourself don't know exactly what you mean by the above. That's why I think you should take the time to sit down and critically thinking about what you want while taking notes.

 

To give an example, you might decide something like, "The local part of the address can contain underscores, but it must start with and contain a letter."

After you clearly and explicitly define what you want, then you'll have a better chance of trying to get what you want.

Please take some time to think carefully about this, do the proper research, and then come back and let us know what you want if you can't get it.

At this point though, I think you have all the info you need to get what you want, but more than anything, you need to take the time to really sit down and think about it a lot. Slowly, it'll all start to make sense, and you should be all right after that.

 

To help guide you a bit, I have a feeling that the main thing you want to know that you don't know yet has to do with lookarounds. Using lookarounds, you can write regexes for things like, "The string must contain at least one letter and one number, and they can be in any order."

For more info about lookarounds, please see the following:

http://www.regular-expressions.info/lookaround.html

  • Upvote 1
Link to comment
Share on other sites

Thanks, HartleySan, and you're absolutely right: now I'm attempting to define what would be a set of requirements that would validate reasonably large and most probable percentage of whatever comes before @ in an email address, and I'm right in the middle of research on that. Obviously, I'm not just sitting around waiting for someone to solve my problems: for the last few days I've been doing mostly research on regex. 

 

Thanks for the resource on lookarounds, it's indeed very valuable!  

Link to comment
Share on other sites

  • 9 months later...

I have a question about email matching with regular expressions. 

 

Chapter 14, Pg. 445, contains the following email matching pattern:

 

^[\w.-]+@[\w.-]+\.[A-Za-z]{2,6}$

 

I may be wrong, but wouldn't it match something like this?

 

somename@some_website.com

 

I would say 'no, it would' because the shortcut \w after @ certainly does allow (and match with) letters, numbers and underscore, according to the PHP manual. So, we cannot remove the underscore if we still want to use \w immediately after the @ sign.

 

if you don't really want the underscore after the @ sign, you may follow the HartleySan's character class, [A-Za-z0-9]. However, this character class will stop such a valid email as somename@some-website.com as it contain a dash (-).

 

So, to my understanding of this topic, if we want to validate/ allow any words, numbers, hyphen, but not underscore between the @ and . in an email address as per the original quest of the 'perfect' one, we can only go with the class like [- A-Za-z0-9] or [A-Za-z0-9 -] (it does no harm if you want to escape the hyphen as per Larry's syntax above)

 

Besides, we can test the other class like [\w.-]+[^\_] between the @ and .

 

Hope this may help!

Link to comment
Share on other sites

What is a valid email address? That's a really important question. I'm actually allowed to create the following email addresses on my host:

 

_@juvenorge.com

_._@juvenorge.com

!@juvenorge.com

#@juvenorge.com

$@juvenorge.com

=@juvenorge.com

?@juvenorge.com

^@juvenorge.com

 

I don't know if these are regarded as valid generally, but that being said, I can create them and probably send and receive emails from them. That at least brings up a few interesting questions for you to answer.

Link to comment
Share on other sites

Yep. As one of the top answers on Stack Overflow says related to validating email addresses, it's actually impossible to do completely with a regex alone.

And really, it's not so much the characters used in the email address so much as the fact that the address is real and is hopefully one that the user actually owns.

 

Anyway, I say use filter_var and move on.

Link to comment
Share on other sites

Yes, they can be regarded as valid ones.

 

In my opinion, a valid email address can be seen as any kind of vertical address (to compare to a physical home address) that someone creates for his contact with other people and that it is also a method of exchanging digital messages one another. So, as long as it meets the universal and programmatic structure of only one @ sign and at least a . (dot) after it, it is considered as a valid one.

 

And a valid email address could never be a dead one, but an alive one. I mean that the valid email address means nothing to computers, but it means something to human being.

 

Therefore, your host company (human being) may think that different people refer to have different kinds of email addresses. So they tolerate the use of non-alphanumeric characters, and ask computers (machine) to accept it.

 

So, as a programmer, I think, when he wants to validate a valid email address, he must:

 

1/ validate and sanitize the email structure syntactically (there is no more than @ sign or so in the structure).

2/ ask the email users or email servers to confirm the validity and reality of the email (there is at least one person possessing it).

 

Then he is successful....right?

  • Upvote 1
Link to comment
Share on other sites

  • 4 weeks later...

 

Dimitri, this regular expression allows for all possible valid email address syntaxes while not allowing any invalid email address:

 

 

(??:\r\n)?[ \t])*(??:(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:
\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ 
\t]))*"(??:\r\n)?[ \t])*))*@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)
?[ \t])*)*\<(??:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[
 \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t]
)*))*(?:,@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*
)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*)
*?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r
\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t
]))*"(??:\r\n)?[ \t])*))*@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??
:\r\n)?[ \t])*))*\>(??:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(??
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?
[ \t]))*"(??:\r\n)?[ \t])*)*?:(?:\r\n)?[ \t])*(??:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*))*@(??:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)*\<(??:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*(?:,@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](??:\r\n)?[ \t])*))*)*?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*))*@(??:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*\>(??:\r\n)?[ \t])*)(?:,\s*(
??:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)(?:\.(?
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t
])*))*@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?
:\.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)*\<(??:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*(?:,@(??:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\.(??:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*)*?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])*)(?:\.(??:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(??:\r\n)?[ \t]))*"(??:\r\n)?[ \t])
*))*@(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*)(?:\
.(??:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(??:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](??:\r\n)?[ \t])*))*\>(?
?:\r\n)?[ \t])*))*)?;\s*) 
That does not allow for comments in email addresses, though, which are technically allowed. This is why most developers either use filter_var() or use a minimal regular expression that just catches obvious fakes.

 

 

LOL...sorry, I almost felt something like that coming, and then it came, better than I expected.. 

Link to comment
Share on other sites

 Share

×
×
  • Create New...