Constructing a regular expression that matches URIs

URI matching is commonly needed, most notably for URL matching – chat clients use this to create links in what is otherwise plain (and not hyper-) text. However, many regexes that are supposed to do exactly that fail on encountering uncommon, yet valid characters, because programmers don’t follow the RFC (many probably don’t even read it).

Additionally, users are stupid: While according to RFC 3986, chevrons should be used to designate URIs, often people use parentheses. When developers try to compensate for this, they create undesired – and more than often unexpected – behaviour: Links created from perfectly valid URIs are suddenly broken – see, for example, the chat client Gajim (and also the bugtracker / wiki Trac).

According to RFC 3986, subsection 1.1.1 , URI[s] begin[s] with a scheme name, which, according to subsection 3.1 consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (“+”), period (“.”), or hyphen (“-”). Therefore, the correct regular expression for a scheme name is [A-Za-z][A-Za-z0-9\+\.\-]* .

After the scheme name, a colon (“:”) follows – the rest is scheme-specifix synthax; according to sections 2.2 and 2.3 we only know it uses a limited set of characters, namely those reserved for delimiting data (“:”, “/”, “?”, “#”, “[", "]“, “@”, “!”, “$”, “&”, “‘”, “(“, “)”, “*”, “+”, “,”, “;”, “=”) and unreserved ones, which include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. This extends the regular expression to [A-Za-z][A-Za-z0-9\+\.\-]*:[A-Za-z0-9\.\-_~:/\?#\[\]@!\$&'\(\)\*\+,;=]+ - metacharacters (“[", "\", "$", "." "?", "*", "+" "(", ")") and the range specifier ("-") properly encoded, of course.

"But wait", you may be thinking right now, "how can I include other characters - umlauts, for example - in URIs, then ?" Well, you can't. But you can describe a resource that contains characters not listed in the above paragraph by means of percent-encoding, a method detailed in section 2.1 to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. A percent-encoded character takes the form of a percent character ("%"), followed by two hexadecimal digits - the space character, for example, is encoded as "%20". This gives us the expression %[A-Fa-f0-9]{2} , which can be added to the existing URI matching expression: [A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+ will catch any valid URI (and probably some invalid ones too).

Now, what about the parenthesis problem that surfaced in the beginning ? A simple solution is just to define an additional expression that matches on URIs, but only if they are preceeded by an opening parenthesis (this feature is called "positive lookbehind") and followed by a closing parenthesis ("positive lookahead"). We get (?<=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\)) . Combining the two massive expressions by means of a simple OR yields the final result: ((?<=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\)))|([A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+) .

Update: Shortly after Gajim implemented it, it became clear that I had forgotten to escape the literal minus sign ("-"), which wouldn't be matched then. This has since been corrected (in this post and in Gajim).

Since regular expressions can be locale-sensitive, I suggest using the C locale.

26. Juni 2008 von admin
Kategorien: Originärer Inhalt, Technik | Schlagwörter: , , | Schreibe einen Kommentar

Schreibe einen Kommentar

Pflichtfelder sind mit * markiert


Before you post, please prove you are sentient.

Was ist der Vorname von Franz Beckenbauer?