<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Die Welt ist gar nicht so. &#187; uri</title>
	<atom:link href="http://blog.dieweltistgarnichtso.net/tag/uri/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.dieweltistgarnichtso.net</link>
	<description>Sie ist ganz anders.</description>
	<lastBuildDate>Sat, 04 Sep 2010 13:45:05 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Constructing a regular expression that matches URIs</title>
		<link>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris</link>
		<comments>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris#comments</comments>
		<pubDate>Thu, 26 Jun 2008 19:27:25 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Originärer Inhalt]]></category>
		<category><![CDATA[Technik]]></category>
		<category><![CDATA[gajim]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[uri]]></category>

		<guid isPermaLink="false">http://blog.dieweltistgarnichtso.net/?p=34</guid>
		<description><![CDATA[
URI matching is commonly needed, most notably for URL matching &#8211; chat clients use this to create links in what is otherwise plain (and not hyper-) text. However, many regexes that are supposed to do exactly that fail on encountering uncommon, yet valid characters, because programmers don&#8217;t follow the RFC (many probably don&#8217;t even read [...]]]></description>
			<content:encoded><![CDATA[<p>
<a href="http://en.wikipedia.org/wiki/URI">URI</a> matching is commonly needed, most notably for <a href="http://en.wikipedia.org/wiki/URI">URL</a> matching &#8211; chat clients use this to create links in what is otherwise plain (and not hyper-) text. However, many regexes that are supposed to do exactly that fail on encountering uncommon, yet valid characters, because programmers don&#8217;t follow the RFC (many probably don&#8217;t even read it).
</p>
<p>
Additionally, users are <em>stupid</em>: While according to <a href="http://tools.ietf.org/html/rfc3986">RFC 3986</a>, <a href="http://en.wikipedia.org/wiki/Brackets#Angle_brackets_or_chevrons_.E2.8C.A9.C2.A0.E2.8C.AA"><em>chevrons</em></a> should be used to designate URIs, often people use <a href="http://en.wikipedia.org/wiki/Brackets#Parentheses__.28_.29"><em>parentheses</em></a>. When developers try to compensate for this, they create undesired &#8211; and more than often unexpected &#8211; behaviour: Links created from <em>perfectly valid URIs</em> are suddenly broken &#8211; see, for example, <a href="http://trac.gajim.org/ticket/3715">the chat client Gajim</a> (and also the bugtracker / wiki Trac).
</p><p>
According to RFC 3986, <a href="http://tools.ietf.org/html/rfc3986#section-1.1.1">subsection 1.1.1</a> , <q>URI[s] begin[s] with a <em>scheme name</em></q>, which, according to <a href="http://tools.ietf.org/html/rfc3986#section-3.1">subsection 3.1</a> <q>consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (&#8220;+&#8221;), period (&#8220;.&#8221;), or hyphen (&#8220;-&#8221;)</q>. Therefore, the correct regular expression for a scheme name is
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*
</code>
.
</p>
<p>
After the scheme name, a colon (&#8220;:&#8221;) follows &#8211; the rest is scheme-specifix synthax; according to sections <a href="http://tools.ietf.org/html/rfc3986#section-2.2">2.2</a> and <a href="http://tools.ietf.org/html/rfc3986#section-2.3">2.3</a> we only know it uses a limited set of characters, namely those reserved for delimiting data (&#8220;:&#8221;, &#8220;/&#8221;, &#8220;?&#8221;, &#8220;#&#8221;, &#8220;[", "]&#8220;, &#8220;@&#8221;, &#8220;!&#8221;, &#8220;$&#8221;, &#8220;&amp;&#8221;, &#8220;&#8216;&#8221;, &#8220;(&#8220;, &#8220;)&#8221;, &#8220;*&#8221;, &#8220;+&#8221;, &#8220;,&#8221;, &#8220;;&#8221;, &#8220;=&#8221;) and unreserved ones, which <q>include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde</q>. This extends the regular expression to
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*:[A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]+
</code>
- metacharacters (&#8220;[", "\", "$", "." "?", "*", "+" "(", ")") <ins>and the range specifier ("-")</ins> properly encoded, of course.
</p>
<p>
"But wait", you may be thinking right now, "how can I include other characters - umlauts, for example - in URIs, then ?" Well, you <em>can't</em>. But you <em>can</em> describe a resource that contains characters not listed in the above paragraph by means of <em>percent-encoding</em>, a method detailed in <a href="http://tools.ietf.org/html/rfc3986#section-2.1">section 2.1</a> to <q>represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component</q>. A percent-encoded character takes the form of a percent character ("%"), followed by two hexadecimal digits - the space character, for example, is encoded as "%20". This gives us the expression
<code>
%[A-Fa-f0-9]{2}
</code>
, which can be added to the existing URI matching expression:
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+
</code>
will catch any valid URI (and probably some invalid ones too).
</p>
<p>
Now, what about the parenthesis problem that surfaced in the beginning ? A simple solution is just to define an additional expression that matches on URIs, but only if they are preceeded by an opening parenthesis (this feature is called "positive lookbehind") and followed by a closing parenthesis ("positive lookahead"). We get
<code>
(?&lt;=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\))
</code>
.
Combining the two massive expressions by means of a simple <em>OR</em> yields the final result:
<code>
((?&lt;=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\)))|([A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+)
</code>
.
</p>

<ins><p>
<strong>Update:</strong> Shortly after Gajim <a href="http://trac.gajim.org/changeset/9845">implemented</a> it, it became clear that I had forgotten to escape the literal minus sign ("-"), which wouldn't be matched then. This has since been corrected (in this post and <a href="http://trac.gajim.org/changeset/9852">in Gajim</a>).
</p></ins>]]></content:encoded>
			<wfw:commentRss>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
