<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Die Welt ist gar nicht so. &#187; uri</title>
	<atom:link href="http://blog.dieweltistgarnichtso.net/tag/uri/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.dieweltistgarnichtso.net</link>
	<description>Sie ist ganz anders.</description>
	<lastBuildDate>Mon, 23 Sep 2013 15:41:20 +0000</lastBuildDate>
	<language>de-DE</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=4.0.35</generator>
	<item>
		<title>Constructing a regular expression that matches URIs</title>
		<link>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris</link>
		<comments>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris#comments</comments>
		<pubDate>Thu, 26 Jun 2008 19:27:25 +0000</pubDate>
		<dc:creator><![CDATA[admin]]></dc:creator>
				<category><![CDATA[Originärer Inhalt]]></category>
		<category><![CDATA[Technik]]></category>
		<category><![CDATA[gajim]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[uri]]></category>

		<guid isPermaLink="false">http://blog.dieweltistgarnichtso.net/?p=34</guid>
		<description><![CDATA[URI matching is commonly needed, most notably for URL matching &#8211; chat clients use this to create links in what is otherwise plain (and not hyper-) text. However, many regexes that are supposed to do exactly that fail on encountering &#8230; <a href="http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris">Weiterlesen <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>
<a href="http://en.wikipedia.org/wiki/URI">URI</a> matching is commonly needed, most notably for <a href="http://en.wikipedia.org/wiki/URI">URL</a> matching &#8211; chat clients use this to create links in what is otherwise plain (and not hyper-) text. However, many regexes that are supposed to do exactly that fail on encountering uncommon, yet valid characters, because programmers don&#8217;t follow the RFC (many probably don&#8217;t even read it).
</p>
<p>
Additionally, users are <em>stupid</em>: While according to <a href="http://tools.ietf.org/html/rfc3986">RFC 3986</a>, <a href="http://en.wikipedia.org/wiki/Brackets#Angle_brackets_or_chevrons_.E2.8C.A9.C2.A0.E2.8C.AA"><em>chevrons</em></a> should be used to designate URIs, often people use <a href="http://en.wikipedia.org/wiki/Brackets#Parentheses__.28_.29"><em>parentheses</em></a>. When developers try to compensate for this, they create undesired &#8211; and more than often unexpected &#8211; behaviour: Links created from <em>perfectly valid URIs</em> are suddenly broken &#8211; see, for example, <a href="http://trac.gajim.org/ticket/3715">the chat client Gajim</a> (and also the bugtracker / wiki Trac).
</p><p>
According to RFC 3986, <a href="http://tools.ietf.org/html/rfc3986#section-1.1.1">subsection 1.1.1</a> , <q>URI[s] begin[s] with a <em>scheme name</em></q>, which, according to <a href="http://tools.ietf.org/html/rfc3986#section-3.1">subsection 3.1</a> <q>consist of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (&#8220;+&#8221;), period (&#8220;.&#8221;), or hyphen (&#8220;-&#8220;)</q>. Therefore, the correct regular expression for a scheme name is
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*
</code>
.
</p>
<p>
After the scheme name, a colon (&#8220;:&#8221;) follows &#8211; the rest is scheme-specifix synthax; according to sections <a href="http://tools.ietf.org/html/rfc3986#section-2.2">2.2</a> and <a href="http://tools.ietf.org/html/rfc3986#section-2.3">2.3</a> we only know it uses a limited set of characters, namely those reserved for delimiting data (&#8220;:&#8221;, &#8220;/&#8221;, &#8220;?&#8221;, &#8220;#&#8221;, &#8220;[&#8220;, &#8220;]&#8221;, &#8220;@&#8221;, &#8220;!&#8221;, &#8220;$&#8221;, &#8220;&amp;&#8221;, &#8220;&#8216;&#8221;, &#8220;(&#8220;, &#8220;)&#8221;, &#8220;*&#8221;, &#8220;+&#8221;, &#8220;,&#8221;, &#8220;;&#8221;, &#8220;=&#8221;) and unreserved ones, which <q>include uppercase and lowercase
   letters, decimal digits, hyphen, period, underscore, and tilde</q>. This extends the regular expression to
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*:[A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]+
</code>
&#8211; metacharacters (&#8220;[&#8220;, &#8220;\&#8221;, &#8220;$&#8221;, &#8220;.&#8221; &#8220;?&#8221;, &#8220;*&#8221;, &#8220;+&#8221; &#8220;(&#8220;, &#8220;)&#8221;) <ins>and the range specifier (&#8220;-&#8220;)</ins> properly encoded, of course.
</p>
<p>
&#8220;But wait&#8221;, you may be thinking right now, &#8220;how can I include other characters &#8211; umlauts, for example &#8211; in URIs, then ?&#8221; Well, you <em>can&#8217;t</em>. But you <em>can</em> describe a resource that contains characters not listed in the above paragraph by means of <em>percent-encoding</em>, a method detailed in <a href="http://tools.ietf.org/html/rfc3986#section-2.1">section 2.1</a> to <q>represent a data octet in a component when that octet&#8217;s corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component</q>. A percent-encoded character takes the form of a percent character (&#8220;%&#8221;), followed by two hexadecimal digits &#8211; the space character, for example, is encoded as &#8220;%20&#8243;. This gives us the expression
<code>
%[A-Fa-f0-9]{2}
</code>
, which can be added to the existing URI matching expression:
<code>
[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+
</code>
will catch any valid URI (and probably some invalid ones too).
</p>
<p>
Now, what about the parenthesis problem that surfaced in the beginning ? A simple solution is just to define an additional expression that matches on URIs, but only if they are preceeded by an opening parenthesis (this feature is called &#8220;positive lookbehind&#8221;) and followed by a closing parenthesis (&#8220;positive lookahead&#8221;). We get
<code>
(?&lt;=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\))
</code>
.
Combining the two massive expressions by means of a simple <em>OR</em> yields the final result:
<code>
((?&lt;=\()[A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+(?=\)))|([A-Za-z][A-Za-z0-9\+\.\-]*:([A-Za-z0-9\.\-_~:/\?#\[\]@!\$&amp;'\(\)\*\+,;=]|%[A-Fa-f0-9]{2})+)
</code>
.
</p>

<ins><p>
<strong>Update:</strong> Shortly after Gajim <a href="http://trac.gajim.org/changeset/9845">implemented</a> it, it became clear that I had forgotten to escape the literal minus sign (&#8220;-&#8220;), which wouldn&#8217;t be matched then. This has since been corrected (in this post and <a href="http://trac.gajim.org/changeset/9852">in Gajim</a>).
</p></ins>
<ins datetime="2013-01-24T21:24:11+00:00">
<p>
Since <a href="http://www.regular-expressions.info/posixbrackets.html#eq">regular expressions can be locale-sensitive</a>, I suggest using the <a href="http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_02"><i>C</i> locale</a>.
</p>
</ins>]]></content:encoded>
			<wfw:commentRss>http://blog.dieweltistgarnichtso.net/constructing-a-regular-expression-that-matches-uris/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
