public/http-notes.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Notes on subtleties of HTTP implementation — Luke T. Shumaker</title>
  <link rel="stylesheet" href="assets/style.css">
  <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
</head>
<body>
<header><a href="/">Luke T. Shumaker</a> » <a href=/blog>blog</a> » http-notes</header>
<article>
<h1 id="notes-on-subtleties-of-http-implementation">Notes on subtleties
of HTTP implementation</h1>
<p>I may add to this as time goes on, but I’ve written up some notes on
subtleties HTTP/1.1 message syntax as specified in RFC 2730.</p>
<h2 id="why-the-absolute-form-is-used-for-proxy-requests">Why the
absolute-form is used for proxy requests</h2>
<p><a
href="https://tools.ietf.org/html/rfc7230#section-5.3.2">RFC7230§5.3.2</a>
says that a (non-CONNECT) request to an HTTP proxy should look like</p>
<pre><code>GET http://authority/path HTTP/1.1</code></pre>
<p>rather than the usual</p>
<pre><code>GET /path HTTP/1.1
Host: authority</code></pre>
<p>And doesn’t give a hint as to why the message syntax is different
here.</p>
<p><a
href="https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header">A
blog post by Parsia Hakimian</a> claims that the reason is that it’s a
legacy behavior inherited from HTTP/1.0, which had proxies, but not the
Host header field. Which is mostly true. But we can also realize that
the usual syntax does not allow specifying a URI scheme, which means
that we cannot specify a transport. Sure, the only two HTTP transports
we might expect to use today are TCP (scheme: http) and TLS (scheme:
https), and TLS requires we use a CONNECT request to the proxy, meaning
that the only option left is a TCP transport; but that is no reason to
avoid building generality into the protocol.</p>
<h2 id="on-taking-short-cuts-based-on-early-header-field-values">On
taking short-cuts based on early header field values</h2>
<p><a
href="https://tools.ietf.org/html/rfc7230#section-3.2.2">RFC7230§3.2.2</a>
says:</p>
<blockquote>
<pre><code>The order in which header fields with differing field names are
received is not significant.  However, it is good practice to send
header fields that contain control data first, such as Host on
requests and Date on responses, so that implementations can decide
when not to handle a message as early as possible.</code></pre>
</blockquote>
<p>Which is great! We can make an optimization!</p>
<p>This is only a valid optimization for deciding to <em>not handle</em>
a message. You cannot use it to decide to route to a backend early based
on this. Part of the reason is that <a
href="https://tools.ietf.org/html/rfc7230#section-5.4">§5.4</a> tells us
we must inspect the entire header field set to know if we need to
respond with a 400 status code:</p>
<blockquote>
<pre><code>A server MUST respond with a 400 (Bad Request) status code to any
HTTP/1.1 request message that lacks a Host header field and to any
request message that contains more than one Host header field or a
Host header field with an invalid field-value.</code></pre>
</blockquote>
<p>However, if I decide not to handle a request based on the Host header
field, the correct thing to do is to send a 404 status code. Which
implies that I have parsed the remainder of the header field set to
validate the message syntax. We need to parse the entire field-set to
know if we need to send a 400 or a 404. Did this just kill the
possibility of using the optimization?</p>
<p>Well, there are a number of “A server MUST respond with a XXX code
if” rules that can all be triggered on the same request. So we get to
choose which to use. And fortunately for optimizing implementations, <a
href="https://tools.ietf.org/html/rfc7230#section-3.2.5">§3.2.5</a> gave
us:</p>
<blockquote>
<pre><code>A server that receives a            ...           set of fields,
larger than it wishes to process MUST respond with an appropriate 4xx
(Client Error) status code.</code></pre>
</blockquote>
<p>Since the header field set is longer than we want to process (since
we want to short-cut processing), we are free to respond with whichever
4XX status code we like!</p>
<h2 id="on-normalizing-target-uris">On normalizing target URIs</h2>
<p>An implementer is tempted to normalize URIs all over the place, just
for safety and sanitation. After all, <a
href="https://tools.ietf.org/html/rfc3986#section-6.1">RFC3986§6.1</a>
says it’s safe!</p>
<p>Unfortunately, most URI normalization implementations will normalize
an empty path to “/”. Which is not always safe; <a
href="https://tools.ietf.org/html/rfc7230#section-2.7.3">RFC7230§2.7.3</a>,
which defines this “equivalence”, actually says:</p>
<blockquote>
<pre><code>                                        When not being used in
absolute form as the request target of an OPTIONS request, an empty
path component is equivalent to an absolute path of &quot;/&quot;, so the
normal form is to provide a path of &quot;/&quot; instead.</code></pre>
</blockquote>
<p>Which means we can’t use the usual normalization implementation if we
are making an OPTIONS request!</p>
<p>Why is that? Well, if we turn to <a
href="https://tools.ietf.org/html/rfc7230#section-5.3.4">§5.3.4</a>, we
find the answer. One of the special cases for when the request target is
not a URI, is that we may use “*” as the target for an OPTIONS request
to request information about the origin server itself, rather than a
resource on that server.</p>
<p>However, as discussed above, the target in a request to a proxy must
be an absolute URI (and <a
href="https://tools.ietf.org/html/rfc7230#section-5.3.2">§5.3.2</a> says
that the origin server must also understand this syntax). So, we must
define a way to map “*” to an absolute URI.</p>
<p>Naively, one might be tempted to use “/*” as the path. But that would
make it impossible to have a resource actually named “/*”. So, we must
define a special case in the URI syntax that doesn’t obstruct a real
path.</p>
<p>If we didn’t have this special case in the URI normalization rules,
and we handled the “/” path as the same as empty in the OPTIONS handler
of the last proxy server, then it would be impossible to request OPTIONS
for the “/” resources, as it would get translated into “*” and treated
as OPTIONS for the entire server.</p>

</article>
<footer>
  <aside class="sponsor"><p>I'd love it if you <a class="em"
      href="/sponsor/">sponsored me</a>.  It will allow me to continue
      my work on the GNU/Linux ecosystem.  Thanks!</p></aside>

<p>The content of this page is Copyright © 2016 <a href="mailto:lukeshu@lukeshu.com">Luke T. Shumaker</a>.</p>
<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a> license.</p>
</footer>
</body>
</html>