public/http-notes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

Notes on subtleties of HTTP implementation
==========================================
---
date: "2016-09-30"
---

I may add to this as time goes on, but I've written up some notes on
subtleties HTTP/1.1 message syntax as specified in RFC 2730.

# Why the absolute-form is used for proxy requests

[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP
proxy should look like

    GET http://authority/path HTTP/1.1

rather than the usual

    GET /path HTTP/1.1
    Host: authority

And doesn't give a hint as to why the message syntax is different
here.

[A blog post by Parsia Hakimian][why-absform] claims that the reason
is that it's a legacy behavior inherited from HTTP/1.0, which had
proxies, but not the Host header field.  Which is mostly true.  But we
can also realize that the usual syntax does not allow specifying a URI
scheme, which means that we cannot specify a transport.  Sure, the
only two HTTP transports we might expect to use today are TCP (scheme:
http) and TLS (scheme: https), and TLS requires we use a CONNECT
request to the proxy, meaning that the only option left is a TCP
transport; but that is no reason to avoid building generality into the
protocol.

# On taking short-cuts based on early header field values

[RFC7230§3.2.2][] says:

>     The order in which header fields with differing field names are
>     received is not significant.  However, it is good practice to send
>     header fields that contain control data first, such as Host on
>     requests and Date on responses, so that implementations can decide
>     when not to handle a message as early as possible.

Which is great!  We can make an optimization!

This is only a valid optimization for deciding to *not handle* a
message.  You cannot use it to decide to route to a backend early
based on this.  Part of the reason is that [§5.4][RFC7230§5.4] tells
us we must inspect the entire header field set to know if we need to
respond with a 400 status code:

>     A server MUST respond with a 400 (Bad Request) status code to any
>     HTTP/1.1 request message that lacks a Host header field and to any
>     request message that contains more than one Host header field or a
>     Host header field with an invalid field-value.

However, if I decide not to handle a request based on the Host header
field, the correct thing to do is to send a 404 status code.  Which
implies that I have parsed the remainder of the header field set to
validate the message syntax.  We need to parse the entire field-set to
know if we need to send a 400 or a 404.  Did this just kill the
possibility of using the optimization?

Well, there are a number of "A server MUST respond with a XXX code if"
rules that can all be triggered on the same request.  So we get to
choose which to use.  And fortunately for optimizing implementations,
[§3.2.5][RFC7230§3.2.5] gave us:

>     A server that receives a            ...           set of fields,
>     larger than it wishes to process MUST respond with an appropriate 4xx
>     (Client Error) status code.

Since the header field set is longer than we want to process (since we
want to short-cut processing), we are free to respond with whichever
4XX status code we like!

# On normalizing target URIs

An implementer is tempted to normalize URIs all over the place, just
for safety and sanitation.  After all,
[RFC3986§6.1][] says it's safe!

Unfortunately, most URI normalization implementations will normalize an
empty path to "/".  Which is not always safe; [RFC7230§2.7.3][], which
defines this "equivalence", actually says:

>                                             When not being used in
>     absolute form as the request target of an OPTIONS request, an empty
>     path component is equivalent to an absolute path of "/", so the
>     normal form is to provide a path of "/" instead.

Which means we can't use the usual normalization implementation if we
are making an OPTIONS request!

Why is that?  Well, if we turn to [§5.3.4][RFC7230§5.3.4], we find the
answer.  One of the special cases for when the request target is not a
URI, is that we may use "\*" as the target for an OPTIONS request to
request information about the origin server itself, rather than a
resource on that server.

However, as discussed above, the target in a request to a proxy must
be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the origin
server must also understand this syntax).  So, we must define a way to
map "\*" to an absolute URI.

Naively, one might be tempted to use "/\*" as the path.  But that
would make it impossible to have a resource actually named "/\*".  So,
we must define a special case in the URI syntax that doesn't obstruct
a real path.

If we didn't have this special case in the URI normalization rules,
and we handled the "/" path as the same as empty in the OPTIONS
handler of the last proxy server, then it would be impossible to
request OPTIONS for the "/" resources, as it would get translated into
"\*" and treated as OPTIONS for the entire server.

[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1
[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3
[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2
[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5
[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2
[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4
[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4
[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header