public/http-notes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

Notes on subtleties of HTTP implementation
==========================================
---
date: "2016-09-30"
---

I may add to this as time goes on, but I've written up some notes on
subtleties HTTP/1.1 message syntax as specified in RFC 2730.

# Why the absolute-form used for proxy requests

[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP
proxy should look like

    GET http://authority/path HTTP/1.1

rather than the usual

    GET /path HTTP/1.1
    Host: authority

And doesn't give a hint as to why the message syntax is different
here.

[A blog post by Parsia Hakimian][why-absform] claims that the reason
is that it's a legacy behavior inherited from HTTP/1.0, which had
proxies, but not the Host header field.  Which is mostly true.  But we
can also realize that the usual syntax does not allow specifying a URI
scheme, which means that we cannot specify a transport.  Sure, the
only two HTTP transports we might expect to use today are TCP (scheme:
http) and TLS (scheme: https), and TLS requires we use a CONNECT
request to the proxy, meaning that the only option left is a TCP
transport; but that is no reason to avoid building generality into the
protocol.

# On taking short-cuts based on early header field values

[RFC7230§3.2.2][] says:

>     The order in which header fields with differing field names are
>     received is not significant.  However, it is good practice to send
>     header fields that contain control data first, such as Host on
>     requests and Date on responses, so that implementations can decide
>     when not to handle a message as early as possible.

I took that as a notice that I can use the first Host or similar
header to quickly route along to my sub-component before I've parsed
the entire header field set.

However, it later states in [§5.4][RFC7230§5.4]:

>     A server MUST respond with a 400 (Bad Request) status code to any
>     HTTP/1.1 request message that lacks a Host header field and to any
>     request message that contains more than one Host header field or a
>     Host header field with an invalid field-value.

Which means that I must parse the entire header field set.

However, if I look a bit closer at §3.2.2, I see that this short-cut
is only valid for deciding to *not handle* a message; if I am handling
it, I cannot use this short-cut.

Except that if I decide not to handle a request based on the Host
header field, the correct thing to do is to send a 404 status code.
Which implies that I have parsed the remainder of the header field set
to validate the message syntax.  Oh no, what do I do?

Well, there are a number of "A server MUST respond with a XXX code if"
rules that can all be triggered on the same request.  So we get to
choose which to use.

And fortunately for optimizing implementations,
[§3.2.5][RFC7230§3.2.5] gave us:

>     A server that receives a            ...           set of fields,
>     larger than it wishes to process MUST respond with an appropriate 4xx
>     (Client Error) status code.

And since the header field set is longer than we want to process
(since we want to short-cut processing), we are free to respond with
whichever 4XX status code we like!

# On normalizing target URIs

An implementer is tempted to normalize URIs all over the place, just
for safety and sanitation.  After all,
[RFC3986§6.1][] says it's safe!

Unfortunately, most URI normalizers implementations will normalize an
empty path to "/".  Which is not always save;
[RFC7230§2.7.3][], which defines this
"equivalence", actually says:

>                                             When not being used in
>     absolute form as the request target of an OPTIONS request, an empty
>     path component is equivalent to an absolute path of "/", so the
>     normal form is to provide a path of "/" instead.

Which means we can't use the usual normalizer implementation if we are
making an OPTIONS request!

Why is that?  Well, if we turn to [§5.3.4][RFC7230§5.3.4], we
find the answer.  One of the special cases for when the request target
is not a URI, is that we may use "\*" as the target for an OPTIONS
request to request information about the origin server itself, rather
than a resource on that server.

However, as discussed above, the target in a request to a proxy must
be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the
origin server must also understand this syntax).  So, we must define a
way to map "\*" to an absolute URI.

Naively, one might be tempted to use "/\*" as the path.  But that
would make it impossible to have a resource actually named "/\*".  So,
we must define a special case in the URI syntax that doesn't obstruct
a real path.

If we didn't have this special case in the URI normalizer, and we
handled the "/" path as the same as empty in the OPTIONS handler of
the last proxy server, then it would be impossible to request OPTIONS
for the "/" resources, as it would get translated into "\*" and
treated as OPTIONS for the entire server.

[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1
[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3
[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2
[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5
[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2
[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4
[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4
[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header