From 36e8932c2d3d42f7651ab6aae7af62175ba172e1 Mon Sep 17 00:00:00 2001 From: Luke Shumaker Date: Fri, 30 Sep 2016 18:54:11 -0400 Subject: add http --- public/http-notes.md | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 131 insertions(+) create mode 100644 public/http-notes.md diff --git a/public/http-notes.md b/public/http-notes.md new file mode 100644 index 0000000..9b08663 --- /dev/null +++ b/public/http-notes.md @@ -0,0 +1,131 @@ +Notes on subtleties of HTTP implementation +========================================== +--- +date: "2016-09-30" +--- + +I may add to this as time goes on, but I've written up some notes on +subtleties HTTP/1.1 message syntax as specified in RFC 2730. + +# Why the absolute-form used for proxy requests + +[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP +proxy should look like + + GET http://authority/path HTTP/1.1 + +rather than the usual + + GET /path HTTP/1.1 + Host: authority + +And doesn't give a hint as to why the message syntax is different +here. + +[A blog post by Parsia Hakimian][why-absform] claims that the reason +is that it's a legacy behavior inherited from HTTP/1.0, which had +proxies, but not the Host header field. Which is mostly true. But we +can also realize that the usual syntax does not allow specifying a URI +scheme, which means that we cannot specify a transport. Sure, the +only two HTTP transports we might expect to use today are TCP (scheme: +http) and TLS (scheme: https), and TLS requires we use a CONNECT +request to the proxy, meaning that the only option left is a TCP +transport; but that is no reason to avoid building generality into the +protocol. + +# On taking short-cuts based on early header field values + +[RFC7230§3.2.2][] says: + +> The order in which header fields with differing field names are +> received is not significant. However, it is good practice to send +> header fields that contain control data first, such as Host on +> requests and Date on responses, so that implementations can decide +> when not to handle a message as early as possible. + +I took that as a notice that I can use the first Host or similar +header to quickly route along to my sub-component before I've parsed +the entire header field set. + +However, it later states in [§5.4][RFC7230§5.4]: + +> A server MUST respond with a 400 (Bad Request) status code to any +> HTTP/1.1 request message that lacks a Host header field and to any +> request message that contains more than one Host header field or a +> Host header field with an invalid field-value. + +Which means that I must parse the entire header field set. + +However, if I look a bit closer at §3.2.2, I see that this short-cut +is only valid for deciding to *not handle* a message; if I am handling +it, I cannot use this short-cut. + +Except that if I decide not to handle a request based on the Host +header field, the correct thing to do is to send a 404 status code. +Which implies that I have parsed the remainder of the header field set +to validate the message syntax. Oh no, what do I do? + +Well, there are a number of "A server MUST respond with a XXX code if" +rules that can all be triggered on the same request. So we get to +choose which to use. + +And fortunately for optimizing implementations, +[§3.2.5][RFC7230§3.2.5] gave us: + +> A server that receives a ... set of fields, +> larger than it wishes to process MUST respond with an appropriate 4xx +> (Client Error) status code. + +And since the header field set is longer than we want to process +(since we want to short-cut processing), we are free to respond with +whichever 4XX status code we like! + +# On normalizing target URIs + +An implementer is tempted to normalize URIs all over the place, just +for safety and sanitation. After all, +[RFC3986§6.1][] says it's safe! + +Unfortunately, most URI normalizers implementations will normalize an +empty path to "/". Which is not always save; +[RFC7230§2.7.3][], which defines this +"equivalence", actually says: + +> When not being used in +> absolute form as the request target of an OPTIONS request, an empty +> path component is equivalent to an absolute path of "/", so the +> normal form is to provide a path of "/" instead. + +Which means we can't use the usual normalizer implementation if we are +making an OPTIONS request! + +Why is that? Well, if we turn to [§5.3.4][RFC7230§5.3.4], we +find the answer. One of the special cases for when the request target +is not a URI, is that we may use "\*" as the target for an OPTIONS +request to request information about the origin server itself, rather +than a resource on that server. + +However, as discussed above, the target in a request to a proxy must +be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the +origin server must also understand this syntax). So, we must define a +way to map "\*" to an absolute URI. + +Naively, one might be tempted to use "/\*" as the path. But that +would make it impossible to have a resource actually named "/\*". So, +we must define a special case in the URI syntax that doesn't obstruct +a real path. + +If we didn't have this special case in the URI normalizer, and we +handled the "/" path as the same as empty in the OPTIONS handler of +the last proxy server, then it would be impossible to request OPTIONS +for the "/" resources, as it would get translated into "\*" and +treated as OPTIONS for the entire server. + +[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1 +[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3 +[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2 +[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5 +[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2 +[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4 +[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4 +[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header -- cgit v1.2.3