Notes on subtleties of HTTP implementation ========================================== --- date: "2016-09-30" --- I may add to this as time goes on, but I've written up some notes on subtleties HTTP/1.1 message syntax as specified in RFC 2730. ## Why the absolute-form is used for proxy requests [RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP proxy should look like GET http://authority/path HTTP/1.1 rather than the usual GET /path HTTP/1.1 Host: authority And doesn't give a hint as to why the message syntax is different here. [A blog post by Parsia Hakimian][why-absform] claims that the reason is that it's a legacy behavior inherited from HTTP/1.0, which had proxies, but not the Host header field. Which is mostly true. But we can also realize that the usual syntax does not allow specifying a URI scheme, which means that we cannot specify a transport. Sure, the only two HTTP transports we might expect to use today are TCP (scheme: http) and TLS (scheme: https), and TLS requires we use a CONNECT request to the proxy, meaning that the only option left is a TCP transport; but that is no reason to avoid building generality into the protocol. ## On taking short-cuts based on early header field values [RFC7230§3.2.2][] says: > The order in which header fields with differing field names are > received is not significant. However, it is good practice to send > header fields that contain control data first, such as Host on > requests and Date on responses, so that implementations can decide > when not to handle a message as early as possible. Which is great! We can make an optimization! This is only a valid optimization for deciding to *not handle* a message. You cannot use it to decide to route to a backend early based on this. Part of the reason is that [§5.4][RFC7230§5.4] tells us we must inspect the entire header field set to know if we need to respond with a 400 status code: > A server MUST respond with a 400 (Bad Request) status code to any > HTTP/1.1 request message that lacks a Host header field and to any > request message that contains more than one Host header field or a > Host header field with an invalid field-value. However, if I decide not to handle a request based on the Host header field, the correct thing to do is to send a 404 status code. Which implies that I have parsed the remainder of the header field set to validate the message syntax. We need to parse the entire field-set to know if we need to send a 400 or a 404. Did this just kill the possibility of using the optimization? Well, there are a number of "A server MUST respond with a XXX code if" rules that can all be triggered on the same request. So we get to choose which to use. And fortunately for optimizing implementations, [§3.2.5][RFC7230§3.2.5] gave us: > A server that receives a ... set of fields, > larger than it wishes to process MUST respond with an appropriate 4xx > (Client Error) status code. Since the header field set is longer than we want to process (since we want to short-cut processing), we are free to respond with whichever 4XX status code we like! ## On normalizing target URIs An implementer is tempted to normalize URIs all over the place, just for safety and sanitation. After all, [RFC3986§6.1][] says it's safe! Unfortunately, most URI normalization implementations will normalize an empty path to "/". Which is not always safe; [RFC7230§2.7.3][], which defines this "equivalence", actually says: > When not being used in > absolute form as the request target of an OPTIONS request, an empty > path component is equivalent to an absolute path of "/", so the > normal form is to provide a path of "/" instead. Which means we can't use the usual normalization implementation if we are making an OPTIONS request! Why is that? Well, if we turn to [§5.3.4][RFC7230§5.3.4], we find the answer. One of the special cases for when the request target is not a URI, is that we may use "\*" as the target for an OPTIONS request to request information about the origin server itself, rather than a resource on that server. However, as discussed above, the target in a request to a proxy must be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the origin server must also understand this syntax). So, we must define a way to map "\*" to an absolute URI. Naively, one might be tempted to use "/\*" as the path. But that would make it impossible to have a resource actually named "/\*". So, we must define a special case in the URI syntax that doesn't obstruct a real path. If we didn't have this special case in the URI normalization rules, and we handled the "/" path as the same as empty in the OPTIONS handler of the last proxy server, then it would be impossible to request OPTIONS for the "/" resources, as it would get translated into "\*" and treated as OPTIONS for the entire server. [RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1 [RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3 [RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2 [RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5 [RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2 [RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4 [RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4 [why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header