Notes on subtleties of HTTP implementation
I may add to this as time goes on, but I’ve written up some notes on subtleties HTTP/1.1 message syntax as specified in RFC 2730.
Why the absolute-form is used for proxy requests
RFC7230§5.3.2 says that a (non-CONNECT) request to an HTTP proxy should look like
GET http://authority/path HTTP/1.1
rather than the usual
GET /path HTTP/1.1
Host: authority
And doesn’t give a hint as to why the message syntax is different here.
A blog post by Parsia Hakimian claims that the reason is that it’s a legacy behavior inherited from HTTP/1.0, which had proxies, but not the Host header field. Which is mostly true. But we can also realize that the usual syntax does not allow specifying a URI scheme, which means that we cannot specify a transport. Sure, the only two HTTP transports we might expect to use today are TCP (scheme: http) and TLS (scheme: https), and TLS requires we use a CONNECT request to the proxy, meaning that the only option left is a TCP transport; but that is no reason to avoid building generality into the protocol.
On taking short-cuts based on early header field values
RFC7230§3.2.2 says:
The order in which header fields with differing field names are received is not significant. However, it is good practice to send header fields that contain control data first, such as Host on requests and Date on responses, so that implementations can decide when not to handle a message as early as possible.
Which is great! We can make an optimization!
This is only a valid optimization for deciding to not handle a message. You cannot use it to decide to route to a backend early based on this. Part of the reason is that §5.4 tells us we must inspect the entire header field set to know if we need to respond with a 400 status code:
A server MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message that lacks a Host header field and to any request message that contains more than one Host header field or a Host header field with an invalid field-value.
However, if I decide not to handle a request based on the Host header field, the correct thing to do is to send a 404 status code. Which implies that I have parsed the remainder of the header field set to validate the message syntax. We need to parse the entire field-set to know if we need to send a 400 or a 404. Did this just kill the possibility of using the optimization?
Well, there are a number of “A server MUST respond with a XXX code if” rules that can all be triggered on the same request. So we get to choose which to use. And fortunately for optimizing implementations, §3.2.5 gave us:
A server that receives a ... set of fields, larger than it wishes to process MUST respond with an appropriate 4xx (Client Error) status code.
Since the header field set is longer than we want to process (since we want to short-cut processing), we are free to respond with whichever 4XX status code we like!
On normalizing target URIs
An implementer is tempted to normalize URIs all over the place, just for safety and sanitation. After all, RFC3986§6.1 says it’s safe!
Unfortunately, most URI normalization implementations will normalize an empty path to “/”. Which is not always safe; RFC7230§2.7.3, which defines this “equivalence”, actually says:
When not being used in absolute form as the request target of an OPTIONS request, an empty path component is equivalent to an absolute path of "/", so the normal form is to provide a path of "/" instead.
Which means we can’t use the usual normalization implementation if we are making an OPTIONS request!
Why is that? Well, if we turn to §5.3.4, we find the answer. One of the special cases for when the request target is not a URI, is that we may use “*” as the target for an OPTIONS request to request information about the origin server itself, rather than a resource on that server.
However, as discussed above, the target in a request to a proxy must be an absolute URI (and §5.3.2 says that the origin server must also understand this syntax). So, we must define a way to map “*” to an absolute URI.
Naively, one might be tempted to use “/*” as the path. But that would make it impossible to have a resource actually named “/*”. So, we must define a special case in the URI syntax that doesn’t obstruct a real path.
If we didn’t have this special case in the URI normalization rules, and we handled the “/” path as the same as empty in the OPTIONS handler of the last proxy server, then it would be impossible to request OPTIONS for the “/” resources, as it would get translated into “*” and treated as OPTIONS for the entire server.