Notes on subtleties of HTTP implementation
==========================================
---
date: "2016-09-30"
---

I may add to this as time goes on, but I've written up some notes on
subtleties HTTP/1.1 message syntax as specified in RFC 2730.

## Why the absolute-form is used for proxy requests

[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP
proxy should look like

    GET http://authority/path HTTP/1.1

rather than the usual

    GET /path HTTP/1.1
    Host: authority

And doesn't give a hint as to why the message syntax is different
here.

[A blog post by Parsia Hakimian][why-absform] claims that the reason
is that it's a legacy behavior inherited from HTTP/1.0, which had
proxies, but not the Host header field.  Which is mostly true.  But we
can also realize that the usual syntax does not allow specifying a URI
scheme, which means that we cannot specify a transport.  Sure, the
only two HTTP transports we might expect to use today are TCP (scheme:
http) and TLS (scheme: https), and TLS requires we use a CONNECT
request to the proxy, meaning that the only option left is a TCP
transport; but that is no reason to avoid building generality into the
protocol.

## On taking short-cuts based on early header field values

[RFC7230§3.2.2][] says:

>     The order in which header fields with differing field names are
>     received is not significant.  However, it is good practice to send
>     header fields that contain control data first, such as Host on
>     requests and Date on responses, so that implementations can decide
>     when not to handle a message as early as possible.

Which is great!  We can make an optimization!

This is only a valid optimization for deciding to *not handle* a
message.  You cannot use it to decide to route to a backend early
based on this.  Part of the reason is that [§5.4][RFC7230§5.4] tells
us we must inspect the entire header field set to know if we need to
respond with a 400 status code:

>     A server MUST respond with a 400 (Bad Request) status code to any
>     HTTP/1.1 request message that lacks a Host header field and to any
>     request message that contains more than one Host header field or a
>     Host header field with an invalid field-value.

However, if I decide not to handle a request based on the Host header
field, the correct thing to do is to send a 404 status code.  Which
implies that I have parsed the remainder of the header field set to
validate the message syntax.  We need to parse the entire field-set to
know if we need to send a 400 or a 404.  Did this just kill the
possibility of using the optimization?

Well, there are a number of "A server MUST respond with a XXX code if"
rules that can all be triggered on the same request.  So we get to
choose which to use.  And fortunately for optimizing implementations,
[§3.2.5][RFC7230§3.2.5] gave us:

>     A server that receives a            ...           set of fields,
>     larger than it wishes to process MUST respond with an appropriate 4xx
>     (Client Error) status code.

Since the header field set is longer than we want to process (since we
want to short-cut processing), we are free to respond with whichever
4XX status code we like!

## On normalizing target URIs

An implementer is tempted to normalize URIs all over the place, just
for safety and sanitation.  After all,
[RFC3986§6.1][] says it's safe!

Unfortunately, most URI normalization implementations will normalize an
empty path to "/".  Which is not always safe; [RFC7230§2.7.3][], which
defines this "equivalence", actually says:

>                                             When not being used in
>     absolute form as the request target of an OPTIONS request, an empty
>     path component is equivalent to an absolute path of "/", so the
>     normal form is to provide a path of "/" instead.

Which means we can't use the usual normalization implementation if we
are making an OPTIONS request!

Why is that?  Well, if we turn to [§5.3.4][RFC7230§5.3.4], we find the
answer.  One of the special cases for when the request target is not a
URI, is that we may use "\*" as the target for an OPTIONS request to
request information about the origin server itself, rather than a
resource on that server.

However, as discussed above, the target in a request to a proxy must
be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the origin
server must also understand this syntax).  So, we must define a way to
map "\*" to an absolute URI.

Naively, one might be tempted to use "/\*" as the path.  But that
would make it impossible to have a resource actually named "/\*".  So,
we must define a special case in the URI syntax that doesn't obstruct
a real path.

If we didn't have this special case in the URI normalization rules,
and we handled the "/" path as the same as empty in the OPTIONS
handler of the last proxy server, then it would be impossible to
request OPTIONS for the "/" resources, as it would get translated into
"\*" and treated as OPTIONS for the entire server.

[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1
[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3
[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2
[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5
[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2
[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4
[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4
[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header