summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLuke Shumaker <lukeshu@sbcglobal.net>2016-09-30 18:54:11 -0400
committerLuke Shumaker <lukeshu@sbcglobal.net>2016-09-30 18:54:11 -0400
commit36e8932c2d3d42f7651ab6aae7af62175ba172e1 (patch)
tree7cb60e8f63f93a612ebea2ce07f32d154b63752f
parent60573896b4d4cc6820172cc9ad2d4355f7168662 (diff)
add http
-rw-r--r--public/http-notes.md131
1 files changed, 131 insertions, 0 deletions
diff --git a/public/http-notes.md b/public/http-notes.md
new file mode 100644
index 0000000..9b08663
--- /dev/null
+++ b/public/http-notes.md
@@ -0,0 +1,131 @@
+Notes on subtleties of HTTP implementation
+==========================================
+---
+date: "2016-09-30"
+---
+
+I may add to this as time goes on, but I've written up some notes on
+subtleties HTTP/1.1 message syntax as specified in RFC 2730.
+
+# Why the absolute-form used for proxy requests
+
+[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP
+proxy should look like
+
+ GET http://authority/path HTTP/1.1
+
+rather than the usual
+
+ GET /path HTTP/1.1
+ Host: authority
+
+And doesn't give a hint as to why the message syntax is different
+here.
+
+[A blog post by Parsia Hakimian][why-absform] claims that the reason
+is that it's a legacy behavior inherited from HTTP/1.0, which had
+proxies, but not the Host header field. Which is mostly true. But we
+can also realize that the usual syntax does not allow specifying a URI
+scheme, which means that we cannot specify a transport. Sure, the
+only two HTTP transports we might expect to use today are TCP (scheme:
+http) and TLS (scheme: https), and TLS requires we use a CONNECT
+request to the proxy, meaning that the only option left is a TCP
+transport; but that is no reason to avoid building generality into the
+protocol.
+
+# On taking short-cuts based on early header field values
+
+[RFC7230§3.2.2][] says:
+
+> The order in which header fields with differing field names are
+> received is not significant. However, it is good practice to send
+> header fields that contain control data first, such as Host on
+> requests and Date on responses, so that implementations can decide
+> when not to handle a message as early as possible.
+
+I took that as a notice that I can use the first Host or similar
+header to quickly route along to my sub-component before I've parsed
+the entire header field set.
+
+However, it later states in [§5.4][RFC7230§5.4]:
+
+> A server MUST respond with a 400 (Bad Request) status code to any
+> HTTP/1.1 request message that lacks a Host header field and to any
+> request message that contains more than one Host header field or a
+> Host header field with an invalid field-value.
+
+Which means that I must parse the entire header field set.
+
+However, if I look a bit closer at §3.2.2, I see that this short-cut
+is only valid for deciding to *not handle* a message; if I am handling
+it, I cannot use this short-cut.
+
+Except that if I decide not to handle a request based on the Host
+header field, the correct thing to do is to send a 404 status code.
+Which implies that I have parsed the remainder of the header field set
+to validate the message syntax. Oh no, what do I do?
+
+Well, there are a number of "A server MUST respond with a XXX code if"
+rules that can all be triggered on the same request. So we get to
+choose which to use.
+
+And fortunately for optimizing implementations,
+[§3.2.5][RFC7230§3.2.5] gave us:
+
+> A server that receives a ... set of fields,
+> larger than it wishes to process MUST respond with an appropriate 4xx
+> (Client Error) status code.
+
+And since the header field set is longer than we want to process
+(since we want to short-cut processing), we are free to respond with
+whichever 4XX status code we like!
+
+# On normalizing target URIs
+
+An implementer is tempted to normalize URIs all over the place, just
+for safety and sanitation. After all,
+[RFC3986§6.1][] says it's safe!
+
+Unfortunately, most URI normalizers implementations will normalize an
+empty path to "/". Which is not always save;
+[RFC7230§2.7.3][], which defines this
+"equivalence", actually says:
+
+> When not being used in
+> absolute form as the request target of an OPTIONS request, an empty
+> path component is equivalent to an absolute path of "/", so the
+> normal form is to provide a path of "/" instead.
+
+Which means we can't use the usual normalizer implementation if we are
+making an OPTIONS request!
+
+Why is that? Well, if we turn to [§5.3.4][RFC7230§5.3.4], we
+find the answer. One of the special cases for when the request target
+is not a URI, is that we may use "\*" as the target for an OPTIONS
+request to request information about the origin server itself, rather
+than a resource on that server.
+
+However, as discussed above, the target in a request to a proxy must
+be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the
+origin server must also understand this syntax). So, we must define a
+way to map "\*" to an absolute URI.
+
+Naively, one might be tempted to use "/\*" as the path. But that
+would make it impossible to have a resource actually named "/\*". So,
+we must define a special case in the URI syntax that doesn't obstruct
+a real path.
+
+If we didn't have this special case in the URI normalizer, and we
+handled the "/" path as the same as empty in the OPTIONS handler of
+the last proxy server, then it would be impossible to request OPTIONS
+for the "/" resources, as it would get translated into "\*" and
+treated as OPTIONS for the entire server.
+
+[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1
+[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3
+[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2
+[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5
+[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2
+[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4
+[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4
+[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header