Sunday, April 17, 2011

Day 45 - location regexp or no regexp

Sometimes when you are wandering in a foreign land, something at the limit of your vision that is always there... You never look at it straight because... Well, because it's not really something you're interested in. But after some time you always end up looking at it carefully. I guess this is curiosity or just a side effect of repetition. So, I have seen so many location here and there that I decided to have a closer look at them.

The location documentation is really good and probably answers all the questions you might have trying to get your configuration working. And you should stop reading this post about right now. Still there? Good, I was kidding... Let's have a more pragmatic approach than the official documentation and let's have a look at examples. I would love to call them "real-life" examples but they are not. They are more "teaching" examples: built specifically to show something.

A location is really the basic block of a nginx configuration and it is very different from what you are used to with Apache (for example :)). So, this gets everybody confused at the beginning but once you get used to it, you'll come to like (or even love) it. A location block is where you define how to handle requests that match this location. Very high-level you could define nginx request processing as:

  1. Parse the request.
  2. Find the matching location.
  3. Process the request based on what is in this location.

There are different kinds of locations. We are going to start with the basic ones: the "exact" location and the "starts with" location (yes, I'm using the Echo module, it's just so convenient for this kind of things):

    location /foo {
        echo foo;
    }
    location = /foo/bar {
        echo foo-bar-eq;
    }
    location /foo/bar {
        echo foo-bar;
    }

The location with the = is the "exact" location and it's the easiest one to explain: if the request is on the URI that is after the = sign, the content of the location block is used to process the request. So, in the example above, GET /foo/bar returns foo-bar-eq (with a \n appended because it's the way of the echo module - or the unix command for that matter).

If, like me you are stupid enough to try it, having the same exact location twice does not work and will prevent nginx to start with:

[emerg]: duplicate location "/foo/bar" in .../nginx.conf

The feature to consider duplicate locations an error works also for the other types of locations (provided location type and URI are the same). And it's a pretty damn good feature because this way you won't get errors due to an extra copy-paste.

The other type of location in the example above is the "starts with" location. If the request starts with the URI defined, then this location will be used. In our example, that's why GET /foo/bari returns foo-bar, just like GET /foo/bar/ but unlike GET /foo/bor which returns foo. You saw me coming: there is some prioritization going on here. nginx uses the longest "starts with" location that matches the request. Not the first one in the configuration order. That's why our config returns foo-bar and not foo when requesting GET /foo/bari

Now, let's add some hot sauce to this with "regular expression" locations. Yes, of course nginx support them. If not, how would you forward all the ".php" requests to your fastcgi backend for processing. So, let's have a look at this configuration:

    location ~ ^/foo {
        echo foo;
    }
    location ~ ^/foo/bar {
        echo foo-bar;
    }

~ is the operator identifying that the following string should be considered a "regexp" (there is also a ~* operator that makes the regular expression match case insensitive). If you know about regular expressions and look attentively at the regular expressions, you cannot fail to see the similarity with the previous example (reminder for those of you who skipped the regular expressions course: ^ is the regular expression equivalent of "starts with"). And this is where things get messy: GET /foo returns foo but GET /foo/bari returns foo. Yeah, baby! If you want the same behavior as before, well you just have to write your configuration this way:

    location ~ ^/foo/bar {
        echo foo-bar;
    }
    location ~ ^/foo {
        echo foo;
    }

So, with regular expressions the order of appearance in your config file is important. Yes, this is bad and you should be careful with it (especially when your configuration grows and the regular expressions don't hold on one screen anymore...). Putting most specific regular expressions first is probably the most natural way (please note that sometimes there is no way to figure out which expression is the most specific, as in ~ /foo/ versus ~ /bar/: both match /a/foo/bar/b and /a/bar/foo/b...).

Now, if you think this is the end of it, think again and look at this:

    location /foo/bar {
        echo foo-bar-starts-with;
    }
    location ~ ^/foo/bar {
        echo foo-bar-regexp;
    }

What is the response to GET /foo/bari? Les jeux sont faits, rien ne va plus... And the winner is foo-bar-regexp. Yes, "regexp" locations take over "starts with" locations (regardless of the order of appearance, as you noticed from the example).

Now is a good time to introduce the "start with before regexp" location (operator is ^~):

    location ~ ^/foo/bar {
        echo foo-bar-regexp;
    }
    location ^~ /foo/bar {
        echo foo-bar-starts-before-regexp;
    }

As you would expect, GET /foo/bari returns foo-bar-before-regexp. The operator is kind of misleading because it's very similar to the regexp one. To remember I use the following trick : ~ is for regexp and ^ denotes the beginning and the negation in regexp world. So ^~ "almost naturally" conveys the meanings that it comes before regexp and that it is not a regexp. YMMV, of course... ;)

The last type of location is very different from the rest of the gang because it is not used during "the search for a location" on a request. It is a "named" location and will be available only as the target of a rewrite (or something similar). Here is an example:

    location @/foo/bari {
        echo foo-bar-internal;
    }
    location ^~ /foo/bar {
        echo foo-bar-starts-before-regexp;
    }

Just like before a GET /foo/bari returns foo-bar-starts-before-regexp. Now, we just proved that named locations are not used for external requests processing. I'm not going to give you an example of using a named location in a rewrite rule because this is far beyond the subject of this post. Now, please note that the @ operator is not followed by a space (unlike the others location operators). Actually, it is pretty much standard to use something that "does not look like a URI" as the name of a named location. Something like location @fallback.

So, with all this you see the order of precedence for locations:

  1. "Exact" location (= operator).
  2. "Starts with before regexp" location (^~ operator). Inside this, longest strings comes first.
  3. "Regular expression" location (~ or ~* operator). Inside this, order of precedence is order of appearance in configuration file.
  4. "Starts with" location (no operator). Inside this, longest strings come first.

You see there is quite room for shooting oneself in the foot. That is probably why you will see most of Igor's configurations (in replies to the mailing-list) avoid using regular expressions. I think this has to do with the difficulty there is to figure out which location is used when reading the configuration but also with the fact that regexp trial-and-error method to find the matching location is slow compared to the method used for literal strings (a red-black tree). So, the recommanded way seems to be to use "nested locations" like this:

location /thumb- {
   location ~ ^/thumb-(.*)\.jpg$ {
       [...]
   }
}

With this, not only do you have a cleaner configuration (it's easier to figure out what happens to /thumb-... requests) but you also save CPU cycles by not even trying a regexp match (that will fail) on all requests that don't start with /thumb-. Oh, the wiki page says nested locatiosn are not recommended, but if the creator himself uses them, why shouldn't we? ;)

That should give you a little bit to ponder till my next post.

4 comments:

  1. You're welcome. And there are more gory details on locations coming... ;)

    ReplyDelete
  2. Helpful post! thanks.

    ReplyDelete
  3. Awesome tutorial, I wish I have found it 2 months ago.

    ReplyDelete