Tuesday, May 17, 2011

Day 51 - proxy_pass and resolver

Today I took a tour of the proxy and upstream directives and I found things I did not expect. It sounds like I could always start my posts with this. Let me start with the basic idea: nginx is good "in front" of another web server because it buffers requests and responses and minimizes the time resources are "locked" by a request in a backend server. That's why you can serve more request per seconds with an Apache behind a nginx than with an Apache all by itself. Kind of counter-intuitive at the beginning (adding layers usually makes things slower not faster) but that's the way it is. Now, my idea was: let's do the same with a crawler and see what happens. Let's say you use wget (if you're a curl kind of guy, just go ahead, Jesus still loves you... ;)) to crawl pages. The question is: would wget benefit from having a nginx between him and the web it's trying to crawl?

So, instead of going:

wget 'http://www.nginx-discovery.com'

I would go:

wget 'http://localhost:1970/crawl/www.nginx-discovery.com/'

Of course with my nginx listening on port 1970 on my localhost. Yes, this is a weird idea, but no more than running nginx on top of Apache to make it faster...

Good news: there is a proxy_pass directive used to implement reverse proxies and that should do the trick. If (yes, there is a if) we manage to extract the target of the crawl from the URL and use it as a backend. You know me, I am a big Test::Nginx fan (especially the flavor with my improvements ;)). So, I used it to test this idea:

=== TEST 1
--- config
location ^~ /crawl {
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass $1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

Don't even bother to try this at home: it fails lamentably with a 500 ("Internal server error"). So, let's dig into the error logs:

[error] 7664#0: *1 invalid URL prefix in "www.google.com/", [...]

Mmm, let's add the infamous http:

=== TEST 1
--- config
location ^~ /crawl {
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass http://$1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

Different result but still in the 5xx family: "502 Bad Gateway". Logs are different too:

[error] 7741#0: *1 no resolver defined to resolve www.google.com, [...]

What? nginx doesn't know about google? And it has been around the web for more than 9 years? May be it knows about Altavista or HotBot... ;) More seriously, a little search on the nginx wiki gets you to the resolver directive. Yes, nginx has its own resolver mechanism which it does not inherit from the system (can you spell resolv.conf in Windows?). Let's try it:

=== TEST 3
--- config
location ^~ /crawl {
    resolver 8.8.8.8;
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass http://$1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

It could have worked... At least, google DNS knows about www.google.com, which is reassuring. There are only two little things:

1. I'm in France and my IP is french so the answer I actually get is a 302 redirect. Changed the www.google.com to www.google.fr.

2.

# })();
# </script>'
#     doesn't match '(?s-xim:<title>Google</title>
# )'

I never get that straight with Test::Nginx. The CR/LF is part of the data... :( So, I should have done:

--- response_body_like : <title>Google</title>

Champagne !!! It worked.

Now, that we got it running, let's figure out why we need a resolver. Well, Linux, POSIX and the like offer only one way to get an IP from a name: gethostbyname. If you take time to read the man page (always a safe thing to do... ;)) you'll realise there is a lot to do to resolve a name: open files, ask NIS or YP what they think about it, ask a DNS server (may be in a few different ways). And all this is synchronous. Now that you are used to the nginx way, you know how bad this is and you don't want to go down the ugly synchronous way. So Igor, faithful to himself, reimplemented a DNS lookup (with an in-memory cache, mind you) just to avoid calling this ugly blocking gethostbyname... And that's why we have this extra resolver directive. Yes, coding the fastest web server comes at a price...

Now, before I let you go, I want to tell you the following test passes:

=== TEST 4
--- config
location = /crawl/www.google.fr/ {
    proxy_pass http://www.google.fr/;
}
--- request
GET /crawl/www.google.fr/
--- response_body_like : <title>Google</title>

Yes, there is no resolver... And actually, I could probably define www.google.fr IP address in my /etc/hosts, nginx would use that value. Would nginx use the ugly blocker after all?

The answer is yes. If it knows it doesn't matter. Here, the proxy_pass directive can be fully "determined" at configuration time. And during configuration, call to ugly blocking functions are not that much of a problem. So, in this situation gethostbyname is used.

Of course, using two completely different ways of doing things to do things that are very similar can be misleading and cause trouble for the newbie. But, hey I don't drive a Ferrari (for fear of hitting my fig-tree)... ;)

Wednesday, May 4, 2011

Day 50 - which version, which modules ? Or ngx-build.

Today, I figured I would give a guy on the mailing list a hand. Poor guy asked the question a couple of times and got no answer. Of course, I don't know the answer but I figured I could learn a few things trying to help. So, I started creating a support directory in my nginx directory and another sub-directory for his case (map-proxy). Yes, that's the way I am: I like it when my room is in order. Now, I figured I would use Test::Nginx to setup something quickly. So, I started wondering: which version of nginx (I have at least three in my nginx directory)? Which modules? I need to completely set an environment for it. But I'm not going to download the full source just for that... Not having an easy way to setup a test environment for a specific nginx configuration has been a rock in my shoe (or something else somewhere else) for quite some time now and I decided to fix it, at last...

This looks again like I'm going to get side-tracked and get something else done than what was originally planned. And I'll drag you along... ;) That's what adventure trips are for, aren't they?

So, I'm sick of having to figure out what is the right version of nginx for the task at hand and where I put this in my directories. That's where my dream started: a perfect world where I could order a nginx 0.8.54 with just the echo module. Or even better, the module I'm working on.

Here is the usage I would like to have:

Usage: ngx-build [main-source] [module-source]
  main-source      Can be a VERSION or a DIRECTORY.
                   Look for main nginx source in DIRECTORY.
                   Look for main nginx source in a directory
                   named nginx-VERSION in $HOME, $HOME/nginx
                   and $NGX_ROOT (if set).
  module-source    DIRECTORY or PATTERN. Use DIRECTORY to indicate
                   a module directory (has a config file). Append
                   * before and after PATTERN and look 
                   for module directories in $HOME,
                   $HOME/nginx and $NGX_ROOT (if set) that match
                   the resulting pattern.

Example: ngx-build 0.8.54 echo

Invoking this should configure with the appropriate options, make the executable and last but not least copy it in the current folder.

That gave me something like this ngx-build script (feel free to use/patch/do whatever you like):

#!/bin/sh
launch_path=`pwd`
function test_ngx_path() {
  if test -z "$my_ngx_path" -a -d "$1" -a -f "$1/src/core/nginx.c"; then
    my_ngx_path="$1";
  fi;
}

test_ngx_path "$1";
if test "$NGX_ROOT"; then
  test_ngx_path "$NGX_ROOT/nginx-$1";
fi
test_ngx_path "$HOME/nginx-$1";
test_ngx_path "$HOME/nginx/nginx-$1";

function test_module_path() {
  if test -z "$my_module_arg" -a -f "$1/config"; then
    my_module_arg="--add-module=$1";
  fi;
}

test_module_path "$2";
if test "$NGX_ROOT"; then
  test_module_path "$NGX_ROOT"/*"$2"*;
fi
test_module_path "$HOME"/nginx/*"$2"*;

cd "$my_ngx_path"
if test -z "$my_module_arg"; then
  ./configure --with-debug
else
  ./configure "$my_module_arg" --with-debug
fi

make -C $my_ngx_path
cp $my_ngx_path/objs/nginx $launch_path

Besides the usual problems you get when trying to write shell scripts (how many escapes, what is the right syntax for testing something, etc.), there is one particular thing you should know about the configure script of nginx: it MUST be run in the folder where it is. Yes, most people follow the usual mantra: "configure && make && make install" or another slightly different version. But sometimes you would like to do it differently. Well, you are more likely to run into problems on less-travelled roads. Or do you? At least with software, this is the way (also known as: if it's not tested, it's not going to work).

And I have at least two uses for this script:

% ngx-build 0.8.54 echo
% ngx-build 0.9.5 `pwd`

Would be better if the second invocation could use . for the current directory but that does not seem to work out of the box and I don't care to spend the extra time on it for now. If anybody does or want to add some extra feature, just let me know.

Oh, in the meantime someone else answered the question. Next time, may be I won't be side-tracked...