Tuesday, May 17, 2011

Day 51 - proxy_pass and resolver

Today I took a tour of the proxy and upstream directives and I found things I did not expect. It sounds like I could always start my posts with this. Let me start with the basic idea: nginx is good "in front" of another web server because it buffers requests and responses and minimizes the time resources are "locked" by a request in a backend server. That's why you can serve more request per seconds with an Apache behind a nginx than with an Apache all by itself. Kind of counter-intuitive at the beginning (adding layers usually makes things slower not faster) but that's the way it is. Now, my idea was: let's do the same with a crawler and see what happens. Let's say you use wget (if you're a curl kind of guy, just go ahead, Jesus still loves you... ;)) to crawl pages. The question is: would wget benefit from having a nginx between him and the web it's trying to crawl?

So, instead of going:

wget 'http://www.nginx-discovery.com'

I would go:

wget 'http://localhost:1970/crawl/www.nginx-discovery.com/'

Of course with my nginx listening on port 1970 on my localhost. Yes, this is a weird idea, but no more than running nginx on top of Apache to make it faster...

Good news: there is a proxy_pass directive used to implement reverse proxies and that should do the trick. If (yes, there is a if) we manage to extract the target of the crawl from the URL and use it as a backend. You know me, I am a big Test::Nginx fan (especially the flavor with my improvements ;)). So, I used it to test this idea:

=== TEST 1
--- config
location ^~ /crawl {
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass $1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

Don't even bother to try this at home: it fails lamentably with a 500 ("Internal server error"). So, let's dig into the error logs:

[error] 7664#0: *1 invalid URL prefix in "www.google.com/", [...]

Mmm, let's add the infamous http:

=== TEST 1
--- config
location ^~ /crawl {
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass http://$1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

Different result but still in the 5xx family: "502 Bad Gateway". Logs are different too:

[error] 7741#0: *1 no resolver defined to resolve www.google.com, [...]

What? nginx doesn't know about google? And it has been around the web for more than 9 years? May be it knows about Altavista or HotBot... ;) More seriously, a little search on the nginx wiki gets you to the resolver directive. Yes, nginx has its own resolver mechanism which it does not inherit from the system (can you spell resolv.conf in Windows?). Let's try it:

=== TEST 3
--- config
location ^~ /crawl {
    resolver 8.8.8.8;
    location ~ "^/crawl/(.*)/(.*)" {
        proxy_pass http://$1/$2;
    }
}
--- request
GET /crawl/www.google.com/
--- response_body_like
<title>Google</title>

It could have worked... At least, google DNS knows about www.google.com, which is reassuring. There are only two little things:

1. I'm in France and my IP is french so the answer I actually get is a 302 redirect. Changed the www.google.com to www.google.fr.

2.

# })();
# </script>'
#     doesn't match '(?s-xim:<title>Google</title>
# )'

I never get that straight with Test::Nginx. The CR/LF is part of the data... :( So, I should have done:

--- response_body_like : <title>Google</title>

Champagne !!! It worked.

Now, that we got it running, let's figure out why we need a resolver. Well, Linux, POSIX and the like offer only one way to get an IP from a name: gethostbyname. If you take time to read the man page (always a safe thing to do... ;)) you'll realise there is a lot to do to resolve a name: open files, ask NIS or YP what they think about it, ask a DNS server (may be in a few different ways). And all this is synchronous. Now that you are used to the nginx way, you know how bad this is and you don't want to go down the ugly synchronous way. So Igor, faithful to himself, reimplemented a DNS lookup (with an in-memory cache, mind you) just to avoid calling this ugly blocking gethostbyname... And that's why we have this extra resolver directive. Yes, coding the fastest web server comes at a price...

Now, before I let you go, I want to tell you the following test passes:

=== TEST 4
--- config
location = /crawl/www.google.fr/ {
    proxy_pass http://www.google.fr/;
}
--- request
GET /crawl/www.google.fr/
--- response_body_like : <title>Google</title>

Yes, there is no resolver... And actually, I could probably define www.google.fr IP address in my /etc/hosts, nginx would use that value. Would nginx use the ugly blocker after all?

The answer is yes. If it knows it doesn't matter. Here, the proxy_pass directive can be fully "determined" at configuration time. And during configuration, call to ugly blocking functions are not that much of a problem. So, in this situation gethostbyname is used.

Of course, using two completely different ways of doing things to do things that are very similar can be misleading and cause trouble for the newbie. But, hey I don't drive a Ferrari (for fear of hitting my fig-tree)... ;)

12 comments:

  1. I've been trying to figure out what the cache expiry time is for a resolved dns entry. Any insight into that?

    ReplyDelete
  2. Hi nil,

    From a brief look at the code, I would say it's 300 sec (ie 5 minutes) but that surprises me a lot as I would have expected Igor to use the DNS TTL. I'll look into it more carefully if I can find a little bit of time.

    ReplyDelete
  3. Thanks, I did finally notice:

    r->valid = 300;

    in ngx_resolver.c.

    It does seem a little arbitrary, but it works for me.

    ReplyDelete
  4. Got confirmation of the cache at 5 minutes from Maxim in this message: Does nginx honor DNS TTLs for proxy upstreams

    ReplyDelete
  5. Just if anybody comes here. There has been updates on this: now nginx supports multiple resolvers *and* respect TTLs. Here is the official documentation: http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver

    ReplyDelete
  6. Perfect, this post saved a lot of headache.

    ReplyDelete
  7. You should consider adding your blog to http://planet.ngx.cc/. It's very much in its infancy, but we're getting there. Please message me if you're interested in adding your blog and having troubles.

    ReplyDelete
  8. ..."there is a proxy_pass directive used to implement reverse proxies" - shouldn't the proxy_pass directive be desribed as implementing a forward proxy?

    ReplyDelete
    Replies
    1. From Igor Sysoev himself:

      "nginx is not intended to run as a forward proxy."

      See: http://forum.nginx.org/read.php?2,196115,196153#msg-196153

      Delete
  9. Holy crap you just solved a problem for me. I was trying to implement this solution described here: http://dgtool.blogspot.com/2013/02/nginx-as-sticky-balancer-for-ha-using.html

    However we are using hostnames instead of IP addresses as in the blog post. I neglected to check my error.log like an idiot or I would have sorted this out long ago. Thank you!

    ReplyDelete
  10. this worked like a charm, thanks

    ReplyDelete