Today I took a tour of the proxy
and upstream
directives and I found things I did not expect. It sounds like I could always start my posts with this. Let me start with the basic idea: nginx is good "in front" of another web server because it buffers requests and responses and minimizes the time resources are "locked" by a request in a backend server. That's why you can serve more request per seconds with an Apache behind a nginx than with an Apache all by itself. Kind of counter-intuitive at the beginning (adding layers usually makes things slower not faster) but that's the way it is. Now, my idea was: let's do the same with a crawler and see what happens. Let's say you use wget
(if you're a curl
kind of guy, just go ahead, Jesus still loves you... ;)) to crawl pages. The question is: would wget
benefit from having a nginx between him and the web it's trying to crawl?
So, instead of going:
wget 'http://www.nginx-discovery.com'
I would go:
wget 'http://localhost:1970/crawl/www.nginx-discovery.com/'
Of course with my nginx listening on port 1970 on my localhost. Yes, this is a weird idea, but no more than running nginx on top of Apache to make it faster...
Good news: there is a proxy_pass
directive used to implement reverse proxies and that should do the trick. If (yes, there is a if) we manage to extract the target of the crawl from the URL and use it as a backend.
You know me, I am a big Test::Nginx fan (especially the flavor with my improvements ;)). So, I used it to test this idea:
=== TEST 1 --- config location ^~ /crawl { location ~ "^/crawl/(.*)/(.*)" { proxy_pass $1/$2; } } --- request GET /crawl/www.google.com/ --- response_body_like <title>Google</title>
Don't even bother to try this at home: it fails lamentably with a 500 ("Internal server error"). So, let's dig into the error logs:
[error] 7664#0: *1 invalid URL prefix in "www.google.com/", [...]
Mmm, let's add the infamous http:
=== TEST 1 --- config location ^~ /crawl { location ~ "^/crawl/(.*)/(.*)" { proxy_pass http://$1/$2; } } --- request GET /crawl/www.google.com/ --- response_body_like <title>Google</title>
Different result but still in the 5xx family: "502 Bad Gateway". Logs are different too:
[error] 7741#0: *1 no resolver defined to resolve www.google.com, [...]
What? nginx doesn't know about google? And it has been around the web for more than 9 years? May be it knows about Altavista or HotBot... ;) More seriously, a little search on the nginx wiki gets you to the resolver directive. Yes, nginx has its own resolver mechanism which it does not inherit from the system (can you spell resolv.conf
in Windows?). Let's try it:
=== TEST 3 --- config location ^~ /crawl { resolver 8.8.8.8; location ~ "^/crawl/(.*)/(.*)" { proxy_pass http://$1/$2; } } --- request GET /crawl/www.google.com/ --- response_body_like <title>Google</title>
It could have worked... At least, google DNS knows about www.google.com
, which is reassuring. There are only two little things:
1. I'm in France and my IP is french so the answer I actually get is a 302 redirect. Changed the www.google.com
to www.google.fr
.
2.
# })(); # </script>' # doesn't match '(?s-xim:<title>Google</title> # )'
I never get that straight with Test::Nginx
. The CR/LF
is part of the data... :( So, I should have done:
--- response_body_like : <title>Google</title>
Champagne !!! It worked.
Now, that we got it running, let's figure out why we need a resolver. Well, Linux, POSIX and the like offer only one way to get an IP from a name: gethostbyname
. If you take time to read the man page (always a safe thing to do... ;)) you'll realise there is a lot to do to resolve a name: open files, ask NIS or YP what they think about it, ask a DNS server (may be in a few different ways). And all this is synchronous. Now that you are used to the nginx way, you know how bad this is and you don't want to go down the ugly synchronous way. So Igor, faithful to himself, reimplemented a DNS lookup (with an in-memory cache, mind you) just to avoid calling this ugly blocking gethostbyname
... And that's why we have this extra resolver
directive. Yes, coding the fastest web server comes at a price...
Now, before I let you go, I want to tell you the following test passes:
=== TEST 4 --- config location = /crawl/www.google.fr/ { proxy_pass http://www.google.fr/; } --- request GET /crawl/www.google.fr/ --- response_body_like : <title>Google</title>
Yes, there is no resolver
... And actually, I could probably define www.google.fr
IP address in my /etc/hosts
, nginx would use that value. Would nginx use the ugly blocker after all?
The answer is yes. If it knows it doesn't matter. Here, the proxy_pass
directive can be fully "determined" at configuration time. And during configuration, call to ugly blocking functions are not that much of a problem. So, in this situation gethostbyname
is used.
Of course, using two completely different ways of doing things to do things that are very similar can be misleading and cause trouble for the newbie. But, hey I don't drive a Ferrari (for fear of hitting my fig-tree)... ;)