nginx discovery journey: January 2011

Monday, January 31, 2011

Day 9 - nginx master and servant (aka workers)

Today, I tried to figure out how nginx scripts work. So, I finally got to installing an IDE. I could have looked for the best IDE for C programming but I did not feel like spending too much time reading reviews. So, I went for what I know: Eclipse. This has nothing to do with nginx but I must say I'm not happy at the community attitude (for once). On one side you have gcj which seems to have difficulties getting their VM to work, on the other side you have the Eclipse team who claims "to test only for java/sun/Oracle -whatever their name is- VM), on yet another side you have the packaging team who packages the gcj and at the end of the day you have the poor user who gets something that doesn't work and must break his installation by adding the VM through the java/sun/Oracle RPM and tar xvzf eclipse. That was not my point but I couldn't resist making it.

So, wanting to see nginx scripts in action, I started my favorite IDE, set a breakpoint in the script compilation code (ngx_http_script.c for those of you who care) and started nginx with the 'out-of-the-box' configuration. That's when I realized what the "process-model" of nginx is. It's actually pretty standard: one master process that reads the configuration file and spawns "worker" processes that will do the job. When you send the appropriate signal to the master/parent process (which is what nginx -s reload ends up doing), then your master will just load the new configuration, stop the existing workers and spawn new ones (which now have the new configuration loaded). Simple, nice, efficient.

As I said, I was expecting to see scripts in action. More specifically, I was convinced the logging mechanism was using them (to handle the log_format directive) but I was obviously mistaken because I ran only once into my breakpoint and that was to compile "$proxy_host" for the http_proxy module. Therefore I looked at the logging mechanism but discovered that it uses its own mechanisms for parsing and managing log_format. I was a bit disappointed and I really wonder why it is so. Probably has to do with the fact that logging was implemented before the scripts and never refactored after scripts made it to the source tree. Of course, this is just guessing from me...

Friday, January 28, 2011

Day 8 - nginx scripts : another name for templates ?

Discovered about scripts (the chapter I'm translating). Surprisingly, scripts look like
some basic template mechanism. It's built into the "HTTP3 part of nginx. You give it a string with
${variable_name} references in it and it will create some "byte-code" that
you can later run to produce strings. Of course, the variable referenced by "variable_name" must
have been previously extracted. And this is where you use the variables I
mentioned yesterday.

This definitely looks like another hardcore, performance-oriented implementation
of a "simple" feature. beyond the fact that it confirms that "devil is in the
details", it also makes me wonder "up to what point are you ready to go to
make something fast" (the ngx_http_script.c file itself is almost 2000 lines
long)? To me, this is another proof that Igor is not much of a compromise guy. But then
again, with compromises he wouldn't have written the fastest web server on
the surface of the planet...

Thursday, January 27, 2011

Day 7 - bye bye buffers, hello variables

I moved out of the buffers/chains with the very serious conviction that when I will proof-read my translation I will throw it away and redo it.

So, I moved to next chapter: variables. At this point, it's not clear to me whether a variable is a thing that can be converted to/from a string or if this is even more generic than this : like a value holder for which you can specify a getter/setter (which could translate from/to string but also do something even niftier). It definitely includes a mechanism for caching (to avoid unnecessary calls to the getter, I assume) which can be activated or not based on the variable itself.

From reading the code, you can cache things like "nginx_version" (not really a surprise) but you cannot cache "request_method" (I would qualify this as half a surprise). Therefore I would tend to think that the caching mechanisms has only one level of granularity : the life of a process. It does not support caching for the lifetime of a query ("request_method" cannot change in the middle of a query processing). But then again, this is all pretty cloudy.

This thing is really a journey: the more I explore, the more I realize how little I understand this thing. But I still hope that at some point I will reach a mountain from which I can say "I get the whole picture". But for now, I'm in the middle of the jungle, trying not to be killed by wild beasts.

I should have mentioned it yesterday but forgot: I subscribed to the nginx users mailing lists. Definitely more traffic than on the development one but to be completely honest, there is not a big difference on the topics. The devel mailing-list does not have many discussions about the actual development of nginx.

Wednesday, January 26, 2011

Day 6 - the C10K problem, nginx buffers and chains

The C10K problem page is a must-read for anyone who is really interested in highly-scalable event-driven architecture. Now, it's a bit outdated and some of the problems mentioned (including the "thundering herd" problem) seem to have been solved a long time ago by our friends working on the kernel. But it took a very long time for the kernel to provide mechanisms that support event-driven user-space software. That is also probably a good reason why nginx could not exist sooner. The lesson I get from this (and this is definitely one I never thought about) is "if the thing is asynchronous" don't make it synchronous because it's making your life easier. Eventually, you will have to bite the bullet. So, the sooner, the better.

Now, to keep you posted on the translation front, I'm in the hardcore part of nginx : buffers and chains of buffers. I must say, I feel like I'm in the middle of the ocean with no land in sight : LOST. The two concepts are closely tied together but they are also very tied to the way content is generated and processed. And that part is further away in the paper I'm translating (or at least, that's what I believe). So, it's like I'm understanding the words but not the meaning of the sentence.

Enough about how I feel. High-level, a buffer points to a chunk of memory and keeps track of where nginx stands in terms of "processing" this data. Chains are a succession of such buffers. I think the whole idea behind this is to avoid copying memory back and forth. Let's say you have a constant string you want added at the end of every single request. There is no way to allocate a copy of this string for each request : you build a buffer that points to the constant string (allocated by your compiler somewhere in memory) and "chain" this buffer just after the original response buffer. I'm over simplifying here but I think this is the idea. At some point (when I am sure I fully understand it), I will have to make a good schema. I definitely think it could be useful. That will come when I'm done with the translation. So, I'm @TODOing this then...

One more revelation for the day : there seems to be one chain per memory allocation pool. Cannot start to fathom why. But I'm sure there is a good reason. May be, future events will shed some light on this.

Tuesday, January 25, 2011

Day 5 - nginx is asynchronous but it also manages memory. Introducing buffers and chains

Done with the first chapters. Of course, when I say "done", it means I did a first translation. I still have to review it. And, I don't know why but I have the feeling that when I'm going to come back to this I will be like "I wasn't understanding a word of what I was saying" and will want only one thing: redo it. I guess that's why compilers always have multiple pass: one to get it wrong but get an idea and one to get it right... ;)

These first chapters were general principles about nginx: it's asynchronous (that's why it's so fast and cheap on CPU) and if you code a module, just make sure it is as well if you don't want to break everything. Now, that was not a surprise: after all, nginx home page claims it's one of the few web servers to solve the C10K problem. But it looks like this thing goes much further than that. Being asynchronous is just the tip of the iceberg : it also has its own memory management mechanisms. If you were planning to do malloc/free, just think again. Surprising at first glance but when you come to think of it and you have seen what happens to a good old Apache when you start getting more than 1500 requests per seconds you figure out this is definitely something to be careful with. For those of you who haven't seen Apache 1500 rps (actual number really depends on memory available and configuration), I'll save you the burden of trying: your full system starts swapping and all you can do is start praying that people don't start hitting the infamous F5 key (aka the "it doesn't work but if I keep hitting on it harder and harder may be it's going to work" key).

Anyway, next chapters are about buffers and chains. I'll probably have a look at the C10K article before going there. This buffers/chains thing is the core of the nginx architecture and I know it's going to be difficult to get through it. So, I feel like I need as much preparation as I can get. And the C10K paper is a good start.

Monday, January 24, 2011

Day 4 - subscription to development mailing list and nginx code cross-referenced

Subscribed to the english development mailing list and looked a bit at the archive/stats. The russian mailing list is definitely more active than the english one. And after starting to translate Valery's page (even with help from google translate and babelfish) I know I won't pick up Russian just by reading it...

This project being so tightly coupled to russian probably explains why it took so long for such a killer thing to make it out in the wild and be recognized and used by the community. Nobody likes using something which all the documentation is in a language they don't understand. But obviously, a lot of effort went there over the past few years. At least for the "user's guide/manual" part. The hardcore stuff (architecture, coding of modules, etc.) is still not that obvious. And don't expect any help from the comments in the code : there is almost none...

The fact that there does not seem to be any official code repository for it is probably another reason why it took so long taking off.

So, I'm working on the translation and as I don't speak russian (already mentioned that), I had to look up the code. There is a very handy cross-reference of the code online. It's on an old code base (0.8.20 vs. 0.8.54(stable)/0.9.2(dev)) but it's here : nginx code cross-reference. Very handy if you don't have a modern IDE handy.

Saturday, January 22, 2011

Day 3 - reading day. Evan again and a newcomer : agentzh

Read the thing about modules guides and advanced modules guides from Evan Miller.
Still a lot of things I don't understand. And I'm not talking about the nitty-gritty details of implementation. I'm talking about overview of the thing. To me it looks like this thing is very complicated. I'm probably missing some of the high-level understanding. How I'm going to get this is still a mystery... But I keep hope.

As this was a "reading" day, I went to agentzh's presentation The state of the art of nginx.conf scripting. Not a lot of information helping me here. This is like a catalogue of modules developed by agentzh more than an explanation of how nginx works. But I am in awe at what he is doing. And I feel like I'm grabbing some of the nginx philosophy : it's an engine (hence the name, I guess) and if you want something done, get a module of your own. Even if this module is super-specific (like a module for tracking). It is probably not a very scalable approach to development for most people (a typical web application has hundreds if not thousands of "actions" : I wouldn't dare code all that as a module in nginx/C). However for something that must be highly-scalable (talking lots of request per seconds here) and has not a lot of business logic, this approach might be a true competitive advantage in situations where the technical cost of the transaction is a fair amount of the total transaction (I'm thinking post-crisis ad-serving here). Anyway, this is rambling so I'd better stop...

Read a bit about the configuration of nginx modules at nginx's wiki: nginx modules. It definitely reinforced the idea that nginx is all about modules development and you pretty much need one for anything in the kitchen sink.

Friday, January 21, 2011

Day 2 - building nginx, Evan Miller and Valery Kholodkov

Did the "basic" yum install:
yum install gcc make autoconf.
Was not enough. Looks like I need a few extra things:

Perl Compatible RegEx:
yum install pcre-devel
and Open SSL:
yum install openssl-devel

Now, the usual configure && make and everything works. Looks like at least I got the thing compiled. It was not that difficult after all...

Looking at the official website, I found a pretty interesting piece of documentation : Emiller's guide to nginx module develoment. Love the part about Batman. Not that I find the analogy very pertinent (sorry Evan) but it's mind catching and fun. Besides that, the thing is clear and well written (from what I can tell). Wish I could do the same. Kudos to Evan (the firstname of my 5-year-old "lover", if anyone cares ;)) Miller.

There is another interesting page from Valery Kholodkov: Development of modules for nginx. Only one caveat... It's in russian and I don't speak russian. So, as I won't learn russian, I'm just going to get some help from google translate and based on what gets out plus some code reading (can speak K&R C better than russian ), I'll try to translate the document myself. It feels like I'm at this stage where I'm going to pay back to the community all (or part of, at least) I got from it. I'll keep you posted and I hope I will avoid the main translation traps (the italians say "traduttore, tradittore", ie translators are traitors).

And one more thing (after that I'll be done for today): Valery started a blog of his own. It's called nginx guts. If you are really interested in nginx itnernals, you're probably better off reading his blog than mine. Now if you want to follow me in my discovery of nginx, feel free to keep reading... ;)

Thursday, January 20, 2011

Day 1 - nginx does http and mail

First look at nginx internals. Guess what? I tar xvzfed it, of course.
And looked at it. First reaction : pretty cool, simple structure, not too many files. What the heck, I've been doing this for 15 years, I can handle this...

Hey, the thing has a mail directory : probably means it can do a mailer job. Better or worse than qmail and/or postfix. I don't care : that's not what I'm here for.

Moving forward directly to the http part. It has modules. It has a core/nginx.c : probably where the main thing is. Looking at it. Yeah, it's here. Something is missing, though : comments. No worries, it's just main, function names are pretty explicit : I can understand. If I had written it, I probably wouldn't have commented it myself.

The cycle thing looks interesting. Looking at core/ngx_cycle.c. This is getting serious. Probably too serious for now. I'll have to look at it later, when I have a better understanding of the big picture. Still no comments. I'm starting to think "you'd better get used to this".

Looking at the hierarchy... http/modules/... A ssi_filter thingie. Very promising for what I want to do. Opening http/modules/ngx_http_ssi_filter_module.c. Trying to figure out what is there. Missing the big picture. Still no comments. Three files, no comments. Now, I'm sure I won't get much help from here. I guess I really entered the community : UTSL is one of the open-source trademarks. At least you can read the source. This is better than not being allowed to. Proprietary source might be a bit more documented, but not always, trust me... And at least, here, I won't have misleading comments of a developer who understood neither the need nor what he did.

Anyway, going back to the main thing... Aside from the command array, the various inits and the configurations/options handling, the ngx_single_process_cycle and ngx_master_process_cycle are clearly the next steps in my exploration. Just have to find them. Let's grep, baby.

And here you go : src/os/unix/ngx_process_cycle. Wasn't expecting to get OS specific this early. Maybe I'm wrong and this is not OS specific but the hierarchy definitely gives an hint in this direction. It starts with an impressive list of sigaddset. Well, this is what nginx is supposed to be all about : signal handling, events, etc. Not too surprised here. Trying to man sigaddset : "No manual entry for sigaddset". Stupid me : my machine doesn't even have developer tools (gcc/make), let alone libraries documentations. I guess we'll continue this after I'm done with a bunch of yum install...

Wednesday, January 19, 2011

Dell XPS 16 wifi not working with Fedora 14

This is the first step of my journey discovering nginx : I got my new computer (Dell Studio XPS 16 - core i7 - 6GB RAM - 256GB SSD) and installed Fedora 14 on it.
Great machine : not very efficient energy wise but what would you expect with 8 cores.

Pretty much everything worked out of the box. Except the wifi. It ships with a Broadcom 4312 and the driver for this is proprietary. So you actually have to get it from broadcom and "extract" it to your install as clearly explained on this page : extraction and installation of b43 driver on fedora. The rest of the page (about b43 driver) is interesting as well... Enjoy the reading.