11

If I get an initial "Name or service not known" (EAI_NONAME), the next call to getaddrinfo() seems to go straight to the dns instead of checking the cache first (nscd logs show no lookup attempts, tcpdump shows traffic to DNS server). If the first call succeeds in getting an address, from then on, all getaddrinfo() calls go to nscd first, as expected.

I'm compiling against glibc-2.13 for arm linux. In my rc.d, nscd is started before my daemon. nscd is set to disallow shared caches, and maintain a host cache. I am using the nscd from busybox (0.47). nsswitch.conf is set so host checks cache/files/dns. hosts.conf is set to check files/bind.

My daemon is calling getaddrinfo().

I have debug logs for nscd running, and they show that the client started to read the DNS response closes with a "Broken Pipe" error.

After that it will show GAI attempts from other daemons attempting to use the cache (so I know it's not nscd locked up or anything), but the daemon that got EAI_NONAME never again contacts nscd to do a cache lookup.

If I restart the daemon, I get the same behaviour, if the first DNS query times out again.

Is there something in glibc that is invalidating my daemon's link to the cache? Is there a way to reconnect my daemon to the cache without restarting it (similar to forcing a resolv.conf re-load via res_init())?

Kevin Stricker
  • 16,692
  • 5
  • 44
  • 69
colin.mc
  • 111
  • 1
  • 4
  • 1
    "*... is there a way to reconnect my daemon to the cache without restarting it ...*" did you try to make your daemon call `getaddrinfo()` "**really**" often. Let's say 100++ times? Try it and monitor access to `nscd`. I can not test this here, but there might be a chance your daemon decides to test the `nscd` connection then again and if succesful uses it from then on. – alk Jun 16 '13 at 10:28
  • "*If I restart the daemon, I get the same behaviour ...*" you are referring to your own daemon here, not to `nscd`, do you? – alk Jun 16 '13 at 10:34
  • Btw: I get the idea mentioned above from inspecting eglibc's sources. – alk Jun 16 '13 at 11:10

1 Answers1

5

As alk mentions in his comment, retrying getaddrinfo() more than 100 times should force a nscd query.


To understand why, let us take a quick peek into the flow of execution inside getaddrinfo().

  1. getaddrinfo() calls gaih_inet.

  2. gaih_inet() performs the following operations on __nss_not_use_nscd_hosts :

    • Checks whether it is a positive integer?
    • Increments it.
    • Checks whether it exceeds the retry count NSS_NSCD_RETRY?

      • It attempts to query nscd ONLY if both the above conditions are satisfied.

      • Also upon attempting a query to nscd, the count is immediately reset to zero
        thereby ignoring nscd for the next NSS_NSCD_RETRY times getaddrinfo() is called.

  3. Also __nss_not_use_nscd_hosts is modified internally by nscd in the following places

Based on the above, it can be concluded that
getaddrinfo() does NOT query nscd every single time.

The internal state of nscd (determined by __nss_not_use_nscd_hosts)
decides if getaddrinfo() ends up calling nscd or not.

To really force one's way around the 100 retry limitation, one could modify NSS_NSCD_RETRY and rebuild libc to deviate from the standard behaviour. But i am not really sure if this will NOT result in any other unintended regressions.

Reference : Patch that introduced the __nss_not_use_nscd_hosts logic in getaddrinfo().

TheCodeArtist
  • 19,131
  • 3
  • 60
  • 123