I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I imagined looking at a certain hardware events going into and out of a routine in my library within the context of the bigger binary but it seems I would need a specific kernel module for PAPI to work for me (Linux 2.6.18 && Intel Xeon 5570)...there is Vtune which is specifically geared for intel processors but it seems like it's something which would profile the entire binary for performance and not specific code snippets (the 3-4 calls into my library).
Is there a way for me to use Vtune for my goal, or possibly something which can give me access to such counters without having to patch my kernel?