0

I have searched to see if this question has been asked before, but I could not find anything - but when I was searching I found some many interesting points about optimization in this answer and the other answers to a question about optimization.

My question is to ask which way is the most efficient/ fastest to set the elements of a large array to zero using C.

The program will track a large number of particles, >>>1000. Each particle is described by several variables some of which will need to be reset to zero every time around a loop, which will be executed >>>1000 times. The exact number of particles that can be handled will depend on the efficiency of the code.

The choices seem to be the following and I have ordered them as I guess from least efficient to most efficient. (I try to describe them with indicative code fragments - no way, of course, is this code that can run, but just something to indicate the strategy - and I realise that loop unrolling might be a good idea, but for simplicity it is not included below)

  1. N particles are represented by an array of a structure that contains all the information about each particle so
  /*structure definition*/
struct particle {
  double a;
  double b;
  ....
};

  /*memory allocation*/
struct particle * part; 
part = (struct particle *)calloc(N,sizeof(particle));

  /*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
  part[i].a=0;
  part[i].b=0;
  .... etc....
}
  1. N particles are represented by several arrays in a structure that contains all the information about the ensemble of particles so
  /*structure definition*/
struct ensemble {
  double * a;
  double * b;
  ....
};

  /*memory allocation*/
struct ensemble group; 
group.a = (double *)calloc(N,sizeof(double));
group.b = (double *)calloc(N,sizeof(double));

  /*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
  group.a[i]=0;
  group.b[i]=0;
  .... etc....
}
  1. exactly the same as 2) above but the variables are reset to zero with
  /*routine to set some particle variables to zero*/
free(group.a); group.a = (double *)calloc(N,sizeof(double));
free(group.b); group.b = (double *)calloc(N,sizeof(double));
  1. instinctively I think there must be an easier way to than 3) to write 0 to memory, which does not require freeing and then reallocating large amounts of memory every time around the loop. -- The answers to this question mention memset, which i am guessing would work, provided that setting everything to zero bytewise will give doubles with values of 0.0000000e00.

  2. same as 2), 3), 4) above, but instead of using any data structure just grab memory for separate arrays.

  /*memory allocation*/
double * a, * b, ... ; 
a = (double *)calloc(N,sizeof(double));
b = (double *)calloc(N,sizeof(double));

  /*routine to set some particle variables to zero*/
for (i=0;i<N;i++)
{
  a[i]=0;
  b[i]=0;
  .... etc....
}

Finally, I saw something that *(a+i)=0 would be quicker than a[i]=0, but for readability the code above has a[i] array indexing.

I also guess that it may be the compiler with optimization flags turned on will do some of these things.

I would be really interested to hear what would be expected to be fastest and how much improvement might be obtained in each refinement..

tom
  • 1,263
  • 10
  • 16
  • 3
    If your system uses IEEE754 format for floating point (which it most likely does), then just use `memset`. – dbush Feb 18 '21 at 23:03
  • 3
    The thing about `*[a+i]` being faster than `a[i]` is nonsense: to the compiler they are literally identical in meaning. – psmears Feb 18 '21 at 23:06
  • As _dbush_ says, use `memset`. This means option 1. It has the highest cache performance of any of the methods – Craig Estey Feb 18 '21 at 23:07
  • 1
    The fastest way to do something is not to do it at all: Design your algorithm so it is not necessary. A possible second fastest way is to incorporate it into something you are already doing. (In particular, if you are working with some thing X, and X must be set to zero before its next use, then set it to zero now, while it is still in cache, to avoid forcing a large `memset` to fetch things from memory.) A possible third fastest way is to use `bzero` if available or `memset` otherwise, as their authors worked on this more than you did. (Any common floating-point format uses all-0 bits for 0.) – Eric Postpischil Feb 18 '21 at 23:10
  • 1
    Some questions: Is it likely that many/most particles will get their values un-zeroed in each loop, or is it likely that only a few will in any given iteration? Do the particles interact much with each other, or are they largely independent? That is, when processing one particle, are you likely to need to read/modify values from another? Is there any sort of multi-threading going on (and if not, have you considered this)? – psmears Feb 18 '21 at 23:11
  • @psmears all particles will need to have a certain number of variables zeroed in each loop because none of these variables will remain zero during a loop, but they all need to start the next loop at zero – tom Feb 18 '21 at 23:39
  • @EricPostpischil- interesting - so would be be more efficient if there is a calculation with `a` or `b` to zero it immediately afterwards so it is ready for the next loop? – tom Feb 18 '21 at 23:40
  • @CraigEstey but in option 1 the variables that need to be zeroed are in structures with other variables that need to hold their values... - each particle may have about 30 varaibles 20 of which need to be reset to zero and the rest left alone...-- – tom Feb 18 '21 at 23:43
  • Many thanks for all the hepful comments :-) - want to do an @allreplies... but not sure if that is possible – tom Feb 18 '21 at 23:45
  • @dbush - would there be any difference in speed between 4 and 5 ? where arrays are part of a structure or just independent? – tom Feb 18 '21 at 23:47
  • @tom Doesn't matter. It's an array either way. – dbush Feb 18 '21 at 23:49
  • 1
    That would have been one of my questions (re. variables to maintain prior state vs. reset). There are _many_ ways to do this. Split the reset variables into a separate `struct` from the stateful parts. How much does a given "round" depend on a prior round? You might be able to do [with (e.g.) 8 CPUs] 16 rounds in parallel [in different threads]. Or, you could use (e.g.) `cuda` and spread things over a 1000 GPU cores. To determine this, we'd need to see representative code of what you're trying to do, overall, _not_ just the reset part. – Craig Estey Feb 18 '21 at 23:54
  • @CraigEstey - Ah interesting - all the reset variables could be lumped into a single structure so that there would be just one huge `memset` command to clear all of them down to zero -- I take your point about seeing more of the code. Yes I was wondering about trying to distribute over GPUs ,,, I tried to keep the question focussed on this aspect of the code rather than give a full description of everything for several reasons - e.g. the exact strategy of the rest of the code is not fixed, the number of calculations to do for each particle for each loop is not fixed yet – tom Feb 19 '21 at 01:17
  • I wouldn't blindly trust memset being the fastest way (or, say, fast enough to fully exhaust the theoretical memory bandwidth). I remember a while back examining the newer MSIL intrinsics for zeroing memory in .NET, and I was able to do program something faster (but only for a fixed length and alignment). I did use SSE, but I don't remember if that was the deciding factor. Fully expending memory bandwidth single-threadedly on x86[-64] is more involved than you would think. – dialer Feb 19 '21 at 02:08

0 Answers0