2

We are using Pivotal Gemfire as a cache for our data. Recently we migrated from gemfire 8.2.1 to 9.5.1 with exactly same regions, data and indexes. But the indexes creation on particularly one region is taking too much of time which has entrycount of 7284500. We have used Spring data gemfire v2.4.1.RELEASE for defining the cache server. Below is the configuration of the problematic region:

<gfe:replicated-region id="someRegion"
            shortcut="REPLICATE_PERSISTENT" concurrency-level=100
            persistent="true" disk-synchronous="true" statistics="true">
            <gfe:eviction action="OVERFLOW_TO_DISK" type="ENTRY_COUNT"
                    threshold=1000></gfe:eviction>
</gfe:replicated-region>

Below are the index definitions:

<gfe:index id="someRegion_idx1" expression="o1.var1" from="/someRegion o1" />
<gfe:index id="someRegion_idx2" expression="o2.var2" from="/someRegion o2"/>
<gfe:index id="someRegion_idx3" expression="o3.var3" from="/someRegion o3"/>
<gfe:index id="someRegion_idx4" expression="o4.var4" from="/someRegion o4"/>
<gfe:index id="someRegion_idx5" expression="o5.var5" from="/someRegion o5"/>
<gfe:index id="someRegion_idx6" expression="o6.var6" from="/someRegion o6"/>
<gfe:index id="someRegion_idx7" expression="o7.var7" from="/someRegion o7"/>
<gfe:index id="someRegion_idx8" expression="o8.var8" from="/someRegion o8"/>

Below is the cache defination:

<gfe:cache
    properties-ref="gemfireProperties"
    close="true"
    critical-heap-percentage=85
    eviction-heap-percentage=75
    pdx-serializer-ref="pdxSerializer"
    pdx-persistent="true"
    pdx-read-serialized="true"
    pdx-ignore-unread-fields="false" />

Below are the Java parameters:

java -Xms50G -Xmx80G -XX:+UseConcMarkSweepGC 
-XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70 
-XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark 
-XX:+UseParNewGC -XX:+UseLargePages 
-XX:+DisableExplicitGC 
-Ddw.appname=$APPNAME \
-Dgemfire.Query.VERBOSE=true \
-Dgemfire.QueryService.allowUntrustedMethodInvocation=true \
-DDistributionManager.MAX_THREADS=20 \
-DDistributionManager.MAX_FE_THREADS=10 \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=11809 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-Dconfig=/config/location/ \
com.my.package.cacheServer

When run without XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC, we used to get the following error while indexes were applied:

org.apache.geode.ForcedDisconnectException: Member isn't responding to heartbeat requests gemfire pivotal

We tried increasing the member-timeout property from 5000 to 300000 but the same issue persisted.

After adding the above GC related java parameters, every index is taking around 24 minutes to get applied, but this time without errors. This is resulting server to take too much time to come up along with around 15 other regions. There is no such issue faced with other regions.(The region in question has the largest data count. Other regions have around 500K to 3M entry count)

KCK
  • 1,856
  • 2
  • 13
  • 30

2 Answers2

4

There are a few things I see from your configuration that need to be adjusted. For some of this I will need to speculate, as I do not know your general tenured heap consumption.

  1. Xmx must equal Xms Set both to 80g as growing the heap can cause major issues
  2. Explicitly set your NewSize = MaxNewSize. If I could see GC logs I could help, but I'm going to give this configuration as a starting point.

Set NewSize and MaxNewSize to 9gb Set SurvivorRatio to 1 Set TargetSurvivorRatio to 85 Add the PrintTenuringDistribution flag to help us fine tune.

  1. I am not a fan of the Scavenge flags, as they cause even more thrashing when not finely tuned. For now, you can keep them in, but I would remove ScavengeBeforeFullGC and ScavengeBeforeRemark. Keep the DisableExplicitGC flag. More importantly, while I read that your behavior changes based upon using these flags, finding a correlation between index creation time and these flags is a stretch. What is more likely is that members are becoming unresponsive due to a bad heap configuration, so let's solve that.

  2. With respect to your eviction configuration, I see you say that you have 7+ million entries in this "problem" region, and yet you have an eviction algorithm where you overflow to disk all but the first 1000 ?? Why? Overflow to disk is something to use to handle bursts of activity, not as a "given". Perhaps you are having disk issues driving some aspects of your issue. Perhaps needing to access all of these entries on disk is a problem. Have you experienced this issue when all entries are actually in the heap?

  3. Enable GC logs with all the flags set to print gc details, datestamps, etc.

  4. If you do not yet have statistics enabled for GemFire, please enable those as well.

  5. If you are finding the member-timeout is insufficient, it is likely that you have issues in your environment. Those should be addressed rather than thinking to increase the member-timeout to cover up those issues.

3

Regarding the index creation time - as David pointed out you have configured this region to have almost all of the data on disk.

That will make index creation more expensive because the process of index creation has to read all of the entries from disk.

However you can make your index creation much faster with this configuration if you use the define flag on your indexes

<gfe:index id="someRegion_idx3" expression="o3.var3" from="/someRegion o3" define="true"/>

This will cause all of your indexes to be created in one pass at the end of the initialization of your ApplicationContext. So hopefully your total time will be closer to 24 minutes because GemFire will only have to scan through all of your data on disk once.

See https://docs.spring.io/spring-gemfire/docs/current/reference/html/#_defining_indexes for more information on defining indexes.

This doesn't really explain your garbage collection issues - I would look at David's answer for more details there.

Dan Smith
  • 416
  • 2
  • 3