-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Riemann becomes unresponsive when CPU Count count is increased #1032
Comments
If you got you correctly, at 192 GB, riemann used 100% of CPU. However, when you resized the box to 256 GB, riemann had a slow startup and extreme CPU usage? This sounds like a GC issue. Try tuning GC, look for If this happens to be a problem, try switching to ZGC. However, be aware it needs way more heap than ParallelGC.
|
Certain updates we tried after.
Tried ZGC algo….. on the 96gb machine the heap quickly tose to 75gb….. But CPU just got stuck up at 100%…. What I’ve noticed is if we use parallelGC or (we dont set any GC algorithm)…. rieman is atleast useable ranging between 70 and 100. I’ll probably set the GC flags and see. Wht could be the reason for CPU to sit at 100 even when I increase CPU core count from 48 to 64 though?🤔 |
On that note, if it really is GC ( which is probably the reason) one more question that strikes me is why is the jvm limiting the heap size here despite of having a large amount of headroom..🤔 |
Unless you explicitly set
You mentioned in the first post you already monitoring stream and netty queue size; try collecting JMX metrics from Riemann or attach tools like VisualVM. There you'll see how much GC is in use. I think ParallelGC will use N or N-1 threads (where N == number of cores) by default. To be sure, check it with: java -XX:+PrintFlagsFinal 2>&1 | grep ParallelGCThreads Since you already have VM at 100% of CPU usage for a while, I'd review Riemann configuration and streaming rules first. Heavy use of |
Since the Riemann rules are maintained by several teams and there are 100s of services that rely on this stack items like |
In that case, I'm not sure how this be considered as a bug on Riemann side; it is hard to reproduce without exact setup you are using. |
I agree with @sanel here. One thing that could help is moving up in JDK versions and trying the newer GC options. |
Updates:
JVM
Currently, we are on an Initially, Riemann was on JVM8 and used CMS
However, on moving to JDK13, G1GC didn't seem to help as it was unstable and memory kept collecting after a few days and kept crashing. ZGC also didn't seem to help. (and JDK 17 after that). Parallel GC as a throughput collector was the most stable. (We probably would go back to see how GC can be changed to a better option than stop-the-world). However, from the stack traces, it didn't seem to be GC. Even switching to What is concerning is, the moment with the new JVM also, when we upgrade the VM to a higher VM (m5.16xlarge [64CPU/256gb] or c5n.18xlarge [72cpu/192gb]), Riemann starts to become unresponsive and CPU stays at 100% with load touching 120~150. In the given screenshot, The VM was recreated to try the bigger VM size. The Yellow and green parts of the graph were the smaller 48CPU machines in the Blue region when we tried to use the higher configuration 72CPU machine. So this issue of high CPU only exists on the BIG VM(72 or 64 CPU). Switching back to the smaller(48 CPU) machine brings everything back to normal and stable. The slight tapering towards the end of the blue region is when we tried to remove rules from the I am noting the 2 thread dumps. Seems like most threads are parked. |
Sadly, it is still hard to figure out what is happening without seeing the full I'd try these things in the given order:
I like riemann-configuration-example from @mcorbin and I find it easy to scale and debug. Try reorganizing (let [index (index)]
(streams
index
influx
cpu-check-stream
mem-check-stream
;; disk-check-stream ;; disk stream is disabled
custom-stream
)) and simply commenting/disabling specific streams, you can quickly see how Riemann behaves. |
Describe the bug
Currently we are on a machine with 48 CPU and 192GB (aws m5.12xLarge). We’ve had an almost 2x increase in the number of metrics in the system. Hence the VM’s are usually at 100% CPU and system load around 80 and 100. So its extremly loaded. This causes our riemann instance to fail occassionally with a high stream and netty queue size.
Naturally we thought of resizing our instance to a higher size VM. We decided to go with a 64CPU 256 GB machine (m5.18x).
However the moment we reaize the VM, riemann becomes un reaponsive in a few miments of starting up. We see that the
‘ riemann executor stream-processor queue size’ almost shoots up to 5~10k before no metrics are seen anymore. If we do a top on the VM again cpu would be at around (6200%) and system load close to 110.
The metrics are forewarded to an influxdb downstream( cpu and memory are around 30%~40%).
Bringing the VM back to the original size solves it. (As in riemann is teaponsibe but unstable at 100% because of the load).
We havent update riemann so it shouldn’t be an update issue. We are using JDK 13.
on the java side we’ve configured heap to 75% with the MaxRAM value and GC is +ParallelGC. (These settings have been working for us perfectly until this new load increase.
Expected behavior
Riemann works with higher VM and handles the higher load.
Background (please complete the following information):
The text was updated successfully, but these errors were encountered: