TLDR: Below are multiple stories of how robots have failed and specific things to validate on your robotic system to minimize hidden failures due to resource issues.In most robotics teams, resource optimization is massively overlooked, and leads to a huge number of headaches and significantly lower overall performance.
“Over half of the engineers we asked didn’t know what their CPU usage looks like, but expressed worries about it.”
In a previous blog we announced our set of tools for monitoring and logging robot resources.
I am going to walk through a set of learnings over the last 3 years with design patterns and best practices for high-quality and high-performance tuning of robotic software.
If a robot goes unstable, it can lead to a huge number of failures, including the scariest (and one I have seen many times), the run-away robot.
These best practices were built over time through many experiences with trying to debug both code on the multiple robots we test on and many of our customers’ robots - then realizing we didn’t know enough to actually succeed - and then either designing systems or setting in place design patterns to resolve these issues. Some of this information is ROS-centric, other parts are common sense, and much of it comes from having to aggregate 1000’s of hours of linux and robotics logs.
We have seen everything from out-of-control robots driving up walls because processes failed to unstable algorithms that only failed in specific connectivity or lighting conditions to unknown bugs that were easily fixed when we were able to see the remotely logged data.
If you are building or using robots, these tricks and best practices can be a life saver. They have been for us!
A Few Examples of Failing Robots
Robot Always Died On Sundays
For about a month, we had a robotic system that never quite felt stable. It would work during the week, but we started to recognize a pattern where by Monday morning, most times it had shut off by the time we came into the office. At that point, we didn’t have great resource debugging tools, so it took a while to identify the issue.
This robot has a python file which has increasing memory consumption, indicative of a memory leak (see the yellow line going up over time). These kinds of subtle issues can be easy to identify and narrow in to a particular process by reviewing graphs of resources over time.
It was caused by a very slow memory leak, which would cause a small ROS node to shut down only after being up for multiple days. This node was considered a “required” one, so it would shut down the whole ROS system and the robot would stop charging. Usually, detecting this would require introspecting the python-based code with YAPPI or a different tool, but when we could step back and visualize resource usage over time, it became very clear (see picture) that there was linear growth in memory usage.
Robot Driving Up Walls
Once, we had a robot that would seem to go haywire and would stop responding to driving commands during demos. It usually ended with the robot driving up a wall or spinning in place until the emergency stop could be activated.
Over time, we realized that there was a WiFi dead spot in a location we didn’t expect, combined with a latching low-level motor controller bug that only occurred when the disconnect happened during an active drive command.
By finally monitoring network connectivity thoroughly and detecting the correlation, the bug was found and fixed, and the network quality was quickly improved.
Safety Note: We also recommend adding an IMU to all mobile robots, so that if they tilt too far in any direction (or fall over), they automatically do a safety stop. Also, while lidar is not a guarantee of safety, you should still have a simple velocity limiter so that it will not drive into objects which appear at its level.
Cameras Failing During Demos Only
For whatever reason, having stable cameras on a robot seems to be a recurring issue. Overall, it usually turns out that the camera is fine, but the secondary resources (drivers, USB Bandwidth, CPU for ROS nodes) end up creating stability issues.
In this case, there was an algorithm that had a varying computational time. When the lights went changed, it went way up. This was non-intuitive, so it took a while to actually start profiling process-level CPU usage over time compared to stimuli.
When we were able to monitor and graph process CPU over time, and compare that to the times the cameras failed, the team was able to finally sort out what was going on.
Hardware Note: USB cameras, and USB in general, perform very poorly for robotic applications. Often it only takes a bump for the USB bus to reset, re-enumerating all the devices and shutting down all your drivers. While starting with USB is the current norm due to cost and simplicity, check out CSI cameras if you are using boards like the Jetson or RPi which support them and check out industrial quality connectors with everything you use.
But… it’s hard to find reliable and easy to use resource monitoring tools
Using `top`, `dmesg` and grepping through logs is both not scalable and only shows the current (or near-real-time) state of the system. In the previous examples, this led to significantly longer debugging times because the causal correlation of a problem over time was not clearly evident.
There are open-source packages that in great detail help you analyze system resources, but often they are specific to one setup, are not maintained anymore or hard to set up, or don’t stream the data as it is accumulated - an essential feature when your robot shuts down. Most importantly, these numbers often require an understanding of the intricacies of the setup (is this CPU usage for one core?). Robotics engineers are already supposed to be knowledgeable about a wide variety of subjects, let’s not put more weight on their shoulders.
Below are a set of best practices we have found from years of tuning robots and also simple ways to check for stability (both using Freedom’s tools and standard linux ones). I am going to use Freedom’s robotic resource monitoring feature’s graphs for examples here, but you can record your data any way you want.
Cloud server system and network monitoring has made a ton of sense for IT infrastructure - we just built it right for robotics.
When using the Robotic Resource Monitor, you can easily set up automatic, remote, alerts when
any of your resources hit too high of usage.
There are many lower-level tools that you can use separately from Freedom (we actually incorporate data from many of them in the resource monitor). Here are a few you should check out:
- Standard linux monitoring packages - htop, df, dmesg, lsusb, nvidia-smi, tegrastats and others
- Ros tools like `rostopic hz`, `roswtf`, etc
- Services such as Datadog, Splunk, New Relic
CPU, Temperature, Memory and Disk Usage
Because software needs to be able to swap usage between apps, the CPU, memory and disk usage should never be near 100%.
Here, you can see how the green CPU usage drops suddenly and then spikes back up. If there are significant fluctuations in CPU usage, you should check to identify what processes are triggering them and if they can cause a perfect storm above 100% of the compute’s capabilities.
Suggestions on best practices:
- Total CPU < 80% - Processes will spike to 100% for small periods of time, so you need to keep your average CPU usage low enough to make these performant.
- Total Memory < 70% - Similar to CPU usage, all programs will need swap space and the ability to allocate additional chunks of memory.
- Total Disk < 75% - SSDs are so inexpensive now that you shouldn’t come close. If disk gets to 95%, then you should shut down the robot’s ability to move or interact because processes can start arbitrarily dying without reason due to file i/o failing..
- Process-level CPU < 60% of an individual core - Unless you have a specific algorithm that cannot be split, most architectures should allow decoupling into multiple ROS nodes or separate processes that interface cleanly with each other.
- Process-level memory < 25% - These days, RAM is cheap and you can re-architect algorithms to work in very small footprints. Most algorithms should be able to run in under 1GB, leaving most of the ram for other uses.
- CPU temperature normally <60 °C - It can max at 80 °C but normally should be well below. If it gets too hot, the CPU’s speed is throttled and processes which worked before, may take more than 100% of the time. If it rises over time, then you should improve the cooling - especially for outdoor robots, where the shell of the robot can make it into an oven, the interior temperature of a robot can reach 60 °C or more even without the added heat from the CPU, so the robot can overheat easily. With indoor ones that don’t have ventilation, this can happen also.
1. Create a resource “gatekeeper” dead-robot switch
If any of the best-practices for max usage fall out of bounds, then disable motion and other safety interactions on the robot and report back in an alert. If any of these core resources get low, both the overall performance and the number of system errors may skyrocket where you can’t even report that something went wrong or stay in control of the system.
2. Find slow memory leaks with multi-day graphs
When you zoom out on your resource monitoring, you can look for ramps in memory usage over time to identify leaks and correlate those back to the specific process which caused it by expanding that process's details. `top` is a great place to start, but it doesn’t graph things cleanly over time. You can expand any process taking more than 1% of CPU or RAM on your compute in the resource monitor tab of Freedom.
3. Check changing PIDs for process resets
If the PID of your process (ex: ROS Node) keeps changing over time, then it is restarting and that is usually caused by a crash/fault. Most times, this isn’t noticed as the process is automatically restarted, but the point where it failed usually hides a resource failure or code exception. In the resource monitor, you can see the exact time each process started and stopped.
4. Zoom out… a lot… and squint at the data
This may not seem scientific, but our brains are great pattern matchers. In the resource monitor, you can load multiple days of data (it may take a little while though!). This will allow you to start to see patterns - is RAM or CPU oscillating or rising over time and does it correlate with different nodes, processes, or connectivity changing?
We have caught background packages we installed which had periodic major CPU usage - but only once a day. This can cause hidden crashes in the future.
5. Upgrade your compute and offload processes
Consider upgrading your Raspberry Pi to a NVIDIA Jetson or your NUC to a higher powered version. Many robots start out with the cheapest, weakest processing available. That can work great for a while, but when your average resource values start to top out, you will be having periods of times where the compute is no longer performant because of bursts of usage you can’t see in the averages.
It’s also a good idea to offload critical processes to separate processors. If you only have a 1 core compute, move to 4 or 8.
GPU Usage and Temperature
GPU is much like CPU. The only difference is that it isn’t used for general computation, so you can predict its usage more accurately. This means you can push it almost to its limits - but you don’t want to push past them.
95% GPU usage is completely normal, but don’t let it hit 100% or it becomes resource constrained.
Suggestions on best practices:
- Usage < 95% - You can push the limits here more as your code will be the only thing accessing the CPU.
- Temperature < 85 °C - If you go above this, the GPU will start to go into a safety shutdown mode.
6. Validate GPU never hits 100%
It is completely fine for GPU usage to be at 95%, but if it ever hits 100%, it means you are resource constrained. You can scan along the graph and verify that even though it is close, it does not max out.
There are multiple tools you can use for profiling TensorFlow algorithms and optimizing them. The simplest way to lower your usage is just turn down the update rate. Do you really need to identify every coffee mug at 30 hz, or can you do it at 5hz and interpolate?
7. Track temperature in the sun and shadows
Sometimes, GPUs overheat after 4-5 hours of usage. You should double check the heat syncs and internal temperature of the robot to make sure there is enough ventilation if temperature continues to rise, especially in outdoor robots in the sun and indoor robots which lack airflow.
Usually, simple additions of fans and better heat syncs fix this issue.
8. Test in complex and changing environments
Most algorithms are not O(1) in computational load, but instead O(n) up to O(n^2) with the complexity of the number of stimuli. Therefore, testing a robot while stationary or in a lab will create significantly different (and usually simpler) results than having it running in the more complex and chaotic real world.
We have seen robots fail because of things as simple as a light bulb with an IR signature which destroyed an algorithm’s ability to localize an object creating a significant loop.
So, just go out and test - and record the results.
9. Benchmark against YOUR data sets, not a university’s
Once you have a case where you max out resources, you can replay camera and lidar inputs into the system and record benchmarks on GPU and CPU usage to know when the efficiency has changed. This may seem trivial, but most companies use open-source data sets which do not represent their real-world baseline.
And, with all of these, review your usage and temperature over time after each run and across many hours.
Message Data Frequency, Size and Queue
This ROS Topic’s update rate varies wildly, but should be constant, showing that something in the system is resource constrained and dropping data.
While many algorithms create updates at 30-500 hz, unstable update rates, gaps in data and network congestion can show that the system is resource limited.
10. Set constant message frequencies
Unless a message is an asynchronous log, classification event or other non-constant data, it should be produced at a very constant rate during normal operations. If the camera FPS, algorithm updates or other messages show up with wobbling graphs, you should check them out.
On the freedom platform, in addition to seeing the data in the Robotic Resource Monitor, you can set Smart Alerts to track the minimum and maximum frequency of your messages, where it will send an alert if the frequency is out of bounds.
11. Send the minimum number of messages
The network and serialization/deserialization CPU bandwidth for messages using Protobuf and ROS is actually pretty high when you are delivering them to multiple listeners. Some messages like images, lidar or large matrix payloads can add multiple percent to CPU usage if their frequency doubles. So rate-limit the creation of intra-machine data. You can optimize bandwidth in your device settings on the Freedom platform.
12. Only send messages if someone listens
If there is no subscriber, don’t clutter your internal message bandwidth with unused data. Use Publisher.getNumSubscribers(). This may seem trivial, but I haven’t seen most people using it.
13. Decimate image and point cloud sizes or frequencies
Most algorithms do not need 60hz 4k images or 20,000 points to achieve a stable result. Those messages can take up 99% of your bandwidth if you are not careful, so both using minimal frequency and also down-sampling images and other payloads can really help stabilize other parts of the system.
By ruthlessly analyzing what your down-stream algorithms need, you can usually at least cut your bandwidth in half, and many times with open-source packages which were created in a lab, cut it 10x without significant decrease in quality of its output.
14. Again, zoom out to see messages over time
When you zoom out, you can see aberrations, like changes in number of topics, messages and bandwidth. When you check these out, you will see the places things go wrong. Usually, each part of the system should perform in roughly the same manner. If your messages are changing wildly or spiking, there is usually a cause.
One time, we caught that the sun coming up tripled the bandwidth needed for a camera that was normally static.
Network Bandwidth
One of the most important elements for autonomous robots is their connectivity to a system for controlling them at a high level. If either local or internet connectivity goes out, is impeded or drops information, it can create an unsafe and uncertain computational environment.
This robot has significantly varying network usage when an operator takes over. Given a good connection, this may be acceptable, however the max bandwidth should be profiled to validate that it is enough, or bandwidth should be lowered overall. You can also see short spikes in bandwidth usage due to unstable network speed, even on a wired or high quality network.
15. Only upload the sensor data you need to review
You don’t need to record everything to the cloud. Change the bandwidth settings for a device to just upload 1 hz data from a 50 hz stream.
This allows you to compute data at 30 hz if you need, but only upload the much slower amount necessary for normal debugging.
16. Use the worst quality “safe” videos possible for piloting
While a 1MBPS video stream looks great, an operator usually does not need that high of resolution and it can swamp most cellular connections (and really increase their cost). You can adjust this in your device’s bandwidth settings.
17. Check user’s browser/app/db connections thoroughly
It isn’t just the connectivity of the robot that matters. If the operator has a connection with low bandwidth, it could have a loop-back effect where packets are not getting delivered.
By running speedtest on your user’s systems, and checking that they can actually stream the data you need, you can find edge cases. Especially when a user is in a corn field, or construction site, the high-quality stream you had in the office may not work as well.
18. Don’t trust the network. Recheck everything each day.
Connection speeds change. Even if you could upload everything on a WiFi or cellular network in the lab, run tests in your customer’s location, or drive around the neighborhood you are going to be in and verify that you don’t see network lags.
We have seen fleets of delivery robots working successfully one day and the next day, having major connectivity issues. The underlying infrastructure of cellular and fixed networks changes significantly with time of day, day of week and many other things.
Bonus 19. Add a dead-robot connectivity switch
If you have a mobile robot, data will drop. Design safety algorithms to account for this, such as simple dead-robot switches which stop all mobility as a “gatekeeper” for motor drivers if any of the resources usage is out of bounds or if a network connection’s signal strength, speed or connectivity dips or has errors in sending data.
Our latest Freedom Agent includes a connectivity switch if you are using Freedom Pilot. If it disconnects, it will send in a 0-velocity command no matter what, so the robot doesn’t drive away.
The graph above shows the overall network traffic and below it, the component from WebRTC teleoperation. When it is turned on, it uses 3x more bandwidth than the rest of the robot combined. By tuning your overall speed to allow these necessary bursts, you can make sure you do not have starved services. If this is not possible, then turning down the bandwidth is necessary.
When can I use this?
Resource optimization for robots is live and ready to work with your robot right now. It’s a one line install on any linux system and takes less than 5 minutes to set up and try. We’re eager to see what the community does with it and would love to hear in what ways it’s being used.
If you want to try this now, you can quickly sign up for free here and get your first robot going.