The intricate dance of policies, software, and hardware in robotics often hides a fundamental truth: even the most elegant AI algorithms can be crippled by low-level system inefficiencies. This was the core message delivered by Rishabh Garg, a Robotics Engineer at Tesla Optimus, during his insightful presentation at the AI Engineer World's Fair. Garg meticulously detailed the often-overlooked challenges in achieving robust, high-performance robotics, emphasizing that unexplained robot behavior frequently stems not from the control policy itself, but from the underlying software and electrical systems.
Garg began by illustrating a seemingly straightforward robotics system: sensors feeding data to a CPU/GPU, which then processes it through a control policy and sends commands to actuators via a CAN bus. The expectation for a simple loop—receive data, apply policy, send output—is a consistent, predictable cycle time, perhaps every two milliseconds. Yet, reality often deviates sharply. When deployed, such a loop exhibited unexpected gaps, creating delays that compromised real-time performance.
The culprit, Garg revealed, was the communication protocol itself. A seemingly small detail, like the 100-bit message size and 1 Megabit per second CAN bus speed, accumulated significant transmission time. As Garg calculated, sending and receiving just ten messages could consume an entire millisecond, creating a noticeable gap in the expected two-millisecond loop cycle. This demonstrated a crucial insight: even seemingly minor data transmission overheads can drastically impact real-time system performance, revealing the critical need for meticulous system-level understanding beyond just algorithmic design.
To mitigate this, a common solution involves multithreading and pipelining, separating communication tasks from policy execution. This "new design" aims to parallelize operations, allowing data reception for the next policy iteration to begin while the current policy is still processing. While this improves cycle time, it introduces new, insidious issues. Garg noted that the system began "stuttering intermittently," with motors "ramping up abnormally, trying to catch up." This perplexing behavior, initially appearing as a policy flaw, was actually a symptom of desynchronization.
Further investigation through cycle time plots, which graph the time since the last message against the message instance, unveiled the true problem. Instead of a consistent two-millisecond interval, some messages arrived back-to-back, while others were significantly delayed. This "jitter" was traced to desynchronization in both transmission (TX) and reception (RX) phases. When the policy took longer than expected, a transmission phase could be missed, queuing data. The next iteration would then send two messages consecutively, leading to erratic actuator commands. Similarly, a missed reception phase could force the policy to reuse stale data, leading to overcompensation and the observed "catching up" behavior. Resolving this requires employing kernel primitives like conditional variables or semaphores for precise synchronization, or, in simpler systems, adding "padding" or intentional delays to create a buffer against timing variations.
Another often-overlooked pitfall is logging. While seemingly benign for debugging, excessive or blocking logging can be catastrophic. Garg pointed out that syncing logs to disk can take up to 30 milliseconds on a Raspberry Pi with an SD card. In a naive, blocking logging setup, a message drop could trigger a log, which then takes so long that the next message is also dropped, leading to a cascade effect and a "complete blackout on the CAN bus." This highlights a critical need for asynchronous logging, handled by a separate CPU core, to prevent system starvation. Finally, priority inversion, where high-priority user-space processes inadvertently block lower-priority kernel mechanisms responsible for fetching data, can further exacerbate message drops.

