All Roads Lead to Robotics

robot [ roh-bot ] (noun)

1. A read/write API to physical reality

2. A source of training tokens about The Self

Two years ago I left Google Robotics and wrote about the various career options that I considered before joining 1X Technologies (nee Halodi Robotics) to lead their AI efforts. A lot has happened in the AI space since then, so I’m reflecting on where I think the field is headed as we all continue down our “Roads to Rome”.

How’s 1X Going?

Just about as startup-y as one can get: busy, fast-paced, many hats. I recruited a great team, introduced the company to some Bay Area investors, established the 1X California office, trained a lot of neural networks, and deployed EVEs at customer sites for guarding applications. We are still in the early days of our mission to create abundant physical labor, but I wanted to share two things I’m really proud of the 1X AI team for accomplishing:

Because we take an end-to-end neural network approach to autonomy, our capability scaling is no longer constrained by how fast we can write code. All of the capabilities in this video involved no coding, it was just learned from data collected and trained on by our Android Operations team.
1X is the first robotics company (to my knowledge) to have our data collectors train the capabilities themselves. This really decreases the time-to-a-good-model, because the people collecting data can get very fast feedback on how good their data is and how much data they actually need to solve the robotic task. I predict this will become a widespread paradigm in how robot data is collected in the future.

We’re now embarking on a much more ambitious scale-up of our models at 1X, and in doing so I see a lot of parallels to the kinds of challenges that LLM teams work on.

All AI Software Converges to Robotics Software

I credit Nikolaus West for originating this idea, which I will elaborate on here with my own interpretation.

When I speak to AI researchers and engineers who are contemplating their career moves, robotics is often not the top of their list. I am guessing that many technologists look to past returns of robotics businesses (which are bad) and extrapolate them to future returns. There are many shiny, easier-to-fund areas to apply the ML skillset to today: chatbots, generative AI, assistive agents, disrupting search, AI tutors, coding copilots, advancing scientific progress, re-inventing computing interfaces, etc.

ML deployed in a pure software environment is easier because the world of bits is predictable. You can move some bits from A to B and trust that they show up at their destination with perfect integrity. You can make an API call to some server over the Internet and assume that it will just work. Even if it fails, the set of failure modes are known ahead of time so you can handle all of them.

In robotics, all of the information outside of the robot is unknown. Your future sensor observations, given your actions, are unknown. You also don’t know where you are, where anything else is, what will happen if you make contact with something, whether the light turned on after you flipped the switch, or whether you even flipped the switch at all. Even trivial things like telling the difference between riding an elevator down vs. being hoisted up in a gantry is hard, as the forces experienced by the inertial measurement unit (IMU) sensor look similar in both scenarios. A little bit of ignorance propagates very quickly, and soon your robot ends up on the floor having a seizure because it thinks that it still has a chance at maintaining balance.

As our AI software systems start to touch the real world, like doing customer support or ordering your Uber for you, they will run into many of the same engineering challenges that robotics faces today; the longer a program interacts with a source of entropy, the less formal guarantees we can make about the correctness of our program’s behavior. Even if you are not building a physical robot, your codebase ends up looking a lot like a modern robotics software stack. I spend an unreasonable amount of my time implementing more scalable data loaders and logging infrastructure, and making sure that when I log data, I can re-order all of them into a temporally causal sequence for a transformer. Sound familiar?

Category	Robotics Challenges	LLM / GenAI Challenges
Logging and Search: how to store, index, and query large amounts of autonomously collected data?	Efficient i.i.d sampling of video sequences to feed a large number of GPUs is tricky. There are too many tokens. How can we extract fewer tokens from large amounts of video?	Same storage, indexing, I/O problems when training video generation models. There are not enough tokens. Where can we get more?
Calibrated confidence: How do you know if the model is able to deal with a given situation correctly?	How do you know if the robot will perform the task?	How does a LLM know if it is able to factually respond to a question?
Simulation and search: Can we know the (potentially dangerous) consequences of an action before we actually take it?	Simulations lack enough fidelity to accurately model many real-world phenomena. Learning world models over raw sensor data (e.g. images) is hard	LLMs cannot inductively or deductively reason well enough that we can just throw compute and reason our way to all answers the way we might for AlphaGo
Self-improvement: How to self-improve from interactions in the real world?	Building a data engine	Because evaluation is nebulous, so goes optimization

All of these problems are tough, but solvable. Even though most AI companies and labs won’t ever have to think about actuator hardware, electromagnetic interference or the safety implications of fast-moving limbs, the robotics + research engineering skill set will be a highly integral aspect of the future of all software, not just that for controlling robots.

If you accept the premise that the engineering and infrastructure problems in LLMs are the same as those in robotics, then we should expect that disembodied AGI and robotic AGI happen at roughly the same time. The hardware is ready and all of the pieces are already there in the form of research papers published over the last 10 years.

More Scattered Thoughts

A lot of AI researchers still think that general-purpose robotics is decades away. But remember that ChatGPT happened seemingly overnight. I think this is going to happen for robotics as well. Once this happens, computing itself will be completely transformed. You can think of the entire world of atoms as the memory of a very large computer, and general-purpose humanoid robots become a read/write API to physical reality. How cool would it be if any kid with a laptop could replant a forest, or build a factory, or clean up all the trash in San Francisco in a single evening?
There are roughly three strategies to get widespread distribution of robots. The first is a software-only approach, where you build an “almighty brain” for controlling robots and every robotics hardware vendor comes to you, begging you for access to the brain API. The upside of this approach is that if you can build a model that no one else can, then you get fat software margins and everyone gives you their data. OpenAI’s GPT4 is perhaps the best example of this. The downside of this approach is that your hardware partner probably doesn’t want to give you the data and their customer doesn’t want to give you the data either and the whole communication pipeline moves slowly. The second approach is to start with a narrow domain, vertically integrating hardware and software, and expand from there. Think autonomous lawnmowers and forklifts and robot arms in workcells picking packages. The upside is that this is how most robotics companies provide value today, but the downside is that they never seem to break out of their niche and go fully general-purpose. The last approach is to go fully general-purpose hardware, general-purpose software, for general-purpose use cases. The downside is that no one has ever solved this, but the upside is that the TAM is infinite. That’s the approach that companies like 1X, Figure, and Tesla are taking.
Big LLM companies (OpenAI, Anthropic, Google) spend a lot of compute resources training a large model once (e.g. GPT4-base), and then post-training it to do other stuff like be an assistant or understand image tokens. As the base models get exponentially more expensive to train, all researchers (no matter what institution you are at) will face the same engineering constraint: there is only enough resources to train the biggest model once. All post-training capabilities need to be derived from that base model, and because it’s hard to anticipate what the downstream tasks look like, you must prepare the base model for all possible tasks. In other words, your foundation model’s training objective should be for the full generative model of the data, such as an autoregressive next-token predictor (e.g. GPT) or a diffusion process (e.g. a video generative model like Sora), or some combination of the two. If you throw your base model budget on a conditional density modeling problem, e.g. “predict all the robot actions from the video”, it might not be a good base model for many tasks that you might care about later. This only becomes more true as the cost of the base model grows.
Despite the AI gold rush that we are now in, it is still very non-obvious how to turn 10M USD worth of GPU-hours into 10M USD+ in incremental margin (besides something like mining crypto). This is one of the main questions I’m working on now. Any startup that raised 10-100M USD to train their own big neural network from scratch in the last 2 years ended up paying an enormous capex cost for something that basically every AI startup gets for free today. I do not mean to imply that scaling up in a bold bet to train an AGI is not a good idea; I just think that the companies that are best positioned to do this are the players with the lowest cost-of-compute (in the same way that Berkshire Hathaway has a negative cost-of-capital when investing insurance float). If you are a startup working on scaling up a model in a high-cost-of-capital environment, you had better be disciplined about your scaling laws and metrics as it pertains to capability (see my above point). Many startups look to how Google was able to turn billions in R&D to many multiples of that via Rankbrain, but they also forget that that required building Google Search business first. As such, I think the vast majority of successful startups will be the ones that can nimbly ride the tide of open-source weights.
I predict a lot of departures in the coming months from the current generation of autonomous vehicle companies. Simultaneously, there is no better time to start a brand new AV company than right now.
A lot of HN commenters were skeptical about the FAANG compensation numbers I put in my blog post two years ago. Since ChatGPT and the OpenAI-GDM-Anthropic talent wars, the numbers have only gotten crazier. I’ve spoken to PhD students who ask for 7 figure salaries. This makes me think back to 2016 when John Schulman making 275k at OpenAI felt like a lot to me.
Outside of my day job, I wrote a book, angel invested in some startups, and joined Tortus in a part-time advisory capacity as their Chief Science Advisor. Tortus makes a co-pilot software that helps to automate back-office workflows for clinicians like summarizing consultations. The other day, I heard a testimonial from a doctor who said he now has free evenings and can take a longer lunch break because he no longer has to spend that time typing up letters and SOAP notes. Needless to say, AI technology has made him more productive and given him time back. We’ve charted a pretty exciting Road to Rome that is quite different from the approach I take at 1X.