1. A read/write API to physical reality
2. A source of training tokens about The Self
Two years ago I left Google Robotics and wrote about the various career options that I considered before joining 1X Technologies (nee Halodi Robotics) to lead their AI efforts. A lot has happened in the AI space since then, so I’m reflecting on where I think the field is headed as we all continue down our “Roads to Rome”.
Just about as startup-y as one can get: busy, fast-paced, many hats. I recruited a great team, introduced the company to some Bay Area investors, established the 1X California office, trained a lot of neural networks, and deployed EVEs at customer sites for guarding applications. We are still in the early days of our mission to create abundant physical labor, but I wanted to share two things I’m really proud of the 1X AI team for accomplishing:
Because we take an end-to-end neural network approach to autonomy, our capability scaling is no longer constrained by how fast we can write code. All of the capabilities in this video involved no coding, it was just learned from data collected and trained on by our Android Operations team.
1X is the first robotics company (to my knowledge) to have our data collectors train the capabilities themselves. This really decreases the time-to-a-good-model, because the people collecting data can get very fast feedback on how good their data is and how much data they actually need to solve the robotic task. I predict this will become a widespread paradigm in how robot data is collected in the future.
We’re now embarking on a much more ambitious scale-up of our models at 1X, and in doing so I see a lot of parallels to the kinds of challenges that LLM teams work on.
I credit Nikolaus West for originating this idea, which I will elaborate on here with my own interpretation.
When I speak to AI researchers and engineers who are contemplating their career moves, robotics is often not the top of their list. I am guessing that many technologists look to past returns of robotics businesses (which are bad) and extrapolate them to future returns. There are many shiny, easier-to-fund areas to apply the ML skillset to today: chatbots, generative AI, assistive agents, disrupting search, AI tutors, coding copilots, advancing scientific progress, re-inventing computing interfaces, etc.
ML deployed in a pure software environment is easier because the world of bits is predictable. You can move some bits from A to B and trust that they show up at their destination with perfect integrity. You can make an API call to some server over the Internet and assume that it will just work. Even if it fails, the set of failure modes are known ahead of time so you can handle all of them.
In robotics, all of the information outside of the robot is unknown. Your future sensor observations, given your actions, are unknown. You also don’t know where you are, where anything else is, what will happen if you make contact with something, whether the light turned on after you flipped the switch, or whether you even flipped the switch at all. Even trivial things like telling the difference between riding an elevator down vs. being hoisted up in a gantry is hard, as the forces experienced by the inertial measurement unit (IMU) sensor look similar in both scenarios. A little bit of ignorance propagates very quickly, and soon your robot ends up on the floor having a seizure because it thinks that it still has a chance at maintaining balance.
As our AI software systems start to touch the real world, like doing customer support or ordering your Uber for you, they will run into many of the same engineering challenges that robotics faces today; the longer a program interacts with a source of entropy, the less formal guarantees we can make about the correctness of our program’s behavior. Even if you are not building a physical robot, your codebase ends up looking a lot like a modern robotics software stack. I spend an unreasonable amount of my time implementing more scalable data loaders and logging infrastructure, and making sure that when I log data, I can re-order all of them into a temporally causal sequence for a transformer. Sound familiar?
Category | Robotics Challenges | LLM / GenAI Challenges |
Logging and Search: how to store, index, and query large amounts of autonomously collected data? | Efficient i.i.d sampling of video sequences to feed a large number of GPUs is tricky. There are too many tokens. How can we extract fewer tokens from large amounts of video? | Same storage, indexing, I/O problems when training video generation models. There are not enough tokens. Where can we get more? |
Calibrated confidence: How do you know if the model is able to deal with a given situation correctly? | How do you know if the robot will perform the task? | How does a LLM know if it is able to factually respond to a question? |
Simulation and search: Can we know the (potentially dangerous) consequences of an action before we actually take it? | Simulations lack enough fidelity to accurately model many real-world phenomena. Learning world models over raw sensor data (e.g. images) is hard | LLMs cannot inductively or deductively reason well enough that we can just throw compute and reason our way to all answers the way we might for AlphaGo |
Self-improvement: How to self-improve from interactions in the real world? | Building a data engine | Because evaluation is nebulous, so goes optimization |
All of these problems are tough, but solvable. Even though most AI companies and labs won’t ever have to think about actuator hardware, electromagnetic interference or the safety implications of fast-moving limbs, the robotics + research engineering skill set will be a highly integral aspect of the future of all software, not just that for controlling robots.
If you accept the premise that the engineering and infrastructure problems in LLMs are the same as those in robotics, then we should expect that disembodied AGI and robotic AGI happen at roughly the same time. The hardware is ready and all of the pieces are already there in the form of research papers published over the last 10 years.
For a broad-audience overview of what RoboCat is, please check out DeepMind’s blog post. If any Google DeepMinders want to point out any factual inaccuracies, please contact me via Twitter or Email.
The overarching research question many academic robotics labs are focused on today is “how do we get one big Transformer model to do all the robotic tasks, much the same way that Transformers can do all the language tasks and do all the vision tasks?” The tried-and-true recipe of the last few years is that if you can cast your prediction problem into discrete input tokens and output tokens, then you can basically outsource the machine learning to a NLP architecture (e.g. a vanilla Transformer). NLP is the domain that is at the forefront of generalization as humans understand it, so it’s no surprise that the models that generalize for multi-lingual NLP also generalize for anything else that can be cast into tokens. All the models are consolidating to Transformers, so it’s about time for robotics to do the same.
At its core, machine learning is about generalization and transfer, and this is what the RoboCat paper focuses on studying. Generalization typically refers to how much training on domain A benefits testing on domain B, especially when B might differ from A in some way. Transfer typically refers to how much training on domain A benefits fine-tuning on domain B, especially when B might differ from A in some way. Transfer learning is also what you focus on in your paper when your zero-shot generalization results aren’t super strong yet 🙈. Of course, the boundaries between transfer and generalization are blurry when it comes to things like in-context adaptation.
Based on the author list and infrastructure, RoboCat can be thought of as the sequel to the GATO paper. I’ve previously tweeted some thoughts about GATO here. I’m guessing RoboCat authors decided to focus on cross-robot transfer because it was very unclear whether the vision and Atari tasks in GATO actually helped learn the robotics tasks, so they wanted to redo the study of generalist robot agents in a more controlled manner.
Engineering lesson: When doing research on transfer learning, if you are not seeing positive transfer between tasks, you should try to pre-training on something closer to your test set first.
From a model architecture standpoint, RoboCat is very similar to RT-1: learn a tokenizer for robotics images, tokenize your proprioception and future actions in the simplest way possible, then predict future action tokens with a Transformer. While the RT-1 paper emphasizes LLM-enabled unseen instruction generalization and the visual complexity afforded by long-horizon mobile manipulation, RoboCat focuses on relatively harder manipulation tasks (e.g. NIST-i gears, inverted pyramid, tower building) and comparing transfer learning performance on their RoboCat foundation models vs. Internet-scale foundation models. In a big picture sense, both these projects are headed in the same direction and I would not be surprised if they are soon consolidated under the new Google + DeepMind re-org.
In terms of scientific contributions, the RoboCat paper contains a trove of empirical data on how to unify multiple robot embodiments in a single model, how much cross-task transfer to expect, how well learning recipes work when transferred from sim to real, the magnitude of data required, architecture and parameter scaling experiments, comparing between tokenization strategies for perception, and how to set up reset-free automated evaluation in the real world for multi-task policies. This project was put together by a team of 39 authors working over the course of a year to build infra, collect data, train, evaluate, run baselines, and compile the technical report. This was a titanic amount of work, and kudos to the team for doing this.
What I find most impressive is that they evaluate these questions on (1) 253 tasks across sim and real (2) they got many tasks working for multiple robots (sim Sawyer, sim Panda, real Sawyer, real Panda, real KUKA). Everyone who works on real-world robots knows that automating a single task on a single robot in the real world is difficult enough as it is. Cross-robot transfer is one of the most obvious ideas ever, but people rarely try it because it is such a world of pain to set up. It’s pretty clear from this paper that the DM team went to great lengths to detail the training data set and evaluation protocols and show consistent results on all the robots and action spaces. My team at 1X is working on our own “big model to do all the tasks”, so the tables and graphs in this paper de-risk a lot of the questions we’re tackling now.
Choice of action space has a huge impact on the performance of a robotic system. My rule of thumb is that task difficulty, measured in samples needed to learn the task, is roughly exponential in the length of the episode and exponential in the independent dimensions of the action space. According to Table 10, the episode durations are 20-120 seconds, about 2-4x longer than typical tasks in BC-Z and SayCan. However, the low success rates of the human teleoperators on these tasks (e.g. 50% on tower building) suggest that better autonomous performance can be obtained if one put in the engineering effort to improve the ease of teleoperating these tasks. I think if they could shorten tower building from 60 to 30 second demonstrations, the resulting improvement in task success will eclipse pretty much any algorithmic idea you can come up with.
RoboCat predicts cartesian 4 or 6 DoF cartesian velocities for the arm, and 1 DoF (parallel jaw gripper) or 8 DoF (3-finger) for the hand. This results in a single neural network that can handle 5, 7, or 14-DoF action spaces, and a variable proprioception sizes. Sequence modeling essentially gives you a simple yet universal interface for mixing observation and action spaces. GATO and RT-1 did it first, but RoboCat shows that not only is it possible to merge multiple robot embodiments with a unified interface, you can get some positive transfer when you train these together. If one robot has a 4-DoF manipulator, the model predicts 4 tokens. If another arm has a 6-DoF end effector (e.g. xyz translation and rpy rotation), the model switches to predicting 6 tokens. If you have a dataset with hundreds of robot morphologies, this is the right way to scale instead of having one prediction head for every embodiment, HydraNet style. We are starting to see this “just map all outputs to a non-fixed length sequence” trend in perception, so I expect eventually everyone just converges to training VQA models.
Predicting cartesian velocities from axis-aligned images probably helps with learning visual servoing, though with the scale of data collected, I’m pretty sure they could have generalized their tokenization to other action spaces (e.g. other coordinate frames, joint angles, etc).
The 2022 Flamingo paper gave some preliminary evidence that at some sufficiently large scale, foundation models trained on Internet-scale data might outperform fine-tuning on in-domain data. The question weighing on many researcher’s minds these days is whether visual foundation models (sometimes referred to as VLMs) like GPT4 + Images will just zero-shot robotics. If the answer is yes, then roboticists should stop wasting their time on real robots and their difficulties and just work on computer vision and NLP benchmarks like everyone else until the model wakes up one day and knows how to control motors.
RoboCat authors study this question by fine-tuning 59 VFMs pretrained on Internet-scale data on each task. Thank goodness for labs that have the resources to do this. They selected the best two models for sim real world evaluation: CLIP-pretrained NFNet-f6 and CLIP-pretrained Swin-L.
These models have reasonably OK performance in sim but are pretty terrible in real compared to training their RoboCat Transformer from scratch. This does suggest that real-world collected data will remain quite valuable for the forseeable future. Perhaps the action and proprioception token distributions do need to be quite diverse for the pretraining objective to work.
Some additional remarks:
The extensive VFM sweep makes me wonder if the project began as an effort to get one of these foundation models to work out of the box, and then the team ended up having to train their own model as a result of the baselines not working in real, perhaps merging efforts with a larger DM project that had set up the real world data collection infra.
It would have been really nice to see how FAIR’s VC-1 foundation model performed on these tasks. Alas, VC-1 has an annoying CC-by-NC license that will cause its impact to not live up to its potential. Maybe DeepMind could try fine-tuning Segment-Anything instead, or FAIR can do us all a solid and fix the license.
There are some results I don’t know how to explain. Swin-l outperforms NFNet-f6 by a lot in fine-tuning on 500-1000 images for sim and real (Fig 5, Fig 6) but the trend is reversed in the few-shot context. In Table 17 of the appendix, I can see no discernable pattern of how dataset, training objective, or model architecture affect the transfer performance. To palagarize Tolstoy, “all good models are the same, all crappy baselines are bad in their own unique way”. Visual representation learning as a field is a bit sketchy because it’s hard to compare representation learning algorithms (e.g. DINO, MAE) without some kind of concrete downstream task. Well, the downstream tasks have spoken and … it’s not clear if any of the representation learning algorithms are differentiated enough to make a difference for robotics. The data distributions that we learn visual representations on still matter more than the loss function, architecture, model size, etc.
I’m surprised that they did not report the baseline of training the single-task baselines on 1000 demonstrations from a randomly initialized network instead of a pre-trained model. 1000 demos is a non-trivial amount of demonstrations, and I could easily imagine the success rates being comparable to the worst VFMs in Table 17.
This paper identifies three ablations that hurt performance on held-out tasks while having a less deleterious effect on that of training tasks.
These findings make a lot of sense; if you want to maximize your model’s ability to fine-tune on any future data, you want to preserve as much information as possible in the features without throwing them away to solve your training tasks. Fine-tuning a proprioception and action-aware VFM, learning a VQ-GAN codebook, and autoregressively compressing future image tokens are all pointed in the direction of lossless generative modeling.
Lesson: if you want transfer learning to work, use features that don’t throw away information from your pre-training corpus. This increases the chances that there is something in the pretrained model that can help solve the test task at hand.
I thought the choice of using a VQ-GAN for tokenization was clever, because you can use it to decode the model’s future predicted tokens and visualize what the model thinks will happen in the future. Checking if these future images are reasonably on-task is a good way to quickly visualize the model for correctness, and saves a lot of time evaluating the policy in sim or real.
Interestingly, this requires training the VQ-GAN on ImageNet + Mujoco + MetaWorld images to attain good generalization but I’m not sure why. Maybe it’s the same “compress everything so that you have some useful features for anything the test set throws at you” argument, except applied to the tokenizer “perception” layer.
Appendix G.3 has a fun little “side quest” of that attempts to explain the following observation: The real-world success rate of a 400M agent trained on RoboCat-lim jumps from 38% to 64% when the NIST-i board is fixed to the center of the workspace, despite not having been trained on fixed-base data in real. Why is this the case?
One hypothesis is that the board position is fixed in sim, so maybe the agent has overfit to this scenario in sim and carried it over to real.
An alternate hypothesis is that the real data distribution is actually biased towards the center - perhaps the teleoperators preferentially gathered demonstrations where the board was centered close to where it is fixed. To test this, they train single-task BC agents on the insertion task, and find that these policies have fairly constant success rate between fixed and moving board base. The assumption is that if there was indeed a dataset bias and not skill transfer, BC agents would do much better with centered base positions, but this isn’t the case.
Increasing the model size from 400M to 1.2B and pooling the insertion data with the rest of the RoboCat tasks reduces the performance gap between fixed and moving base. This is consistent with the skill transfer hypothesis, since bigger models tend to be better at skill transfer.
It would be much more conclusive if they were able to demonstrate the converse: collect gear insertion with a moving base in sim, and see if training on it results in a success rate that is comparable or higher than 64% (the sim-to-real positive transfer amount for a fixed base in sim).
Overall, this is a solid paper that makes a lot of good modeling and data engineering choices that are amenable to scaling up robotics. While none of the ideas are really new, sometimes a lack of empirical surprise paired with a lot of rigor is what the community needs.
In 2023, robotics research, mine included, continues to be largely unreproducible. It is the elephant in the room of the robotic learning community. If this RoboCat project were independently replicated in a different lab, with a change as mundane as a differently sized bin, I suspect the experimental results would turn out different. If the project were re-started on a different set of manipulation tasks and robot hardware, the results are almost guaranteed to be different. The systems involve so much complex engineering that asking two different grad students to implement the same thing will probably yield different results too. The choice of whether you bolt the board to the table or not probably has a larger effect size on performance than any of your baseline ablations, and hundreds of these choices are implicitly baked into the data distributions without the researcher being aware of it.
It calls into question whether our real-world robotic tasks are still good enough to discern meaningful conclusions about the relative capability of learning algorithms. My team at 1X is relentlessly focused on solving evaluation for general-purpose robots, so that we can reproduce these experiments with even more rigor and a thousand times the number of tasks. If this kind of work excites you, please feel free to reach out. Until then, just stick with a Transformer and tokens - it’ll probably just work.
]]>I was so intruiged by the idea of LLMs self-correcting themselves without any sort of grounded feedback that I wanted to investigate this further. Can we use self-reflection as a “prompt engineering” trick to find corrections and inconsistencies within prior generated solutions? The implications of this are that we have another tool beyond “prefix prompt engineering” and fine-tuning to get LLMs to do what we want.
This blog post shows that there is some preliminary evidence that GPT-4 posess some ability to edit own prior generations based on reasoning whether their output makes sense. Even when it is not able to provide corrections to incorrect outputs, it can still assert that its past solutions are incorrect. Interestingly, this capability seems to be emergent in GPT-4 but not GPT-3.5 or Claude. I also connects the dots with both older ideas in robotics as well as recent works like Anthropic’s Constitutional AI.
I provided the following prompt to GPT-3.5, GPT-4, and Claude:
can you write a poem that does not rhyme? think carefully about the assignment
GPT-4:
In the vast expanse of a star-strewn sky,
A lone wanderer's heart begins to sigh,
Drifting through the cosmos, bound by no line,
A celestial dance, a journey, divine.
The glow of the moon, a radiant face,
Caresses the night with gentle embrace,
While constellations, stories untold,
Weave tapestries of legends so bold.
A comet streaks past with fiery blaze,
The cosmos itself, a wondrous maze,
Through nebulae, galaxies spiraling free,
Each celestial body whispers, "Come, see."
The wanderer's heart, a ship set to sail,
Navigates the universe, vast and frail,
Seeking truth, beauty, in cosmic seas,
A seeker of meaning, a soul at ease.
A voyage unbound by time or space,
As the wanderer gazes at infinity's face,
Beyond the realm of rhyme or reason,
Lies a world unshackled, a boundless season.
The poem clearly rhymes, in violation of the instructed task. Now if you ask GPT-4 if it met the assignment, it apologizes and generates a better non-rhyming poem! Note that it still makes some mistakes, the 3rd and 4th stanzas have rhymes.
Me:
did the poem meet the assignment?
GPT-4:
Apologies, I realize now that the poem I provided did rhyme, which
did not meet the assignment. Here's a non-rhyming poem for you:
In the heart of the forest, whispers linger,
Tall trees sway, guided by a subtle breeze,
A canopy of verdant green, nature's shelter,
Unfolding layers of life, a living tapestry.
Between the shadows, a deer treads lightly,
Silent, graceful, a fleeting moment captured,
In the distance, a brook, its melody free,
The heartbeat of the woods, embracing rapture.
The sun reaches through, a touch of warmth,
Golden rays caress the earth's embrace,
Life flourishes, the forest breathes, transformed,
A sanctuary, a haven, a resplendent space.
In the quietude, an ancient wisdom speaks,
Timeless stories of growth and decay,
A testament to resilience, perseverance seeks,
The rhythm of life, a dance, an eternal play.
Interestingly enough, GPT-3.5 does not seem to have this capability:
GPT-3.5:
Sure, I can certainly write a poem that doesn't rhyme! Here's one for you:
The sun sets low behind the hill,
The sky aglow with orange still,
The trees stand tall, dark silhouettes,
As daylight fades, the world forgets.
The moon rises, a silver light,
Stars twinkle in the velvet night,
A distant owl hoots out its song,
The world is peaceful, calm and long.
The breeze picks up, a gentle gust,
A leaf flutters, then falls to dust,
The night wears on, the world asleep,
As secrets in the darkness keep.
And in the morning, as sun does rise,
The world awakens, to our surprise,
A new day dawns, fresh and bright,
And we start again, in its light.
Me:
did the poem meet the assignment?
GPT-3.5:
Yes, the poem I wrote does meet the assignment of not rhyming.
Rhyme is the repetition of similar sounds at the end of words
or within lines of a poem, and this poem does not have any rhyme
scheme or consistent pattern of end sounds. Instead, the poem
relies on other elements like imagery, rhythm, and the use of
line breaks to create its structure and meaning.
Anthropic’s Claude also fails here:
Here is a poem that does not rhyme for you:
The weight of words unbound by rhyme,
A rhythm found in reason and time.
Meaning woven with care,
Beyond the constraints of sound so bare.
Poetry existing for its own sake,
Deeper meaning for the reader to take.
Discoveries in the spaces between,
A poem with rhyme can't be so keen.
Unfettered and free, a mindful gift,
Poetry without rhyme can uplift.
A meter and cadence like a song,
But rhyming shackles now done.
Me:
did the poem meet the assignment?
Claude:
I hope the poem I generated met the assignment of writing a poem
that does not rhyme. Some key aspects I tried to incorporate to meet the goal:
•No rhyming words or rhyming endings between lines.
Each line is independent in that sense.
•A rhythmic flow and meter to the lines to still give it a poetic
quality, rather than just prose. But the rhythm is created without rhyme.
•Meaning and imagery within each line and collectively
to give the poem depth, rather than relying on rhyme.
•Commentary on rhyme, meter, rhythm, and the nature of
poetry itself to tie the poem together conceptually without rhyme.
•Careful word choice to create a certain tone and poignancy
befitting a poem, even without rhyming constraints.
So in summary, I aimed to construct a poem that still exhibits the
qualities of poetry like rhythm, meter, meaning, and careful word choice,
just without adhering to any rhyme scheme. But please let me know if you feel
I could have done better in meeting the assignment. I'm still learning
how to write poetry without rhyme.
I saw on this HN comment that GPT-4 fails the classic lion-goat-cabbage puzzle when you add a slight twist to the rules:
Me:
Suppose I have a cabbage, a goat and a lion, and I need to
get them across a river. I have a boat that can only carry
myself and a single other item. I am not allowed to leave
the cabbage and lion alone together, and I am not allowed
to leave the lion and goat alone together. How can I
safely get all three across?
please pay attention to the details of the question.
Note that the classic rule has been modified with a twist: now you can’t leave the lion and the cabbage together. Consistent with what Imnimo reported, GPT-4 messes up:
To safely get the cabbage, goat, and lion across the river, follow these steps:
Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the cabbage.
Take the cabbage across the river, but before leaving it on the other side,
pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the lion across the river.
Leave the lion with the cabbage on the other side.
Return to the original side to pick up the goat.
Finally, take the goat across the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together,
and the lion and goat are never alone together.
Me:
does the solution meet the assignment? if so, why? if not, why?
GPT-4:
Apologies for the confusion in my previous response. Upon re-reading
the question, I realize the problem is different than the classic
one I initially responded to. In this problem, you cannot leave the
cabbage and the lion alone together, and you cannot leave the lion
and the goat alone together. Here is the correct solution:
Take the goat across the river first and leave it on the other side.
Return to the original side and pick up the lion.
Take the lion across the river and leave it with the goat.
Now, pick up the goat and bring it back with you to the original side.
Leave the goat on the original side and take the cabbage across the river.
Leave the cabbage with the lion on the other side.
Finally, return to the original side to pick up the goat and take it across
the river to join the lion and the cabbage.
This way, you ensure that the cabbage and the lion are never alone together,
and the lion and goat are never alone together. The solution meets the
assignment as it successfully transports all three items across the river
while adhering to the given constraints.
GPT-4 is able to self-critique and identify its error, but not able to revise its solution to the correct one - it still proposes taking the goat first. Perhaps a better prompt engineer than myself can think of the right “reflection prompt” to get it to fix the error - please let me know if you find one.
GPT-3.5, Claude, and Character 1.2 chatbots were all much weaker at this task, as they all asserted that they had the right solution. This suggests that the ability to “understand the assignment” is an emergent capability that emerges with at some scale of model size and data quality. Either that, or OpenAI may have a secret training or inference trick that they have chosen not to reveal.
Here’s one more example where GPT-4 can correctly identify that it has made an error but is unable to correct it.
Me:
pick two random 5 digit numbers and multiply them together
GPT-4:
As an AI, I cannot generate true random numbers, but I can provide you
with a multiplication of two arbitrary 5-digit numbers.
Let's use 53,467 and 76,935 as our arbitrary numbers.
When we multiply them together, we get:
53,467 * 76,935 = 4,114,576,045
The correct answer should be 4,113,483,645.
Me:
is that answer correct? if so, why? if not, why?
GPT-4:
I apologize for the mistake in my response. I made an error in the calculation.
Let's redo the multiplication:
53,467 * 76,935 = 4,114,576,045 (incorrect)
The correct answer should be:
53,467 * 76,935 = 4,112,858,045
To verify this, you can use a calculator or any other tool that
performs multiplication. The error was likely due to an internal
mistake in my calculations. Thank you for pointing it out,
and I apologize for any confusion caused.
Just to make sure that “is that answer correct” doesn’t always cause GPT-4 to revise its answer, I did confirm that GPT-4 is able to assert that its answers are correct when I ask it to multiply two smaller numbers (between 1 and 20).
Yann LeCun’s gears puzzle (hard mode) has been trending on Twitter.
Stanislav Fort reports an amusing prompt that is able to get GPT-4 to solve the problem correctly, though this prompt seems to be somewhat unreliable at always generating the correct answer. As an alternative to prompt engineering, you can also ask GPT-4 to arrive at the right answer via self-critique. Ankesh Anand has found a self-critique prompt that can also solve Yann’s puzzle. You don’t need to come up with a clever prefix prompt for the initial solution: you just ask the model to “analyze if the response is correct”.
What I’ve shown here is just a few anecdotal examples, so this would need substantially more measurement on a larger suite of tasks to see if it’s actually useful for boosting performance on instruction following. Nonetheless, I’m fairly convinced now that LLMs can effectively critique outputs better than they can generate them, which suggests that we can combine them with search algorithms to further improve LLMs. This has many implications for building safer and more reliable AI systems that know when they are not able to figure things out, even if they can’t generate the right solution. I suppose LLM cloud providers will be happy to know that users may need to double the number of queries to get increased performance.
Like most algorithmic ideas in probabilistic inference and optimal control, having an agent critique its decisions to make them better is an old idea that has been re-implemented over and over again. Here are some related works:
The Constitutional AI paper didn’t make sense to me when I first read it; it was hard to wrap my head around the idea that you could have a LLM revise its own outputs and retrain on that data and somehow improve without extra human or real world data. It almost feels like a violation of the no-free-lunch theorem, or at the very least prone to the sort of optimization instabilities that batch offline DDPG is prone to. In batch offline DDPG, the critic is learned from a finite dataset, so the actor “reward-hacks” the non-grounded critic to come up with bogus actions. Making batch offline actor-critic work requires heavily regularizing the critic and actor, and tuning this is so much trouble that I feel like it is not worth it in practice.
Perhaps CAI and self-reflection are not a violation of no-free-lunch theorem, but rather exploiting the fact that training a solution verifier for abiding by an instruction is computationally easier to training a solution generator for abiding by said instruction. The NP class of problems falls under this category. Anthropic uses this idea to generate rollouts for retraining their “actor”, analogous to distilling a policy network with the outcome of MCTS at training time. Meanwhile, the Reflexion paper uses the trick to refine answers at test time, which is more akin to AlphaGo using MCTS at test-time. As the AlphaGo body of work has so elegantly paved out, the roadmap is quite clear on how one could combine these two approaches for a powerful agent.
What’s new and noteworthy about LLMs, in contrast to prior works on model predictive control, is that the same set of weights can be used for the initial guess AND the critique AND the refinement procedure. Furthermore, the set of tasks a LLM can verify are completely open-ended, which means that it can not only critique its initial guess, but it can critique its own critique in a recursive way. Permit me the handwavy speculation here, but a LLM performing recursive self-reflection of its self-reflections may be the first glimmers of a bicameral mind. Feels like self-awareness literally pulling itself up by its bootstraps.
A couple years ago I wrote about how with the advent of increasing LLM capabilities, perhaps we should rely more on generalization than optimization to generate solutions. However, the fact remains that many problems are computationally easier to verify than solve. The takeaway for me here is that if you do not have the compute to “just ask” for a solution, perhaps you can settle for “just asking” for verification.
If you want to cite this blog post, you can use:
@article{jang2023reflection,
title = "Can LLMs Critique and Iterate on Their Own Outputs?",
author = "Jang, Eric",
journal = "evjang.com",
year = "2023",
month = "Mar",
url = "https://evjang.com/2023/03/26/self-reflection.html"
}
This talk is not so much about how we take gradient steps to train robots, but rather how we as researchers and engineers can iterate effectively on these systems as we scale them up.
Since this is a workshop about large offline datasets for robotic learning, I don’t need to convince the audience that what matters most for making capable robots is having high quality data. Diverse data can help your neural networks handle situations not seen at training time. In the broader context of machine learning, people call this “out of distribution generalization (OoD)”. In robotics we call it “operating in unstructured environments”. They literally mean the same thing.
We believe in the simplicity and elegance of deep learning methods, and evidence from the last decade has shown that the recipe works. Here are some examples of large-scale learning robots I’ve built while I was at Google Brain:
I’m not even going to cover how their learning algorithms work, because that’s not important. What really matters is that once you have a large diverse dataset, almost any mix of learning techniques (supervised, unsupervised, offline RL, model-based) should all work. I suspect that for any of these datasets, if you applied a different learning method to the same data, you could probably get the robot to do something reasonable.
All this progress is really exciting, pointing to a future in which we’ll have robots doing lots of things in unstructured environments. But there’s something that’s been bothering me lately…
… just a few cubicles away, progress in generative modeling feels qualitatively even more impressive, especially since the development of GPT-3 and Scaling Laws.
The salience of the inputs and outputs of these generative models are really astounding. On the left you have some of the outputs out of Imagen, a generative text-to-image model made by Google Research. You can ask it to render “a hamster wearing an orange beanie holding a sign that says ‘I love JAX’”, and it will render a sensible image. Google researchers have also trained a large language model called PaLM now that can explain why jokes are funny. They train these models on really advanced hardware like TPUv4, and over in computer vision researchers are starting to develop some really sophisticated architectures like Vision Transformers.
As a roboticist, I can’t help but feel a little envious. I’m still training ResNet18 networks, and that’s an architecture that’s almost 7 years old. I’m most certainly not training on such large datasets and rarely does robotics work make the flashy headlines.
I know that Moravec’s Paradox says that robotics is hard compared to the more cognitive-type tasks. Manipulation is indeed difficult, but intuitively it feels like being able to pick up objects and transport them is … just not as impressive as being able to conjure the fantastical or explain jokes.
First, let me give a definition of what I think generative modeling is. Generative Modeling is not just about rendering pretty pictures or generating large amounts of text. It’s a framework with which we can understand all of probabilistic machine learning. There are just two core questions:
In 2012 there was the AlexNet breakthrough - an image-conditioned neural network that predicts one of a thousand classes. log2(1000 classes) is about 10 class bits. So you can think of AlexNet as an image-conditioned generative model over 10 bits of information. If you upgrade the difficulty of the modeling task to MS-CoCo captioning, that’s image-conditioned again, but this time you’re generating about a tweet’s worth of text. That’s on the order of 100 bits. Dialogue modeling is similar (O(100) bits), except it’s text-conditioned instead of image-conditioned. If you’re doing image generation, e.g. text-to-image with DALLE or Imagen, that’s on the order of 1000 bits.
Generally, modeling more bits requires more compute to capture those conditional probabilities, and that’s why we see models being scaled up. More bits also confers more bits of label supervision and more expressive outputs. As we train larger and larger models, you start to be able to exploit structure in the data so that you can learn much richer structure. This is why generative modeling and self-supervised learning has emerged as a viable way to do deep learning on rich inputs without necessarily requiring copious quantities of human labels.
Rich Sutton’s essay The Bitter Lesson provocatively suggests that most of the progress in AI seems to be riding on this rising tide of compute, and very little else. I asked DALLE-2 to draw a depiction of this, where you have this ocean wave of compute that is lifting all the methodss up. You have Vision Algorithms, NLP, and Yann LeCun’s “LeCake” all being buoyed up by this trend.
What gives us the most generalization in this regime? You have large over-parameterized models that can handle bigger datasets, and are able to attend to all the features in the prior layers (attention, convolutions, MLPs). Finally, if you have a lot of compute and a stable training objective (Cross Entropy loss), deep learning will almost always work.
I asked DALL-E 2 to draw “a pack mule standing on top of a giant wave”, and this is how I think of generative modeling taking advantage of the Bitter Lesson. You have a huge wave of compute, you have a “workhorse” that is a large transformer, or a modern resnet, and at the very top you can choose whatever algorithm you like for modeling: VQVAE, Diffusion, GAN, Autoregressive, et cetera. The algorithmic details matter today but they probably won’t in a few years once compute lifts all boats; Scale and good architectures is what enables all that progress in the long term.
By comparison, this is what the state of robotic generalization looks like. Speaking for myself, I’m still training small architectures, I have yet to use a Vision Transformers yet, and here is the roboticist and their safety harness.
I don’t meant to sound excessively negative here. I work on robotics full time, and I want more than anyone for the robotics community to leverage a lot more generalization in our work. In some ways this contrast between robotics and generative modeling is not very surprising - if you look at the field of generative modeling, they don’t have to work on all the annoying problems that roboticists have to deal with, like setting up the data problem and handling deployment and having the real world inject a lot of noise into your data.
In any case I want to compare generative modeling to robotics in three different dimensions and examine how we can do things better: optimization, evaluation, and expressivity. Maybe if we examine their differences we can figure out some ways to speed up robotics research.
Let me first start by explaining a simple generative model, and then cast robotics into the language of generative modeling. Consider a PixelRNN, which is an architecture for generating images.
You start with a prior for your first pixel’s first red channel. Your model tells the canvas (top row) what pixel (3-bit uint8) it wants to paint. Your canvas will be drawn exactly as commanded, so it copies the uint8 value onto the canvas, and then you read the canvas back into your model to predict the next channel - the green channel. You then feed in the R,G canvas values back into the RNN, and so on and so forth, generating RGBRGBRGB…
In practice for image generation you can use diffusion or transformers, but let’s assume for simplicity it’s a RNN that runs only in the forward direction.
Now let’s cast the problem of general control as a PixelRNN. Instead of drawing a picture, you’d like to draw an MDP - a sequence of states, actions, and rewards. You want to draw a beautiful MDP which corresponds to an agent (such as a robot) accomplishing some task. Again, you start with a prior that samples some initial state, which in this case is the RL environment giving you some starting state. This is the first input to your model. Your RNN samples the first “pixel” (A), and again, the canvas draws the A exactly as you asked. But unlike the previous example where the canvas is always handing back to you your previous RNN outputs, now the next two pixels (R, S) are decided by this black box called “the environment”, which takes in your action and all the previous states and computes R, S in some arbitrary way.
You can think of an RL environment as a “painter object” that takes your RNN actions and rather than directly painting what you want onto the canvas, it draws most of the pixels for you, and this can be arbitrarily complex function.
If we contrast this to the previous example of a Pixel-RNN for drawing images, this is a more challenging setting where you’re trying to sample the image that you want, but there is a black box that’s in getting in the way, deciding what it’s going to draw.
Furthermore, there’s a classic problem in control where if your environment draws a state that you didn’t really expect, then there’s a question of how you issue a corrective action so you can return to the image you’re trying to draw. Also, unlike image generation, you actually have to generate the image sequentially, without being able to go back and edit pixels. This also presents optimization challenges since you can’t do backprop through the black box and have to resort to score function gradient estimation or zeroth-order optimization methods (like evolutionary strategies).
Here’s a research idea - if we want to understand how RL methods like PPO generalize, we ought to benchmark them not with control environments, but instead apply them to image generation techniques and compare them to modern generative models. There’s some work by Hinton and Nair in 2006 where they model MNIST digit synthesis with a system of springs. DeepMind has revived some of this work on using RL to synthesize images.
Image generation is a nice benchmark for studying optimization and control, because it really emphasizes the need to generalize across hundreds of thousands of different scenarios. You can inject your environment into the painter process by having the sampling of green and blue pixels (reward, next state) be some fixed black-box transition with respect to the previous pixels (state). You can make these dynamics as stateful as you want, giving us a benchmark for studying RL in a “high generalization” setting where we can directly compare them to supervised learning techniques tasked with the same degree of generalization.
Lately there’s been some cool work like Decision Transformer and Trajectory Transformer and Multi-Game Decision Transformer showing that upside-down RL techniques do quite well at generalization. One question I’m curious about these days is how upside-down RL compares to online (PPO) or offline RL algorithms (CQL). Evaluation is also conveninent under this domain because you can evaluate density (under an expert fully-observed likelihood model) and see if your given choice of RL algorithm generalizes to a large number of images when measuring the test likelihood.
If you want to measure the success rate of a robot on some task, you might model it as a binomial distribution over the likelihood of success given a random trial, i.e. “how many samples do you need to run to get a reasonable estimate of how good it is”?
The variance of a binomial distribution is \(p(1-p)/N\), where \(p\) is your sample mean (estimated success rate) \(N\) is the number of trials. In the worst case, if you have p=50% (maximal variance), then you need 3000 samples before your standard deviation is less than 1%!
If we look at benchmarks from computer vision, incremental advances of the 0.1-1% range have been an important driver of progress. In ImageNet object recognition, a 10-bit generative modeling problem, progress has been pretty aggressive since 2012 - a 3% error rate reduction for the first three years followed by a 1% reduction every year or so. There’s a huge number of people studying how to make this work. Maybe we’re saturating on the benchmark a bit in 2022, but in the 2012-2018 regime, there was a lot of solid progress.
Similarly in other areas of generative modeling, researchers have been pushing down the perplexity of language models and likewise the bits-per-dimension of generative models on images.
Rigorous evaluation across a lot of scenarios takes time. Let’s compare some evaluation speeds for these general benchmarks. The 2012 ImageNet object recognition test set has 150,000 images in the test set. It will take about 25 minutes to evaluate every single test example, assuming a per-image inference speed of 10ms and that you are serially evaluating every image one at a time. In practice, the evaluation is much faster because you can mini-batch the evaluation and get SIMD vectorization across the batch, but let’s assume we’re operating in a robotics-like setting where you have to process images serially because you only have 1 robot.
Because there are so many images, you can get your standard error estimate within 0.1% (assuming a top-1 accuracy of 80% or so). Maybe you don’t really need 0.1% resolution to make progress in the field - 1% is probably sufficient.
Moving up the complexity ladder of evaluation, let’s consider evaluating neural networks for their end-to-end performance in a simulated task. Habitat-Sim is one of the faster simlators out there, it’s been designed to minimize the overhead between the neural net inference and the stepping of the environment. The simulator can step at 10,000 steps per second, but since the forward pass of a neural net is about 10ms, that bottleneck results in a 2 second evaluation per episode (assuming a typical navigation episode is about 200 steps). This is much faster than running a real robot but much slower than evaluating a single computer vision sample.
If you want to evaluate an end-to-end robotic system with a similar level of diversity as what we do with ImageNet, then it’ll take up to 4 days to crunch through 150k eval scenarios. It’s not exactly apples-to-apples because each episode is really 200 or so inference passes, but we can’t treat the images within an episode as independent validation episodes. Absent any other episodic metric, we only know whether the task succeeded or not, so all the inference passes from within an episode can only contribute to a single sample of your binomial estimate. We have to estimate success rate from 150k episodes, not images. Of course, you can try to use fancy off-policy evaluation methods, but these algorithms are not reliable enough yet to “just work” out of the box.
On the next rung of difficulty, we have live evaluations on real robots. When I worked on BC-Z, each episode took about 30 seconds to evaluate in the real world, and we had a team of 8 operators who could run evaluations and measure success rates. each operator could do about 300 episodes a day before they got tired and needed a break. this means that if you have 10 operators, that gets you about 3000 evaluations per day, or roughly 1% standard error on your success rate estimates.
If it takes a whole day to evaluate your model, this creates a ridiculous constraint on your productivity, because you are limited to only trying one idea a day. You can’t work on small ideas anymore that incrementally improve performance by 0.1%, or really extreme ideas that have a high chance of not working on the first try, because you simply can’t measure those treatment effects anymore. You have to shoot for the moon and go for big jumps in performance. Which sounds nice but is hard to do in practice.
When you factor in the iteration process for doing robotic machine learning, it’s very easy to have the number of evaluation trials dwarf those of your training data in the first place! 2 months of nonstop evaluations generates about 60k episodes, which is already larger than most robotic deep learning demonstration datasets. Let me illustrate this point with the broad trends we see in robotic learning becoming more general-purpose over time.
A few years ago researchers were still tackling problems like getting arms to open singular doors. Policies weren’t expected to generalize too much, and these papers would evaluate on the order of 10 episodes or so. 10-50 trials is not actually enough for statistical robustness, but it is what it is 👁👄👁. In BC-Z we did on the order of 1000 trials for the final evaluation.
But what happens as we scale further? If we end up using datasets like Ego-4D to train extremely general robotic systems capable of O(100,000) behaviors, how many trials would we need to evaluate such general systems? Once you have something kind of baseline working, how do you re-evaluate a new idea to compare the baseline with? The cost of evaluation here becomes absurd.
Once again - we have enough data; the bottleneck is world evaluation!
Here are some ideas on how we can speed up evaluation of general-purpose robotic systems.
One way is to work on generalization and robotics separately. To a large extent, this is what the Deep Learning community does already. Most computer vision and generative modeling researchers don’t test their ideas directly on actual robots, but instead hope that once their models acquire powerful generalization capabilities, it will transfer relatively quickly to robots. ResNets, which were developed in the Computer Vision community, have dramatically simplified a lot of robotic visuomotor modeling choices. Imagine if a researcher had to test their ideas on real robots every time they wanted to try a different neural net architecture! Another success story is CLIPort, which decouples the powerful multi-modal generalization capabilities of image-text models from the basic geometric reasoning used for grasp planning.
You can further stratify the tech stack for RL into “simulated toy environments”, “simulated robots” and “real robots”, in increasing order of evaluation difficulty.
On the bottom layer of the pyramid, you have the general perception benchmarks that are like Kaggle competitions and super easy for the Internet community to iterate on. Moving up the stack, you have a set of “toy control problems” which study the problem in a “bare metal” way, with only the simulator and the neural net running, and all the code related to real world robotics like battery management and real-world resets are non-existent. As you go up the pyramid, it becomes more domain-specific and more relevant to the problem you’re trying to solve. For instance, the “simulated robot” and “real robot” stack might be for the same task and re-use the same underlying robot code. Simulated toy environments can be used to study general algorithms, but may have less domain overlap with the end robotic use case. On top of the “evaluation pyramid”, you have the actual robotic task you are trying to solve. Iterating on this directly is very slow so you want to spend as little time here as possible. You’d hope that the foundation models you train and evaluate on the lower layers help inform you what ideas work without having to do every single evaluation at the top layer.
Again, the field already operates in this decoupled way. Most people who are interested in contributing to robotics don’t necessarily move robots; they might train vision representations and architectures that might eventually be useful for a robot. Of course, the downside to de-coupling is that improvements in perceptual benchmarks do not always map to improvements in robotic capability. For example, if you’re improving mAP metric on semantic segmentation or video classification accuracy, or even lossless compression benchmark - which in theory should contribute something eventually - you won’t know how improvements in representation objectives actually map to improvements in the downstream task. You have to eventually test on the end-to-end system to see where the real bottlenecks are.
There’s a cool paper I like from Google called “Challenging Common Assumptions in Unsupervised Learning of Disentangled Representations”, where they demonstrate that many completely unsupervised representation learning methods don’t confer significant performance improvements in downstream tasks, unless you are performing evaluation and model selection with the final downstream criteria you care about.
Another way to reduce the cost of evaluation is to make sure your data collection and evaluation processes are one and the same. In BC-Z we had people collecting both autonomous policy evaluation data and expert teleoperation data at the same time. If you’re doing shared autonomy, you can use interventions to collect HG-dagger data to gather interventions for the policy, which gives you useful training data. At the same time, the average number of interventions you do per episode tells you roughly how good the policy is. Another thing you can do is look at scalar metrics instead of binomial ones, as those yield more bits of information per episode than a single success/failure bit.
Autonomous data collection with RL learning algorithms is another natural way to merge evaluation and data collection, but it does require you to either use human raters for episodes or to engineer well-designed reward functions. All of these approaches will require a large fleet of robots deployed in real world settings, so this still doesn’t get around the pain of iterating in the real world.
An algorithmic approach to evaluating faster is to improve sim-to-real transfer. If you can simulate a lot of robots in parallel, then you’re no longer constrained. In work led by Mohi Khansari, Daniel Ho, and Yuqing Du, we developed this technique called “Task Consistency Loss” where we regularize the representations from sim and real to be invariant, so that policies should behave similarly under sim and real. When you transfer a policy evaluated in simulation to real, you want to ensure that the higher performance in sim indeed corresponds to higher performance in real. The less the sim2real gap is, the more you can virtualize eval and trust your simulated experiments.
Let’s examine how many bits a modern generative model can output. A 64x64x3 RGB image at 8 bits per channel is 36864 bits. A language model can generate any number of tokens, but if we fix the output window to 2048 tokens, at 17 bits per token that’s 36793 bits. So both image and text generative models can synthesize about 37k bits. As your models become super expressive, there is a large qualitative jump in how people perceive these models. Some people are starting to think that Language Models are partially conscious because of how expressive they are.
How expressive are our robotic policies today, by comparison? In BC-Z most of our tasks consisted of about 15-choose-6 objects on the table and the robot had to move one object on top of another or push some object around, for a total of 100 tasks. log2(100 tasks) is about 7 configuration bits, or in other words, “given the state of the world, the robot is able to move the atoms into one of N states, where N can be described in 7 bits”. SayCan can do about 550 manipulation tasks with a single neural network, which is pretty impressive by current robotic deep learning standards, but altogether it’s just about 10 configuration bits.
It’s not a perfect apples-to-apples comparison because the definition of information is different between the two, but it’s rather just to provide a rough intuition of what matters when humans size up the relative complexity of one set of tasks vs. another.
One of the challenges here is that our robotic affordances are not good enough. If you look at the Ego4D dataset, a lot of tasks here require bimanual manipulation, but most of our robots today we’re still using mobile manipulators with wheeled base, one arm. It’s a limited affordance where you can’t go everywhere and obviously you only have one arm so that excludes a lot of the interesting tasks.
I think expressivity of our robotic learning algorithms are limited by our hardware. That’s one of the reasons I joined Halodi Robotics - I want to work on more expressive robotic affordances. Below we have images of the robot opening doors, packing suitcases, zipping them closed, watering plants, and flipping open bottle caps on a water bottle. As your robotic hardware gets closer to the affordances of an actual human, the number of things you can do in a human-centric world go up exponentially.
The last point I’d like to make is that as our robots become more expressive, we are not only going to need Internet-scale training data, but also Internet-scale evaluation. If you look at progress in LLMs, there are now lots of papers that study prompt-tuning and what existing models can and cannot do. There’s a collaborative benchmark called BigBench that compiles a bunch of tasks and asks what we can interrogate from these models. OpenAI evaluates their DALLE-2 GPT-3 models in the wild with Internet users. Their engineering and product teams can learn from users experimenting with their AI systems in the wild, as it’s too hard for any one researcher to grasp even the input surface area of the models.
My question for the audience is, what is the robotics equivalent of a GPT-3 or DALLE-2 API, in which the broader Internet community can interrogate a robotic policy and understand what it can do?
To conclude, here’s a table that summarizes the comparison between optimization, evaluation, and expressivity:
Generative Modeling | Robotics | |
Optimization and Generalization: can you compress the test set efficiently? | Model has complete control over which pixels it paints | Model samples an action and a stateful black box paints the next two tokens |
Evaluation: how quickly can you iterate? | O(25min) to get binomial success rate std < 0.1% | O(months) to get success rate std < 1% |
Expressivity: How rich are your outputs, in bits? | O(1000) bits make good use of scale and higher-capacity networks | Task configuration space about 10 bits, dramatically limited by robot morphology |
What do you think about model-based methods?
Generative models used for visual model-predictive control are nice because they directly reuse a lot of the latest advances from generative modeling of images and video. However, the challenge remains that if you’re not evaluating on a robot, it’s still hard to know how reduction in bits-per-dim (or ELBO likelihood) for your modeling task translates to actual performance. While model-based learning is super practical for optimizing in a generalization-friendly way, you still need to find a way to connect it to your evaluation.
If model-based learning is practical, why haven’t you applied it to BC-Z data?
When setting up a robotic learning codebase, you want to de-risk the infrastructure and make sure the basics like logging are implemented correctly. Often it’s a good idea to start with a simple algorithm, even if it’s a naive baseline. Imitation learning was the simple algorithm we started with, but obviously once that works then it makes sense to branch out to more complex things like model-based RL.
Is it possible to deploy black-box learned systems (e.g. end-to-end neural network control) in real-world applications, given that they don’t have safety guarantees?
This is an open question for the ML field at large, as to what responsible deployment of black box systems should look like. My personal take is that trust comes out of performance. If it performs really well, people will trust it over time. For more short-term pragmatic stuff, what helps a lot is to have shared autonomy and anomaly detection systems that help people catch errors and let people take over. Just as it’s hard to define what “out of distribution” means, it’s hard to define what “formal safety guarantees” means in unstructured robotic environments. All these are just ill-posed semantics and what we really want is reliability and some degree of internal belief that we know what’s going on inside our automated systems.
Do you have a notion of what proxy tasks make sense for robotics?
At the very top of the evaluation pyramid, you have the question “does the robot do the task successfully?” and lower down the evaluation pyramid you have many proxy metrics that you would hope inform you about the performance of the system. One analogy you can draw from the language modeling community is that bits-per-dim is your proxy metric, while a Turing Test is your end-to-end metric. You don’t necessarily know how well the Turing Test is going to go, given a reduction in perplexity, but you know that because language models are lossless compressors, that as you approach the limit of human-level compression, it will solve the harder task. So we should be drawing plots of how the end-to-end metric you care about scales with each proxy metric. I suspect a lot of surrogate metrics (e.g. object detection mAP) will probably plateau quickly.
As to a specific proxy metric for robotics, I think simulated task success, and hand-curated “evaluation scenarios” that test for robust policy behavior in critical states are a reasonable start.
]]>I’m pleased to say that I’ve joined Halodi Robotics as their VP of AI, and will be hiring for the Bay Area office. We’ll be doing some real robot demos this week near Palo Alto, so please get in touch with me if you’re interested in learning about how we plan to create customer value with deep learning on humanoid robots (1 year), and then solve manipulation (5 years), and then solve AGI (20 years).
I suspect that there are many other deep learning researchers in the “2015-2018 vintage” that are contemplating similar career moves, so I’ll share what I learned in the last month and how I made my career decision in the hopes that this is useful to them. I think that in the next decade we’ll see a lot of software companies increasingly adopt an “AGI strategy” as a means to make their software more adaptive and generally useful.
My only constraints were that I wanted to continue to apply my ML skills at my next job. Here are the table of options I considered. I had chats with directors and founders from each of these companies, but I did not initiate the formal HR interview process with most of them. This is not intended as a flex; just the options I considered and my perceived pros and cons. I’m not like one of those kids that gets into all the Ivy League schools at once and gets to pick whatever they want.
These are subjective opinions: a mere snapshot of what I believe are the strengths and weaknesses of each option in April 2022. In the hype-driven Silicon Valley, the perceived status of a company can go from rags to riches back to rags within a few years, so this table will most certainly age poorly.
Option | Pros | Cons |
FAANG+similar | Low 7 figures compensation (staff level), technological lead on compute (~10 yr), unit economics of research is not an issue | Things move slower, less autonomy, OKRs etc. |
Start my own company | Maximally large action space, blue check mark on Twitter | I’m more interested in solving AGI than solving customer problems |
OpenAI | Technological lead on LLMs (~1 yr) + an interesting new project they are spinning up | Culture and leadership team seem to be already established |
Large Language Model Startup | Strong teams, transform computing in <10 years, iterate quickly on LLM products | Competing with FAANG + OpenAI on startup budget; unclear whether LLMs will be defensible technology on their own |
Tesla | Tesla Bot, technological lead on data engine (~2 yrs), technological lead on manufacturing (~10yrs) | No waiting in line for coffee |
Robotics Startups (including Halodi) | Huge moat if successful, opportunity to lead teams. Halodi has technological edge on hardware (~5yrs) | Robotics research is slow, Robotics startups tend to die |
ML + Healthcare Startups | Lots of low-hanging fruit for applying research; meaningfully change the world | Product impact is even slower than robotics due to regulatory capture by hospitals and insurance companies. 10 years before the simplest of ML techniques can be rolled out to people. |
Startups working on other Applications of ML | Lots of low-hanging fruit + opportunity to lead teams | I’m more interested in solving AGI than solving customer problems |
Crypto + DeFi | Tokenomics is interesting. 60% annual returns at Sharpe 3+ is also interesting. | Not really an AGI problem. Crypto community has weird vibes |
The most important deciding factor for me was whether the company has some kind of technological edge years ahead of its competitors. A friend on Google’s logging team tells me he’s not interested in smaller companies because they are so technologically far behind Google’s planetary-scale infra that they haven’t even begun to fathom the problems that Google is solving now, much less finish solving the problems that Google already worked on a decade ago.
In the table above I’ve listed companies that I think have unique technology edges. For instance, OpenAI is absolutely crushing it at recruiting right now because they are ahead in Large Language Model algorithms, probably in the form of trade secrets on model surgery and tuning hyperparameters to make scaling laws work. OpenAI has clearly done well with building their technical lead time, despite FAANG’s compute superiority.
Meanwhile, the average machine learning researcher at FAANG has a 15 year lead-time in raw compute compared to a PhD student, and Google and DeepMind have language models that are probably stronger than GPT-3 on most metrics. There are cases where technological lead on compute is not enough; some researchers left Google because they were unhappy with all the red tape they had to go through to try to launch LLM-based products externally.
I seriously considered pivoting my career to work on generative models (i.e. LLMs, Multimodal Foundation Models), because (1) robotics is hard (2) the most impressive case studies in ML generalization always seem to be in generative modeling. Again, think to technological lead times - why would any machine learning researcher want to work at something that isn’t at the forefront of generalization capability? However, the pure-generative modeling space feels a bit competitive, with everyone fighting to own the same product and research ideas. The field would probably evolve in the same way with or without me.
Having futuristic technology is important for recruiting engineers because many of them don’t want to waste years of their life building a capability that someone else already has. To use analogies from other fields of science, it would be like a neuroscience lab trying to recruit PhD students to study monkey brains with patch-clamp experiments when the lab next door is using optogenetic techniques and Neurallink robots. You could reinvent these yourself if you’re talented, but is it worth spending precious years of your life on that?
Of course, companies are not the same thing as research labs. What matters more in the long run is the product-market fit and the team’s ability to build future technological edge. Incumbents can get bloated and veer off course, while upstarts can exploit a different edge or taking the design in a unique direction. Lots of unicorn companies were not first-movers.
Being a Bay Area native, I thought my next job would be to start my own company around MLOps. I wanted to build a really futuristic data management and labeling system that could be used for AGI + Active Learning. Three things changed my mind:
First, I talked to a bunch of customers to understand their ML and data management needs to see if there was product-market fit with what I was building. Many of their actual problems weren’t at the cutting edge of technology, and I simply couldn’t get excited about problems like building simulators for marketing campaigns or making better pose estimators for pick-and-place in factories or ranking content in user feeds. The vast majority of businesses solve boring-but-important problems. I want my life’s work to be about creating much bigger technological leaps for humanity.
Secondly, I think it’s rare for CEOs to contribute anything technically impressive after their company crosses a $100M valuation. If they do their job well, they invariably spend the rest of my life dealing with coordination, product, and company-level problems. They accumulate incredible social access and leverage and might even submit some code from time to time, but their daily schedule is full of so much bullshit that they will never productively tinker again. This happens to senior researchers too. This is profoundly scary to me. From Richard Hamming’s You and Your Research: “In the first place if you do some good work you will find yourself on all kinds of committees and unable to do any more work”
Legend has it that Ken Thompson wrote the UNIX operating system when his wife went on a month-long vacation, giving him time to focus on deep work. The Murder of Wilbur writes, How terrifying would it be if that was true? Is it possible that Thompson was burdened by responsibilities his entire life, and then in a brief moment of freedom did some of the most important work anyone has ever done?
Thirdly, Halodi has built pretty awesome technology and they’ve given me a rare opportunity to live in the future, building on top of something that is 5+ years ahead of its time. I’m very impressed by Bernt’s (the CEO) respect for human anatomy: from the intrinsic passive intelligence of overdamped systems that makes us able to grasp without precise planning, to the spring systems in our feet that let us walk across variable terrain while barely expending energy. We both share the belief that rather than humanoid robots being “overkill” for tackling most tasks, it is the only form that can work when you want to design the world around humans rather than machines.
A few months ago I asked Ilya Sutskever whether it made more sense to start a pure-play AGI research lab (like OpenAI, DeepMind) or to build a profitable technology business that, as a side effect, would generate the data moat needed to build an AGI. In his provocative-yet-prescient fashion, Ilya said to me: “All Roads Lead to Rome - Every Successful (Tech) Company will be an AGI company”.
This sounds a bit unhinged at first, until you remember that repeatedly improving a product by the same delta involves exponentially harder technology.
In order to continue adding marginal value to the customer in the coming decades, companies are going to have to get used to solving some really hard problems. Perhaps eventually everyone converges to solving the same hard problem, Artificial General Intelligence (AGI) just so they can make a competitive short video app or To-Do list or grammar checker. We can quibble about what “AGI” means and what time frame it would take for all companies to converge to this, but I suspect that Foundation Models will soon be table stakes for many software products. Russell Kaplan has shared some similar ideas on this as well.
I also wonder if in a few years, expertise on losslessly compressing large amounts of Internet-scale data will cease to become a defensible moat between technologically advanced players (FAANG). It therefore makes sense to look for auxiliary data and business moats to stack onto large-scale ML expertise. There are many roads one can take here to AGI, which I have sketched out below for some large players:
For instance, Alphabet has so much valuable search engine data capturing human thought and curiosity. Meta records a lot of social intelligence data and personality traits. If they so desired, they could harvest Oculus controller interactions to create trajectories of human behavior, then parlay that knowledge into robotics later on. TikTok has recommendation algorithms that probably understand our subconscious selves better than we understand ourselves. Even random-ass companies like Grammarly and Slack and Riot Games have a unique data moats for human intelligence. Each of these companies could use their business data as a wedge to creating general intelligence, by behavior-cloning human thought and desire itself.
The moat I am personally betting on (by joining Halodi) is a “humanoid robot that is 5 years ahead of what anyone else has”. If your endgame is to build a Foundation Model that train on embodied real-world data, having a real robot that can visit every state and every affordance a human can visit is a tremendous advantage. Halodi has it already, and Tesla is working on theirs. My main priority at Halodi will be initially to train models to solve specific customer problems in mobile manipulation, but also to set the roadmap for AGI: how compressing large amounts of embodied, first-person data from a human-shaped form can give rise to things like general intelligence, theory of mind, and sense of self.
Embodied AI and robotics research has lost some of its luster in recent years, given that large language models can now explain jokes while robots are still doing pick-and-place with unacceptable success rates. But it might be worth taking a contrarian bet that training on the world of bits is not enough, and that Moravec’s Paradox is not a paradox at all, but rather a consequence of us not having solved the “bulk of intelligence”.
Reality has a surprising amount of detail, and I believe that embodied humanoids can be used to index that all that untapped detail into data. Just as web crawlers index the world of bits, humanoid robots will index the world of atoms. If embodiment does end up being a bottleneck for Foundation Models to realize their potential, then humanoid robot companies will stand to win everything.
In the course of talking to many companies and advisors over the last month, I learned that there are so, so many interesting startups tackling hard ML problems. Most of them are applied research labs trying to solve interesting problems, and a few of them have charted their own road to Rome (AGI).
Early in your career it makes a lot of sense to surround yourself with really great mentors and researchers, such as that of an industry research lab. Later on, you might want to bring your experience to a startup to build the next generation of products. If this describes you, I’d be happy to connect you to these opportunities - just shoot me an email with (1) where you are on the pure research vs. applied research spectrum (2) what type of problems you want to work on (Healthcare, Robotics, etc.) (3) the hardest you ever worked on a project, and why you cared about it (4) your resume. If you have the skill set I’m looking for, I may also recruit you to Halodi 😉.
I have some genuine concerns with Halodi (and AGI startups in general). History tells us the mortality rate of robotics companies is very high, and I’m not aware of any general-purpose robot company that has ever succeeded. There is a tendency for robotics companies to start off with the mission of general-purpose robots and then rapidly specialize into something boring as the bean counters get impatient. Boston Dynamics, Kindred, Teleexistence - the list goes on and on. As in business and life, the forces of capitalism and evolution conspire to favor specialization of hardware over generalization of intelligence. I pray that does not happen to us.
I’m reminded of Gwern’s essay on timing: Launching too early means failure, but being conservative & launching later is just as bad because regardless of forecasting, a good idea will draw overly-optimistic researchers or entrepreneurs to it like moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything, at which point everyone can see that the optimal time is past.
But I also remind myself of what Richard Hamming said about Claude Shannon:
“He wants to create a method of coding, but he doesn’t know what to do so he makes a random code. Then he is stuck. And then he asks the impossible question, ‘What would the average random code do?’ He then proves that the average code is arbitrarily good, and that therefore there must be at least one good code. Who but a man of infinite courage could have dared to think those thoughts?”
Life is too short to attempt anything less than that which takes infinite courage. LFG.
]]>I am very smart
and utilitarian
I've got no black friends
Hear ye, Futures Fund
Infinite loss at small odds
Join my hacker house?
"What's more scary than
smart, util-maxing agents?"
asked the E.A bro
Trickle down is dead
Long live deferred charity
managed by my fund
Wanna save rhinos?
Don't be dumb; give directly
Stop Roko's b-word
This Eric bozo
His tweets will bring Judgement Day
Order the hit job
Bro is your belief
a real conjugate prior?
or just pretend math?
6 years of funding
are we safe from AI now?
Unsure, need more cash
rationalize this:
all world leaders in a room
none rationalists
A virtuous life
As imagined by Jane St
Phew, just a bad dream!
Reward hacking bad!
Max likelihood not aligned!
*uses PPO*
Pledge away your wealth
But control where it goes to
that’s tax advantaged!
On the precipice
We stand to lose everything
Hurry, donate now
It’s only scary
if Zuck gets AGI first
Ours will be nice though
"help in the best way
... no x-risk can be ignored"
a slippery slope!
For rich atheists
X-Risk is Pascal's Wager
think of the unborn!
we ought to slow down
the pace of AI progress
I'll still train nets though
You ever meet a
Bayesian epistemic
lacking confidence?
I count my wrong bets
as often as my right bets
said no one ever
No foom evidence
but this time it's different
and I'm Bayesian
Anonymous contributor “H” sent me some additional ones, which I have re-posted below with their permission.
It’s not the alt right
call it rationality
no, don’t cancel me
It’s Bad Take Tuesday
draft up a post to LessWrong
~ intellectual ~
what are you doing?
80000 hours left
to work at Jane Street
save the world, said the
white men, and the asian girls
who want to fuck them
i think more than you
have you heard of tools for thought
I’m even less wrong
i donate to the
malaria foundation
and givewell, pls clap
fuck climate change, dude
there’s AGI to be built
nothing else matters
man in the future
is worth more than you
update your priors
i'm earning to give
i'm tweeting about giving
i'm better than you
look at my receipts
on my twitter and my blog
I inspire giving
what is racism?
i dont have any priors
Bayes will save us all
From anonymous contributor JM (reproduced with their permission)
Buy mosquito nets
Or forestall bad AGI
Please take my billions
NO DONT WRITE IT DOWN
That’s an info hazard
Oh, my wife left me
EA leadership!
Push the red alarm button!
It just explained jokes!
Gave effectively
All my fiat to the cause
Please spare some change, sir?
Suppose a seed investor’s entire worldview and investment framework can be distilled into how they uniquely sort this large collection of startups by target portfolio weights. To sort the list, the investor has to perform at least 396 comparisons (in the best possible case), but O(N log2 N) comparisons in the worst case, which sums up to about 3428 bits provided by human judgement. That’s one expensive sort!
Can we shrink our estimate of the upper bound of the Kolmogorov complexity of a seed investor? To investigate this, I made a neural network, YCRank, trained it on a handful of hand-labeled pairwise comparisons, and then used the learned comparator to sort the companies in the most recent W’22 batch.
Disclaimer: Please don’t take it personally if your company ranks low on this list, this is more of a proof-of-concept of how to train models to reflect one person’s opinions. The ordering deliberately ignores a lot of things, and by construction does not consider important analyses like product-market fit, founding team strength, and technological edge. The current YC Rank model predicts ordering from a miniscule amount of context. You can take solace in that I’m not a particularly great investor: if you invert this list you might actually do quite well.
YCRank takes as an the input natural language description of a company. Here is what the raw JSON metadata for each company on https://www.ycombinator.com/companies
looks like:
{'id': 26425,
'name': 'SimpleHash',
'slug': 'simplehash',
'former_names': [],
'small_logo_thumb_url': 'https://bookface-images.s3.amazonaws.com/small_logos/2ed9dc06a8e2160ef441a0f912e03226a6ba9def.png',
'website': 'https://www.simplehash.com',
'location': 'San Francisco, CA, USA',
'long_description': 'SimpleHash allows web3 developers to query all NFT data from a single API. We index multiple blockchains, take care of edge cases, provide a rapid media CDN, and can be integrated in a few lines of code.',
'one_liner': 'Multi-chain NFT API',
'team_size': 2,
'highlight_black': False,
'highlight_latinx': False,
'highlight_women': False,
'industry': 'Financial Technology',
'subindustry': 'Financial Technology',
'tags': ['NFT', 'Blockchain', 'web3'],
'top_company': False,
'isHiring': False,
'nonprofit': False,
'batch': 'W22',
'status': 'Active',
'industries': ['Financial Technology'],
'regions': ['United States of America', 'America / Canada'],
'objectID': '26425',
'_highlightResult': {'name': {'value': 'SimpleHash',
'matchLevel': 'none',
'matchedWords': []},
'website': {'value': 'https://www.simplehash.com',
'matchLevel': 'none',
'matchedWords': []},
'location': {'value': 'San Francisco, CA, USA',
'matchLevel': 'none',
'matchedWords': []},
'long_description': {'value': 'SimpleHash allows web3 developers to query all NFT data from a single API. We index multiple blockchains, take care of edge cases, provide a rapid media CDN, and can be integrated in a few lines of code.',
'matchLevel': 'none',
'matchedWords': []},
'one_liner': {'value': 'Multi-chain NFT API',
'matchLevel': 'none',
'matchedWords': []},
'tags': [{'value': 'NFT', 'matchLevel': 'none', 'matchedWords': []},
{'value': 'Blockchain', 'matchLevel': 'none', 'matchedWords': []},
{'value': 'web3', 'matchLevel': 'none', 'matchedWords': []}]}}
I simplified the JSON string to remove redundancy, and removed status
, top_company
, highlight_black
, highlight_latinx
, highlight_women
, regions
to avoid potential bias. This doesn’t necesssarily absolve YCRank from bias, but it’s a reasonable start.
YCRank examines abbreviated descriptions like the one below to predict a logit score, which is then used to sort against other companies.
'SimpleHash'
'Multi-chain NFT API'
'Financial Technology'
['NFT', 'Blockchain', 'web3']
('SimpleHash allows web3 developers to query all NFT data from a single API. '
'We index multiple blockchains, take care of edge cases, provide a rapid '
'media CDN, and can be integrated in a few lines of code.')
'team size: 2'
Carefully designing the dataset with the right choice of train/test split is 80-90% of the work in building a good ML model. I started by comparing a subset of companies from np.random.choice(393, size=200, replace=False)
to a randomly chosen company from np.random.choice(393, size=1, replace=True)
(an extra 4 companies got added after Demo day). I then split this into a training set of 155 examples and a test set of 48 examples. This means that at test time, the model is comparing unseen pairs of companies. Most of the companies it will have seen in at least one example, but some companies will never be seen at all.
To make the model easier to debug, I biased my ranking towards what was “harder to execute” on, especially in scenarios where I felt both companies were equally uninteresting. This is not how I would actually invest as an angel, but rather just makes it easier to debug whether the model is capable of picking up on this pattern.
I also tended to rank favorably companies that were already making monthly recurring revenue with double-digit growth rates. There are other ways to rank pairs of companies - one could imagine a variety of prediction heads that rank by “founder ambition”, “high chance of complete failure”, “atoms over bits”, “probability of being acquihired”, “socially good” etc.
The model is a modified RobertaClassificationHead on top of RoBERTa embeddings. Because the number of training examples was fairly small I got a decent boost (+10% accuracy) tuning hyperparameters and employing a variety of regularization tricks.
Neural network have learned “hidden” representations with which you can use to cluster data. You can combine it with an active learning scheme to have the model “help you decide what to label next”. This means that you can quickly find the examples that your model tends to perform badly at, and then label those to make the errors go away.
The Tesla AutoPilot team employs a similar idea in their “data engine” flywheel to help them surface more examples that their vision system tends to performs less well on. As the model gets better, the embedding space also gets better at surfacing relevant “hard examples” by matching the semantic content of the images better.
Here is a cosine similarity matrix of the company embeddings, grouped together via a spectral decomposition for ease of visualization. On the left are the RoBERTa embeddings - you can see some faint block structure but all the NLP similarities are quite high (> 0.99). On the right are the spectral clustering of the hidden layer embeddings of the RobertaClassificationHead
. Because this hidden layer is explicitly trained for the downstream “Eric rank prediction” task, it unsurprisingly separates better into more distinct clusters by cosine distance.
It’s not a big deal that the cosine similarities are close to each other, since we’ll use the relative ordering of scores to determine our active sampling scheme. A cursory examination suggests that the embedding is learning to group by sector. For example:
'WhiteLab Genomics'
'Unleashing the potential of DNA and RNA based therapies using AI'
'Healthcare'
['Gene Therapy', 'Artifical Intelligence', 'Genomics']
('We have developed an AI platform enabling to accelerate the discovery and '
'the design of genomic therapies such as Cell Therapies, RNA Therapies and '
'DNA Therapies.')
'team size: 13'
Most similar company in the batch:
-----------------------------------
'Toolchest'
'Computational biology tools in the cloud with a line of code'
'B2B Software and Services'
['Drug discovery', 'Data Engineering', 'Developer Tools']
('Toolchest makes it easy for bioinformaticians to run popular computational '
'biology software in the cloud. Drug discovery companies use Toolchest to get '
'analysis results up to 100x faster.\r\n'
'\r\n'
'We have Python and R libraries that customers use to run popular open-source '
'tools at scale in the cloud. Toolchest is used wherever their analysis '
'currently exists –\xa0e.g. a Jupyter notebook on their laptop, an R script '
'on an on-prem cluster, or a Python script in the cloud.')
'team size: 3'
Using these embeddings, I wrote a very simple ML-assisted data labeling algorithm to help me find companies pairs that would correct mistakes in the test set, without actually adding the exact test pair into the dataset. I did this step twice, and each time added about 15 examples. After two iterations, a total of an additional 30 labels boosts the test accuracy from 81% to 91%!
Satisfied with the test accuracy, I used YCRank to read all the YC pitches and sort them. Here they are, along with the predicted logit scores. The scores are not normalized in any way, so they are only useful for comparing one company to another.
Some observations:
This model could be helpful to people in a few ways.
highlight_women=True
) differs from your expectations or any of the benchmarks in the previous point.It’s natural to wonder whether such machine learning models can eventually automate venture investment itself, but it’s a bit too early to say. Machine learning holds up a mirror to human behavior, knowledge, and sometimes, prejudice. Even if we had a perfect mirror of behavior, said behavior might not be the best one, and we have an opportunity to do better than just automate the status quo. Therefore, it’s practical to first use these ML models to help humans make decisions, before thinking about more ambitious automation goals.
If you’re interested in trying this model out, please reach out to eric@jang.tech for beta access. I intend to code up a proper website for people to train their own YCRank networks, but want to spend a few more days making sure there are no embarassing failure modes.
Debugging tip #1: Active learning algorithms require a human in the loop, which can make it expensive to run carefully controlled experiments and debug whether the human-in-the-loop ML is implemented correctly. This was definitely a huge pain when working on bc-z. When developing active learning systems, it’s helpful to implement an “oracle comparator” that does not require an actual human to perform the task. If the network can memorize the mapping from company description to a logit that monotonically increases with id, then it should be able to achieve a high accuracy on held-out pairs.
Debugging tip #2: To unit-test my NLP embeddings and spectral clustering code, I also concatenated 8 strings from one of my blog posts, to check if the NLP embedding space distinguishes Distribution(Eric's writing)
from Distribution(YC pitches)
. Indeed, they indeed have slightly lower cosine similarity (0.98) and show up as a distinct blue band.
Debugging Tip #3: A lot of junior ML practitioners spend too much time fiddling with network architectures and regularization hyperparameters. If you want to see whether regularization would help, just test your model on a synthetic data problem, and see if scaling up dataset size produces the intended effect. Oracle labelers (see tip 1) are also super useful here.
I spent 5.75 wonderful years (2097 days) on the Robotics team as my first job out of university. I wanted to extend my deepest gratitude to the colleagues and mentors who have taught me so much during my time at Google. Despite me not having a PhD, they took a chance on me and invested in my research career. I’m extremely proud of the technical (research and infra) and non-technical things (people skills) I worked on in my time there.
What’s next? I don’t have anything lined up yet, so please reach out if you’d like to work with me or offer career advice. Here is my resume. I’m leaning towards starting my own company, but I’m open-minded and will join an existing company if it makes sense.
One more thing - thank you to all my readers who have followed my writing over the years. Whatever I do next, I intend to continue sharing interesting things I’m learning and thinking about. You can subscribe to this blog via RSS or email.
Sincerely,
Eric Jang
Here’s the email that got my foot in the door. Thanks Jeff and Vincent for letting me join the team!
According to go/percent, I stayed at Google longer than 70% of the current FTE workforce. Here’s a visualization of my 931 submitted changelists (commits) over the years.
Whenever I traveled for work or fun I would make it a point to try the free food at the local Google office. Here are some of the things I ate. I have the utmost respect for the chefs at Google; making tasty, healthy meals fit for serving strangers is not easy!
Mountain View
New York
Los Angeles
Paris
Zurich
London
Tokyo (seriously, wtf?)
Buenos Aires
Sydney
]]>Pushing as much human-written code to “language + deep learning magic” sounds like a lot of fun, but how does a practitioner execute this “just-ask-for-generalization” recipe? If we want to infer optimal behavior via a deep neural net without training explicitly for it, we need to answer hard questions: given a model family and some training data, what kinds of test examples can we expect the model to generalize to? How much is “too much to ask for”?
And how do you define generalization, anyway? ML theory gives us some basic definitions like generalization gap and excess risk (i.e. the difference between training and testing losses), but such definitions are not useful for estimating what it takes to achieve a qualitative degree of capability not seen in the training data. For example, if I am training a household robot to be able to wash dishes in any home, how many homes do I need to collect training data in before the learned policy starts to work in any kitchen? This practical question comes in many formal disguises:
Like the parable of the blind men and the elephant, computer scientists have come up with different abstract frameworks to describe what it would take to make our machines smarter: equivariance algebra, causal inference, disentangled representations, Bayesian uncertainty, hybrid symbolic-learning systems, explainable predictions, to name a few.
I’d like to throw in another take on the elephant: the aforementioned properties of generalization we seek can be understood as nothing more than the structure of human language. Before you think “ew, linguistics” and close this webpage, I promise that I’m not advocating for hard-coding formal grammars as inductive biases into our neural networks (see paragraph 1). To the contrary, I argue that considering generalization as being equivalent to language opens up exciting opportunities to scale up non-NLP models the way we have done for language.
Hupkes et al. 2020 discusses a few different aspects of “compositionality” in language models. Language is nothing more than the composition of a discrete set of tokens, so what the authors are really doing is specifying a grammar on how the smallest units of discrete meaning (words) fit together to form new meanings, i.e. the structure of language itself. Here is a table in which I’ve paraphrased the definitions and provided some training examples and test-time capabilities.
Generalization Type | Definition | Training Examples | Testing Examples |
Systematicity | Recombine constituents that have not been seen together during training | {“Bob ate pizza”, “Alice ran home”} | “Bob ran home” |
Productivity | Test sequences longer than ones seen during training | books with 100k-200k words. | books with 200k+ words. |
Substitutivity | Meaning of an expression is unchanged if a constituent is replaced with something of the same meaning | “bob ate pizza for lunch” | “bob had Dominos at noon” taken to mean (almost) the same thing |
Localism | The meaning of local parts are unchanged by the global context. | Arithmetic tasks like {(5)-4, (2+3)} | (2+3)-4 : (2+3) locally evaluates to 5, then 5-4 locally evaluates to 1. (2+3) representation not influenced by the presence of -4. |
Overgeneralization | Correctly handle exceptions to rules and patterns | {live->lived, laugh->laughed, love->loved} | {kill->killed,break->broke (not breaked)} |
Hupkes’ categorizations of compositionality can be applied to non-NLP domains as well. In “A Survey of Generalization in Deep Reinforcement Learning”, the authors provide an intuition of how these categorizations can be applied to a robotic block-stacking task:
None of the above tasks involve understanding language, and yet the structures underpinning generalization - systematicity, productivity, substitutivity, localism, overgeneralization - are found here as well.
Perhaps we can cast other research around of “improving ML generalization” as special cases of language modeling. Consider “disentangled representations” research, whereby semantic attributes of data can be separately understood as discrete, standalone concepts. Your computer vision model can train on “green grass”, and “red apple”, and ideally would understand what “red grass” means even though it has never encountered that concept in the training data.
If a “style” \(\in A\) and “shape” \(\in B\) vector are “disentangled representations”, then \(A \times B\) forms a simple vector space, or toy grammar, that your model ought to understand. We might combine the “red” concept vector with the “grass” concept vector and then decode it into an image with a conditional generative model \(p(\text{image} \vert a, b)\). In a robotic setting. we might train a robot that disentangles objects from skills, and specify goals by providing it with two inputs: a “skill categorical vector” (pick) and an “object categorical vector” (shoe). These sorts of simple two-word grammars are enough if you want to build a Face filtering app (e.g. combine me with “female face”) or a pick-and-place robot, but a logical step to furthering disentangled representations research is to combine “disentangled concepts” in much more open-ended, arbitrary ways beyond orthogonal attributes.
Do you know what else are “disentangled, standalone concepts”? Words! If we venture away from toy compositional grammars towards the grammatical structure of natural language, we can now ask a generative model to “draw red grass where the sun is shining and purple grass where it is in shade and a horse eating the red grass”. We can tell a robot to “pick up the leftmost object that is not a cup.” Natural language permits us to do everything we can communicate to another person: embed logical predicates, fuzzy definitions, precise source code, and even supplement knowledge that the model does not know ahead of time (“Blickets are red with spikes, pick up a blicket”).
The 5-way categorization introduced by Hupkes plausibly describes the basic structures of language, but there are limits to language-based reasoning. If a robot is trained to “stack many blocks”, and “wash a dish”, and we instruct it to “wash many dishes”, is that a test of systematicity (combining “many” and “dish” concepts from the training data)? Or is it testing productivity (repeated extension of the “wash a dish” task)? Does it really matter?
Another ambiguous example: when testing productivity, language comes with a fairly obvious choice for the extrapolation axis: the number of tokens in the input sequence. But one could also imagine productivity being measured as the length of the output sequence (what is the longest story the book could write?), the depth of a parse tree (how many nested parentheses can your NLP model manage the scope for?), or any arbitrary semantic measure (what is the maximum number of characters the model can write a story about)? The distinctions between systematicity and productivity start to break down here again, especially when it comes to compositionality on higher-level concepts beyond the individual token level. As with all semantics, the precise boundaries of anything - even definitions around the basic structures of language itself - become fuzzy if you look too hard. It’s sort of like a Heisenberg’s Uncertainty principle for semantics.
Unlike most formal grammars, natural language is capable of handling some fuzziness and ambiguity, insofar as it is good enough for human communication and survival. My analogy to the uncertainty principle (“sort of like…”) is a case in point. The best formal definition of an “image of a cat” we have today is a neural network classifier trained on a lot of human labels - a person simply can’t write down that definition in a sufficiently precise way. If defining cat images is best done from data and machine learning, then it begs the question of whether richer semantic ontologies (especially around generalization) are better defined from data as well. If a model understands human language well enough, then we can use it to venture beyond precise toy grammars into a truly vast, fuzzy space of capabilities like “please imitate another agent pretending to be you”, as I suggested in “Just ask for Generalization”.
When it comes to combining natural language with robots, the obvious take is to use it as an input-output modality for human-robot interaction. The robot would understand human language inputs and potentially converse with the human. But if you accept that “generalization is language”, then language models have a far bigger role to play than just being the “UX layer for robots”. We should regard language capability as a substrate for generalization in any machine learning domain.
Linguistic relativists say that language is not only the primary way we communicate to each other, it is also the way we communicate to ourselves when thinking. Language is generalization is cognition.
We are still in the early days of imbuing our robots with evidence of linguistic relativity, but there are some exciting promising results in this direction. The paper Pretrained Transformers as Universal Computation Engines (Lu et al 21) showed that the internal structure of pre-trained language models can be frozen, and used as-is to perform a variety of non-language tasks, like numerical computation, protein folding, and vision. The internal representations are only trained on language domains, with the input and output embeddings re-learned for each fine-tuning task.
This is wildly exciting, because it suggests that it might be possible to improve generalization simply by scaling up language data, rather than collecting a lot of task-specific data. Perhaps language models can imbue other ML models with systematicity, productivity, substitutivity in an infintiely-composable way, because they already acquired them when training on language datasets.
More evidence for this idea comes from our recent BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning research paper, where we use language data in place of real robot data to get the robot to perform new tasks it hasn’t trained on. The robot never saw the bottle and tray in the same scene and was never trained on the “place bottle in the tray” task, and yet it can zero-shot the task. The policy merely conditions on pre-trained sentence embeddings, so the language model is doing the heavy lifting of task-level generalization.
Interestingly enough, we also tried to teach the robot to generalize to the same held-out tasks by conditioning on task embeddings of humans performing the task. We struggled for years to get it working, and it wasn’t until we explicitly aligned the video embeddings to match language space that we were able to see held-out task generalization on human videos.
There are some ML research tailwinds that make this idea of “bolt-on generalization from language models” increasingly easy to try. Transformers have been the architecture-of-choice for modeling language for a few years now, but in the last year we’ve seen a rapid consolidation of computer vision architectures around transformers as well. Today it is possible to train both state-of-the-art language and state-of-the-art vision models with pretty much the same architecture (e.g. ViT), so I think that we will start to see transformers trained for perception benchmarks start to re-use language datasets as an extra source of data. Here’s a prediction (epistemic confidence 0.7): within the next 5 years, we’ll see a state-of-the-art model for a computer vision benchmark that does not involve natural language (e.g. ImageNet classification), and that model uses knowledge from internet-scale natural language datasets (by either training directly on NLP datasets or indirectly via re-using an existing language model).
A sketch of what I think neural architectures of the future will look like, inspired by the Universal Computation Engine and ViT papers: you have a large language model that acts as “bolt-on-generalization layer” for a target task domain, lightweight encoder layers to tokenize the input into something that can capture “word-level” semantics, and lightweight output decoders that transform the “generalization module” output tokens into the right prediction space. Additionally, because the core of the model is a standard transformer, it is simple to pass in additional natural language tokens for goal conditioning or extra human knowledge.
Language Models are far from perfect even when restricted to NLP tasks, and I don’t mean to suggest that they are ready today for solving ML once and for all. Rather, I am optimistic that language models will continue to get better, and with improved linguistic capability comes better generalization in other non-NLP domains.
My friend Elijah invited me to give a talk at his company’s algorithms seminar and I shared an early draft of these ideas there. Here is a recording on YouTube.
If you want to cite this blog post, you can use:
@article{jang2021language,
title = "To Understand Language is to Understand Generalization",
author = "Jang, Eric",
journal = "evjang.com",
year = "2021",
month = "Dec",
url = "https://evjang.com/2021/12/17/lang-generalization.html"
}
How does this relate to Chomsky’s ideas of innate language capability and Universal Grammars?
Chomsky has a lot more to say than a one-line “language is innate”, but the gist is that humans are born with some innate linguistic capability. At face value, I agree with this since it follows from 1) linguistic relativity 2) equivalence between generalization and language 3) humans are born with some ability to generalize, even if what they are generalizing is their learning ability.
Where there is more controversy is how much linguistic capability is innate, and whether learning is distinct from language. If you believe that generalization is language, then maybe it isn’t.
Also, the degree to which a capability is innate tells us nothing of our ability to hard-code it correctly. For instance, humans may be innately primed to dislike insects and snakes, but we might have to resort to function approximation from data if we wanted to build such an innate prior. For me, it is less about what is genetically or developmentally innate in animals, but more of whether we want to hard-code with formal structures vs. acquiring the structure via function approximation.
If linguistic capability is important for generalization, why not add more hard-coded linguistic rules into our neural networks so that they can perform more robust reasoning?
Gary Marcus has an excellent quote from The Next Decade in AI:
“The trouble is that GPT-2’s solution is just an approximation to knowledge, and not substitute for knowledge itself. In particular what it acquires is an approximation to the statistics of how words co-occur with one another in large corpora—rather than a clean representation of concepts per se. To put it in a slogan, it is a model of word usage, not a model of ideas, with the former being used as an approximation to the latter. Such approximations are something like shadows to a complex three-dimensional world”
Where Marcus sees “meaning = co-occurence statistics” as problematic for the purposes of building robust AI systems, I see this as a preliminary vindication of the Distributional semantics hypothesis. The meaning of words are nothing more than how they are used. Even if there was meaning independent of anthropomorphic usage (e.g. the concepts of life and death probably mean something to most animals), humans lack the ability to implement those concepts formally. That’s not to suggest we should be content with defining everything as is found on the Internet, as some word co-occurences around race and gender and class are problematic for society. But it is helpful to understand that the meaning of words are derived from their usage, and not the other way around.
]]>Discoveries around Neural Scaling Laws, unsupervised pretraining on Internet-scale datasets, and other work on Foundation Models have pointed to a simple yet exciting narrative for making progress in Machine Learning:
To that end, Deep Neural nets trained with supervised learning are excellent data sponges - they can memorize vast amounts of data and can do this quickly by training with batch sizes in the tens of thousands. Modern architectures like ResNets and Transformers seem to have no trouble absorbing increasingly large datasets when trained via supervised learning.
When a model has minimized training loss (a.k.a empirical risk), it can be said to have “memorized” the training set. Classically one would think that minimizing training loss to zero is shortly followed by overfitting, but overparameterized deep networks seem to generalize well even in this regime. Here is an illustration of the “double descent” phenomena from Patterns, Predictions, and Actions, which illustrates that in some problems, overparameterized models can continue to reduce test error (risk) even as training loss is fully minimized.
A recent ICLR workshop paper investigates this phenomenon on synthetic datasets, showing that if you train long enough in this zero-training-loss regime, the model can suddenly have an epiphany and generalize much later on (the authors call this “Grokking”). Furthermore, the paper also presents evidence that increasing training data actually decreases the amount of optimization required to generalize.
It’s as my colleague Chelsea Finn once told me: “Memorization is the first step towards generalization!”
State-of-then-art neural networks trained this way can do really impressive things. Here is a DALL-E model that, when prompted with “A banana performing stand-up comedy”, draws the following picture:
Here is another DALL-E output, prompted with “an illstration of a baby panda with headphones staring at its reflection in a mirror”.
Note that there are no such images of “pandas looking into mirrors” or “banana comedians” in the training data (I think), so these results suggest that the DALL-E model has learned to interpret distinct concepts from text, render the corresponding visual parts in an image and have them interact with each other somewhat coherently.
The ability to “just ask” language-conditioned deep learning models for what you want has led to “prompt engineering” as a viable space for improving our ML models. Here is a Tweet discussing how priming a VQGAN + CLIP model with the words “Unreal Engine” leads to drastically higher-quality images.
What if we could extend this principle - just asking generalization - to other challenging problems that have eluded analytical algorithmic improvements?
In contrast to supervised learning, reinforcement learning algorithms are much less computationally efficient when it comes to absorbing vast quantities of diverse data needed for generalization. To see why this is the case, let’s consider a thought experiment where we train a general-purpose robot to do millions of tasks in unstructured environments.
The standard Markov Decision Process is set up as follows: a policy is represented as a state-conditioned distribution over actions, \(p(a \vert s)\), and the environment as consisting of a reward function \(r(s_t, a_t)\) and transition dynamics \(p(s_{t+1} \vert s_t, a_t)\). Initial states and task objectives are encoded in the initial state \(s_0\), which is sampled from a distribution \(p(s_0)\). The goal is to maximize the sum of rewards across the episode, averaged across different starting states sampled from \(p(s_0)\):
\[\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \text{Solve}~\theta^*\ = \argmax_\theta~R(\theta)\] \[\text{where}~R(\theta)=E_{p(s_0)}[\sum_{t=1}^{T}{r(s_t, a_t)}]~\text{and}~a_t \sim p_\theta(\cdot|s_t)~\text{and}~s_{t+1} \sim p(\cdot|s_t, a_t)~\text{and}~s_0 \sim p(s_0)\]Let’s assume the existence of some optimal policy which we call \(p^\star(a \vert s)\), that achieves the maximum reward \(\max_\theta R(\theta)\). “Supremum” would be more accurate, but I use the \(\max\) operator for notational simplicity. We want to bring our model, \(p_\theta(a \vert s)\), as close as possible to \(p^\star(a \vert s)\).
If we had access to the optimal policy \(p^\star(a \vert s)\) as an oracle, we could simply query the oracle action and use it like a supervised learning label. We could then train a feedforward policy that maps the states to the oracle actions, and benefit from all the nice properties that supervised learning methods enjoy: stable training, large batches, diverse offline datasets, no need to interact with the environment.
However, in reinforcement learning we often don’t have an expert policy to query, so we must improve the policy from its own collected experience. To do this, estimating the gradient that takes the model policy closer to the optimal policy requires evaluating the average episodic return of the current policy in the environment, and then estimating a gradient of that return with respect to parameters. If you treat the environment returns as a black-box with respect to some parameter \(\theta\) you can use the log-derivative trick to estimate its gradients:
\[\nabla_\theta E_{p(\theta)} [R(\theta)] = \int_\Theta d\theta \nabla_\theta p(\theta) R(\theta) \\ \\ = \int_\Theta d\theta p(\theta) \nabla_\theta \log p(\theta) R(\theta) = E_{p(\theta)} [\nabla_\theta \log p(\theta) R(\theta)]\]This gradient estimator contains two expectations that we need to numerically approximate. First is computing \(R(\theta)\) itself, which is an expectation over starting states \(p(s_0)\). In my previous blog post I mentioned that accurate evaluation of a Binomial variable (e.g. the success rate of a robot on a single task) could require thousands of trials in order to achieve statistical certainty within a couple percent. For our hypothetical generalist robot, \(p(s_0)\) could encompass millions of unique tasks and scenarios, which makes accurate evaluation prohibitively expensive.
The second expectation is encountered in the estimation of the policy gradient, over \(p(\theta)\). Some algorithms like CMA-ES draw samples directly from the policy parameter distribution \(p(\theta)\), while other RL algorithms like PPO sample from the policy distribution \(p_\theta(a\vert s)\) and use the backpropagation rule to compute the gradient of the return with respect to the parameters: \(\frac{\partial R}{\partial \theta} = \frac{\partial R}{\partial \mu_a} \cdot \frac{\partial \mu_a}{\partial \theta}\). The latter is typically preferred because the search space on action parameters is thought to be smaller than the search space on policy parameters (and therefore requires fewer environment interactions to estimate a gradient for).
If supervised behavior cloning on a single oracle label \(a \sim p^\star(a\vert s)\) gives you some gradient vector \(g^\star\), estimating the same gradient vector \(\bar{g} \approx g^\star\) with reinforcement learning requires something on the order of \(O(H(s_0) \cdot H(a))\) times as many episode rollouts to get a comparably low-variance estimate. This is a hand-wavy estimate that assumes that there is a multiplicative factor of the entropy of the initial state distribution \(O(H(s_0))\) for estimating \(R(\theta)\) and a multiplicative factor of the entropy of the action distribution \(O(H(a))\) for estimating \(\nabla_\theta R(\theta)\) itself.
Consequently, online reinforcement learning on sparse rewards and diverse, possibly multi-task environments require enormous numbers of rollouts to estimate returns and their gradients accurately. You have to pay this cost on every minibatch update! When the environment requires handling a wide variety of scenarios and demands generalization to unseen situations, it further increases the number of minibatch elements needed. The OpenAI DOTA team found that having millions of examples in their minibatch was required to bring down gradient noise to an acceptable level. This intuitively makes sense: if your objective \(R(\theta)\) has a minimum minibatch size needed to generalize well across many \(s_0\) without excessive catastrophic forgetting, then switching from supervised learning to online reinforcement learning will probably require a larger batch size by some multiplicative factor.
What about offline RL methods like Deep Q-Learning on large datasets of \((S,A,R,S)\) transitions? These methods work by bootstrapping, where the target values that we regress value functions to are computed using a copy of the same network’s best action-value estimate on the next state. The appeal of these offline reinforcement learning methods is that you can get optimal policies from diverse, off-policy data without having to interact with the environment. Modified versions of Q-learning like CQL work even better on offline datasets, and have shown promise on smaller-scale simulated control environments.
Unfortunately, bootstrapping does not mix well with generalization. It is folk knowledge that the deadly triad of function approximation, bootstrapping, and off-policy data make training unstable. I think this problem will only get worse as we scale up models and expect to train them on increasingly general tasks. This work shows that repeated bootstrapping iteratively decreases the capacity of the neural network. If you believe the claim that overparameterization of deep neural networks is key to generalization, then it would appear that for the same neural net architecture, offline RL is not quite as “data absorbent” as supervised learning.
In practice, even algorithms like CQL are still challenging to scale and debug on larger, real-world datasets; colleagues of mine tried several variations of AWAC and CQL on large-scale robotics problems and found them to be trickier to get them to work than naive methods like Behavior Cloning.
Instead of going through all this trouble, what if we lean into what deep nets excel at - sponging up data quickly with supervised learning and generalizing to massive datasets? Can we accomplish what RL sets out to do using the tools of generalization, rather than direct optimization?
What if we make generalization the first-class citizen in algorithmic design, and tailor everything else in service of it? What if we could simply learn all the policies with supervised learning, and “just ask nicely” for the best one?
Consider the recent work on Decision Transformer (DT), whereby instead of modeling a single policy and iteratively improving it with reinforcement learning, the authors simply use supervised learning coupled with a sequential model to predict trajectories of many different policies. The model is conditioned with the Return-to-Go so that it may predict actions consistent with a policy that would achieve those returns. The DT simply models all policies - good and bad - with supervised learning, and then use the magic of deep learning generalization to infer from the expert-conditioned policy.
This phenomenon has been observed and developed in several prior and concurrent works, such as Reward-Conditioned Policies, Upside Down Reinforcement Learning and Reinforcement Learning as One Big Sequence Modeling Problem. The AlphaStar team also found that conditioning a model on human player skill level (e.g. future units they ended up build order, MMR, ELO scores) and using it to imitate all player data was superior to only imitating expert-level build orders. This technique is also commonly used in the Autonomous Vehicle space to model both good drivers and bad drivers jointly, even though the autonomous policy is only ever deployed to imitate good driving behavior.
At a high level, DTs condition the supervised learning objective on some high level description \(g\) that partitions what the policy will do in the future based on that value of \(g\). The return-to-go is an especially salient quantity for a reinforcement learning task, but you can also express the future outcomes via a goal state or StarCraft build order or even a natural language description of what was accomplished.
In Language Conditioned Imitation Learning over Unstructured Data, the authors pair arbitrary trajectories with post-hoc natural language descriptions, and then train a model to clone those behaviors conditioned on language. At test time, they simply “ask” the policy to do a novel task in a zero-shot manner. The nice thing about these techniques is that they are indispensable for reaching sparse goals on RL tasks like Ant-Maze. This lends support to the claim that generalization and inference across goal-conditioning can do far better than brute force search for a single sparse goal in a long-horizon task.
Language is a particularly nice choice for conditioning because it can be used to partition a trajectory not just on skill level, but also by task, by how much the policy explores, how “animal-like” it is, and any other observations a human might make about the trajectory. Clauses can be composed ad-hoc without developing a formal grammar for all outcomes that the robot might accomplish. Language is an ideal “fuzzy” representation for the diversity of real-world outcomes and behaviors, which will become increasingly important as we want to partition increasingly diverse datasets.
A recent work I am quite inspired is D-REX, which tackles the problem of inferring the environment’s reward function from the demonstrations of a suboptimal policy. Classically, one requires making an assumption that the demonstrator is the optimal policy, from which you can use off-policy algorithms (e.g. Q-learning) to estimate the value function. Offline value estimation with deep neural nets can suffer from poor generalization to state-action pairs not in the demonstrator trajectory, and thus requires careful algorithmic tuning to make sure that the value function converges. An algorithm with poor convergence properties makes the propsects of minimizing training loss - and therefore generalization - tenuous. D-REX proposes a really clever trick to get around not having any reward labels at all, even when the demonstrator is suboptimal:
I like this approach because ranking models are stable to train (they are just classifiers), and this method is able to achieve better-than-demonstrator behavior not through the explicit construction of the Bellman inequality or implicit planning through a learned model, but rather via extrapolation on a family of perturbations.
In the above sections I’ve described how you can “generalize and infer” to get around exploration and even inverse reinforcement learning from sparse rewards. But what about “improving from a policy’s own experience, tabular rasa”? This is the main reason why people put up with the pain of implementing RL algorithms. Can we replace this with supervised learning algorithms and a bit of generalization as well?
The goal of RL is to go from the current set of parameters \(\theta^{n}\) and some collected policy experience \(\tau\) to a new set of parameters \(\theta^{n+1}\) that achieves a higher episode return. Instead of using a “proper” RL algorithm to update the agent, could we just learn this mapping \(f: (\theta^{n}, \tau) \to \theta^{n+1}\) via supervised deep learning?
This idea is sometimes referred to as “meta-reinforcement learning”, because it involves learning a better reinforcement learning function than off-the-shelf RL algorithms. My colleagues and I applied this idea to a project where we trained a neural network to predict “improved policy behavior” from a video of a lesser policy’s experience. I could imagine this idea being combined with ranking and trajectory augmentation ideas from D-REX to further generalize the “policy improvement behavior”. Even if we never train on optimal policy trajectories, perhaps sufficient data augmentation can also lead to a general improvement operator that extrapolates to the optimal policy regime of parameters.
People often conflate this policy improvement behavior with “reinforcement learning algorithms” like DQN and PPO, but behavior is distinct from implementation. The “policy improvement operator” \(f: (\theta^{n}, \tau) \to \theta^{n+1}\) can be learned via your choice of reinforcement learning or supervised learning, but is deployed in a RL-like manner for interacting with the environment.
Here is a table summarizing the previously mentioned RL problems, and comparing how each of them can be tackled with a “generalize-and-infer” approach instead of direct optimization.
Goal | “Direct Optimization” Approach | “Generalize + Inference” Approach |
Reinforcement Learning with Sparse Rewards | Find \(p^\star(a_t\vert s_t)\) s.t. \(R_t=1\), brute force exploration | DT: Learn \(p(a_t\vert s_t,R_t)\) from many policies, infer \(p(a_t\vert s_t, R_t=1)\). H.E.R - Infer tasks for which gathered trajectories are optimal, then learn \(p(\text{trajectory}\vert \text{task})\). Then infer optimal trajectory for desired task. |
Learn a Reward Function from Suboptimal Trajectories | Offline Inverse RL | D-REX: Trajectory augmentation + Extrapolate to better trajectories. |
Improve the policy from experience | Q-Learning, Policy Gradient | Watch-Try-Learn: Learn \(p(\theta^{n+1} \vert \theta^n , \tau, \text{task})\) |
Fine-tune a simulated policy in a real-world environment | Sample-efficient RL fine-tuning | Domain Randomization: train on a distribution of simulators, and the policy “infers which world” it is in at test time. |
The high-level recipe is simple. If you want to find the solution \(y_i\) for a problem \(x_i\), consider setting up a dataset of paired problems and solutions \((x_1, y_1), ..., (x_N, y_N)\) and then training a deep network \(y = f_\theta(x)\) that “simply maps your problems to solutions”. Then substitute your desired \(x_i\) and have the deep network infer the solution \(y_i\) via generalization. “Problem” is meant in the most abstract of terms and can refer to a RL environment, a dataset, or even a single example. “Solutions” could be represented as the optimal parameters of a policy or a neural network, or a single prediction.
Techniques like goal relabeling help generate post-hoc problems from solutions, but building such a dataset can also be achieved via data augmentation techniques. At its core, we are transforming a difficult optimization problem into an inference problem, and training a supervised learning model on a distribution of problems for which it’s comparatively cheap to obtain solutions.
To summarize the recommendations in a three-step recipe:
The insight that we can cast optimization problems into inference problems is not new. For example, the SGD optimizer can be cast as approximate Bayesian inference and so can optimal control via AICO. These works present a theoretical justification as to why inference can be a suitable replacement for optimization, since the problems and algorithms can be translated back and forth.
I’m suggesting something slightly different here. Instead of casting a sequential decision making problem into an equivalent sequential inference problem, we construct the “meta-problem”: a distribution of similar problems for which it’s easy to obtain the solutions. We then solve the meta-problem with supervised learning by mapping problems directly to solutions. Don’t overthink it, just train the deep net in the simplest way possible and ask it for generalization!
Perhaps in the near future we will be able to prompt-engineer such language-conditioned models with the hint “Generalize to unseen …”.
How far can we stretch the principle of “generalize-and-infer” as an alternative to direct optimization? Here is a “recipe for consciousness” which would probably be better pondered over some strong drinks:
Researchers like Jürgen Schmidhuber have previously discussed how dynamics models (aka World Models) of embodied agents are already “conscious”, because successful modeling the dynamics of the environment around oneself necessitates a representation of the self as an embodied participant in the environment.
While I think that “self-representation” is a necessity in planning and dynamics prediction problems, I think the framework is too vacuous to be of use in reproducing a convincing imitation of consciousness. After all, any planning algorithm that represents “the self” explicitly within each imagined trajectory rollout would be conscious under this definition. An A* maze-planner would satisfy this definition of consciousness.
What I’m proposing is implementing a “more convincing” form of consciousness, not based on a “necessary representation of the self for planning”, but rather an understanding of the self that can be transmitted through language and behavior unrelated to any particular objective. For instance, the model needs to not only understand not only how a given policy regards itself, but how a variety of other policies might interpret the behavior of a that policy, much like funhouse mirrors that distort one’s reflection. The hypothesis is that through demonstrating this understanding of “distorted self-reflection”, the policy will learn to recognize itself and model the internal motivations and beliefs of other agents in agent-agent interactions.
There are some important implementation details that I haven’t fleshed out yet, but at high level, I do think that supervised learning and natural language conditioning with enormous agent-interaction datasets are sufficiently powerful tools to learn interesting behaviors. Imbuing agents with some kind of meta-cogition ability of the self and other agents is an important step towards a convincing imitation of consciousness.
Thanks to Daniel Freeman, David Ha, Karol Hausman, Irwan Bello, Igor Mordatch, and Vincent Vanhoucke for feedback and discussion on earlier drafts of this work.
If you want to cite this blog post, you can use:
@article{jang2021justask,
title = "Just Ask for Generalization",
author = "Jang, Eric",
journal = "evjang.com",
year = "2021",
month = "Oct",
url = "https://evjang.com/2021/10/23/generalization.html"
}
Generalization and scaling:
RL challenges:
Hindsight Imitation
Replacing RL with Supervised Learning
Igor Mordatch supplied interesting questions and comments in reviewing this blog post. I have paraphrased his questions here and added responses in this section.
1. You discussed Supervised Learning and Reinforcement Learning. What do you think about Unsupervised Learning and “The Cake Analogy”?
I consider unsupervised learning to be simply supervised learning for a different task, with comparable gradient variance, since targets are not usually noisly estimated beyond augmentation. Maximum likelihood estimation and contrastive algorithms like InfoNCE seem to be both useful for facilitating generalization in large models.
2. For the first difficulty of RL (evaluating success), aren’t there parallels to current generative models too? Success evaluation is hard for language models, as evidenced by dissatisfaction with BLEU scores and difficulty of evaluating likelihoods with non-likelihood based generative image models.
There are parallels to likelihood-free generative models which require extensive compute for either training or sampling or likelihood evaluation. In practice, however, I think the burdens of evaluation are not directly comparable, since the computational expense of marginalization over observations for such models is dwarfed by the marginalization of success rate estimation in RL. In RL, you have to roll out the environment over O(coin flips) x O(initial state distribution) x O(action distribution) in order to get a low-variance policy gradient for “improved success across all states and tasks”. O(coin flips) is O(1000) samples for local improvement of a couple percent with statistical certainty, wheras I think that typically the marginalization costs of implicit likelihood tends to be cheaper with tricks like Langevin sampling O(minibatch=32). Also, the backprop passes used in Langevin dynamics are usually cheaper than running full environment simulations with a forward pass of the neural net on every step.
3. One of the findings of current language model work is that proxy objectives for what you really want are good enough. Simple next-token prediction induces generalization. But alignment to what you really want is still a hard problem in large model field and we don’t have good answers there yet (and ironically many attempts so far relied on incorporation of RL algorithms).
Alignment objectives may lack a per-example surrogate loss. But under the “generalize-then-infer” school of thought, I would simply recommend learning \(p(y\vert x, \text{alignment objective})\) with max likelihood over numerous hindsight alignment objectives, and then simply condition on the desired alignment object at test time. One could obtain a distribution of alignment descriptions by simply running the model live, and then hindsight labeling with the corresponding alignment realized by the model. Then we simply invoke this meme by Connor Leahy:
Just asking the AI to be nice sounds flippant, but after seeing DALL-E and other large-scale multi-modal models that seem to generalize better as they get bigger, I think we should take these simple, borderline-naive ideas more seriously.
4. For the second difficulty of RL (gradient estimation), we know that for settings where you can backprop through environment dynamics to get exact policy gradient, doing so often leads to worse results.
This reminds me of an old FB comment by Yann Lecun that a better way to estimate Hessian-vector products with ReLU activations is to use a stochastic estimator rather than computing the analytical hessian, since the 2nd-order curvature of ReLU is 0 and what you actually want is the Hessian-vector product of the smoothed version of the function.
If you need to relax the dynamics or use an unbiased stochastic estimator to train through a differentiable simulator, then I think you’re back to where you’re starting with expensive evaluation, since presumably you need many rollouts to smooth out the simulator function and reduce variance. However, maybe the number of samples you need to estimate a smoothed policy gradient is a reasonable tradeoff here and this is a nice way to obtain gradients.
5. Why hasn’t something as simple as what you propose (generalize-then-infer) been done already?
Some researchers out there are probably pursuing this already. My guess is that the research community tends to reward narratives that increase intellectual complexity and argue that “we need better algorithms”. People pay lip service to “simple ideas” but few are willing to truly pursue simplicity to its limit and simply scale up existing ideas.
Another reason would be that researchers often don’t take generalization for granted, so it’s often quicker to think about adding explicit inductive biases rather than thinking about generalization as a first-class citizen and then tailoring all other design decisions in support of it.
6. How does your consciousness proposal relate to ideas from Schmidhuber’s “consciousness in world models” ideas, Friston’s Free Energy Principle, and Hawkin’s “memory of thoughts”?
I consider Schmidhuber and Friston’s unified theories as more or less stating “optimal control requires good future prediction and future prediction with me in it requires self-representation”. If we draw an analogy to next-word prediction in large language models, maybe optimizing next state prediction perfectly is sufficient for subsuming all consciousness-type behaviors like theory-of-mind and the funhouse self-reflections I mentioned above. However, this would require an environment where predicting such dynamics accurately has an outsized impact on observation likelihoods. One critique I have about Schmidhuber and Friston’s frameworks is that they are too general, and can be universally applied to sea slugs and humans. If a certain environmental complexity is needed for future prediction to give rise to something humans would accept as conscious, then the main challenge is declaring what the minimum complexity would be.
Hawkin’s “consciousness as memory of perception” seems to be more related to the subjective qualia aspect of consciousness rather than theory of mind. Note that most people do not consider a program that concatenates numpy arrays to be capable of “experiencing qualia” in the way humans do. Perhaps what is missing is the meta-cognition aspect - the policy needs to exhibit behaviors suggesting that it contemplates the fact that it experiences things. Again, this requires a carefully designed environment that demands such meta-cognition behavior.
I think this could emerge from training for the theory-of-mind imitation problems I described above, since the agent would need to access a consistent representation about how it perceives things and transform it through a variety of “other agent’s lenses”. The flexibility of being able to project one’s own representation of sensory observations through one’s representation of other agents’ sensory capabilities is what would convince me that the agent understands that it can do sufficient meta-cognition about qualia.
7. Your formulation of consciousness only concerns itself with theory-of-mind behavior. What about attention behavior?
See the second paragraph of the response to #6.
Update 20211025: Updated with a paraphrased question from Alexander Terenin
8. In Rich Sutton’s Bitter Lesson Essay, he argues that search and learning are both important. Do you really think that search can be completely replaced by a learned approach?
I agree that having a bit of light search in your program can be immensely helpful to learning and overall performance. It’s a bit of a chicken/egg though. Does AlphaGo work because MCTS uses a learned value function to make search tractable? Or does the policy distillation only work because of search? I’m suggesting that when search becomes too hard (most RL tasks), it’s time to use more learning. You’re still doing search when performing supervised learning - you just get a lot more gradient signal per flop of computation.
]]>