Towards Data Science – Medium Your home for data science. A Medium publication sharing concepts, ideas and codes. – Medium

  • A Crash Course of Planning for Perception Engineers in Autonomous Driving
    by Patrick Langechuan Liu on June 30, 2024 at 12:42 am

    The fundamentals of planning and decision-makingAlphaGo, ChatGPT and FSD (image credit Elena Popova, Karthik Sridasyam and Jonathan Kemper on Unsplash)A classical modular autonomous driving system typically consists of perception, prediction, planning, and control. Until around 2023, AI (artificial intelligence) or ML (machine learning) primarily enhanced perception in most mass-production autonomous driving systems, with its influence diminishing in downstream components. In stark contrast to the low integration of AI in the planning stack, end-to-end perception systems (such as the BEV, or birds-eye-view perception pipeline) have been deployed in mass production vehicles.Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)There are multiple reasons for this. A classical stack based on a human-crafted framework is more explainable and can be iterated faster to fix field test issues (within hours) compared to machine learning-driven features (which could take days or weeks). However, it does not make sense to let readily available human driving data sit idle. Moreover, increasing computing power is more scalable than expanding the engineering team.Fortunately, there has been a strong trend in both academia and industry to change this situation. First, downstream modules are becoming increasingly data-driven and may also be integrated via different interfaces, such as the one proposed in CVPR 2023’s best paper, UniAD. Moreover, driven by the ever-growing wave of Generative AI, a single unified vision-language-action (VLA) model shows great potential for handling complex robotics tasks (RT-2 in academia, TeslaBot and 1X in industry) and autonomous driving (GAIA-1, DriveVLM in academia, and Wayve AI driver, Tesla FSD in industry). This brings the toolsets of AI and data-driven development from the perception stack to the planning stack.This blog post aims to introduce the problem settings, existing methodologies, and challenges of the planning stack, in the form of a crash course for perception engineers. As a perception engineer, I finally had some time over the past couple of weeks to systematically learn the classical planning stack, and I would like to share what I learned. I will also share my thoughts on how AI can help from the perspective of an AI practitioner.The intended audience for this post is AI practitioners who work in the field of autonomous driving, in particular, perception engineers.The article is a bit long (11100 words), and the table of contents below will most likely help those who want to do quick ctrl+F searches with the keywords.Table of Contents (ToC)Why learn planning?What is planning? The problem formulation The Glossary of Planning Behavior Planning Frenet vs Cartesian systemsClassical tools-the troika of planning Searching Sampling OptimizationIndustry practices of planning Path-speed decoupled planning Joint spatiotemporal planningDecision making What and why? MDP and POMDP Value iteration and Policy iteration AlphaGo and MCTS-when nets meet trees MPDM (and successors) in autonomous drivingIndustry practices of decision making Trees No treesSelf-Reflections Why NN in planning? What about e2e NN planners? Can we do without prediction? Can we do with just nets but no trees? Can we use LLMs to make decisions?The trend of evolutionWhy learn planning?This brings us to an interesting question: why learn planning, especially the classical stack, in the era of AI?From a problem-solving perspective, understanding your customers’ challenges better will enable you, as a perception engineer, to serve your downstream customers more effectively, even if your main focus remains on perception work.Machine learning is a tool, not a solution. The most efficient way to solve problems is to combine new tools with domain knowledge, especially those with solid mathematical formulations. Domain knowledge-inspired learning methods are likely to be more data-efficient. As planning transitions from rule-based to ML-based systems, even with early prototypes and products of end-to-end systems hitting the road, there is a need for engineers who can deeply understand both the fundamentals of planning and machine learning. Despite these changes, classical and learning methods will likely continue to coexist for a considerable period, perhaps shifting from an 8:2 to a 2:8 ratio. It is almost essential for engineers working in this field to understand both worlds.From a value-driven development perspective, understanding the limitations of classical methods is crucial. This insight allows you to effectively utilize new ML tools to design a system that addresses current issues and delivers immediate impact.Additionally, planning is a critical part of all autonomous agents, not just in autonomous driving. Understanding what planning is and how it works will enable more ML talents to work on this exciting topic and contribute to the development of truly autonomous agents, whether they are cars or other forms of automation.What is planning?The problem formulationAs the “brain” of autonomous vehicles, the planning system is crucial for the safe and efficient driving of vehicles. The goal of the planner is to generate trajectories that are safe, comfortable, and efficiently progressing towards the goal. In other words, safety, comfort, and efficiency are the three key objectives for planning.As input to the planning systems, all perception outputs are required, including static road structures, dynamic road agents, free space generated by occupancy networks, and traffic wait conditions. The planning system must also ensure vehicle comfort by monitoring acceleration and jerk for smooth trajectories, while considering interaction and traffic courtesy.The planning systems generate trajectories in the format of a sequence of waypoints for the ego vehicle’s low-level controller to track. Specifically, these waypoints represent the future positions of the ego vehicle at a series of fixed time stamps. For example, each point might be 0.4 seconds apart, covering an 8-second planning horizon, resulting in a total of 20 waypoints.A classical planning stack roughly consists of global route planning, local behavior planning, and local trajectory planning. Global route planning provides a road-level path from the start point to the end point on a global map. Local behavior planning decides on a semantic driving action type (e.g., car following, nudging, side passing, yielding, and overtaking) for the next several seconds. Based on the decided behavior type from the behavior planning module, local trajectory planning generates a short-term trajectory. The global route planning is typically provided by a map service once navigation is set and is beyond the scope of this post. We will focus on behavior planning and trajectory planning from now on.Behavior planning and trajectory generation can work explicitly in tandem or be combined into a single process. In explicit methods, behavior planning and trajectory generation are distinct processes operating within a hierarchical framework, working at different frequencies, with behavior planning at 1–5 Hz and trajectory planning at 10–20 Hz. Despite being highly efficient most of the time, adapting to different scenarios may require significant modifications and fine-tuning. More advanced planning systems combine the two into a single optimization problem. This approach ensures feasibility and optimality without any compromise.Classification of planning design approaches (source: Fluid Dynamics Planner)The Glossary of PlanningYou might have noticed that the terminology used in the above section and the image do not completely match. There is no standard terminology that everyone uses. Across both academia and industry, it is not uncommon for engineers to use different names to refer to the same concept and the same name to refer to different concepts. This indicates that planning in autonomous driving is still under active development and has not fully converged.Here, I list the notation used in this post and briefly explain other notions present in the literature.Planning: A top-level concept, parallel to control, that generates trajectory waypoints. Together, planning and control are jointly referred to as PnC (planning and control).Control: A top-level concept that takes in trajectory waypoints and generates high-frequency steering, throttle, and brake commands for actuators to execute. Control is relatively well-established compared to other areas and is beyond the scope of this post, despite the common notion of PnC.Prediction: A top-level concept that predicts the future trajectories of traffic agents other than the ego vehicle. Prediction can be considered a lightweight planner for other agents and is also called motion prediction.Behavior Planning: A module that produces high-level semantic actions (e.g., lane change, overtake) and typically generates a coarse trajectory. It is also known as task planning or decision making, particularly in the context of interactions.Motion Planning: A module that takes in semantic actions and produces smooth, feasible trajectory waypoints for the duration of the planning horizon for control to execute. It is also referred to as trajectory planning.Trajectory Planning: Another term for motion planning.Decision Making: Behavior planning with a focus on interactions. Without ego-agent interaction, it is simply referred to as behavior planning. It is also known as tactical decision making.Route Planning: Finds the preferred route over road networks, also known as mission planning.Model-Based Approach: In planning, this refers to manually crafted frameworks used in the classical planning stack, as opposed to neural network models. Model-based methods contrast with learning-based methods.Multimodality: In the context of planning, this typically refers to multiple intentions. This contrasts with multimodality in the context of multimodal sensor inputs to perception or multimodal large language models (such as VLM or VLA).Reference Line: A local (several hundred meters) and coarse path based on global routing information and the current state of the ego vehicle.Frenet Coordinates: A coordinate system based on a reference line. Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. See below for a more detailed introduction.Trajectory: A 3D spatiotemporal curve, in the form of (x, y, t) in Cartesian coordinates or (s, l, t) in Frenet coordinates. A trajectory is composed of both path and speed.Path: A 2D spatial curve, in the form of (x, y) in Cartesian coordinates or (s, l) in Frenet coordinates.Semantic Action: A high-level abstraction of action (e.g., car following, nudge, side pass, yield, overtake) with clear human intention. Also referred to as intention, policy, maneuver, or primitive motion.Action: A term with no fixed meaning. It can refer to the output of control (high-frequency steering, throttle, and brake commands for actuators to execute) or the output of planning (trajectory waypoints). Semantic action refers to the output of behavior prediction.Different literature may use various notations and concepts. Here are some examples:Decision Making System: Sometimes includes planning and control as well. (source: A Survey of Motion Planning and Control Techniques for Self-driving Urban Vehicles, and BEVGPT)Motion Planning: Sometimes is the top-level planning concept and includes behavior planning and trajectory planning. (source: Towards A General-Purpose Motion Planning for Autonomous Vehicles Using Fluid Dynamics).Planning: Sometimes includes behavior planning, motion planning, and also route planning.These variations illustrate the diversity in terminology and the evolving nature of the field.Behavior PlanningAs a machine learning engineer, you may notice that the behavior planning module is a heavily manually crafted intermediate module. There is no consensus on the exact form and content of its output. Concretely, the output of behavior planning can be a reference path or object labeling on ego maneuvers (e.g., pass from the left or right-hand side, pass or yield). The term “semantic action” has no strict definition and no fixed methods.The decoupling of behavior planning and motion planning increases efficiency in solving the extremely high-dimensional action space of autonomous vehicles. The actions of an autonomous vehicle need to be reasoned at typically 10 Hz or more (time resolution in waypoints), and most of these actions are relatively straightforward, like going straight. After decoupling, the behavior planning layer only needs to reason about future scenarios at a relatively coarse resolution, while the motion planning layer operates in the local solution space based on the decision made by behavior planning. Another benefit of behavior planning is converting non-convex optimization to convex optimization, which we will discuss further below.Frenet vs Cartesian systemsThe Frenet coordinate system is a widely adopted system that merits its own introduction section. The Frenet frame simplifies trajectory planning by independently managing lateral and longitudinal movements relative to a reference path. The sss coordinate represents longitudinal displacement (distance along the road), while the lll (or ddd) coordinate represents lateral displacement (side position relative to the reference path).Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. This transformation converts non-linear road boundary constraints on curvy roads into linear ones, significantly simplifying the subsequent optimization problems. Additionally, humans perceive longitudinal and lateral movements differently, and the Frenet frame allows for separate and more flexible optimization of these movements.Schematics on the conversion from Cartesian frame to Frenet frame (source: Cartesian Planner)The Frenet coordinate system requires a clean, structured road graph with low curvature lanes. In practice, it is preferred for structured roads with small curvature, such as highways or city expressways. However, the issues with the Frenet coordinate system are amplified with increasing reference line curvature, so it should be used cautiously on structured roads with high curvature, like city intersections with guide lines.For unstructured roads, such as ports, mining areas, parking lots, or intersections without guidelines, the more flexible Cartesian coordinate system is recommended. The Cartesian system is better suited for these environments because it can handle higher curvature and less structured scenarios more effectively.Classical tools — the troika of planningPlanning in autonomous driving involves computing a trajectory from an initial high-dimensional state (including position, time, velocity, acceleration, and jerk) to a target subspace, ensuring all constraints are satisfied. Searching, sampling, and optimization are the three most widely used tools for planning.SearchingClassical graph-search methods are popular in planning and are used in route/mission planning on structured roads or directly in motion planning to find the best path in unstructured environments (such as parking or urban intersections, especially mapless scenarios). There is a clear evolution path, from Dijkstra’s algorithm to A* (A-star), and further to hybrid A*.Dijkstra’s algorithm explores all possible paths to find the shortest one, making it a blind (uninformed) search algorithm. It is a systematic method that guarantees the optimal path, but it is inefficient to deploy. As shown in the chart below, it explores almost all directions. Essentially, Dijkstra’s algorithm is a breadth-first search (BFS) weighted by movement costs. To improve efficiency, we can use information about the location of the target to trim down the search space.Visualization of Dijkstra’s algorithm and A-star search (Source: PathFinding.js, example inspired by RedBlobGames)The A* algorithm uses heuristics to prioritize paths that appear to be leading closer to the goal, making it more efficient. It combines the cost so far (Dijkstra) with the cost to go (heuristics, essentially greedy best-first). A* only guarantees the shortest path if the heuristic is admissible and consistent. If the heuristic is poor, A* can perform worse than the Dijkstra baseline and may degenerate into a greedy best-first search.In the specific application of autonomous driving, the hybrid A* algorithm further improves A* by considering vehicle kinematics. A* may not satisfy kinematic constraints and cannot be tracked accurately (e.g., the steering angle is typically within 40 degrees). While A* operates in grid space for both state and action, hybrid A* separates them, maintaining the state in the grid but allowing continuous action according to kinematics.Analytical expansion (shot to goal) is another key innovation proposed by hybrid A*. A natural enhancement to A* is to connect the most recently explored nodes to the goal using a non-colliding straight line. If this is possible, we have found the solution. In hybrid A*, this straight line is replaced by Dubins and Reeds-Shepp (RS) curves, which comply with vehicle kinematics. This early stopping method strikes a balance between optimality and feasibility by focusing more on feasibility for the further side.Hybrid A* is used heavily in parking scenarios and mapless urban intersections. Here is a very nice video showcasing how it works in a parking scenario.Hybrid A-star algorithm with analytical expansion (source: the 2010 IJRR Hybrid A-star paper and 2012 Udacity class )SamplingAnother popular method of planning is sampling. The well-known Monte Carlo method is a random sampling method. In essence, sampling involves selecting many candidates randomly or according to a prior, and then selecting the best one according to a defined cost. For sampling-based methods, the fast evaluation of many options is critical, as it directly impacts the real-time performance of the autonomous driving system.Large Language Models (LLMs) essentially provide samples, and there needs to be an evaluator with a defined cost that aligns with human preferences. This evaluation process ensures that the selected output meets the desired criteria and quality standards.Sampling can occur in a parameterized solution space if we already know the analytical solution to a given problem or subproblem. For example, typically we want to minimize the time integral of the square of jerk (the third derivative of position p(t)), indicated by the triple dots over p, where one dot represents one order derivative with respect to time), among other criteria.Minimizing squared jerk for driving comfort (source: Werling et al, ICRA 2010)It can be mathematically proven that quintic (5th order) polynomials provide the jerk-optimal connection between two states in a position-velocity-acceleration space, even when additional cost terms are considered. By sampling in this parameter space of quintic polynomials, we can find the one with the minimum cost to get the approximate solution. The cost takes into account factors such as speed, acceleration, jerk limit, and collision checks. This approach essentially solves the optimization problem through sampling.Sampling of lateral movement time profiles (source: Werling et al, ICRA 2010)Sampling-based methods have inspired numerous ML papers, including CoverNet, Lift-Splat-Shoot, NMP, and MP3. These methods replace mathematically sound quintic polynomials with human driving behavior, utilizing a large database. The evaluation of trajectories can be easily parallelized, which further supports the use of sampling-based methods. This approach effectively leverages a vast amount of expert demonstrations to mimic human-like driving behavior, while avoiding random sampling of acceleration and steering profiles.Sampling from human-driving data for data-driven planning methods (source: NMP, CoverNet and Lift-splat-shoot)OptimizationOptimization finds the best solution to a problem by maximizing or minimizing a specific objective function under given constraints. In neural network training, a similar principle is followed using gradient descent and backpropagation to adjust the network’s weights. However, in optimization tasks outside of neural networks, models are usually less complex, and more effective methods than gradient descent are often employed. For example, while gradient descent can be applied to Quadratic Programming, it is generally not the most efficient method.In autonomous driving, the planning cost to optimize typically considers dynamic objects for obstacle avoidance, static road structures for following lanes, navigation information to ensure the correct route, and ego status to evaluate smoothness.Optimization can be categorized into convex and non-convex types. The key distinction is that in a convex optimization scenario, there is only one global optimum, which is also the local optimum. This characteristic makes it unaffected by the initial solution to the optimization problems. For non-convex optimization, the initial solution matters a lot, as illustrated in the chart below.Convex vs non-convex optimization (source: Stanford course materials)Since planning involves highly non-convex optimization with many local optima, it heavily depends on the initial solution. Additionally, convex optimization typically runs much faster and is therefore preferred for onboard real-time applications such as autonomous driving. A typical approach is to use convex optimization in conjunction with other methods to outline a convex solution space first. This is the mathematical foundation behind separating behavior planning and motion planning, where finding a good initial solution is the role of behavior planning.Take obstacle avoidance as a concrete example, which typically introduces non-convex problems. If we know the nudging direction, then it becomes a convex optimization problem, with the obstacle position acting as a lower or upper bound constraint for the optimization problem. If we don’t know the nudging direction, we need to decide first which direction to nudge, making the problem a convex one for motion planning to solve. This nudging direction decision falls under behavior planning.Of course, we can do direct optimization of non-convex optimization problems with tools such as projected gradient descent, alternating minimization, particle swarm optimization (PSO), and genetic algorithms. However, this is beyond the scope of this post.A convex path planning problem vs a non-convex one (chart made by author)The solution process of the convex vs non-convex path planning problem (chart made by author)How do we make such decisions? We can use the aforementioned search or sampling methods to address non-convex problems. Sampling-based methods scatter many options across the parameter space, effectively handling non-convex issues similarly to searching.You may also question why deciding which direction to nudge from is enough to guarantee the problem space is convex. To explain this, we need to discuss topology. In path space, similar feasible paths can transform continuously into each other without obstacle interference. These similar paths, grouped as “homotopy classes” in the formal language of topology, can all be explored using a single initial solution homotopic to them. All these paths form a driving corridor, illustrated as the red or green shaded area in the image above. For a 3D spatiotemporal case, please refer to the QCraft tech blog.We can utilize the Generalized Voronoi diagram to enumerate all homotopy classes, which roughly corresponds to the different decision paths available to us. However, this topic delves into advanced mathematical concepts that are beyond the scope of this blog post.The key to solving optimization problems efficiently lies in the capabilities of the optimization solver. Typically, a solver requires approximately 10 milliseconds to plan a trajectory. If we can boost this efficiency by tenfold, it can significantly impact algorithm design. This exact improvement was highlighted during Tesla AI Day 2022. A similar enhancement has occurred in perception systems, transitioning from 2D perception to Bird’s Eye View (BEV) as available computing power scaled up tenfold. With a more efficient optimizer, more options can be calculated and evaluated, thereby reducing the importance of the decision-making process. However, engineering an efficient optimization solver demands substantial engineering resources.Every time compute scales up by 10x, algorithm will evolve to next generation. — — The unverified law of algorithm evolutionIndustry Practices of PlanningA key differentiator in various planning systems is whether they are spatiotemporally decoupled. Concretely, spatiotemporally decoupled methods plan in spatial dimensions first to generate a path, and then plan the speed profile along this path. This approach is also known as path-speed decoupling.Path-speed decoupling is often referred to as lateral-longitudinal (lat-long) decoupling, where lateral (lat) planning corresponds to path planning and longitudinal (long) planning corresponds to speed planning. This terminology seems to originate from the Frenet coordinate system, which we will explore later.Decoupled solutions are easier to implement and can solve about 95% of issues. In contrast, coupled solutions have a higher theoretical performance ceiling but are more challenging to implement. They involve more parameters to tune and require a more principled approach to parameter tuning.The comparison of decoupled and joint planning (source: made by the author, inspired by Qcraft)Pros and cons of decoupled vs joint spatiotemporal planning (chart made by author)Path-speed decoupled planningWe can take Baidu Apollo EM planner as an example of a system that uses path-speed decoupled planning.The EM planner significantly reduces computational complexity by transforming a three-dimensional station-lateral-speed problem into two two-dimensional problems: station-lateral and station-speed. At the core of Apollo’s EM planner is an iterative Expectation-Maximization (EM) step, consisting of path optimization and speed optimization. Each step is divided into an E-step (projection and formulation in a 2D state space) and an M-step (optimization in the 2D state space). The E-step involves projecting the 3D problem into either a Frenet SL frame or an ST speed tracking frame.The EM iteration in Apollo EM planner (source: Baidu Apollo EM planner )The M-step (maximization step) in both path and speed optimization involves solving non-convex optimization problems. For path optimization, this means deciding whether to nudge an object on the left or right side, while for speed optimization, it involves deciding whether to overtake or yield to a dynamic object crossing the path. The Apollo EM planner addresses these non-convex optimization challenges using a two-step process: Dynamic Programming (DP) followed by Quadratic Programming (QP).DP uses a sampling or searching algorithm to generate a rough initial solution, effectively pruning the non-convex space into a convex space. QP then takes the coarse DP results as input and optimizes them within the convex space provided by DP. In essence, DP focuses on feasibility, and QP refines the solution to achieve optimality within the convex constraints.In our defined terminology, Path DP corresponds to lateral BP, Path QP to lateral MP, Speed DP to longitudinal BP, and Speed QP to longitudinal MP. Thus, the process involves conducting BP (Basic Planning) followed by MP (Master Planning) in both the path and speed steps.A full autonomous driving stack with path-speed decoupled planning (chart made by author)Joint spatiotemporal planningAlthough decoupled planning can resolve 95% of cases in autonomous driving, the remaining 5% involve challenging dynamic interactions where a decoupled solution often results in suboptimal trajectories. In these complex scenarios, demonstrating intelligence is crucial, making it a very hot topic in the field.For example, in narrow-space passing, the optimal behavior might be to either decelerate to yield or accelerate to pass. Such behaviors are not achievable within the decoupled solution space and require joint optimization. Joint optimization allows for a more integrated approach, considering both path and speed simultaneously to handle intricate dynamic interactions effectively.A full autonomous driving stack with joint spatiotemporal planning (chart made by author)However, there are significant challenges in joint spatiotemporal planning. Firstly, solving the non-convex problem directly in a higher-dimensional state space is more challenging and time-consuming than using a decoupled solution. Secondly, considering interactions in spatiotemporal joint planning is even more complex. We will cover this topic in more detail later when we discuss decision-making.Here we introduce two solving methods: brute force search and constructing a spatiotemporal corridor for optimization.Brute force search occurs directly in 3D spatiotemporal space (2D in space and 1D in time), and can be performed in either XYT (Cartesian) or SLT (Frenet) coordinates. We will take SLT as an example. SLT space is long and flat, similar to an energy bar. It is elongated in the L dimension and flat in the ST face. For brute force search, we can use hybrid A-star, with the cost being a combination of progress cost and cost to go. During optimization, we must conform to search constraints that prevent reversing in both the s and t dimensions.Overtake by lane change in spatiotemporal lattice (source: Spatiotemporal optimization with A*)Another method is constructing a spatiotemporal corridor, essentially a curve with the footprint of a car winding through a 3D spatiotemporal state space (SLT, for example). The SSC (spatiotemporal semantic corridor, RAL 2019), encodes requirements given by semantic elements into a semantic corridor, generating a safe trajectory accordingly. The semantic corridor consists of a series of mutually connected collision-free cubes with dynamical constraints posed by the semantic elements in the spatiotemporal domain. Within each cube, it becomes a convex optimization problem that can be solved using Quadratic Programming (QP).SSC still requires a BP (Behavior Planning) module to provide a coarse driving trajectory. Complex semantic elements of the environment are projected into the spatiotemporal domain concerning the reference lane. EPSILON (TRO 2021), showcases a system where SSC serves as the motion planner working in tandem with a behavior planner. In the next section, we will discuss behavior planning, especially focusing on interaction. In this context, behavior planning is usually referred to as decision making.An illustration of the spatiotemporal corridor (source: SSC)Decision makingWhat and why?Decision making in autonomous driving is essentially behavior planning, but with a focus on interaction with other traffic agents. The assumption is that other agents are mostly rational and will respond to our behavior in a predictable manner, which we can describe as “noisily rational.”People may question the necessity of decision making when advanced planning tools are available. However, two key aspects — uncertainty and interaction — introduce a probabilistic nature to the environment, primarily due to the presence of dynamic objects. Interaction is the most challenging part of autonomous driving, distinguishing it from general robotics. Autonomous vehicles must not only navigate but also anticipate and react to the behavior of other agents, making robust decision-making essential for safety and efficiency.In a deterministic (purely geometric) world without interaction, decision making would be unnecessary, and planning through searching, sampling, and optimization would suffice. Brute force searching in the 3D XYT space could serve as a general solution.In most classical autonomous driving stacks, a prediction-then-plan approach is adopted, assuming zero-order interaction between the ego vehicle and other vehicles. This approach treats prediction outputs as deterministic, requiring the ego vehicle to react accordingly. This leads to overly conservative behavior, exemplified by the “freezing robot” problem. In such cases, prediction fills the entire spatiotemporal space, preventing actions like lane changes in crowded conditions — something humans manage more effectively.To handle stochastic strategies, Markov Decision Processes (MDP) or Partially Observable Markov Decision Processes (POMDP) frameworks are essential. These approaches shift the focus from geometry to probability, addressing chaotic uncertainty. By assuming that traffic agents behave rationally or at least noisily rationally, decision making can help create a safe driving corridor in the otherwise chaotic spatiotemporal space.Among the three overarching goals of planning — safety, comfort, and efficiency — decision making primarily enhances efficiency. Conservative actions can maximize safety and comfort, but effective negotiation with other road agents, achievable through decision making, is essential for optimal efficiency. Effective decision making also displays intelligence.MDP and POMDPWe will first introduce Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP), followed by their systematic solutions, such as value iteration and policy iteration.A Markov Process (MP) is a type of stochastic process that deals with dynamic random phenomena, unlike static probability. In a Markov Process, the future state depends only on the current state, making it sufficient for prediction. For autonomous driving, the relevant state may only include the last second of data, expanding the state space to allow for a shorter history window.A Markov Decision Process (MDP) extends a Markov Process to include decision-making by introducing action. MDPs model decision-making where outcomes are partly random and partly controlled by the decision maker or agent. An MDP can be modeled with five factors:State (S): The state of the environment.Action (A): The actions the agent can take to affect the environment.Reward (R): The reward the environment provides to the agent as a result of the action.Transition Probability (P): The probability of transitioning from the old state to a new state upon the agent’s action.Gamma (γ): A discount factor for future rewards.This is also the common framework used by reinforcement learning (RL), which is essentially an MDP. The goal of MDP or RL is to maximize the cumulative reward received in the long run. This requires the agent to make good decisions given a state from the environment, according to a policy.A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s. MDP or RL studies the problem of how to derive the optimal policy.The agent-environment interface in MDP and RL (source: Reinforcement Learning: An Introduction)A Partially Observable Markov Decision Process (POMDP) adds an extra layer of complexity by recognizing that states cannot be directly observed but rather inferred through observations. In a POMDP, the agent maintains a belief — a probability distribution over possible states — to estimate the state of the environment. Autonomous driving scenarios are better represented by POMDPs due to their inherent uncertainties and the partial observability of the environment. An MDP can be considered a special case of a POMDP where the observation perfectly reveals the state.MDP vs POMDP (source: POMDPs as stochastic contingent planning)POMDPs can actively collect information, leading to actions that gather necessary data, demonstrating the intelligent behavior of these models. This capability is particularly valuable in scenarios like waiting at intersections, where gathering information about other vehicles’ intentions and the state of the traffic light is crucial for making safe and efficient decisions.Value iteration and Policy iterationValue iteration and policy iteration are systematic methods for solving MDP or POMDP problems. While these methods are not commonly used in real-world applications due to their complexity, understanding them provides insight into exact solutions and how they can be simplified in practice, such as using MCTS in AlphaGo or MPDM in autonomous driving.To find the best policy in an MDP, we must assess the potential or expected reward from a state, or more specifically, from an action taken in that state. This expected reward includes not just the immediate reward but also all future rewards, formally known as the return or cumulative discounted reward. (For a deeper understanding, refer to “Reinforcement Learning: An Introduction,” often considered the definitive guide on the subject.)The value function (V) characterizes the quality of states by summing the expected returns. The action-value function (Q) assesses the quality of actions for a given state. Both functions are defined according to a given policy. The Bellman Optimality Equation states that an optimal policy will choose the action that maximizes the immediate reward plus the expected future rewards from the resulting new states. In simple terms, the Bellman Optimality Equation advises considering both the immediate reward and the future consequences of an action. For example, when switching jobs, consider not only the immediate pay raise (R) but also the future value (S’) the new position offers.Bellman’s equation of optimality (chart made by author)It is relatively straightforward to extract the optimal policy from the Bellman Optimality Equation once the optimal value function is available. But how do we find this optimal value function? This is where value iteration comes to the rescue.Extract best policy from optimal values (chart made by author)Value iteration finds the best policy by repeatedly updating the value of each state until it stabilizes. This process is derived by turning the Bellman Optimality Equation into an update rule. Essentially, we use the optimal future picture to guide the iteration toward it. In plain language, “fake it until you make it!”Update value functions under the guidance of Bellman’s Equation (chart made by author)Value iteration is guaranteed to converge for finite state spaces, regardless of the initial values assigned to the states (for a detailed proof, please refer to the Bible of RL). If the discount factor gamma is set to 0, meaning we only consider immediate rewards, the value iteration will converge after just one iteration. A smaller gamma leads to faster convergence because the horizon of consideration is shorter, though it may not always be the best option for solving concrete problems. Balancing the discount factor is a key aspect of engineering practice.One might ask how this works if all states are initialized to zero. The immediate reward in the Bellman Equation is crucial for bringing in additional information and breaking the initial symmetry. Think about the states that immediately lead to the goal state; their value propagates through the state space like a virus. In plain language, it’s about making small wins, frequently.Value and policy functions interact until they converge to optimum together (source: Reinforcement Learning: An Introduction)However, value iteration also suffers from inefficiency. It requires taking the optimal action at each iteration by considering all possible actions, similar to Dijkstra’s algorithm. While it demonstrates feasibility as a basic approach, it is typically not practical for real-world applications.The contrast of Bellman Equation and Bellman Optimality Equation (chart made by author)Policy iteration improves on this by taking actions according to the current policy and updating it based on the Bellman Equation (not the Bellman Optimality Equation). Policy iteration decouples policy evaluation from policy improvement, making it a much faster solution. Each step is taken based on a given policy instead of exploring all possible actions to find the one that maximizes the objective. Although each iteration of policy iteration can be more computationally intensive due to the policy evaluation step, it generally results in a faster convergence overall.In simple terms, if you can only fully evaluate the consequence of one action, it’s better to use your own judgment and do your best with the current information available.AlphaGo and MCTS — when nets meet treesWe have all heard the unbelievable story of AlphaGo beating the best human player in 2016. AlphaGo formulates the gameplay of Go as an MDP and solves it with Monte Carlo Tree Search (MCTS). But why not use value iteration or policy iteration?Value iteration and policy iteration are systematic, iterative methods that solve MDP problems. However, even with improved policy iteration, it still requires performing time-consuming operations to update the value of every state. A standard 19×19 Go board has roughly 2e170 possible states. This vast number of states makes it intractable to solve with traditional value iteration or policy iteration techniques.AlphaGo and its successors use a Monte Carlo tree search (MCTS) algorithm to find their moves, guided by a value network and a policy network, trained on both human and computer play. Let’s take a look at vanilla MCTS first.The four steps of MCTS by AlphaGo, combining both value network and policy network (source: AlphaGo, Nature 2016)Monte Carlo Tree Search (MCTS) is a method for policy estimation that focuses on decision-making from the current state. One iteration involves a four-step process: selection, expansion, simulation (or evaluation), and backup.Selection: The algorithm follows the most promising path based on previous simulations until it reaches a leaf node, a position not yet fully explored.Expansion: One or more child nodes are added to represent possible next moves from the leaf node.Simulation (Evaluation): The algorithm plays out a random game from the new node until the end, known as a “rollout.” This assesses the potential outcome from the expanded node by simulating random moves until a terminal state is reached.Backup: The algorithm updates the values of the nodes on the path taken based on the game’s result. If the outcome is a win, the value of the nodes increases; if it is a loss, the value decreases. This process propagates the result of the rollout back up the tree, refining the policy based on simulated outcomes.After a given number of iterations, MCTS provides the percentage frequency with which immediate actions were selected from the root during simulations. During inference, the action with the most visits is selected. Here is an interactive illustration of MTCS with the game of tic-tac-toe for simplicity.MCTS in AlphaGo is enhanced by two neural networks. Value Network evaluates the winning rate from a given state (board configuration). Policy Network evaluates the action distribution for all possible moves. These neural networks improve MCTS by reducing the effective depth and breadth of the search tree. The policy network helps in sampling actions, focusing the search on promising moves, while the value network provides a more accurate evaluation of positions, reducing the need for extensive rollouts. This combination allows AlphaGo to perform efficient and effective searches in the vast state space of Go.The policy network and value network of AlphaGo (source: AlphaGo, Nature 2016)In the expansion step, the policy network samples the most likely positions, effectively pruning the breadth of the search space. In the evaluation step, the value network provides an instinctive scoring of the position, while a faster, lightweight policy network performs rollouts until the game ends to collect rewards. MCTS then uses a weighted sum of the evaluations from both networks to make the final assessment.Note that a single evaluation of the value network approaches the accuracy of Monte Carlo rollouts using the RL policy network but with 15,000 times less computation. This mirrors the fast-slow system design, akin to intuition versus reasoning, or System 1 versus System 2 as described by Nobel laureate Daniel Kahneman. Similar designs can be observed in more recent works, such as DriveVLM.To be exact, AlphaGo incorporates two slow-fast systems at different levels. On the macro level, the policy network selects moves while the faster rollout policy network evaluates these moves. On the micro level, the faster rollout policy network can be approximated by a value network that directly predicts the winning rate of board positions.What can we learn from AlphaGo for autonomous driving? AlphaGo demonstrates the importance of extracting an excellent policy using a robust world model (simulation). Similarly, autonomous driving requires a highly accurate simulation to effectively leverage algorithms similar to those used by AlphaGo. This approach underscores the value of combining strong policy networks with detailed, precise simulations to enhance decision-making and optimize performance in complex, dynamic environments.MPDM (and successors) in autonomous drivingIn the game of Go, all states are immediately available to both players, making it a perfect information game where observation equals state. This allows the game to be characterized by an MDP process. In contrast, autonomous driving is a POMDP process, as the states can only be estimated through observation.POMDPs connect perception and planning in a principled way. The typical solution for a POMDP is similar to that for an MDP, with a limited lookahead. However, the main challenges lie in the curse of dimensionality (explosion in state space) and the complex interactions with other agents. To make real-time progress tractable, domain-specific assumptions are typically made to simplify the POMDP problem.MPDM (and the two follow-ups, and the white paper) is one pioneering study in this direction. MPDM reduces the POMDP to a closed-loop forward simulation of a finite, discrete set of semantic-level policies, rather than evaluating every possible control input for every vehicle. This approach addresses the curse of dimensionality by focusing on a manageable number of meaningful policies, allowing for effective real-time decision-making in autonomous driving scenarios.Semantic actions help control the curse of dimensionality (source: EPSILON)The assumptions of MPDM are twofold. First, much of the decision-making by human drivers involves discrete high-level semantic actions (e.g., slowing, accelerating, lane-changing, stopping). These actions are referred to as policies in this context. The second implicit assumption concerns other agents: other vehicles will make reasonably safe decisions. Once a vehicle’s policy is decided, its action (trajectory) is determined.The framework of MPDM (chart created by author)MPDM first selects one policy for the ego vehicle from many options (hence the “multi-policy” in its name) and selects one policy for each nearby agent based on their respective predictions. It then performs forward simulation (similar to a fast rollout in MCTS). The best interaction scenario after evaluation is then passed on to motion planning, such as the Spatiotemporal Semantic Corridor (SCC) mentioned in the joint spatiotemporal planning session.MPDM enables intelligent and human-like behavior, such as actively cutting into dense traffic flow even when there is no sufficient gap present. This is not possible with a predict-then-plan pipeline, which does not explicitly consider interactions. The prediction module in MPDM is tightly integrated with the behavior planning model through forward simulation.MPDM assumes a single policy throughout the decision horizon (10 seconds). Essentially, MPDM adopts an MCTS approach with one layer deep and super wide, considering all possible agent predictions. This leaves room for improvement, inspiring many follow-up works such as EUDM, EPSILON, and MARC. For example, EUDM considers more flexible ego policies and assigns a policy tree with a depth of four, with each policy covering a time duration of 2 seconds over an 8-second decision horizon. To compensate for the extra computation induced by the increased tree depth, EUDM performs more efficient width pruning by guided branching, identifying critical scenarios and key vehicles. This approach explores a more balanced policy tree.The forward simulation in MPDM and EUDM uses very simplistic driver models (Intelligent driver model or IDM for longitudinal simulation, and Pure Pursuit or PP for lateral simulation). MPDM points out that high fidelity realism matters less than the closed-loop nature itself, as long as policy-level decisions are not affected by low-level action execution inaccuracies.The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)Contingency planning in the context of autonomous driving involves generating multiple potential trajectories to account for various possible future scenarios. A key motivating example is that experienced drivers anticipate multiple future scenarios and always plan for a safe backup plan. This anticipatory approach leads to a smoother driving experience, even when cars perform sudden cut-ins into the ego lane.A critical aspect of contingency planning is deferring the decision bifurcation point. This means delaying the point at which different potential trajectories diverge, allowing the ego vehicle more time to gather information and respond to different outcomes. By doing so, the vehicle can make more informed decisions, resulting in smoother and more confident driving behaviors, similar to those of an experienced driver.Risk-aware contingency planning (source: MARC, RAL 2023)MARC also combines behavior planning and motion planning together. This extends the notion and utility of forward simulation. In other words, MPDM and EUDM still uses policy tree for high level behavior planning and rely on other motion planning pipelines such as semantic spatiotemporal corridors (SSC), due to the fact that ego motion in the policy tree is still characterized by heavily quantized behavior bucket. MARC extends this by keeping the quantized behavior for agents other than ego but uses more refined motion planning directly in the forward rollout. In a way it is a hybrid approach, where hybrid carries a similar meaning to that in hybrid A*, a mix of discrete and continuous.One possible drawback of MPDM and all its follow-up works is their reliance on simple policies designed for highway-like structured environments, such as lane keeping and lane changing. This reliance may limit the capability of forward simulation to handle complex interactions. To address this, following the example of MPDM, the key to making POMDPs more effective is to simplify the action and state space through the growth of a high-level policy tree. It might be possible to create a more flexible policy tree, for example, by enumerating spatiotemporal relative position tags to all relative objects and then performing guided branching.Industry practices of decision makingDecision-making remains a hot topic in current research. Even classical optimization methods have not been fully explored yet. Machine learning methods could shine and have a disruptive impact, especially with the advent of Large Language Models (LLMs), empowered by techniques like Chain of Thought (CoT) or Monte Carlo Tree Search (MCTS).TreesTrees are systematic ways to perform decision-making. Tesla AI Day 2021 and 2022 showcased their decision-making capabilities, heavily influenced by AlphaGo and the subsequent MuZero, to address highly complex interactions.At a high level, Tesla’s approach follows behavior planning (decision making) followed by motion planning. It searches for a convex corridor first and then feeds it into continuous optimization, using spatiotemporal joint planning. This approach effectively addresses scenarios such as narrow passing, a typical bottleneck for path-speed decoupled planning.Neural network heuristics guided MCTS (source: Tesla AI Day 2021)Tesla also adopts a hybrid system that combines data-driven and physics-based checks. Starting with defined goals, Tesla’s system generates seed trajectories and evaluates key scenarios. It then branches out to create more scenario variants, such as asserting or yielding to a traffic agent. Such an interaction search over the policy tree is showcased in the presentations of the years 2021 and 2022.One highlight of Tesla’s use of machine learning is the acceleration of tree search via trajectory optimization. For each node, Tesla uses physics-based optimization and a neural planner, achieving a 10 ms vs. 100 µs time frame — resulting in a 10x to 100x improvement. The neural network is trained with expert demonstrations and offline optimizers.Trajectory scoring is performed by combining classical physics-based checks (such as collision checks and comfort analysis) with neural network evaluators that predict intervention likelihood and rate human-likeness. This scoring helps prune the search space, focusing computation on the most promising outcomes.While many argue that machine learning should be applied to high-level decision-making, Tesla uses ML fundamentally to accelerate optimization and, consequently, tree search.The Monte Carlo Tree Search (MCTS) method appears to be an ultimate tool for decision-making. Interestingly, those studying Large Language Models (LLMs) are trying to incorporate MCTS into LLMs, while those working on autonomous driving are attempting to replace MCTS with LLMs.As of roughly two years ago, Tesla’s technology followed this approach. However, since March 2024, Tesla’s Full Self-Driving (FSD) has switched to a more end-to-end approach, significantly different from their earlier methods.No treesWe can still consider interactions without implicitly growing trees. Ad-hoc logics can be implemented to perform one-order interaction between prediction and planning. Even one-order interaction can already generate good behavior, as demonstrated by TuSimple. MPDM, in its original form, is essentially one-order interaction, but executed in a more principled and extendable way.Multi-order interaction between prediction and planning (source: TuSImple AI day, in Chinese, translated by author)TuSimple has also demonstrated the capability to perform contingency planning, similar to the approach proposed in MARC (though MARC can also accommodate a customized risk preference).Contingency planning (source: TuSImple AI day, in Chinese, translated by author)Self-ReflectionsAfter learning the basic building blocks of classical planning systems, including behavior planning, motion planning, and the principled way to handle interaction through decision-making, I have been reflecting on potential bottlenecks in the system and how machine learning (ML) and neural networks (NN) may help. I am documenting my thought process here for future reference and for others who may have similar questions. Note that the information in this section may contain personal biases and speculations.Why NN in planning?Let’s look at the problem from three different perspectives: in the existing modular pipeline, as an end-to-end (e2e) NN planner, or as an e2e autonomous driving system.Going back to the drawing board, let’s review the problem formulation of a planning system in autonomous driving. The goal is to obtain a trajectory that ensures safety, comfort, and efficiency in a highly uncertain and interactive environment, all while adhering to real-time engineering constraints onboard the vehicle. These factors are summarized as goals, environments, and constraints in the chart below.The potentials of NN in planning (chart made by author)Uncertainty in autonomous driving can refer to uncertainty in perception (observation) and predicting long-term agent behaviors into the future. Planning systems must also handle the uncertainty in future trajectory predictions of other agents. As discussed earlier, a principled decision-making system is an effective way to manage this.Additionally, a typically overlooked aspect is that planning must tolerate uncertain, imperfect, and sometimes incomplete perception results, especially in the current age of vision-centric and HD map-less driving. Having a Standard Definition (SD) map onboard as a prior helps alleviate this uncertainty, but it still poses significant challenges to a heavily handcrafted planner system. This perception uncertainty was considered a solved problem by Level 4 (L4) autonomous driving companies through the heavy use of Lidar and HD maps. However, it has resurfaced as the industry moves toward mass production autonomous driving solutions without these two crutches. An NN planner is more robust and can handle largely imperfect and incomplete perception results, which is key to mass production vision-centric and HD-mapless Advanced Driver Assistance Systems (ADAS).Interaction should be treated with a principled decision-making system such as Monte Carlo Tree Search (MCTS) or a simplified version of MPDM. The main challenge is dealing with the curse of dimensionality (combinatorial explosion) by growing a balanced policy tree with smart pruning through domain knowledge of autonomous driving. MPDM and its variants, in both academia and industry (e.g., Tesla), provide good examples of how to grow this tree in a balanced way.NNs can also enhance the real-time performance of planners by speeding up motion planning optimization. This can shift the compute load from CPU to GPU, achieving orders of magnitude speedup. A tenfold increase in optimization speed can fundamentally impact high-level algorithm design, such as MCTS.Trajectories also need to be more human-like. Human likeness and takeover predictors can be trained with the vast amount of human driving data available. It is more scalable to increase the compute pool rather than maintain a growing army of engineering talents.The NN-based planning stack can leverage human-driving data more effectively (Chart created by author)What about e2e NN planners?An end-to-end (e2e) neural network (NN) planner still constitutes a modular autonomous driving (AD) design, accepting structured perception results (and potentially latent features) as its input. This approach combines prediction, decision, and planning into a single network. Companies such as DeepRoute (2022) and Huawei (2024) claim to utilize this method. Note that relevant raw sensor inputs, such as navigation and ego vehicle information, are omitted here.A full autonomous driving stack with an e2e planner (chart made by author)This e2e planner can be further developed into an end-to-end autonomous driving system that combines both perception and planning. This is what Wayve’s LINGO-2 (2024) and Tesla’s FSDv12 (2024) claim to achieve.The benefits of this approach are twofold. First, it addresses perception issues. There are many aspects of driving that we cannot easily model explicitly with commonly used perception interfaces. For example, it is quite challenging to handcraft a driving system to nudge around a puddle of water or slow down for dips or potholes. While passing intermediate perception features might help, it may not fundamentally resolve the issue.Additionally, emergent behavior will likely help resolve corner cases more systematically. The intelligent handling of edge cases, such as the examples above, may result from the emergent behavior of large models.A full autonomous driving stack with a one-model e2e driver (chart made by author)My speculation is that, in its ultimate form, the end-to-end (e2e) driver would be a large vision and action-native multimodal model enhanced by Monte Carlo Tree Search (MCTS), assuming no computational constraints.A world model in autonomous driving, as of 2024 consensus, is typically a multimodal model covering at least vision and action modes (or a VA model). While language can be beneficial for accelerating training, adding controllability, and providing explainability, it is not essential. In its fully developed form, a world model would be a VLA (vision-language-action) model.There are at least two approaches to developing a world model:Video-Native Model: Train a model to predict future video frames, conditioned on or outputting accompanying actions, as demonstrated by models like GAIA-1.Multimodality Adaptors: Start with a pretrained Large Language Model (LLM) and add multimodality adaptors, as seen in models like Lingo-2, RT2, or ApolloFM. These multimodal LLMs are not native to vision or action but require significantly less training resources.A world model can produce a policy itself through the action output, allowing it to drive the vehicle directly. Alternatively, MCTS can query the world model and use its policy outputs to guide the search. This World Model-MCTS approach, while much more computationally intensive, could have a higher ceiling in handling corner cases due to its explicit reasoning logic.Can we do without prediction?Most current motion prediction modules represent the future trajectories of agents other than the ego vehicle as one or multiple discrete trajectories. It remains a question whether this prediction-planning interface is sufficient or necessary.In a classical modular pipeline, prediction is still needed. However, a predict-then-plan pipeline definitely caps the upper limit of autonomous driving systems, as discussed in the decision-making session. A more critical question is how to integrate this prediction module more effectively into the overall autonomous driving stack. Prediction should aid decision-making, and a queryable prediction module within an overall decision-making framework, such as MPDM and its variants, is preferred. There are no severe issues with concrete trajectory predictions as long as they are integrated correctly, such as through policy tree rollouts.Another issue with prediction is that open-loop Key Performance Indicators (KPIs), such as Average Displacement Error (ADE) and Final Displacement Error (FDE), are not effective metrics as they fail to reflect the impact on planning. Instead, metrics like recall and precision at the intent level should be considered.In an end-to-end system, an explicit prediction module may not be necessary, but implicit supervision — along with other domain knowledge from a classical stack — can definitely help or at least boost the data efficiency of the learning system. Evaluating the prediction behavior, whether explicit or implicit, will also be helpful in debugging such an e2e system.Can we do with just nets but no trees?Conclusions First. For an assistant, neural networks (nets) can achieve very high, even superhuman performance. For agents, I believe that using a tree structure is still beneficial (though not necessarily a must).First of all, trees can boost nets. Trees enhance the performance of a given network, whether it’s NN-based or not. In AlphaGo, even with a policy network trained via supervised learning and reinforcement learning, the overall performance was still inferior to the MCTS-based AlphaGo, which integrates the policy network as one component.Second, nets can distill trees. In AlphaGo, MCTS used both a value network and the reward from a fast rollout policy network to evaluate a node (state or board position) in the tree. The AlphaGo paper also mentioned that while a value function alone could be used, combining the results of the two yielded the best results. The value network essentially distilled the knowledge from the policy rollout by directly learning the state-value pair. This is akin to how humans distill the logical thinking of the slow System 2 into the fast, intuitive responses of System 1. Daniel Kahneman, in his book “Thinking, Fast and Slow,” describes how a chess master can quickly recognize patterns and make rapid decisions after years of practice, whereas a novice would require significant effort to achieve similar results. Similarly, the value network in AlphaGo was trained to provide a fast evaluation of a given board position.Grandmaster-Level Chess Without Search (source: DeepMind, 2024)Recent papers explore the upper limits of this fast system with neural networks. The “chess without search” paper demonstrates that with sufficient data (prepared through tree search using a conventional algorithm), it is possible to achieve grandmaster-level proficiency. There is a clear “scaling law” related to data size and model size, indicating that as the amount of data and the complexity of the model increase, so does the proficiency of the system.So here we are with a power duo: trees boost nets, and nets distill trees. This positive feedback loop is essentially what AlphaZero uses to bootstrap itself to reach superhuman performance in multiple games.The same principles apply to the development of large language models (LLMs). For games, since we have clearly defined rewards as wins or losses, we can use forward rollout to determine the value of a certain action or state. For LLMs, the rewards are not as clear-cut as in the game of Go, so we rely on human preferences to rate the models via reinforcement learning with human feedback (RLHF). However, with models like ChatGPT already trained, we can use supervised fine-tuning (SFT), which is essentially imitation learning, to distill smaller yet still powerful models without RLHF.Returning to the original question, nets can achieve extremely high performance with large quantities of high-quality data. This could be good enough for an assistant, depending on the tolerance for errors, but it may not be sufficient for an autonomous agent. For systems targeting driving assistance (ADAS), nets via imitation learning may be adequate.Trees can significantly boost the performance of nets with an explicit reasoning loop, making them perhaps more suitable for fully autonomous agents. The extent of the tree or reasoning loop depends on the return on investment of engineering resources. For example, even one order of interaction can provide substantial benefits, as demonstrated in TuSimple AI Day.Can we use LLMs to make decisions?From the summary below of the hottest representatives of AI systems, we can see that LLMs are not designed to perform decision-making. In essence, LLMs are trained to complete documents, and even SFT-aligned LLM assistants treat dialogues as a special type of document (completing a dialogue record).Representative AI products as of 2024 (chart made by author)I do not fully agree with recent claims that LLMs are slow systems (System 2). They are unnecessarily slow in inference due to hardware constraints, but in their vanilla form, LLMs are fast systems as they cannot perform counterfactual checks. Prompting techniques such as Chain of Thought (CoT) or Tree of Thoughts (ToT) are actually simplified forms of MCTS, making LLMs function more like slower systems.There is extensive research trying to integrate full-blown MCTS with LLMs. Specifically, LLM-MCTS (NeurIPS 2023) treats the LLM as a commonsense “world model” and uses LLM-induced policy actions as a heuristic to guide the search. LLM-MCTS outperforms both MCTS alone and policies induced by LLMs by a wide margin for complex, novel tasks. The highly speculated Q-star from OpenAI seems to follow the same approach of boosting LLMs with MCTS, as the name suggests.The trend of evolutionBelow is a rough evolution of the planning stack in autonomous driving. It is rough as the listed solutions are not necessarily more advanced than the ones above, and their debut may not follow the exact chronological order. Nonetheless, we can observe general trends. Note that the listed representative solutions from the industry are based on my interpretation of various press releases and could be subject to error.One trend is the movement towards a more end-to-end design with more modules consolidated into one. We see the stack evolve from path-speed decoupled planning to joint spatiotemporal planning, and from a predict-then-plan system to a joint prediction and planning system. Another trend is the increasing incorporation of machine learning-based components, especially in the last three stages. These two trends converge towards an end-to-end NN planner (without perception) or even an end-to-end NN driver (with perception).A rough history of evolution of planning (Chart made by author)TakeawaysML as a Tool: Machine learning is a tool, not a standalone solution. It can assist with planning even in current modular designs.Full Formulation: Start with a full problem formulation, then make reasonable assumptions to balance performance and resources. This helps create a clear direction for a future-proof system design and allows for improvements as resources increase. Recall the transition from POMDP’s formulation to engineering solutions like AlphaGo’s MCTS and MPDM.Adapting Algorithms: Theoretically beautiful algorithms (e.g., Dijkstra and Value Iteration) are great for understanding concepts but need adaptation for practical engineering (Value Iteration to MCTS as Dijkstra’s algorithm to Hybrid A-star).Deterministic vs. Stochastic: Planning excels in resolving deterministic (not necessarily static) scenes. Decision-making in stochastic scenes is the most challenging task toward full autonomy.Contingency Planning: This can help merge multiple futures into a common action. It’s beneficial to be aggressive to the degree that you can always resort to a backup plan.End-to-end Models: Whether an end-to-end model can solve full autonomy remains unclear. It may still need classical methods like MCTS. Neural networks can handle assistants, while trees can manage agents.AcknowledgmentsThis blog post is heavily inspired by Dr. Wenchao Ding’s course on planning on Shenlan Xueyuan (深蓝学院).Thanks for the heavy and inspiring discussion with Naiyan Wang and Jingwei Zhao. Thanks to 论文推土机, Invictus, XXF, PYZ and JCL for giving critical feedback to the initial draft. Thanks for the insightful discussion with Professor Wei Zhan from Berkeley on trends in academia.ReferenceEnd-To-End Planning of Autonomous Driving in Industry and Academia: 2022–2023, Arxiv 2024BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning, AAAI 2024Towards A General-Purpose Motion Planning for Autonomous Vehicles Using Fluid Dynamics, Arxiv 2024Tusimple AI day, in Chinese with English subtitle on Bilibili, 2023/07Tech blog on joint spatiotemporal planning by Qcraft, in Chinese on Zhihu, 2022/08A review of the entire autonomous driving stack, in Chinese on Zhihu, 2018/12Tesla AI Day Planning, in Chinese on Zhihu, 2022/10Technical blog on ApolloFM, in Chinese by Tsinghua AIR, 2024Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame, ICRA 2010MP3: A Unified Model to Map, Perceive, Predict and Plan, CVPR 2021NMP: End-to-end Interpretable Neural Motion Planner, CVPR 2019 oralLift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, ECCV 2020CoverNet: Multimodal Behavior Prediction using Trajectory Sets, CVPR 2020Baidu Apollo EM Motion Planner, Baidu, 2018AlphaGo: Mastering the game of Go with deep neural networks and tree search, Nature 2016AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play, Science 2017MuZero: Mastering Atari, Go, chess and shogi by planning with a learned model, Nature 2020ToT: Tree of Thoughts: Deliberate Problem Solving with Large Language Models, NeurIPS 2023 OralCoT: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022LLM-MCTS: Large Language Models as Commonsense Knowledge for Large-Scale Task Planning, NeurIPS 2023MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving, ICRA 2015MPDM2: Multipolicy Decision-Making for Autonomous Driving via Changepoint-based Behavior Prediction, RSS 2015MPDM3: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment, RSS 2017EUDM: Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching, ICRA 2020MARC: Multipolicy and Risk-aware Contingency Planning for Autonomous Driving, RAL 2023EPSILON: An Efficient Planning System for Automated Vehicles in Highly Interactive Environments, TRO 2021A Crash Course of Planning for Perception Engineers in Autonomous Driving was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • My 30-Day Map Challenge 2023
    by Bryan R. Vallejo on June 29, 2024 at 4:20 pm

    An overview of selected map topics and algorithmsContinue reading on Towards Data Science »

  • From Vision Transformers to Masked Autoencoders in 5 Minutes
    by Essam Wisam on June 29, 2024 at 4:14 pm

    A Straightforward Guide on How NLP Tasks Generalize to Computer VisionNearly all natural language processing tasks which range from language modeling and masked word prediction to translation and question-answering were revolutionized as the transformer architecture made its debute in 2017. It didn’t take more than 2–3 years for transformers to also excel in computer vision tasks. In this story, we explore two fundamental architectures that enabled transformers to break into the world of computer vision.Table of Contents· The Vision Transformer ∘ Key Idea ∘ Operation ∘ Hybrid Architecture ∘ Loss of Structure ∘ Results ∘ Self-supervised Learning by Masking· Masked Autoencoder Vision Transformer ∘ Key Idea ∘ Architecture ∘ Final Remark and ExampleThe Vision TransformerImage from Paper: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”Key IdeaThe vision transformer is simply meant to generalize the standard transformer architecture to process and learn from image input. There is a key idea about the architecture that the authors were transparent enough to highlight:“Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications.”OperationIt’s valid to take “fewest possible modifications” quite literally because they pretty much make zero modifications. What they actuall modify is input structure:In NLP, the transformer encoder takes a sequence of one-hot vectors (or equivalently token indices) that represent the input sentence/paragraph and returns a sequence of contextual embedding vectors that could be used for a further tasks (e.g., classification)To generalize the CV, the vision transformer takes a sequence of patch vectors that represent the input image and returns a sequence of contextual embedding vectors that could be used for a further tasks (e.g., classification)In particular, suppose the input images have dimensions (n,n,3) to pass this as an input to the transformer, what the vision transformer does is:Divides it into k² patches for some k (e.g., k=3) as in the figure above.Now each patch will be (n/k,n/k,3) the next step is to flatten each patch into a vectorThe patch vector will be of dimensionality 3*(n/k)*(n/k). For example, if the image is (900,900,3) and we use k=3 then a patch vector will have dimensionality 300*300*3 representing the pixel values in the flattened patch. In the paper, authors use k=16. Hence, the paper’s name “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” instead of feeding a one-hot vector representing the word they represent a vector pixels representing a patch of the image.The rest of the operations remains as in the original transformer encoder:These patch vectors pass by a trainable embedding layerPositional embeddings are added to each vector to maintain a sense of spatial information in the imageThe output is num_patches encoder representations (one for each patch) which could be used for classification on the patch or image levelMore often (and as in the paper), a CLS token is prepended the representation corresponding to that is used to make a prediction over the whole image (similar to BERT)How about the transformer decoder?Well, remember it’s just like the transformer encoder; the difference is that it uses masked self-attention instead of self-attention (but the same input signature remains). In any case, you should expect to seldom use a decoder-only transformer architecture because simply predicting the next patch may not a task of great interest.Hybrid ArchitectureAuthors also mentions that it’s possible to start with a CNN feature map instead of the image itself to form a hybrid architecture (CNN feeding output to vision transformer). In this case, we think of the input as a generic (n,n,p) feature map and a patch vector will have dimensions (n/k)*(n/k)*p.Loss of StructureIt may cross your mind that this architecture shouldn’t be so good because it treated the image as a linear structure when it isn’t. The author try to depict that this is intentional by mentioning“The two-dimensional neighborhood structure is used very sparingly…position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch”We will see that the transformer is able to learn this as evidenced by its good performance in their experiments and more importantly the architecture in the next paper.ResultsThe main verdict from the results is that vision transformers tend to not outperform CNN-based models for small datasets but approach or outperofrm CNN-based models for larger datasets and either way require significantly less compute:Table from Paper: “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”.Here we see that for the JFT-300M dataset (which has 300M images), the ViT models pre-trained on the dataset outperform ResNet-based baselines while taking substantially less computational resources to pre-train. As can be seen the larget vision transformer they used (ViT-Huge with 632M parameters and k=16) used about 25% of the compute used for the ResNet based model and still outperformed it. The performance doesn’t even downgrade that much with ViT-Large using only <6.8% of the compute.Meanwhile, others also expose results where the ResNet performed significantly better when trained on ImageNet-1K which has just 1.3M images.Self-supervised Learning by MaskingAuthors performed a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT (i.e., masking out patches and attempting to predict them).“We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%).”With self-supervised pre-training, their smaller ViT-Base/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch. But still 4% behind supervised pre-training.Masked Autoencoder Vision TransformerImage from Paper: Masked Autoencoders Are Scalable Vision LearnersKey IdeaAs we have seen from the vision transformer paper, the gains from pretraining by masking patches in input images were not as significant as in ordinary NLP where masked pretraining can lead to state-of-the-art results in some fine-tuning tasks.This paper proposes a vision transformer architecture involving an encoder and a decoder that when pretrained with masking results in significant improvements over the base vision transformer model (as much as 6% improvement compared to training a base size vision transformer in a supervised fashion).Image from Paper: Masked Autoencoders Are Scalable Vision LearnersThis is some sample (input, output, true labels). It’s an autoencoder in the sense that it tried to reconstruct the input while filling the missing patches.ArchitectureTheir encoder is simply the ordinary vision transformer encoder we explained earlier. In training and inference, it takes only the “observed” patches.Meanwhile, their decoder is also simply the ordinary vision transformer encoder but it takes:Masked token vectors for the missing patchesEncoder output vectors for the known patchesSo for an image [ [ A, B, X], [C, X, X], [X, D, E]] where X denotes a missing patch, the decoder will take the sequence of patch vectors [Enc(A), Enc(B), Vec(X), Vec(X), Vec(X), Enc(D), Enc(E)]. Enc returns the encoder output vector given the patch vector and X is a vector to represent missing token.The last layer in the decoder is a linear layer that maps the contextual embeddings (produced by the vision transformer encoder in the decoder) to a vector of length equal to the patch size. The loss function is mean squared error which squares the difference between the original patch vector and the predicted one by this layer. In the loss function, we only look at the decoder predictions due to masked tokens and ignore the ones corresponding the present ones (i.e., Dec(A),. Dec(B), Dec(C), etc.).Final Remark and ExampleIt may be surprising that the authors suggest masking about 75% of the patches in the images; BERT would mask only about 15% of the words. They justify like so:Images,are natural signals with heavy spatial redundancy — e.g., a missing patch can be recovered from neighboring patches with little high-level understanding of parts, objects, and scenes. To overcome this difference and encourage learning useful features, we mask a very high portion of random patches.Want to try it out yourself? Checkout this demo notebook by NielsRogge.This is all for this story. We went through a journey to understand how fundamental transformer models generalize to the computer vision world. Hope you have found it clear, insighful and worth your time.References:[1] Dosovitskiy, A. et al. (2021) An image is worth 16×16 words: Transformers for image recognition at scale, arXiv.org. Available at: https://arxiv.org/abs/2010.11929 (Accessed: 28 June 2024).[2] He, K. et al. (2021) Masked autoencoders are scalable vision learners, arXiv.org. Available at: https://arxiv.org/abs/2111.06377 (Accessed: 28 June 2024).From Vision Transformers to Masked Autoencoders in 5 Minutes was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

  • AutoRound: Accurate Low-bit Quantization for LLMs
    by Benjamin Marie on June 29, 2024 at 3:47 pm

    Between quantization-aware training and post-training quantizationContinue reading on Towards Data Science »

  • Data Engineering, Redefined
    by Bernd Wessely on June 29, 2024 at 3:41 pm

    How data engineering is practiced today and why we should redefine itWhat is the Problem with Data Engineering Today?If you search for a clear definition of what data engineering actually is, you’ll get so many different proposals that it leaves you with more questions than answers.But as I want to explain what needs to be redefined, I’ll better use one of the more popular definitions that clearly represents the current state and mess we all face:Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.— Joe Reis and Matt Housley in “Fundamentals of Data Engineering”That is a fine definition and now, what is the mess?Let’s look at the first sentence, where I highlight the important part that we should delve into:…take in raw data and produce high-quality, consistent information that supports downstream use cases…Accordingly, data engineering takes raw data and transforms it to (produces) information that supports use cases. Only two examples are given, like analysis or machine learning, but I would assume that this includes all other potential use cases.The data transformation is what drives me and all my fellow data engineers crazy. Data transformation is the monumental task of applying the right logic to raw data to transform it into information that enables all kinds of intelligent use cases.To apply the right logic is actually the main task of applications. Applications are the systems that implement the logic that drives the business (use cases) — I continue to refer to it as an application and implicitly also mean services that are small enough to fit into the microservices architecture. The applications are usually built by application developers (software engineers if you like). But to meet our current definition of data engineering, the data engineers must now implement business logic. The whole mess starts with this wrong approach.I have written an article about that topic, where I stress that “Data Engineering is Software Engineering…”. Unfortunately, we already have millions of brittle data pipelines that have been implemented by data engineers. These pipelines sometimes — or regrettably, even oftentimes — do not have the same software quality that you would expect from an application. But the bigger problem is the fact that these pipelines often contain uncoordinated and therefore incorrect and sometimes even hidden business logic.However, the solution is not that all data engineers should now be turned into application developers. Data engineers still need to be qualified software engineers, but they should by no means turn into application developers. Instead, I advocate a redefinition of data engineering as “all about the movement, manipulation, and management of data”. This definition comes from the book “What Is Data Engineering? by Lewis Gavin (O’Reilly, 2019)”. However, and this is a clear difference to current practices, we should limit manipulation to purely technical ones.We should no longer allow the development and use of business logic outside of applications.To be very clear, data engineering should not implement business logic. The trend in modern application development is actually to keep stateless application logic separate from state management. We do not put application logic in the database and we do not put persistent state (or data) in the application. In the functional programming community they joke “We believe in the separation of church and state”. If you now think, “Where is the joke?”, then this might help. But now without any jokes: “We should believe in the separation of business logic and business data”. Accordingly, I believe we should explicitly leave data concerns to the data engineer and logic concerns to the application developer.What are “technical manipulations” that still are allowed for the data engineer, you might ask. I would define this as any manipulation to data that does not change or add new business information. We can still partition, bucket, reformat, normalize, index, technically aggregate, etc., but as soon as real business logic is necessary, we should address it to the application developers in the business domain responsible for the respective data set.Why Data Engineering Got SidetrackedThe simple and obvious principle of separation of concerns – Image by authorWhy have we moved away from this simple and obvious principle?I think this shift can be attributed to the rapid evolution of databases into multifunctional systems. Initially, databases served as simple, durable storage solutions for business data. They provided very helpful abstractions to offload functionality to persist data from the real business logic in the applications. However, vendors quickly enhanced these systems by embedding software development functionality in their database products to attract application developers. This integration transformed databases from mere data repositories into comprehensive platforms, incorporating sophisticated programming languages and tools for full-fledged software development. Consequently, databases evolved into powerful transformation engines, enabling data specialists to implement business logic outside traditional applications. The demand for this shift was further amplified by the advent of large-scale data warehouses, designed to consolidate scattered data storage — a problem that became more pronounced with the rise of microservices architecture. This technological progression made it practical and efficient to combine business logic with business data within the database.In the end, not all software engineers succumbed to the temptation of bundling their application logic within the database, preserving hope for a cleaner separation. As data continued to grow in volume and complexity, big data tools like Hadoop and its successors emerged, even replacing traditional databases in some areas. This shift presented an opportunity to move business logic out of the database and back to application developers. However, the notion that data engineering encompasses more than just data movement and management had already taken root. We had developed numerous tools to support business intelligence, advanced analytics, and complex transformation pipelines, allowing the implementation of sophisticated business logic.These tools have become integral components of the modern data stack (MDS), establishing data engineering as its own discipline. The MDS comprises a comprehensive suit of tools for data mangling and transformation, but these tools remain largely unfamiliar to the typical application developer or software engineer. Despite the potential to “turn the database inside out” and relocate business logic back to the application layer, we failed to fully embrace this opportunity. The unfortunate practice of implementing business logic remains with data engineers to this day.How Data Engineering Can be RedefinedLet’s more precisely define what “all about the movement, manipulation, and management of data” involves.Data engineers can and should provide the most mature tools and platforms to be used by application developers to handle data. This is also the main idea with the “self-serving data platform” in the data mesh. However, the responsibility of defining and maintaining the business logic remains within the business domains. These people far better know the business and what business transformation logic should be applied to data.Okay, so what about these nice ideas like data warehouse systems and more general the overall “data engineering lifecycle” as defined by Joe Reis and Matt Housley?Data Engineering Lifecycle cmpleted — Image by author based on Joe Reis’ and Matt Housley’s view“Data pipelines” are really just agents between applications, but if business logic is to be implemented in the pipelines, these systems should be considered their own applications in the enterprise. Applications that should be maintained by application developers in the business domains and not by data engineers.The nice and clear flow of data from source (“Generation” as they named it in their reference) to serving for consumers is actually idealistically reduced. The output from “Reverse ETL” is data that again serves as input for downstream applications. And not only “Reverse ETL” but also “Analytics” and “Machine Learning” create output to be consumed by analytical and operational downstream applications. The journey of data through the organization does not end with the training or application of an ML model or the creation of a business analysis. The results of these applications need to be further processed within the company and therefore be integrated into the overall business process flow (indicated by the blue boxes and arrows). This view blurs the rigid distinction still practiced between the operational and analytical planes, the elimination of which I have described as a core target of data mesh.So what is the real task of the data engineer? We should evolve the enterprise architecture to enable application developers to take back the business logic into their applications, while allowing seamless exchange and sharing of data between these applications. Data engineers effectively build the “Data Infrastructure” and all the tooling that integrate business logic through a data mesh of interconnected business applications. This is by far more complicated than just providing a collection of different database types, a data warehouse or a data lake, whether on-premises or in the cloud. It comprises the implementation of tools and infrastructure, the definition and application of governance and data sharing principles (including modeling) that ultimately enables the universal data supply in the enterprise. However, the implementation of business logic should definitely be outside the scope of data engineering.Data Infrastructure as the data engineers’ services to be used by all application developers — Image by authorThis redefined data engineering practice is in line with the adapted data mesh approach that I proposed in this three-part series:Challenges and Solutions in Data Mesh – Part 1Challenges and Solutions in Data Mesh — Part 2Challenges and Solutions in Data Mesh — Part 3Data Engineering, Redefined was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

Newsletter

Join our newsletter to get the free update, insight, promotions.