Ideal will push all images in July. NOA releases end-to-end VLM new autonomous d

May 12, 2024 tech 71 COMMENT

The no-map NOA that can be used everywhere will be fully pushed to all Li Auto AD Max users in July

Full automatic AES and all-round low-speed AEB will be pushed in July

A new autonomous driving technology architecture based on end-to-end models, VLM visual language models, and world models will be released

The early bird plan for end-to-end + VLM is launched

On July 5, 2024, at the 2024 Intelligent Driving Summer Conference, Li Auto announced that it will fully push the "nationwide usable" no-map NOA to all Li Auto AD Max users in July, and will push the full automatic AES (Automatic Emergency Steering) and all-round low-speed AEB (Automatic Emergency Braking) in July. At the same time, Li Auto released a new autonomous driving technology architecture based on end-to-end models, VLM visual language models, and world models, and launched the early bird plan for the new architecture.

In terms of intelligent driving products, the no-map NOA no longer relies on high-precision maps or prior information, and can be used in the navigation coverage area across the country, bringing a smoother detour experience with the ability to plan in time and space. The no-map NOA also has the ability to navigate and route with a super long view distance, and can still pass smoothly at complex intersections. At the same time, the no-map NOA fully considers the user's psychological safety boundary, bringing a tacit and reassuring intelligent driving experience with centimeter-level micro-operations. In addition, the AES function to be pushed soon can be fully automatically triggered without relying on human assistance, avoiding more high-risk accidents. The all-round low-speed AEB once again expands the active safety risk scenarios, effectively reducing the high frequency of minor scrapes and scratches in low-speed moving scenarios.

In terms of autonomous driving technology, the new architecture is composed of end-to-end models, VLM visual language models, and world models. The end-to-end model is used to handle conventional driving behaviors, with information transmission, reasoning calculation, and model iteration being more efficient, and driving behaviors being more human-like from sensor input to driving trajectory output through only one model. The VLM visual language model has strong logical thinking ability, can understand complex road conditions, navigation maps, and traffic rules, and deal with high-difficulty unknown scenarios. At the same time, the autonomous driving system will learn and test in a virtual environment built based on the world model. The world model combines reconstruction and generation to build test scenarios that are in line with real laws and have excellent generalization ability.

Li Auto's Senior Vice President of Product, Fan Haoyu, said: "Li Auto has always adhered to polishing product experience with users. From pushing the first batch of a thousand experience users in May this year, to expanding the scale of experience users to more than ten thousand in June, we have accumulated more than a million kilometers of no-map NOA driving mileage in various parts of the country. After the full push of the no-map NOA, 240,000 Li Auto AD Max car owners will all use the leading intelligent driving products in the country at present, which is a sincere heavy upgrade."

Li Auto's Vice President of Intelligent Driving Research and Development, Lang Xianpeng, said: "From the full-stack self-research and development started in 2021, to the release of a new autonomous driving technology architecture today, Li Auto's autonomous driving research and development has never stopped exploring. We have combined the end-to-end model and the VLM visual language model to bring the industry's first dual-system deployment plan on the vehicle side, and for the first time successfully deployed the VLM visual language model on the vehicle-side chip. This leading new architecture in the industry is a milestone technical breakthrough in the field of autonomous driving."Tu-free NOA's Four Major Capability Enhancements for Efficient National Road Traffic

 

The upcoming Tu-free NOA, to be rolled out in July, brings four significant capability upgrades to comprehensively enhance user experience. Firstly, thanks to the full enhancement of perception, understanding, and road structure construction capabilities, Tu-free NOA has eliminated its reliance on prior information. Users can utilize NOA within the navigation coverage of cities across the country, and even activate the feature in more unique settings such as narrow alleys and rural paths.

Secondly, based on efficient spatiotemporal joint planning capabilities, the vehicle's avoidance and detouring around road obstacles have become smoother. Spatiotemporal joint planning achieves synchronized planning in both horizontal and longitudinal spaces, and by continuously predicting the spatial interaction between the ego-vehicle and other vehicles, it plans all drivable trajectories within the future time window. Learning from high-quality samples, the vehicle can quickly select the optimal trajectory and execute detour maneuvers decisively and safely.

In complex urban intersections, Tu-free NOA's route selection capabilities have also been significantly improved. Tu-free NOA employs a BEV visual model integrated with navigation matching algorithms, perceiving real-time changes in curbs, road surface arrow markings, and intersection features, and fully integrating lane structures with navigation features, effectively addressing the issue of structuring complex intersections. It possesses the ability to navigate with an ultra-long visual range, making intersection passage more stable.

At the same time, Tu-free NOA focuses on the psychological safety boundaries of users, bringing a more harmonious and reassuring driving experience with centimeter-level micro-maneuvering capabilities. Through the fusion of LiDAR and visual front occupancy networks, the vehicle can recognize irregular obstacles within a larger range, with higher perception accuracy, thus achieving earlier and more accurate predictions of the behavior of other road participants. Thanks to this, the vehicle can maintain a reasonable distance from other road participants, and the timing of acceleration and deceleration is more appropriate, effectively enhancing the user's sense of safety while driving.

Advanced Active Safety Capabilities and Expanded Coverage Scenarios

 

In the field of active safety, Li Auto has established a comprehensive safety risk scenario library, classified according to frequency and level of danger, and continuously improves the coverage of risk scenarios. In July, it will push full-automatic AES and all-round low-speed AEB functions to users.

To address physical limit scenarios where AEB cannot avoid accidents, Li Auto has introduced the fully automatic AES (Automatic Emergency Steering) function. When the vehicle is traveling at high speeds, the reaction time left for the active safety system is extremely short, and in some cases, even if AEB is triggered, the vehicle's full braking force may not be enough to stop in time. At this point, the AES function will be triggered in time, without the need for manual steering operation, to automatically perform emergency steering to avoid the target in front, effectively preventing accidents in extreme scenarios.The comprehensive low-speed AEB is designed for parking and low-speed driving scenarios, providing 360-degree active safety protection. In the complex underground parking environment, obstacles such as surrounding pillars, pedestrians, and other vehicles increase the risk of scrapes. The comprehensive low-speed AEB can effectively identify forward, backward, and side collision risks, and apply emergency braking in time, bringing a more reassuring experience to users' daily vehicle use.

Breakthroughs in autonomous driving technology, with dual systems for enhanced intelligence

Xiaomi's new autonomous driving technology architecture is inspired by the Nobel Prize winner Daniel Kahneman's fast and slow system theory, simulating human thinking and decision-making processes in the field of autonomous driving to form a more intelligent and human-like driving solution.

The fast system, also known as System 1, is adept at handling simple tasks and is the intuition formed by human experience and habits, sufficient to deal with 95% of the routine scenarios when driving a vehicle. The slow system, also known as System 2, is the logical reasoning, complex analysis, and computational ability formed by humans through deeper understanding and learning, used to solve complex and even unknown traffic scenarios when driving a vehicle, accounting for about 5% of daily driving. System 1 and System 2 work together to ensure high efficiency in most scenarios and high limits in a few scenarios, forming the foundation for human cognition, understanding of the world, and decision-making.

Xiaomi has formed a prototype of the autonomous driving algorithm architecture based on the fast and slow system theory. System 1 is implemented by an end-to-end model, which has the ability to respond efficiently and quickly. The end-to-end model receives sensor inputs and directly outputs driving trajectories for vehicle control. System 2 is implemented by the VLM visual language model, which, after receiving sensor inputs, goes through logical thinking and outputs decision-making information to System 1. The dual-system autonomous driving capability will also be trained and verified in the cloud using a world model.

Efficient end-to-end model

The input of the end-to-end model is mainly composed of cameras and LiDAR, with multi-sensor features extracted and fused through the CNN backbone network, projected into the BEV space. To enhance the model's representation ability, Xiaomi has also designed a memory module with both temporal and spatial memory capabilities. In the model's input, Xiaomi has also added vehicle status information and navigation information, which, after being encoded by the Transformer model, are decoded together with the BEV features to identify dynamic obstacles, road structures, and general obstacles, and plan the driving trajectory.

Multi-task outputs are achieved in an integrated model, with no rules intervening in the middle, so the end-to-end model has significant advantages in information transfer, reasoning calculation, and model iteration. In actual driving, the end-to-end model demonstrates a stronger general obstacle understanding capability, beyond-line-of-sight navigation capability, road structure understanding capability, and more human-like path planning capability.High-Ceiling VLM Visual Language Model

 

The algorithmic architecture of the VLM (Visual Language Model) consists of a unified Transformer model that encodes Prompt text through a Tokenizer and encodes visual information from the front-facing camera images and navigation map data. It then aligns the modalities through a text-image alignment module and finally performs autoregressive inference to output an understanding of the environment, driving decisions, and driving trajectories, which are passed on to System 1 for vehicle control assistance.

Xiaoli's VLM has a parameter count of 2.2 billion, demonstrating a strong understanding of the complex traffic environment in the physical world. It can handle unknown scenarios encountered for the first time with ease. The VLM model can recognize environmental information such as road surface smoothness and lighting, prompting System 1 to control the vehicle speed to ensure safe and comfortable driving. The VLM model also has a stronger ability to understand navigation maps, which can work with the vehicle's system to correct navigation and prevent wrong turns during driving. Additionally, the VLM model can comprehend complex traffic rules such as bus lanes, tidal lanes, and time-restricted access, making reasonable decisions during driving.

Reconstructed and Generated Combined World Model

 

Xiaoli's world model combines reconstruction and generation techniques, using 3DGS (3D Gaussian Splashing) technology to reconstruct real data and supplementing it with generative models for new perspectives. During scene reconstruction, dynamic and static elements are separated, with the static environment being reconstructed and dynamic objects being reconstructed and generated from new perspectives. After re-rendering the scene, a 3D physical world is formed, where dynamic assets can be freely edited and adjusted, achieving partial generalization of the scene. Compared to reconstruction, generative models have stronger generalization capabilities, with conditions such as weather, lighting, and traffic flow being customizable to generate new scenes that conform to real-world patterns, for evaluating the adaptability of the autonomous driving system under various conditions.

The combination of reconstruction and generation creates a more excellent virtual environment for the learning and testing of the capabilities of the autonomous driving system, equipping the system with an efficient closed-loop iteration capability, ensuring the system's safety and reliability.

Comments