Prompting VLMs with and for Classical Planners

Abstract

Classical planning is one of the earliest subareas of AI, where one can leverage domain knowledge to perform long-horizon reasoning and address complex planning problems. Despite the impressive successes, one limitation of classical planners is the assumption that state transitions take place in closed worlds, making them less robust to unforeseen situations in open worlds. Also, classical planners assume the current world state is provided beforehand, which can be unrealistic in practice. Aiming to address those two limitations, we propose a novel framework, called DKPrompt, that visually grounds a classical planner through a vision-language model (VLM) for open-world planning. A unique feature of DKPrompt is the use of the action description knowledge of classical planners to tailor VLM prompts before and after each action, enabling active perception and situational awareness on classical planners. Results from quantitative experiments show that DKPrompt outperforms naive classical planners, pure VLM-based planners and a few other competitive baselines in task completion rate.

An Overview of DKPROMPT.

An overview of DKPROMPT. By simply querying the robot's current observation against the domain knowledge~(i.e., action preconditions and effects) as VQA tasks, DKPROMPT can call the classical planner to generate a new valid plan using updated world states. Note that DKPROMPT only queries about predicates. The left shows how DKPROMPT checks every precondition of the action to be executed next, and the right shows how it verifies the expected action effects are all in place after action execution. Replanning is triggered when preconditions or effects are unsatisfied after updating the planner's action knowledge.

Everyday Tasks in OmniGibson Simulator

Quantitative evaluation results are collected in the OmniGibson simulator. The agent is equipped with a set of skills, and aims to use its skills to interact with the environment, completing long-horizon tasks autonomously.

Store firewood: Collect two wooden sticks and place them on a table.

Cook a frozen pie: Take an apple pie out of the fridge and heat it using an oven.

Boil water in the microwave: Pick up an empty cup in a closed cabinet, fill it with water using a sink, and boil it in a microwave.

Halve an egg: Find a knife in the kitchen and use it to cut a hard-boiled egg into half.

Bring in empty bottle: Find two empty bottles in the garden and bring them inside.

DKPROMPT v.s. Baselines in Success Rate over Five Everyday Tasks

DKPROMPT consistently outperforms baselines in task completion rates. By incorporating domain knowledge (i.e., action preconditions and effects) for prompting, DKPROMPT gains a greater positive impact on task completion from action failure recovery compared to other baselines.

Ablation Study of DKPROMPT

Impact of preconditions and effects on task completion. DKPROMPT achieves an average success rate of 66.5%. For Eff.-only that considers only action effects, the average success rate drops to 53.0%, and for Pre.-only that considers only preconditions, it further decreases to 41.5%. This suggests that the integration of both effects and preconditions significantly enhances task performance.

Performance of Other VLMs

Performance comparison of DKPROMPT with other VLMs. We also run experiments on various VLMs, including GPT-4 (as being used in the original implementation of DKPROMPT) from OpenAI, Gemini 1.5 from Google, and Claude 3 from Anthropic. According to the figure, GPT-4 consistently performs better than Gemini and Claude.