Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Expert score for maze2d environment may be wrong #215

Open
onceagain8 opened this issue Jul 8, 2023 · 2 comments
Open

[Question] Expert score for maze2d environment may be wrong #215

onceagain8 opened this issue Jul 8, 2023 · 2 comments

Comments

@onceagain8
Copy link

onceagain8 commented Jul 8, 2023

Summary

  1. There are issues with the scoring calculation of expert strategies in the maze2d environment.
  2. The incorrect scoring calculation is a result of the expert strategies not being called properly.
  3. The scores of the experts should be higher than the current scores.

Description:

Environment: maze2d

If you utilize the provided code (scripts/reference_scores/maze2d_controller.py) to calculate the score of the expert strategy, it may yield inaccurate results.

env = gym.make(args.env_name)
env.seed(0)
np.random.seed(0)
controller = waypoint_controller.WaypointController(env.str_maze_spec)

ravg = []
for _ in range(args.num_episodes):
    s = env.reset()
    returns = 0
    for t in range(env._max_episode_steps):
        position = s[0:2]
        velocity = s[2:4]
        act, done = controller.get_action(position, velocity, env.get_target())
        s, rew, _, _ = env.step(act)
        returns += rew
    ravg.append(returns)
print(args.env_name, 'returns', np.mean(ravg))

The WaypointController strategy (expert strategy) may only produce accurate results in the initial episode. However, it is likely to fail in reaching the goal point during subsequent episodes in the maze2d environment.

Why this happen?

The issue arises from the expert strategy implemented in the d4rl/pointmaze/waypoint_controller.py file. Specifically, the get_action function serves as the action selection mechanism for the expert strategy, and it contains the following code snippet:

if np.linalg.norm(self._target - np.array(self.gridify_state(target))) > 1e-3: 
    #print('New target!', target, 'old:', self._target)
    self._new_target(location, target)

This code implies that the waypoints will only be recalculated when the endpoint changes.

Taking into consideration the code in scripts/reference_scores/maze2d_controller.py, it appears that the self._new_target() function is executed solely at the beginning of the first episode. This is because env.reset() does not modify the endpoint. Consequently, in subsequent episodes, the waypoints will not be recalculated, and instead, the waypoints from the initial trajectory will be reused. As a result, the optimal strategy fails to achieve the desired outcome.

Experiment

Upon incorporating env.render() into the scripts/reference_scores/maze2d_controller.py file, it was observed that the expert strategy indeed fails to reach the target point. The video has been uploaded to Google Drive:

https://drive.google.com/file/d/13OF_z3hBAzcxX5upg6byVZrteWnJeFao/view?usp=sharing.

After making modifications to the code, I conducted a re-evaluation of the expert strategy across different environments. The results are presented below:

env_name maze2d-umaze-v1 maze2d-medium-v1 maze2d-large-v1
expert policy(new) 223.48 420.48 551.23
expert policy(old) 161.86 277.39 273.99
@onceagain8 onceagain8 changed the title [Bug Report] Expert score for maze2d environment may be wrong [Question] Expert score for maze2d environment may be wrong Jul 9, 2023
@HamedDi81
Copy link

Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command: env.get_normalized_score(average return of 100 episodes). However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL".
I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?

@zhyaoch
Copy link

zhyaoch commented Mar 12, 2024

Hi, I think you're right. I trained the decision transformer in maze2d-medium-dense-v1 environment and calculated the normalized score with this command: env.get_normalized_score(average return of 100 episodes). However, I obtained a score of 56, which does not align with the reported maximum score of 35 in the paper " QDT: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL". I wanted to know if you have calculated the expert score for maze2d-medium-dense-v1?

Hi, I'm also attempting to calculate normalized score with command: env.get_normalized_score(average return of 100 episodes) in antmaze task , but can't get correct score repported in the paper. Have you found a solution to this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants