Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on data efficiency / ablation study on same dataset for other methods #3

Open
Kin-Zhang opened this issue Sep 13, 2022 · 17 comments

Comments

@Kin-Zhang
Copy link

Kin-Zhang commented Sep 13, 2022

Thanks for your work.

I have one question on the experiment part, I notice that here:

Dataset collection We ran a rule-based expert agent on all 8 kinds of towns and weather with 2 FPS, to collect an expert dataset of 3M frames (410 hours).

As I knew that in [CVPR'21 Transfuser], [TPAMI'22 Transfuser], [CVPR'22 LAV] and [Arxiv'22 TCP] dataset sizes are all around 200-400K frames. [1M=1,000 K]

I didn't find the result table in the paper with the same dataset training but only the online leaderboard. Maybe you have some insights that didn't point out in the paper, so I propose this issue.

This 3M=3000K with 200-400K which makes it unclear whether it's the model or the extra data that brings the boost to the performance. Have you done any experiments on the data efficiency of your model?

Update: 2022/11/1, adding the discussion link here: Kin-Zhang/carla-expert#4

@Kin-Zhang Kin-Zhang changed the title Lack of experiment on data efficiency / ablation study on same dataset for other methods Questions on data efficiency / ablation study on same dataset for other methods Sep 13, 2022
@deepcs233
Copy link
Collaborator

Hi, Kin-Zhang

3M frames includes 8 towns * 21 kind of weather. In our work, we only take use of part of towns and weather and the traning dataset size is about 1.5M. We will update the details in the camera ready version.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant and we plan to use less data in the future. In our early version, we try to use about 1/5 training size and the performance only has a slight drop in our offline benchmark.

@Kin-Zhang
Copy link
Author

Thanks for your reply. I think maybe it could open this issue to remind others if they want to do some comparison on the same dataset with other open-source methods. And propose the result here.

Looking forward to your implementation.

@Kait0
Copy link

Kait0 commented Sep 15, 2022

Two minor details to add to this discussion:
The CVPR TransFuser you mentioned only uses 150k frames as per their appendix.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant

If you are repeating the same route with a different weather you are implicitly also randomizing the traffic situation / cars encountered because the CARLA traffic manager is not deterministic in CARLA 0.9.10.1
The thing that will be redundant is the route layout / static environment.

@Kin-Zhang
Copy link
Author

Kin-Zhang commented Sep 15, 2022

Thanks for @Kait0 pointing out here detail. And I check the [TPAMI'22] Transfuser training data are around 228k frames in total which is also smaller compared with others. Is that the online leaderboard dataset size also?

@Kait0
Copy link

Kait0 commented Sep 15, 2022

yes

@deepcs233
Copy link
Collaborator

deepcs233 commented Sep 15, 2022

Two minor details to add to this discussion: The CVPR TransFuser you mentioned only uses 150k frames as per their appendix.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant

If you are repeating the same route with a different weather you are implicitly also randomizing the traffic situation / cars encountered because the CARLA traffic manager is not deterministic in CARLA 0.9.10.1 The thing that will be redundant is the route layout / static environment.

In our data collection setting, one same route will be collected 21 times (we have 21 kind of weathers). Besides the route layout and static environment is the same, the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection). Actually, there was a bug in an earlier version of the project that use only about 1/5 of the data for training. Before we fix the bug, interfuser still perform better than Transfuser[CVPR'21].
Finally, we thank your interest on our work and suggestions. We may add some experiments which use a smaller dataset in our CoRL 2022 camera ready version.

@deepcs233
Copy link
Collaborator

Thanks for @Kait0 pointing out here detail. And I check the [TPAMI'22] Transfuser training data are around 228k frames in total which is also smaller compared with others. Is that the online leaderboard dataset size also?

@Kait0 The online leaderboard doesn't provide a dataset. It only provides the training routes and scenarios.

@Kait0
Copy link

Kait0 commented Sep 15, 2022

@Kait0 The online leaderboard doesn't provide a dataset. It only provides the training routes and scenarios.

I think he was asking whether we used the same dataset for the models we submitted to the online leaderboard (since I worked on that project).

). Besides the route layout and static environment is the same, the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection).

My point was that even if you run the route with the exact same scenario and route file you will still experience different driving situations as in CARLA 0.9.10.1 neither the physics nor the traffic manager are deterministic (the CARLA team has worked on this in newer releases).
You can see this for example in Table 2 and 3 of your paper.
Here you run the same route multiple times but the results are different every time (std > 0).
The same effect will happen if you repeat the same training routes/scenarios during data collection multiple times.

@deepcs233
Copy link
Collaborator

Yes, std > 0 and is a small number. So we claim the dataset is redundant not the same.

@Kait0
Copy link

Kait0 commented Sep 15, 2022

Don't think the magnitude of the DS std is a meaningful metric. Different situations don't necessarily lead to different driving scores, since the model might solve both.
I can imagine that experiencing different constellations of cars in the dataset might be beneficial for training which is why I mentioned it.
I think it is an open question since it is unclear how much the difference from the carla randomness is or whether is helps at all (I think there is no paper right now that investigates this).
I would be interested if you make a dataset ablation on this for your camera ready version like you mentioned.

@Kin-Zhang
Copy link
Author

Kin-Zhang commented Sep 15, 2022

Actually, there was a bug in an earlier version of the project that use only about 1/5 of the data for training. Before we fix the bug, interfuser still perform better than Transfuser[CVPR'21].

Sorry for the misunderstanding here. WHAT I MEANS here is that Interfuse didn't do a fair comparison. Like CVPR'21 Transfuser and LAV which made the comparison with other models in the same dataset which holds the fair for comparison in their paper.
Control the variable like dataset/dataset size holds the fair when comparing others' methods to prove it's the model and method achieve the result but not the large dataset, but I knew it's a really large workload for these.

Let's see only the data size here. Even the 1/5 data size 3000/5=600K is still larger than others like Transfuser, LAV, and TCP. But as I mentioned before, it's a really large workload to compare others' methods in your dataset with fairness.

So, after InterFuser code is ready, I believe there will be others like to compare InterFuser in their dataset for research on methods. Just leave these comments and discussions for them to review. Please let us know if you did this. ^v^

the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection)

Even for some redundant datasets as @Kait0 mentioned. I also noticed that the event would be the same, but the TrafficManager for other NPC is still random in CARLA 0.9.10.1.

@deepcs233
Copy link
Collaborator

deepcs233 commented Sep 15, 2022

Thanks, i will leave the discussions here and there are two points needed to be clarified:

  1. 3000k data size is the full size of the dataset, we dont use all of them to train the agent. In training stage, we take about 6 towns and 12 weathers which probably accounted for 1500k size.
  2. TCP uses the technic of model embedding, Rails uses 1M frames, IARL uses 40M frames. LAV comapres Transfuser[CVPR '21] in the paper. However, LAV uses 400K data and Transfuser uses 150k data, which is also three times. Different methods use different camera and sensors. Totally fair comparasion is difficult but we can also find the contributions from these works. We will update the detailed experiment in the following version.

@Kait0
Copy link

Kait0 commented Sep 15, 2022

WHAT I MEANS here is that Interfuse didn't do a fair comparison.

You are holding them to an unreasonable standard here.
Ablations are usually done in papers with the same dataset, but system level comparisons to other papers are almost never with a comparable dataset (also for LAV and TransFuser).
Most methods differ widely in the amount of training data, the kind of training labels, sensors inputs, pre-training, training method, run time, architecture and engineering.
Progress in CARLA has come from progress on all of these things, not just the architecture.

@Kin-Zhang
Copy link
Author

Kin-Zhang commented Sep 15, 2022

Ablations are usually done in papers with the same dataset, but system level comparisons to other papers are almost never with a comparable dataset (also for LAV and TransFuser).

This issue is for dataset size but not the system level and the reason you just mentioned. Above comments are also about the comparison of the dataset/size but not the whole system level which includes like running time (normally we didn't compare that). If it made a misunderstanding or offense to the authors, I apologize.

Still, the question is about: the large dataset size makes it unclear whether it's the model or the large data that brings the boost to the performance.

However, LAV uses 400K data and Transfuser uses 150k data, which is also three times.

Yes, you are right. And that's why I wanted to compare LAV in my expert/dataset size when LAV was published but still waiting for their collecting script since there are actually lots of labels there after I check. >.<

This message is left for people who want to do this task: architecture and methods are important, but for this task, the way to collect dataset (expert), dataset size should be also considered when making the comparison.

@deepcs233
Copy link
Collaborator

@Kait0 @Kin-Zhang Thanks for your disscussions. I will compelete the ablation studies about dataset size and update the results here.

@Kait0
Copy link

Kait0 commented Oct 26, 2022

Perhaps relevant for the people interested in this discussion.
https://arxiv.org/abs/2210.14222v1
Some of my colleagues made a ablation study on the effects of dataset scaling in the privileged setting (Table 2 b).
They recollected the TransFuser 228k dataset with different seeds 1x, 2x and 3x.
This will randomize the traffic situations encountered (but the routes/ environment is the same).
For their privileged models, scaling the dataset leads to consistent improvements in DS.

@Kin-Zhang
Copy link
Author

Kin-Zhang commented Nov 1, 2022

Screenshot from the paper <PlanT: Explainable Planning Transformers via Object-Level Representations> as Kait0 mentioned:

image

This may lead us to think more: the dataset size that 10x likes 2-3M with others' methods may also achieve such high scores or even higher (maybe).

Recent NIPS work <Model-Based Imitation Learning for Urban Driving> also uses the 2.9M dataset size, I linked their issue also:
wayveai/mile#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants