Questions on data efficiency / ablation study on same dataset for other methods #3

Kin-Zhang · 2022-09-13T02:39:42Z

Thanks for your work.

I have one question on the experiment part, I notice that here:

Dataset collection We ran a rule-based expert agent on all 8 kinds of towns and weather with 2 FPS, to collect an expert dataset of 3M frames (410 hours).

As I knew that in [CVPR'21 Transfuser], [TPAMI'22 Transfuser], [CVPR'22 LAV] and [Arxiv'22 TCP] dataset sizes are all around 200-400K frames. [1M=1,000 K]

I didn't find the result table in the paper with the same dataset training but only the online leaderboard. Maybe you have some insights that didn't point out in the paper, so I propose this issue.

This 3M=3000K with 200-400K which makes it unclear whether it's the model or the extra data that brings the boost to the performance. Have you done any experiments on the data efficiency of your model?

Update: 2022/11/1, adding the discussion link here: Kin-Zhang/carla-expert#4

deepcs233 · 2022-09-13T08:24:20Z

Hi, Kin-Zhang

3M frames includes 8 towns * 21 kind of weather. In our work, we only take use of part of towns and weather and the traning dataset size is about 1.5M. We will update the details in the camera ready version.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant and we plan to use less data in the future. In our early version, we try to use about 1/5 training size and the performance only has a slight drop in our offline benchmark.

Kin-Zhang · 2022-09-13T08:31:41Z

Thanks for your reply. I think maybe it could open this issue to remind others if they want to do some comparison on the same dataset with other open-source methods. And propose the result here.

Looking forward to your implementation.

Kait0 · 2022-09-15T10:42:06Z

Two minor details to add to this discussion:
The CVPR TransFuser you mentioned only uses 150k frames as per their appendix.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant

If you are repeating the same route with a different weather you are implicitly also randomizing the traffic situation / cars encountered because the CARLA traffic manager is not deterministic in CARLA 0.9.10.1
The thing that will be redundant is the route layout / static environment.

Kin-Zhang · 2022-09-15T11:54:54Z

Thanks for @Kait0 pointing out here detail. And I check the [TPAMI'22] Transfuser training data are around 228k frames in total which is also smaller compared with others. Is that the online leaderboard dataset size also?

Kait0 · 2022-09-15T11:55:28Z

yes

deepcs233 · 2022-09-15T14:29:47Z

Two minor details to add to this discussion: The CVPR TransFuser you mentioned only uses 150k frames as per their appendix.

In the stage of data collection, one same route will be collected at different weather repeatly. So the dataset is redundant

If you are repeating the same route with a different weather you are implicitly also randomizing the traffic situation / cars encountered because the CARLA traffic manager is not deterministic in CARLA 0.9.10.1 The thing that will be redundant is the route layout / static environment.

In our data collection setting, one same route will be collected 21 times (we have 21 kind of weathers). Besides the route layout and static environment is the same, the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection). Actually, there was a bug in an earlier version of the project that use only about 1/5 of the data for training. Before we fix the bug, interfuser still perform better than Transfuser[CVPR'21].
Finally, we thank your interest on our work and suggestions. We may add some experiments which use a smaller dataset in our CoRL 2022 camera ready version.

deepcs233 · 2022-09-15T14:31:41Z

Thanks for @Kait0 pointing out here detail. And I check the [TPAMI'22] Transfuser training data are around 228k frames in total which is also smaller compared with others. Is that the online leaderboard dataset size also?

@Kait0 The online leaderboard doesn't provide a dataset. It only provides the training routes and scenarios.

Kait0 · 2022-09-15T14:49:40Z

@Kait0 The online leaderboard doesn't provide a dataset. It only provides the training routes and scenarios.

I think he was asking whether we used the same dataset for the models we submitted to the online leaderboard (since I worked on that project).

). Besides the route layout and static environment is the same, the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection).

My point was that even if you run the route with the exact same scenario and route file you will still experience different driving situations as in CARLA 0.9.10.1 neither the physics nor the traffic manager are deterministic (the CARLA team has worked on this in newer releases).
You can see this for example in Table 2 and 3 of your paper.
Here you run the same route multiple times but the results are different every time (std > 0).
The same effect will happen if you repeat the same training routes/scenarios during data collection multiple times.

deepcs233 · 2022-09-15T15:08:59Z

Yes, std > 0 and is a small number. So we claim the dataset is redundant not the same.

Kait0 · 2022-09-15T15:46:07Z

Don't think the magnitude of the DS std is a meaningful metric. Different situations don't necessarily lead to different driving scores, since the model might solve both.
I can imagine that experiencing different constellations of cars in the dataset might be beneficial for training which is why I mentioned it.
I think it is an open question since it is unclear how much the difference from the carla randomness is or whether is helps at all (I think there is no paper right now that investigates this).
I would be interested if you make a dataset ablation on this for your camera ready version like you mentioned.

Kin-Zhang · 2022-09-15T16:10:21Z

Actually, there was a bug in an earlier version of the project that use only about 1/5 of the data for training. Before we fix the bug, interfuser still perform better than Transfuser[CVPR'21].

Sorry for the misunderstanding here. WHAT I MEANS here is that Interfuse didn't do ~~a fair~~ comparison. Like CVPR'21 Transfuser ~~and LAV which~~ made the comparison with other models in the same dataset which holds the fair for comparison in their paper.
Control the variable like dataset/dataset size holds the fair when comparing others' methods to prove it's the model and method achieve the result but not the large dataset, but I knew it's a really large workload for these.

Let's see only the data size here. Even the 1/5 data size 3000/5=600K is still larger than others like Transfuser, LAV, and TCP. But as I mentioned before, it's a really large workload to compare others' methods in your dataset with fairness.

So, after InterFuser code is ready, I believe there will be others like to compare InterFuser in their dataset for research on methods. Just leave these comments and discussions for them to review. Please let us know if you did this. ^v^

the active scenarios is also exactly the same (because we use the same route file and scenario file in each collection)

Even for some redundant datasets as @Kait0 mentioned. I also noticed that the event would be the same, but the TrafficManager for other NPC is still random in CARLA 0.9.10.1.

deepcs233 · 2022-09-15T16:53:20Z

Thanks, i will leave the discussions here and there are two points needed to be clarified:

3000k data size is the full size of the dataset, we dont use all of them to train the agent. In training stage, we take about 6 towns and 12 weathers which probably accounted for 1500k size.
TCP uses the technic of model embedding, Rails uses 1M frames, IARL uses 40M frames. LAV comapres Transfuser[CVPR '21] in the paper. However, LAV uses 400K data and Transfuser uses 150k data, which is also three times. Different methods use different camera and sensors. Totally fair comparasion is difficult but we can also find the contributions from these works. We will update the detailed experiment in the following version.

Kait0 · 2022-09-15T16:55:01Z

WHAT I MEANS here is that Interfuse didn't do a fair comparison.

You are holding them to an unreasonable standard here.
Ablations are usually done in papers with the same dataset, but system level comparisons to other papers are almost never with a comparable dataset (also for LAV and TransFuser).
Most methods differ widely in the amount of training data, the kind of training labels, sensors inputs, pre-training, training method, run time, architecture and engineering.
Progress in CARLA has come from progress on all of these things, not just the architecture.

Kin-Zhang · 2022-09-15T17:01:33Z

Ablations are usually done in papers with the same dataset, but system level comparisons to other papers are almost never with a comparable dataset (also for LAV and TransFuser).

This issue is for dataset size but not the system level and the reason you just mentioned. Above comments are also about the comparison of the dataset/size but not the whole system level which includes like running time (normally we didn't compare that). If it made a misunderstanding or offense to the authors, I apologize.

Still, the question is about: the large dataset size makes it unclear whether it's the model or the large data that brings the boost to the performance.

However, LAV uses 400K data and Transfuser uses 150k data, which is also three times.

Yes, you are right. And that's why I wanted to compare LAV in my expert/dataset size when LAV was published but still waiting for their collecting script since there are actually lots of labels there after I check. >.<

This message is left for people who want to do this task: architecture and methods are important, but for this task, the way to collect dataset (expert), dataset size should be also considered when making the comparison.

deepcs233 · 2022-09-16T01:18:26Z

@Kait0 @Kin-Zhang Thanks for your disscussions. I will compelete the ablation studies about dataset size and update the results here.

Kait0 · 2022-10-26T11:46:39Z

Perhaps relevant for the people interested in this discussion.
https://arxiv.org/abs/2210.14222v1
Some of my colleagues made a ablation study on the effects of dataset scaling in the privileged setting (Table 2 b).
They recollected the TransFuser 228k dataset with different seeds 1x, 2x and 3x.
This will randomize the traffic situations encountered (but the routes/ environment is the same).
For their privileged models, scaling the dataset leads to consistent improvements in DS.

Kin-Zhang · 2022-11-01T12:10:40Z

Screenshot from the paper <PlanT: Explainable Planning Transformers via Object-Level Representations> as Kait0 mentioned:

This may lead us to think more: the dataset size that 10x likes 2-3M with others' methods may also achieve such high scores or even higher (maybe).

Recent NIPS work <Model-Based Imitation Learning for Urban Driving> also uses the 2.9M dataset size, I linked their issue also:
wayveai/mile#4

Kin-Zhang changed the title ~~Lack of experiment on data efficiency / ablation study on same dataset for other methods~~ Questions on data efficiency / ablation study on same dataset for other methods Sep 13, 2022

Kin-Zhang mentioned this issue Nov 1, 2022

Online leaderboard result and question on comparsion results setting on other methods wayveai/mile#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on data efficiency / ablation study on same dataset for other methods #3

Questions on data efficiency / ablation study on same dataset for other methods #3

Kin-Zhang commented Sep 13, 2022 •

edited

Loading

deepcs233 commented Sep 13, 2022

Kin-Zhang commented Sep 13, 2022

Kait0 commented Sep 15, 2022

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

Kait0 commented Sep 15, 2022

deepcs233 commented Sep 15, 2022 •

edited

Loading

deepcs233 commented Sep 15, 2022

Kait0 commented Sep 15, 2022

deepcs233 commented Sep 15, 2022

Kait0 commented Sep 15, 2022

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

deepcs233 commented Sep 15, 2022 •

edited

Loading

Kait0 commented Sep 15, 2022 •

edited

Loading

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

deepcs233 commented Sep 16, 2022

Kait0 commented Oct 26, 2022

Kin-Zhang commented Nov 1, 2022 •

edited

Loading

Questions on data efficiency / ablation study on same dataset for other methods #3

Questions on data efficiency / ablation study on same dataset for other methods #3

Comments

Kin-Zhang commented Sep 13, 2022 • edited Loading

deepcs233 commented Sep 13, 2022

Kin-Zhang commented Sep 13, 2022

Kait0 commented Sep 15, 2022

Kin-Zhang commented Sep 15, 2022 • edited Loading

Kait0 commented Sep 15, 2022

deepcs233 commented Sep 15, 2022 • edited Loading

deepcs233 commented Sep 15, 2022

Kait0 commented Sep 15, 2022

deepcs233 commented Sep 15, 2022

Kait0 commented Sep 15, 2022

Kin-Zhang commented Sep 15, 2022 • edited Loading

deepcs233 commented Sep 15, 2022 • edited Loading

Kait0 commented Sep 15, 2022 • edited Loading

Kin-Zhang commented Sep 15, 2022 • edited Loading

deepcs233 commented Sep 16, 2022

Kait0 commented Oct 26, 2022

Kin-Zhang commented Nov 1, 2022 • edited Loading

Kin-Zhang commented Sep 13, 2022 •

edited

Loading

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

deepcs233 commented Sep 15, 2022 •

edited

Loading

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

deepcs233 commented Sep 15, 2022 •

edited

Loading

Kait0 commented Sep 15, 2022 •

edited

Loading

Kin-Zhang commented Sep 15, 2022 •

edited

Loading

Kin-Zhang commented Nov 1, 2022 •

edited

Loading