Skip to content

Containers: When to Use Which

amirroth edited this page Sep 23, 2021 · 4 revisions

The C++ standard library includes a number of container (i.e., data structure) templates: array, vector, list, stack, queue, deque, map, set, unordered_map, and unordered_set. The use of good data structures is the key to good performance. So ... when to use each one of these? First, a quick lesson about asymptotic behavior and fixed overheads.

Asymptotic behavior and fixed overheads

You've probably heard that "std::map is optimized for search". That is the first part of a larger sentence. The complete version is "std::map is optimized for search relative to other containers when the number of elements in the container is large, and the larger the number of elements the greater its relative advantage over the other containers." This is the concept of asymptotic cost, which in computer science is written as "O(f(N))" and called "big-O" notation (N is the number of elements in the data structure).

Searching for an element in a std::vector has asymptotic behavior of O(N), the cost is proportional to N, the number of elements. Searching for an element in a std::map has asymptotic behavior of O(lg(N)), the cost is proportional to lg(N) (i.e., log base 2 of N). In practice, the growth functions are more complicated than just N or lg(N). There is usually some constant factor applied. For instance, in a std::vector you typically only have to search half the elements before finding the one you need and so the function is really N/2, not N. For similar reasons, the average cost of searching a std::map is lg(N/2) or lg(N)-1, not just lg(N). There are also constant factors involved with traversing the data structure that may be different for different data structures, so the costs are really Cv*(N/2) and Cm*(lg(N)-1). But when N gets very large, the only thing that matters is N so that is the only part that is written in the notation. Suppose Cv (the cost of traversing a std::vector is 1) and Cm (the cost of traversing a node in std::map is 3). When N is 1,000,000, the cost of finding an element in a std::vector is 500,000, the cost of finding the same element in std::map is 60. The difference increases as N gets even larger. When N is 1,000,000,000, the cost of finding something in a std::vector is 500,000,000 and the cost of finding it in a std::map is 90!

Here's the thing about asymptotic behavior and EnergyPlus. In EnergyPlus we never deal with containers of 1,000,000 elements. What is the largest number of elements we deal with? 10,000? 1,000? The costs for searching a std::vector and a std::map each with 1,000 elements are 500 and 30, respectively. When you go to 100 elements, the costs become 50 and 20. When you get to 10 elements they are 5 and 9--if you have 10 elements, it's faster to search a std::vector than a std::map, not to mention to create a std::vector than a std::map! When thinking about which container or data structure to use, it's important not only to think about asymptotic behavior but where that asymptotic behavior takes over (usually somewhere between 10 and 100 elements for most data structures) and whether EnergyPlus is likely to ever hit that crossover point. If you are only expecting 10-20 elements, use the simpler data structure with the lower constant overhead factor. Save the "fancy" data structures that are optimized for certain operations for containers with many elements.

std::array: the best container

The best container--it least as far as EnergyPlus use cases are concerned but more generally too I suspect--is [std::array]. (https://en.cppreference.com/w/cpp/container/array). std::array is an array whose size is fixed at compile-time. This does limit the use of std::array, but when array size is statically known, it's the best option. std::array is a low-overhead, pointer-safe veneer on top of a C-style array. The memory footprint of a std::array is exactly that of a C-style array. Even the number of elements is not saved, it's part of the std::array template type instance as opposed to individual object instances--yes, std::array of the same element type but different element counts are considered different types!! There is no secondary heap allocation, construction, or destruction associated with a std::array. In all other containers, the actual data element storage is allocated dynamically and lives on "the heap" regardless of where the container object itself lives, stack, heap, or static global memory. In a std::array the data-element storage is the container object. If a std::array is declared to be on the stack, the elements it contains are also on the stack. Zero-overhead memory management is one advantage of std::array.

But there's more! Because there is no dynamic memory management, a std::array can be made constexpr assuming the elements it contains can also be made constexpr. constexpr is one of the great new features of C++ (learn more about it here). constexpr data is written into a special section in the executable and is loaded into memory for use with no actual code needing to run. Look at the FluidProperties module to see how useful this can be.

std::array and gsl::span

We have already hinted at one of the traditional limitations of std::array, the fact that std::array of different sizes are considered different types even if they contain the same type of element. This makes using std::array's as function parameters difficult. Thankfully, C++20 provides a fix (and in the meantime, we are using that fix via the C++ guideline support library). That fix is the std::span (or for now gsl::span) template. gsl::span is a lightweight, storage-less reference template for std::array, std::vector, and C-style arrays and pointers. It has the same relationship to std::array as std::string_view has to std::string. gsl::span contains a pointer to storage and a size element. When a parameter is delcared gsl::span and an std::array is passed as an argument, the compiler initializes gsl::span with a pointer to the std::array and the number of elements (which is part of the type so the compiler knows it). Within the callee, the number of elements can be retrieved using the .size() method. Look here for an example of how gsl::span is used.

If you actually look at the example, you saw that gsl::span is declared as a value parameter. Aren't objects supposed to be passed by const & rather than by value? In general, yes. But there are exceptions to every rule, and the exceptions to this one are std::string_view and gsl::span, both of which have only two members (it's actually faster overall to pass an object with two elements by value than to pass a reference to the object and then access both members in the callee) and because both are essentially references and passing references by reference is considered gauche in StackOverflow circles.

TL;DR std::span/gsl::span has removed whatever barriers were remaining to using std::array. Use this container wherever possible!

std::vector, EPVector, and ObjexxFCL::Array1D

There are a significant number of arrays whose sizes are known at compile time, but an even larger number of arrays whose sizes are determined by the contents of the IDF file. For these, you can use std::vector, EPVector (a thin veneer on top of std::vector that provides Fortran-like 1-based indexing functionality), or ObjexxFCL::Array1D (a from-scratch library that provides all the functionality of Fortran arrays). The goal for EnergyPlus should be to prune Fortran-specific functionality and gradually migrate from ObjexxFCL::Array1D to EPVector and finally to std::vector, this last transition also requires moving to 0-based array indexing. You should give preference to std::vector, EPVector, and ObjexxFCL::Array1D in that order, but for the time being all are fine.

There is more to say about multi-dimensional arrays and arrays of objects vs. arrays of scalars, but those will require separate pages.

std::map and std::unordered_map

As alluded to at the top of this page, these containers do have some real world uses, although not nearly as many as one would think. std::map and std::unordered_map are good for large dynamic collections that have to be searched relatively frequently. std::map is implemented as "red-black" (i.e., self-rebalancing) binary tree. std::unordered_map is implemented as a hash table (in my experience, hash tables are by far the second most useful real-world data structure after arrays). There is a bit of constant additive overhead associated with inserting into and searching a hash table that is associated with computing the hash function of the element, but overall insertion and search are faster because with a good hash function a hash table is naturally self-balancing. If traversing the elements in some order (e.g., alphabetic) is important, use std::map otherwise use std::unordered_map.

A good use of std::unordered_map is as a temporary data structure used to link permanent data structures together during initialization. Many EnergyPlus objects reference Zone by name. Once the Zone array is created and sorted, a std::unordered_map<std::string_view, int> ZoneNameUCtoNum can be created to help other objects quickly translate Zone names to indices.

std::set, std::unordered_set, std::queue, std::stack, std::deque, and std::list

I honestly cannot think of too many EnergyPlus use cases for these data structures. If you find yourself using one of them, ask yourself why.

The mistake we all make: containers do not mirror element typing relationships

If you've been programming in C++ for a while, you will have been bitten by this counter-intuitive issue. Don't worry, we all have. Multiple times. What is "this issue" exactly? "This" is the fact that C++ has two polymorphism mechanisms, class and template, and although they can be used together as constructs, their individual polymorphism mechanisms are not compatible and do not compose in the way you would intuitively expect. Specifically:

struct BaseGlobalStruct {
};

struct WindowData : BaseGlobalStruct {
};

std::vector<BaseGlobalStruct> baseGlobalStructVec;
std::vector<WindowData> windowDataVec;

Forget for a second that BaseGlobalStruct is an abstract class (i.e., it has a pure virtual function that cannot be called) and cannot be instantiated. If it could be, would windowDataVec be a subtype of baseGlobalStructVec? Would you be able to write a function that specifies a std::vector<BaseGlobalStruct> & as an argument and pass a std::vector<WindowData> as a parameter to that function? No and no. No template instantiation is a subtype of any other template instantiation even if the template is identical and the template parameters have a subtype relationship. TANSTATS--there ain't no such thing as template subtyping.

If you want to exploit element polymorphism within a container, the container has to be defined as holding pointers to the base class, even if it only contains pointers to a subclass.

std::vector<BaseGlobalStruct *> baseGlobalStructPVec; 
std::vector<BaseGlobalStruct *> windowDataPVec; // contains pointers to WindowData objects but defined in terms of the parent BaseGlobalStruct

Note, this is a characteristic of all templates. Array1D and EPVector will have the same issue.

Clone this wiki locally