ADR 0017: Restrict Dataset to items with matching dimensionality#
Status: accepted
Deciders: Jan-Lukas, Neil, Simon, Sunyoung
Date: 2023-07-04
Context#
Background#
Dataset was conceived as a way to represent a collection of items with compatible dimension extents, but possibly different sets or subsets of dimensions.
Recently we have introduced DataGroup, which drops the restriction of compatible dimension extents but, unlike Dataset, does in turn not provide support for joint coordinates.
The addition of DataGroup was triggered by a long series of cases where we realized that Dataset is not useful and flexbile enough in real applications.
This is not to say that Dataset is entirely useless, but it is not as useful as we had hoped.
One key area that is not covered by DataGroup is the representation of table-like data (or multi-dimensional generalizations thereof), in a manner similar to pandas.DataFrame.
The recent decision to remove support for attrs, including addition of an aligned flag, has made the situation worse, as this added a number of edge cases where the semantics of Dataset are not clear.
Finally, there have been discussions around structure-of-array data-types.
Dataset could be used for this purpose, but the current semantics may be too complex for comfort.
Analysis#
There are two possible ways of reasoning about Dataset.
Firstly, we may argue that while technically complex, the work has already been done, and the problems detailed below are encountered only in edge cases.
Secondly, we can ask ourselves if we would have added Dataset in its current form and shape if we had DataArray and DataGroup.
Concrete problems are:
If an item in a
Datasetlacks one or more dimensions of other items,Datasettakes the stance that the value is constant along the missing dimensions. For example, imagine a 2-D temperature array and a 3-D pressure array. Was the temperature measured only at ground level? Then we should maybe not useDataset, as the temperature is not actually height-independent. In this context, havingDatasetwith support for lower-dimensional items can be seen as risky, as it may lead to incorrect data analysis.If we consider two slices of a
Datasetcontaining a lower-dimensional item “abc” then, e.g., addition of these slices will yield aDatasetcontaining the sum of “abc” with itself. There is no indication that this happens, and it may be surprising to the users.Reduction operations such as
sum()are not truly well-defined. We currently raise if there is a lower-dimensional item when a concrete reduction dim is provided, but support reducing all dimensions. This is inconsistent and potentially not a good choice (note though that a similar problem applies toDataGroup). This implies, e.g., that addition of twoDatasetitems does not commute with callingsum()on theDataset.Complicated logic for selecting coordinates, affecting code as well as documentation needs. If a dataset has a dimension of length 1, should items without that dim have the coordinate?
Slicing a dim can make coords suddenly apply for an item.
Readonly flags in
DataArrayare required to avoid modifying data in lower-dimensional items via slices of a dataset.Consistency issues after the introduction of the alignment flag: issues selecting coords with length-2 bin edges #3148 and disappearing unaligned coords #3149.
Dataset.dimmay “lie” if an item is 0-D. Maybe it should have raised unless all items are 1-D?dimsandsizesofDataArrayimply an order, but they do not forDataset. This leads to some code overhead and risk of confusion.Complicated (internal) logic for updating the
sizesdict. This is not a problem for the user, except for rare edge cases where size-changing item replacements are not supported although they could be.
Given the long-term focus of the project, and the limited area of applicability of Dataset with its current semantics, we believe that it is worth considering a change.
While none of the above issues are major, they add up to a significant amount of complexity that may turn out hard to manage or justify in the long run.
Even in its current state several aspects of the above are not well-documented, neither for developers nor for users.
Proposed solution#
We propose to restrict Dataset to items with matching dimensionality.
Each item of a dataset will retain a masks dictionary (no change).
We can then think of Scipp providing a natural cascade of objects:
Given a number of arrays that we would like to combine, if sizes and dtype are consistent use DataArray with an extra dimension.
If only sizes are consistent use Dataset.
If sizes are not consistent use DataGroup.
Note that there also other reasons for choosing Dataset over DataArray, in particular for simple string-based column access and efficient addition of or removal of items.
Decision#
Datasetwill be restricted to items with matching dimensionality.
Consequences#
Positive:#
Cleaner and simpler semantics.
Simplified documentation and training.
Simplified code.
Resolves a number of issues where semantics of
Datasetare not clear in edge-cases, in particular with bin-edge coordinates.
Negative:#
Datasetwill no longer be able to represent certain types of data. Users will need to resort toDataGroupinstead, which has other limitations, such as requiring to duplicate coordinates. Another option would be to replicate data values of the lower-dimensional items to match the dimensionality of the higher-dimensional items. This would require more memory, but would force the users to be explicit about the meaning of data they want to represent.User code needs to be migrated.
Existing files with incompatible
Datasetdata will no longer be readable. We could ignore this issues (as this is unlikely to be have been used in practice), or return aDataGroupinstead.