ADR 0017: Restrict Dataset to items with matching dimensionality#
Status: accepted
Deciders: Jan-Lukas, Neil, Simon, Sunyoung
Date: 2023-07-04
Context#
Background#
Dataset
was conceived as a way to represent a collection of items with compatible dimension extents, but possibly different sets or subsets of dimensions.
Recently we have introduced DataGroup
, which drops the restriction of compatible dimension extents but, unlike Dataset
, does in turn not provide support for joint coordinates.
The addition of DataGroup
was triggered by a long series of cases where we realized that Dataset
is not useful and flexbile enough in real applications.
This is not to say that Dataset
is entirely useless, but it is not as useful as we had hoped.
One key area that is not covered by DataGroup
is the representation of table-like data (or multi-dimensional generalizations thereof), in a manner similar to pandas.DataFrame
.
The recent decision to remove support for attrs
, including addition of an aligned
flag, has made the situation worse, as this added a number of edge cases where the semantics of Dataset
are not clear.
Finally, there have been discussions around structure-of-array data-types.
Dataset
could be used for this purpose, but the current semantics may be too complex for comfort.
Analysis#
There are two possible ways of reasoning about Dataset
.
Firstly, we may argue that while technically complex, the work has already been done, and the problems detailed below are encountered only in edge cases.
Secondly, we can ask ourselves if we would have added Dataset
in its current form and shape if we had DataArray
and DataGroup
.
Concrete problems are:
If an item in a
Dataset
lacks one or more dimensions of other items,Dataset
takes the stance that the value is constant along the missing dimensions. For example, imagine a 2-D temperature array and a 3-D pressure array. Was the temperature measured only at ground level? Then we should maybe not useDataset
, as the temperature is not actually height-independent. In this context, havingDataset
with support for lower-dimensional items can be seen as risky, as it may lead to incorrect data analysis.If we consider two slices of a
Dataset
containing a lower-dimensional item “abc” then, e.g., addition of these slices will yield aDataset
containing the sum of “abc” with itself. There is no indication that this happens, and it may be surprising to the users.Reduction operations such as
sum()
are not truly well-defined. We currently raise if there is a lower-dimensional item when a concrete reduction dim is provided, but support reducing all dimensions. This is inconsistent and potentially not a good choice (note though that a similar problem applies toDataGroup
). This implies, e.g., that addition of twoDataset
items does not commute with callingsum()
on theDataset
.Complicated logic for selecting coordinates, affecting code as well as documentation needs. If a dataset has a dimension of length 1, should items without that dim have the coordinate?
Slicing a dim can make coords suddenly apply for an item.
Readonly flags in
DataArray
are required to avoid modifying data in lower-dimensional items via slices of a dataset.Consistency issues after the introduction of the alignment flag: issues selecting coords with length-2 bin edges #3148 and disappearing unaligned coords #3149.
Dataset.dim
may “lie” if an item is 0-D. Maybe it should have raised unless all items are 1-D?dims
andsizes
ofDataArray
imply an order, but they do not forDataset
. This leads to some code overhead and risk of confusion.Complicated (internal) logic for updating the
sizes
dict. This is not a problem for the user, except for rare edge cases where size-changing item replacements are not supported although they could be.
Given the long-term focus of the project, and the limited area of applicability of Dataset
with its current semantics, we believe that it is worth considering a change.
While none of the above issues are major, they add up to a significant amount of complexity that may turn out hard to manage or justify in the long run.
Even in its current state several aspects of the above are not well-documented, neither for developers nor for users.
Proposed solution#
We propose to restrict Dataset
to items with matching dimensionality.
Each item of a dataset will retain a masks
dictionary (no change).
We can then think of Scipp providing a natural cascade of objects:
Given a number of arrays that we would like to combine, if sizes
and dtype
are consistent use DataArray
with an extra dimension.
If only sizes
are consistent use Dataset
.
If sizes
are not consistent use DataGroup
.
Note that there also other reasons for choosing Dataset
over DataArray
, in particular for simple string-based column access and efficient addition of or removal of items.
Decision#
Dataset
will be restricted to items with matching dimensionality.
Consequences#
Positive:#
Cleaner and simpler semantics.
Simplified documentation and training.
Simplified code.
Resolves a number of issues where semantics of
Dataset
are not clear in edge-cases, in particular with bin-edge coordinates.
Negative:#
Dataset
will no longer be able to represent certain types of data. Users will need to resort toDataGroup
instead, which has other limitations, such as requiring to duplicate coordinates. Another option would be to replicate data values of the lower-dimensional items to match the dimensionality of the higher-dimensional items. This would require more memory, but would force the users to be explicit about the meaning of data they want to represent.User code needs to be migrated.
Existing files with incompatible
Dataset
data will no longer be readable. We could ignore this issues (as this is unlikely to be have been used in practice), or return aDataGroup
instead.