ADR 0018: More explicit control of replaced dims in bin
and hist
functions#
Status: accepted
Deciders: Simon, JanLukas, Neil, Sunyoung
Date: 20240712
Context#
Background#
Currently functions such as bin
and hist
derive their behavior from looking at whether the coordinate being used for the operation is a “dimensioncoordinate” of the data array, i.e., a coordinate with its name matching a dimension of the data.
Recently we have seen more and more cases where this behavior leads to surprising and too implicit results.
To control the actual behavior a number of workarounds in usercode are required, using combinations of functions such as flatten
, group
, rename_dims
, drop_coords
, transpose
, etc.
Furthermore, there are cases where the default behavior shows suboptimal performance, which can be avoided using further workarounds.
All of this is cumbersome and error prone, so this ADR proposes a more predictable and explicit control of the dimension(s) to be replaced or added.
Proposed solution#
Add a
dim
argument to allbin
andhist
functions to specify the dimension(s) to be replaced.The default
dim=None
means the dimension of the coordinate(s) used for the operation.In case of binned data there can be cases without coordinate, since only the bin contents might have the relevant coordinate. In this case
dim=None
meansdim=()
, i.e., no dimension is replaced but a new one is added.
Example of the new function signature:
def hist(
x: Variable  DataArray,
arg_dict: dict[str, int  Variable]  None = None,
/,
*,
dim: tuple[str, ...]  str  None = None,
**kwargs: int  Variable,
) > DataArray: ...
That is, dim
is keywordonly and can be a single string or a tuple of strings.
Internally, dim=None
is handled as follows:
Use
dim=da.coords[key].dims if key in da.coords else ()
for each key inarg_dict
andkwargs
.If there are multiple keys (in
arg_dict
and/orkwargs
), thedim
argument is defined as the union of all thedim
values defined above.
Analysis#
With the above proposal, we get the following behavior for dense data:
Data dims 
Coord dims 
Result dims 
Old op 
New op 
Change 
Comments (old op) 

(x,) 
(x,) 
(x,) 

same 
✅ 

(x,) 
(x,) 
(y,) 

same 
✅ 

(x,) 
(x,) 
(y,z) 

same 
✅ 

(x,y) 
(y,) 
(z,) 


❌ 

(x,y) 
(x,y) 
(z,) 


❌ 
can also 
(x,y) 
(y,) 
(x,z) 


❌ 

(x,y) 
(x,y) 
(x,z) 


❌ 
In the cases where workarounds such as flatten
are required, we currently get exceptions with a direct call to hist
.
That is, there do not appear to be a significant number of breaking changes, unless someone relied on those exceptions.
For binned data we obtain:
Data dims 
Coord dims 
Result dims 
Old op 
New op 
Change 
Consistent with dense case 

(…) 
(…) 
(…) 

same 
✅ 
 
(x,) 
(x,) 
(x,) 

same 
✅ 
✅ 
(x,) 
 
(y,) 


❌ 
 
(x,) 
(x,) 
(y,) 


❌ 
✅ 
(x,) 
 
(x,y) 


✅ 
 
(x,) 
(x,) 
(x,y) 


❌ 
✅ 
(x,y) 
 
(z,) 


❌ 
 
(x,y) 
(y,) 
(z,) 


❌ 
✅ 
(x,y) 
(x,y) 
(z,) 


❌ 
✅ 
(x,y) 
 
(x,z) 


❌ 
 
(x,y) 
(y,) 
(x,z) 


❌ 
✅ 
(x,y) 
(x,y) 
(x,z) 


❌ 
✅ 
The same holds for hist>bin
in all the above cases.
As for dense data, the actual number of breaking changes appears to be small or even negligible.
We can see that in both the dense and the binned case, the proposed solution represents no change over the current behavior in the typical cases. It does however allow for removing a significant number of “expert” workarounds, and it allows for more explicit control of the operation.
Alternatives#
In the binneddata case for dim=None
it was initially considered defining dim=<internalcontentdim>
instead of using the dimensions of the outer coordinate.
This would have resulted in the following:
Binned data:
Data dims 
Result dims 
Old op 
New op 
Change 
Comments (old op) 

(…) 
(…) 

same 
✅ 

(x,) 
(x,) 

same 
✅ 

(x,) 
(x,) 

same 
✅ 

(x,) 
(y,) 


❌ 
Or use 
(x,) 
(y,) 


❌ 
Or use 
(x,) 
(x,y) 

same 
✅ 

(x,) 
(x,y) 

same 
✅ 

(x,y) 
(x,z) 


❌ 
Or use 
(x,y) 
(x,z) 


❌ 
Or use 
(x,y) 
(z,) 


❌ 
need extra tricks for multithreading 
There appears to be no breaking change here, since as before a new dimension will be added by bin
or hist
, with the default dim=None
.
However, the solution chosen for the proposal is more consistent with the densedata case.
The only real downside of the proposed solution appears to be that in some cases the dim
argument can be omitted but adding/changing a coord to/of the outer data array will require adding a dim
argument since otherwise the output dimensionality will change.
Decision#
Implement proposed solution for all variants of the bin
and hist
functions.
Consequences#
Positive:#
More explicit syntax, instead of relying on operation dimension name matching name and dimension of existing coordinate.
Avoid several workarounds that are currently required in user code for certain operations.
Avoid even more workarounds for optimal performance in user code.
Complexity is absorbed into Scipp.
Negative:#
Need to handle more cases in Scipp itself, with extra need for maintenance and testing. As a lot of this has already been done in special cases elsewhere this should be manageable and should result in a net gain if we consider a wider scope than just Scipp itself.