Architecture

Each software has some kind of architecture, and this is the place to describe it in broad terms, to make it easier for developers to get around the code. Following the scheme of a layered architecture as featured by the “Clean Architecture” (Robert Martin) or the “hexagonal architecture”, alternatively known as “ports and adapters” (Alistair Cockburn), the different layers are described successively from the inside out.

Just to make matters a bit more interesting, the evedata package has basically two disjunctive interfaces: one towards the data file format (eveH5), the other towards the users of the actual data, e.g. the radiometry and evedataviewer packages. See Fig. 3 for a first overview.

Fig. 3 The two interfaces of the evedata package: eveH5 is the persistence layer of the recorded data and metadata, and evedataviewer and radiometry are Python packages for graphically displaying and processing and analysing these data, respectively. Most people concerned with the actual data will either use evedataviewer or the radiometry package, but not evedata directly.

An alternative view that may be more helpful in the long run, leading to a better overall architecture, rests on the distinction between two dimensions of layers: functional and technical. While for long time, the first line of organisation in code was technical layers, grouping software into functional blocks (packages or whatever the name) that are each independently technically layered preserves the concepts of the application much better. A first idea for the evedata package is presented in Fig. 4.

Fig. 4 A two-dimensional view of the architecture, with technical and functional layers. The primary line of organisation in the code, according to different authors, should be the functional layers or packages, and each of the functional blocks should be organised into three technical layers: boundaries, controllers, entities. The idea and name of these three technical layers goes back to Ivar Jacobson (1992). The boundaries (sometimes called interfaces) contain two different kinds of elements: facades and resources. While resources are typically concerned with the persistence layer and similar things, facades are the user-facing elements providing access to the underlying entities and controllers.

There are three functional layers with dependencies in one direction: measurement, evefile, and scan. Each of the functional layers is divided into three technical layers, i.e. boundaries (“interfaces”), controllers, and entities (BCE). The boundaries layer contains both, the “interfaces” (“adapters”) pointing to the user (facades) and those pointing to the underlying infrastructure (resources). Here, “user” can be either actual human users or other functional layers. This means that high-level functional layers shall not directly depend on elements of the lower technical layers of lower-level functional layers, but use the respective facades of the lower-level functional layers. As an example for the evedata package: the measurement functional layer uses the evefile and scan facades of the evefile and scan functional layers.

A corresponding UML package diagram for Fig. 4 is shown in the figure below. Here, only the dependencies within the individual functional layers are shown, not the dependencies between the functional layers. The latter proceed from left to right: measurement > evefile > scan.

Fig. 5 An UML package diagram of the evedata package following the organisation in functional layers that each contain three technical layers, as shown in Fig. 4. To hide the names of the technical layers from the user, one could think of importing the relevant classes (basically the facades) in the __init__.py files of the respective top-level functional packages.

For each of the functional layers, the corresponding technical layers are described below. Deviating from the direction of dependencies as shown in Fig. 4, we start with the evefile functional layer, and for each of the layers we start with the entities and proceed via the controllers to the boundaries. From a user perspective interested in measured data, the journey starts with the data file (eveH5), represented on a low level by the evefile functional layer and on a high level as user interface by the measurement functional layer. The scan functional layer representing the information originally contained in the SCML file, while technically at the bottom of the dependencies chain, is the least interesting from a user’s perspective primarily interested in the data, and is probably the layer fully implemented last.

General remarks on the UML class diagrams

The UML class diagrams in this document try to consistently follow a series of conventions listed below. This list is not meant to be exhaustive and may change over time.

Capitalising attribute types

Attribute types that are default types of the (Python) language are not capitalised.

Attribute types that are instances of self-defined classes are capitalised and spelled exactly as the corresponding class.
Singular and plural forms of attributes

Scalar attributes have singular names.

Attributes containing containers (lists, dictionaries, …) have plural names.
Naming conventions

Generally, naming conventions follow PEP8: class names are in CamelCase, attributes and methods in snake_case.
Attributes of enumerations

No convention has yet been agreed upon. Possibilities would be ALLCAPS (as the attributes could be interpreted as constants) or snake_case.
Dictionaries

Attributes that contain dictionaries as container have the container type followed by curly braces {}, although this seems not to be part of the UML standard.

Important

Partly due to the conventions for the UML class diagrams outlined above and due to the reasons leading to these conventions in the first place, the data model described in the UML class diagrams differs often in subtle details of attribute names from the currently existing data models and, e.g., the SCML schema definition. Eventually, it would be good to agree upon a list of conventions and try to consistently apply them throughout the different interconnected parts (SCML, GUI, engine, evedata, …). These conventions are primarily concerned with a shared vocabulary for the concepts, not with CamelCase vs. snake_case and alike, as this will differ for different languages (and we can agree on mapping rules).

Evefile

Generally, the evefile functional layer, as mentioned already in the Concepts section, provides the interface towards the persistence layer (eveH5 files). This is a rather low-level interface focussing on a faithful representation of all information available in an eveH5 file as well as the corresponding scan description (SCML), as long as the latter is available.

Furthermore, the evefile functional layer provides a stable abstraction of the contents of an eveH5 file and is hence not concerned with different versions of both, the eveH5 file structure and the SCML schema. The data model provided via its entities needs to be powerful (and modular) enough to allow for representing all currently existing data files (regardless of their eveH5 and SCML schema versions) and future-proof to not change incompatibly (open-closed principle, the “O” in “SOLID) when new requirements arise.

Important

As the evefile functional layer is not meant as a (human-facing) user interface, it is not concerned with concepts such as “fill modes” (joining), but represents the data “as is”. This means that the different data can generally not be plotted against each other. This is a deliberate decision, as joining data for a (two-dimensional) data array, although generally desirable for (simple) plotting purposes, masks/removes some highly important information, e.g. whether a value has not been measured in the first place, or whether obtaining a value has failed for some reason.

Entities

Entities are the innermost technical layer: everything depends on the entities, but the entities depend on nothing but themselves. Furthermore, entities may have little to no behaviour (i.e., data classes). For the evefile functional layer, the entities consist of three modules: file, data, and metadata, in the order of their dependencies.

file module

Despite the opposite chain of dependencies, starting with the file module seems sensible, as its File class represents a single eveH5 file and provides kind of an entry point.

_images/evedata.evefile.entities.file.svg — Fig. 6 Class hierarchy of the `evefile.entities.file` module. The `File` class is sort of the central interface to the entire subpackage, as this class provides a faithful representation of all information available from a given eveH5 file. To this end, it incorporates instances of classes of the other modules of the subpackage. Furthermore, “Scan” inherits from the identically named facade of the scan functional layer and contains the full information of the SCML file (if the SCML file is present in the eveH5 file).

Points to discuss further (without claiming to be complete)

Split “Scan” and “Station”, similarly to how they are stored in eveH5 in the future?

The scan.boundaries.scan subpackage now contains two facades: Scan with the scan description and Station with the description of all devices (machine, beamline, setup), currently “setup” in the SCML.

The setup part of the SCML file sent to the engine is not necessarily identical to the XML file with the setup description loaded by the engine. Hence, it may make sense to have both stored separately, or have a “Scan” facade that contains only the scan description, and a “Station” facade containing the information on the relevant devices.

Although the engine saves the SCML file to the HDF5 file, it saves the SCML as obtained when submitting a scan. The “only” part the engine currently looks up in the station-related XML (messplatz.xml) is the information necessary for the dynamic snapshots. This somewhat unclear situation led and still leads sometimes to confusion. Which part of what file is saved where in the File class remains to be seen.

Some comments (not discussions any more):

“data”, “snapshots”, “monitors”: lists or dicts?

In the meantime, the three attributes are modelled as dictionaries. How about modelling them as dictionaries, with the keys being the names of the corresponding datasets (i.e., the last part of the path within the HDF5 file).
Organisation of datasets in main according to the scan module structure

Despite the current structure of the eveH5 files, datasets will be organised and split according to their use in the different scan modules. In case no SCML is available, a “dummy” scan module will be created containing all the datasets in main.

data module

Data are organised in “datasets” within HDF5, and the data module provides the relevant entities to describe these datasets. Although currently (as of 08/2024, eve version 2.1) neither average nor interval detector channels save the individual data points, at least the former is a clear need of the engineers/scientists. Hence, the data model already respects this use case. As per position (count) there can be a variable number of measured points, the resulting array is no longer rectangular, but a “ragged array”. While storing such arrays is possible directly in HDF5, the implementation within evedata is entirely independent of the actual representation in the eveH5 file.

_images/evedata.evefile.entities.data.svg — Fig. 7 Class hierarchy of the `data` module. Each class has a corresponding metadata class in the `metadata` module. While in this diagram, some child classes seem to be identical, they have a different type of metadata (see the `metadata` module below). Generally, having different types serves to discriminate where necessary between detector channels and motor axes. For details on the `ArrayChannelData` and `AreaChannelData` classes, see Fig. 8 and Fig. 10, respectively. You may click on the image for a larger view.

Points to discuss further (without claiming to be complete)

Currently none… ;-)

Some comments (not discussions any more, though):

MonitorData with more than one value per time

MonitorData should have only one value per time, although it can currently not completely be excluded that the same value is monitored multiple times, most probably resulting in identical values at identical times, see #7688, note-11. However, this should be regarded as a bug (and if actually occurring in an eveH5 file, treated in some deterministic way). A special case are monitor data occurring before starting the actual scan, as these all get the special timestamp -1, see #7688, note-10. In this case, only the last (youngest) data point should be retained/used.
Values of MonitorData

MonitorData can have textual (non-numeric) values. This should not be too much of a problem given that numpy can handle string arrays (though <v2.0 only fixed-size string values. Hence, evedata may need to depend on numpy>=2.0).
raw (i.e. individual) values of AverageChannelData and IntervalChannelData

Currently, the measurement program only collects the average values in both cases. However, there is the frequent request to collect the raw values as well. The data structure already supports this.
References to external data/files

There are measurements where for a given position count spectra (1D) or entire images (2D) are recorded. At least for the latter, the data usually reside in external files. Currently, the file name (including the full path, starting with which version of the eveH5 schema?) is stored as value in the dataset in these cases. For a discussion, see #7732. An additional complication: historically, the has been some mismatch between file number stored in the HDF5 dataset and actual file number. Hence, some way of correcting the mapping after reading the file needs to be possible.

Generally, spectra (1D data per position count) contained within an eveH5 file in the “arraydata” group are modelled as ArrayChannelData, with the _data attribute being a 2D numpy array. In case of storing images (2D data per position count), these data are modelled as AreaChannelData, with the _data attribute being either a list of 2D/3D numpy arrays (containing the image data for one or different channels), a numpy array of arrays, or a 3D/4D array.

While for a usual HDF5 dataset, the DataImporter object contains the eveH5 filename and dataset path for accessing the data, in case of external data, it contains a list of external filenames/references.
Joined (“filled”) data

Only axis data can and will be filled, and they will be filled differently depending on the channel data they are plotted against.

If we allow several channels to be plotted against one axis, things will get slightly more involved, as the axis data need to be joined with respect to both channels in this case, and probably the channel data filled with NaN values as well. Alternatives would be to have the axis data joined individually for the individual channels, or to delete those points in the channel datasets where the other channel(s) don’t have corresponding values (hooray, there we are again with our different “fill modes”…).

Joining takes place by objects located in the controllers technical layer of the measurement functional layer, and the joined data will be stored in a separate attribute, retaining the original unfilled data.
Detector channels that are redefined within an experiment/scan

Generally, detector channels can be redefined within an experiment/scan, i.e. can have different operational modes (standard/average vs. interval) in different scan modules. Currently (eveH5 v7), all data are stored in the identical dataset on HDF5 level and only by “informed guessing” (if at all possible) can one deduce that they served different purposes. Generally, we need separate datasets on the HDF5 level for detector channels that change their type or attributes within a scan, see #6879, note 16.

The current state of affairs (as of 06/2024) regarding a new eveH5 scheme (v8) is to separate single-point channels from average and interval channels and have average and interval channel datasets per se be suffixed by the scan module ID. Given that one and the same channel can only be used once in a scan module, this should be unique.

While the future way of storing those detector channels in eveH5 files is discussed in #7726, we need a solution for legacy data solving two problems:
1. separating the values for the different channels into separate datasets
This is rather complicated, but probably possible by looking at the different HDF5 datasets where present – although this would require reading the data of the HDF5 datasets if corresponding datasets are available in the “averagemeta” or/and “standarddev” group to check for changes in these data.

Separating the data is but only necessary if corresponding datasets are available in the “averagemeta” or/and “standarddev” groups. I.e., loading the data needs only to happen once this condition is met. However, as soon as this condition is met, data for legacy files need to be loaded to separate the data into separate datasets and not to have the surprise afterwards when first accessing the presumably single detector channel to all of a sudden have it split into several datasets.
1. sensibly naming the resulting multiple datasets.
Generally, the same strategy as proposed for the new eveH5 scheme should be used here, i.e. suffixing the average and interval detector channels with the scan module ID. Given that one and the same channel can only be used once in a scan module, this should be unique. The type of detector channel can be deduced from the class type.

Getting the scan module ID requires to read the SCML, though, as usually, the SMCounter pseudo-detector channel will not be present. Furthermore, mapping position counts to scan modules is far from simple. Hence, an alternative option may be to suffix the respective datasets with increasing integer numbers, without relation to the scan module ID.
Additional class DeviceData, but not OptionData

Devices seem currently only to be saved as monitors in the “device” section of the eveH5 file and appear as MonitorData. Generally, starting with eve v1.32, all pre-/postscan devices (and options) are automatically stored as monitors, i.e. in the “devices” section of the eveH5 file.

When timestamps of monitor data should be mapped to position counts (while retaining the original monitor data), this most probably means to create new instances of (subclasses of) MeasureData. If these monitor data are devices, this is the case for DeviceData.

Options should generally be mapped to the respective classes the options belong to. For options, we additionally need to distinguish between “scalar” options that do not change within one scan module (and should in the future appear as attributes on the HDF5 level), and options whose values need to be saved for each individual position count (and should in the future appear as additional dataset columns on the HDF5 level).

As of now, scalar options appear as dictionary options in the Metadata class hierarchy, while variable options with individual values per position count appear as dictionary options in the Data class hierarchy. Note that this only applies to options that are not mapped to attributes of an explicit class in the data model.
Dealing with axes read-back values (RBV)

With the advent of precise optical encoders, it turned out that axes do move after arriving at their set point. Hence, for some measurements, axes RBVs are read as pseudo-detector channels. Currently (eveH5 v7), these data end up as detector channels in distinct HDF5 datasets. However, logically they belong to the corresponding axes. Further complications may arise from the fact that there exists the use case for recording these axes RBVs during averaging of detector channel data.

The data model distinguishes between axes with encoders and those without. In the future, for axes with encoders the RBV will be read synchronously to every detector channel readout. This makes filling for those axes unnecessary, as for every detector channel readout there exists a “true” axis value. For average channels, this further means that as many axes RBVs are present as maximum detector channel readouts for this position, resulting in general in a “ragged array”.

As for axes without encoder, the RBV does not change after the axis has arrived at its position, additionally reading these axes RBVs would not make sense, as this would suggest real values.
Sorting non-monotonic positions in eveH5 datasets

Due to the (intrinsic) way the engine handles scans, position counts can be non-monotonic (#4562, #7722). However, this will usually be a problem for the analysis. Therefore, positions need to be sorted monotonically, and this is done during data import.

Array channels

Array channels in their general form are channels collecting 1D data. Typical devices used here are MCAs, but oscilloscopes and vector signal analysers (VSA) would be other typical array channels. Hence, for these quite different types of array channels, distinct subclasses of the generic ArrayChannelData class exist, see Fig. 8.

Multi Channel Analysers (MCA) generally collect 1D data and typically have separate regions of interest (ROI) defined, containing the sum of the counts for the given region. For the EPICS MCA record, see https://millenia.cars.aps.anl.gov/software/epics/mcaRecord.html.

_images/mcachannel.svg — Fig. 9 Preliminary data model for the `MCAChannelData` classes. The basic hierarchy is identical to Fig. 7, and here, the relevant part of the metadata class hierarchy from Fig. 12 is shown as well. Separating the `MCAChannelCalibration` class from the `ArrayChannelMetadata` allows to add distinct behaviour, *e.g.* creating calibration curves from the parameters.

Note: The scalar attributes for ArrayChannelROIs will currently be saved as snapshots regardless of whether the actual ROI has been defined/used. Hence, the evedata package needs to decide based on the existence of the actual data whether to create a ROI object and attach it to ArrayChannelData.

The calibration parameters are needed to convert the x axis of the MCA spectrum into a real energy axis. Hence, the MCAChannelCalibration class has a method calibrate() for performing exactly this conversion. The relationship between calibrated units (cal) and channel number (chan) is defined as cal=CALO + chan*CALS + chan^2*CALQ. The first channel in the spectrum is defined as chan=0. However, not all MCAs/SDDs have these calibration values: Ketek SDDs seem to not have these values (internal calibration?).

The real_time and life_time values can be used to get an idea of the amount of pile up occurring, i.e. having two photons with same energy within a short time interval reaching the detector being detected as one photon with twice the energy. Hence, latest in the radiometry package, distinct methods for this kind of analysis should be implemented.

Area channels

Area channels are basically 2D channels, i.e., cameras. There are (at least) two distinct types of cameras in use, namely scientific cameras and standard consumer digital cameras for taking pictures (of sample positions in the setup). While scientific cameras usually record only greyscale images, but have additional parameters and can define regions of interest (ROI), consumer cameras are much simpler in terms of their data model and typically record RGB images. These different types of images need to be handled differently in the data processing and analysis pipeline.

_images/scientificcamera.svg — Fig. 11 Preliminary data model for the `ScientificCameraData` classes. The basic hierarchy is identical to Fig. 7, and here, the relevant part of the metadata class hierarchy from Fig. 12 is shown as well. As different area detectors (scientific cameras) have somewhat different options, probably there will appear a basic `AreaChannelData` class with more specific subclasses.

Regarding file names/paths: Some of the scientific cameras are operated from Windows, hence there is usually no unique mapping of paths to actual places of the files, particularly given that Windows allows to map a drive letter to arbitrary network paths. It seems as if these paths are quite different for the different detectors, and therefore, some externally configurable mapping should be used.

Note for Pilatus cameras: Those cameras seem to have three sensors each for temperature and humidity. Probably the simplest solution would be to store those values in an array rather than having three individual fields each. In case of temperature (and humidity) readings for each individual image, the array would become a 2D array.

Points to discuss further (without claiming to be complete)

Label for the ROIs

The camera controller seems to provide the possibility to set labels for ROIs. Is this supported by the EPICS driver currently in use, and could we just add the PV to the template file(s)?
Marker for the ROIs

The camera controller seems to set MinX, MinY, SizeX, SizeY. Is this the generally agreed and consistent way of defining the marker area? What should the four elements of the ScientificCameraROIData.marker attribute represent?

Some comments (not discussions any more, though):

Sample cameras: additional fields

The parameters beam_x and beam_y don’t change between images within a scan module, but are set once when adjusting the camera, and otherwise never change. Hence, they have been moved to the metadata.

Note, however, that currently, these parameters are marked as autoacquire=measurement in the template file.

metadata module

Data without context (i.e. metadata) are mostly useless. Hence, to every class (type) of data in the evefile.data module, there exists a corresponding metadata class.

Note

As compared to the UML schemata for the IDL interface, the decision of whether a certain piece of information belongs to data or metadata is slightly different here. The main reason for this is the problem in current (as of eveH5 v7) files and redefined detector channels, leading to a loss of information that needs to be changed anyway and is discussed above for the data module. With separate datasets for different detector channels, the problem is solved and the immutable metadata belong to the metadata classes (and are converted to attributes on the HDF5 level in the future scheme, v8).

_images/evedata.evefile.metadata.svg — Fig. 12 Class hierarchy of the evefile.metadata module. Each concrete class in the evefile.data module has a corresponding metadata class in this module. You may click on the image for a larger view.

A note on the AbstractDeviceMetadata interface class: The eveH5 dataset corresponding to the TimestampMetadata class is special in sense of having no PV and transport type nor an id. Several options have been considered to address this problem:

Moving these three attributes down the line and copying them multiple times (feels bad).
Leaving the attributes blank for the “special” dataset (feels bad, too).
Introduce another class in the hierarchy, breaking the parallel to the Data class hierarchy (potentially confusing).
Create a mixin class (abstract interface) with the three attributes and use multiple inheritance/implements.

As obvious from the UML diagram, the last option has been chosen. The name “DeviceMetadata” resembles the hierarchy in the scan.entities.setup module and clearly distinguishes actual devices from datasets not containing data read from some instrument.

Points to discuss further (without claiming to be complete)

Currently none… ;-)

Some comments (not discussions any more, though):

Attributes “pv” and “access_mode”

“pv” is the EPICS process variable, “access_mode” refers to the access mode (local vs. ca, in the future additionally pva). Both are currently (as of eveH5 v7) stored as one attribute “access” in the eveH5 datasets, separated by “:” in the form <access_mode>:<pv>. In the new eveH5 schema (v8), these attributes will be split into two attributes with the corresponding names.
Attributes for AverageChannelMetadata and IntervalChannelMetadata

The current model in the UML schemas of data and metadata assumes different data(sets) in case a detector channel gets redefined within a scan, see #7726 and the discussion above. This should be verified and specified.

Todo

Extend Metadata classes to contain all relevant attributes from the SCML/setup description. This should already include metadata regarding the actual devices used not yet available from the SCML/XML but relevant for a true traceability of changes in the setup.

Controllers

Code in the controllers technical layer operate on the entities and provide the required behaviour (the “business logic”).

What may be in here:

Mapping different versions of eveH5 files to the entities
Mapping timestamps to position counts (=> move to measurement subpackage)
Converting MPSKIP scans into average detector channel with adaptive number of recorded points (=> move to measurement subpackage?)
Separating datasets for channels redefined within one scan and currently (up to eveH5 v7) stored in one HDF5 dataset
Extract set values for axes (requires access to SCML)
Correct mapping of file numbers for external files

version_mapping module

For details, see the documentation of the version_mapping module.

Being version agnostic with respect to eveH5 and SCML schema versions is a central aspect of the evedata package. This requires facilities mapping the actual eveH5 files to the data model provided by the entities technical layer of the evefile subpackage. The EveFile facade obtains the correct VersionMapper object via the VersionMapperFactory, providing an HDF5File resource object to the factory. It is the duty of the factory to obtain the “version” attribute from the HDF5File object (possibly requiring to explicitly get the attributes of the root group of the HDF5File object).

_images/evedata.evefile.controllers.version_mapping.svg — Fig. 13 Class hierarchy of the evefile.controllers.version_mapping module, providing the functionality to map different eveH5 file schemas to the data structure provided by the `HDF5File` class. The `VersionMapperFactory` is used to get the correct mapper for a given eveH5 file. For each eveH5 schema version, there exists an individual `VersionMapperVx` class dealing with the version-specific mapping. The idea behind the `Mapping` class is to provide simple mappings for attributes and alike that need not be hard-coded and can be stored externally, *e.g.* in YAML files. This would make it easier to account for (simple) changes.

For each eveH5 schema version, there exists an individual VersionMapperVx class dealing with the version-specific mapping. That part of the mapping common to all versions of the eveH5 schema takes place in the VersionMapper parent class, e.g. removing the chain. The idea behind the Mapping class is to provide simple mappings for attributes and alike that can be stored externally, e.g. in YAML files. This would make it easier to account for (simple) changes.

Converting MPSKIP scans into average detector channel

Module name:: mpskip
Dependencies:: Scan, and here in particular the channels used in the scan module with the MPSKIP channel

Given the data model to not correspond to the current eveH5 structure (v7), it makes sense to convert scans using MPSKIP to “fake” average detector channels storing the individual data points on this level.

Important

MPSKIP is exclusively used by one group, and not only for storing the individual data points for averaging, but for recording axes RBVs for each individual detector channel readout as well, due to motor axes changing their position slightly. The axes RBVs are recorded for using pseudo-detectors are/should be equipped with encoders to ensure actual values being read. The data model now supports axes objects to have more than one value for a position to account for this situation. This means, however, to convert the MPSKIP scans with individual position counts for each detector readout to scans with multiple values available per individual position.

A typical scan using MPSKIP uses the MPSKIP detector in the innermost scan module. Here, the only motor axis is a counter (defining the maximum number of attempts to record data for a single averaging), and the detector channels are the primary readout (electrometer), often a secondary electrometer, the ring current and lifetime, and a series of RBVs from motor axes. Additionally, the SM-Counter detector is used in this scan module. All the actual motor axes are set in outer scan modules, typically an overall positioning of the sample (by means of a goniometer) in the outer scan module and the monochromator in the inner scan module. This accounts for a doubly nested scan in total, and the actual detector values having positions where no motor axis (besides the RBVs that are currently pseudo-detector channels, hence marked as channel) has corresponding positions.

The MPSKIP detector channel datasets get mapped to a SkipData object, and if such an object is present, all detectors in the same scan module (i.e., with identical positions) need to be converted:

Actual detectors (not RBVs as pseudo-detector channels) need to be mapped to either AverageChannelData or AverageNormalizedChannelData.
The Counter (axis) data can be used to conveniently determine the positions for each individual averaging, the data object should afterwards be removed.
Axes RBVs (present als pseudo-detector channels) need to be mapped to AxisData objects with the individual axis values stored as ragged array.
The SM-Counter (channel) data can be removed.
What about the Time(r) data? This is the (cumulative) time in seconds.

If the SCML is present, reading the scan part of the SCML and inferring the motor axes and detector channels where the MPSKIP detector is present makes it much easier to get the names of the data objects that need to be modified. Hence, it might be sensible to (i) implement the minimum functionality of the scan subpackage necessary and (ii) rely on the SCML to be present for the time being. It might be a sensible option to check for the SCML to be present if a SkipData object has been created, and in those (probably rare) cases to issue a warning that this is currently not supported.

Separating datasets for redefined channels

Module name:: dataset_separation
Dependencies:: Scan, and here the channels defined in a scan module as well as the corresponding scan module ID

Generally, detector channels can be redefined within an experiment/scan, i.e. can have different operational modes (standard/average vs. interval) in different scan modules. Currently (eveH5 v7), all data are stored in the identical dataset on HDF5 level and only by “informed guessing” (if at all possible) can one deduce that they served different purposes. Generally, we need separate datasets on the HDF5 level for detector channels that change their type or attributes within a scan, see #6879, note 16.

The current state of affairs (as of 09/2024) regarding a new eveH5 scheme (v8) is to separate single-point channels from average and interval channels and have average and interval channel datasets per se be suffixed by the scan module ID. Given that one and the same channel can only be used once in a scan module, this should be unique.

While the future way of storing those detector channels in eveH5 files is discussed in #7726, we need a solution for legacy data solving two problems:

separating the values for the different channels into separate datasets

This is rather complicated, but probably possible by looking at the different HDF5 datasets where present – although this would require reading the data of the HDF5 datasets if corresponding datasets are available in the “averagemeta” or/and “standarddev” group to check for changes in these data.

Separating the data is but only necessary if corresponding datasets are available in the “averagemeta” or/and “standarddev” groups. I.e., loading the data needs only to happen once this condition is met. However, as soon as this condition is met, data for legacy files need to be loaded to separate the data into separate datasets and not to have the surprise afterwards when first accessing the presumably single detector channel to all of a sudden have it split into several datasets.

sensibly naming the resulting multiple datasets.

Generally, the same strategy as proposed for the new eveH5 scheme should be used here, i.e. suffixing the average and interval detector channels with the scan module ID. Given that one and the same channel can only be used once in a scan module, this should be unique. The type of detector channel can be deduced from the class type.

Getting the scan module ID requires to read the SCML, though, as usually, the SMCounter pseudo-detector channel will not be present. Furthermore, mapping position counts to scan modules is far from simple. Hence, an alternative option may be to suffix the respective datasets with increasing integer numbers, without relation to the scan module ID.

Extract set values for axes

Module name:: axes_set_values
Dependencies:: Scan, and here the details of the axes positions defined in each scan module – thus requiring parsing of the different ways how to define axes positions.

The axes positions stored in the HDF5 file are the RBVs after positioning. If, however, an axis never reached the set value due to limit violation or other constraints, this is usually not visible from the HDF5 file, as the severity is typically not recorded. However, the set values for each axis can be inferred from the scan description. Having this information would be helpful for routine checks whether a scan ran as expected. Set values are stored in the set_values attribute of the AxisData class.

Note

In case the axis set values in the SCML are only a reference to an external file, these values cannot reliably be read afterwards. Hence, in this case, probably either a warning should be issued or the situation silently ignored. In a future eveH5 scheme, it may be sensible to store those values in some way in the file.

Correct mapping of file numbers for external files

In the past, there has been some occasions where the stored file numbers for external files (usually images from 2D detectors) do not match with the correct files. Hence, we need some mechanism to modify the file numbers after loading a file for a correct mapping. It is unclear so far whether there is a way to automatically and reliably detect when to apply this correction.

Boundaries

What may be in here:

facade:
- EveFile

resources:

HDF5File
Interfaces towards additional files, e.g. images
- Images in particular are usually not stored in the eveH5 files, but only pointers to these files.
- Import routines for the different files (or at least a sensible modular mechanism involving an importer factory) need to be implemented.
- Is the evedata package the correct place for these importers? One could think of the radiometry package as the better place, but on the other hand, the evedataviewer package would need to be able to display those data as well, hence need the import to be done.
  
  Given the evedata package to act as a (complicated) importer, all importer mechanisms for additional data not stored in the HDF5 file should be implemented here.
Interfaces towards other file formats
- One potential candidate for an exchange format would be the NeXus format. However, there is not one NeXus file format, but there are several schemas for different types of experiments. For details, see the NeXus application definitions. Hence, those exporters may better be located in the radiometry package.

evefile module (facade)

_images/evedata.evefile.boundaries.evefile.svg — Fig. 14 Class hierarchy of the evefile.boundaries.evefile module, providing the facade for an eveH5 file. Currently, the basic idea is to inherit from the `File` entity and extend it accordingly, adding behaviour.

As per Fig. 14, the EveFile class inherits from the File class of the entities subpackage. Reading (loading) an eveH5 file results in calling out to HDF5File.read(), followed by mapping the eveH5 contents to the data model. Additionally, for eveH5 v7 and below, datasets for detector channels that have been redefined within one scan and scans using MPSKIP are mapped to the respective datasets accordingly. Last but not least, the corresponding SCML (and setup description, where applicable) is loaded and the metadata contained therein mapped to the metadata of the corresponding datasets.

Some comments (not discussions any more, though):

Metadata from SCML file

There is more information available from the SCML file (and the measurement station/beam line description - but that is generally not available when reading eveH5 files if it is not contained in the SCML). This information needs to be mapped to the respective metadata classes (and those classes be extended accordingly). This mapping will take place here, as per schema of the functional and technical layers, the evefile subpackage depends on the scan subpackage.
Non-monotonic position counts in eveH5 datasets

Due to the (intrinsic) way the engine handles scans, position counts can be non-monotonic (#4562, #7722). However, this will usually be a problem for the analysis. Therefore, the sorting logic is implemented in the entities layer in the load method(s).

eveh5 module (resource)

The aim of this module is to provide a Python representation (in form of a hierarchy of objects) of the contents of an eveH5 file that can be mapped to both, the evefile and measurement interfaces. While the Python h5py package already provides the low-level access and gets used, the eveh5 module contains Python objects that are independent of an open HDF5 file, represent the hierarchy of HDF5 items (groups and datasets), and contain the attributes of each HDF5 item in form of a Python dictionary. Furthermore, each object contains a reference to both, the original HDF5 file and the HDF5 item, thus making reading dataset data on demand as simple as possible.

_images/evedata.evefile.boundaries.eveh5.svg — Fig. 15 Class hierarchy of the `evedata.evefile.boundaries.eveh5` module. The `HDF5Item` class and children represent the individual HDF5 items on a Python level, similarly to the classes provided in the h5py package, but *without* requiring an open HDF5 file. Furthermore, reading actual data (dataset values) is deferred by default.

As such, the HDF5Item class hierarchy shown above is pretty generic and should work with all eveH5 versions. However, it is not meant as a generic HDF5 interface, as it does make some assumptions based on the eveH5 file structure and format.

Some comments (not discussions any more, though):

Reading the entire content of an eveH5 file at once vs. deferred reading?
- Reading relevant metadata (e.g., to decide about what data to plot) should be rather fast. And generally, only two “columns” will be displayed (as f(x,y) plot) at any given time – at least if we don’t radically change the way data are looked at compared to the IDL Cruncher.
- References to the internal datasets of a given HDF5 file are stored in the corresponding Python data structures (together with the HDF5 file name). Hence, HDF5 files are closed after each operation, such as not to have open file handles that may be problematic (but see the quote from A. Collette below).
- Plotting requires data to be properly filled, although filling will most probably not take place globally once, but on a per plot base. See the discussion on fill modes, currently below in the Dataset subpackage section.
From the book “Python and HDF5” by Andrew Collette:

You might wonder what happens if your program crashes with open files. If the program exits with a Python exception, don’t worry! The HDF library will automatically close every open file for you when the application exits.

—Andrew Collette, 2014 (p. 18)

Measurement

Generally, the measurement subpackage, as mentioned already in the Concepts section, provides the interface towards the “user”, where user mostly means the evedataviewer and radiometry packages. However, besides these two Python packages, human users will want to use the evedata package as well. Hence, it should be as human-friendly as possible.

The overall package structure of the evedata package is shown in Fig. 5. Furthermore, a series of (still higher-level) UML schemata for the measurement subpackage are shown below, reflecting the current state of affairs (and thinking).

Note

The mapping of the information contained in both, the HDF5 and SCML layers of an eveH5 file, to the measurement is far from being properly modelled or understood. This is partly due to the step-wise progress in understanding. On a rather fundamental level, it remains to be decided whether a Measurement should allow for reconstructing how a measurement has actually been carried out (i.e., provide access to the SCML and hence the anatomy of the scan).

What is the main difference between the evefile and the measurement subpackages? Basically, the information contained in an eveH5 file needs to be “interpreted” to be able to process, analyse, and plot the data. While the evefile subpackage provides the necessary data structures to faithfully represent all information contained in an eveH5 file, the measurement subpackage provides the result of an “interpretation” of this information in a way that facilitates data processing, analysis and plotting.

However, the measurement subpackage is still general enough to cope with all the different kinds of measurements the eve measurement program can deal with. Hence, it may be a wise idea to create dedicated dataset classes in the radiometry package for different types of experiments. The NeXus file format may be a good source of inspiration here, particularly their application definitions. The evedataviewer package in contrast aims at displaying whatever kind of measurement has been performed using the eve measurement program. Hence it will deal directly with Measurement (facade) objects of the measurement subpackage.

Arguments against the 2D data array as sensible representation

Currently, one very common and heavily used abstraction of the data contained in an eveH5 file is a two-dimensional data array (basically a table with column headers, implemented as pandas dataframe). As it stands, many problems in the data analysis and preprocessing of data come from the inability of this abstraction to properly represent the data. Two obvious cases, where this 2D approach simply breaks down, are:

subscans – essentially a 2D dataset on its own
adaptive average detector channel saving the individual, non-averaged values (implemented using MPSKIP)

Furthermore, as soon as spectra (1D) or images (2D) are recorded for a given position (count), the 2D data array abstraction breaks down as well.

Other problems inherent in the 2D data array abstraction are the necessary filling of values that have not been obtained. Currently, once filled there is no way to figure out for an individual position whether values have been recorded (in case of LastFill) or whether a value has not been recorded or recording failed (in case of NaNFill).

A few ideas/comments for further modelling this subpackage:

evefile represents the eveH5 file, while measurement maps the different datasets to more sensible abstractions.
- Not all abstractions will necessarily be reflected in the future in the eveH5 file. Currently (eveH5 v7), most of the abstractions are clearly not visible there. To deal with this situation, the entities in the evefile subpackage should reflect the future eveH5 scheme and abstractions therein, with the version mapping from the controller technical layer responsible for the mapping of older eveH5 schemata to the entities.