Working with multiple ARM files and Xarray#

When working with multiple ARM files such as the datastream sgpmfrsraod1michE13.c1, errors can occur depending on the parameters used to read in the files. ACT uses xarray open_mfdataset to read in multiple files and feeds these parameters into the xarray function. Parameters such as combine (how xarray combines the files), data_vars (whether to attempt to combine all variables despite dimensions), and much more. Depending on how coordinates and dimensions are in the files, such as multiple dimensions outside of time, the default reader in ACT and xarray can fail.

Lets start by looking at one singular file in for our sgpmfrsraod1michE13.c1 datastream:

import glob

import xarray as xr
import act
files = sorted(glob.glob('sgpmfr_data/*'))
files
['sgpmfr_data/sgpmfrsraod1michE13.c1.20200101.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200102.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200103.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200104.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200105.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200106.000000.cdf',
 'sgpmfr_data/sgpmfrsraod1michE13.c1.20200107.000000.cdf']
ds = act.io.read_arm_netcdf(files[0])
ds
<xarray.Dataset> Size: 3MB
Dimensions:                                  (time: 4320, bench_angle: 181,
                                              wavelength: 750,
                                              Io_interquartile_time: 121,
                                              Io_wavelength: 5,
                                              Io_gauss_time: 61)
Coordinates:
  * time                                     (time) datetime64[ns] 35kB 2020-...
  * bench_angle                              (bench_angle) float32 724B 0.0 ....
  * wavelength                               (wavelength) float32 3kB 325.0 ....
  * Io_interquartile_time                    (Io_interquartile_time) object 968B ...
  * Io_gauss_time                            (Io_gauss_time) object 488B 2019...
Dimensions without coordinates: Io_wavelength
Data variables: (12/245)
    base_time                                datetime64[ns] 8B 2020-01-01
    time_offset                              (time) datetime64[ns] 35kB 2020-...
    hemisp_broadband_raw                     (time) float32 17kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter1_raw            (time) float32 17kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter2_raw            (time) float32 17kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter3_raw            (time) float32 17kB dask.array<chunksize=(4320,), meta=np.ndarray>
    ...                                       ...
    direct_normal_transmittance_filter5      (time) float32 17kB dask.array<chunksize=(4320,), meta=np.ndarray>
    Io_interquartile_values                  (Io_interquartile_time, Io_wavelength) float32 2kB dask.array<chunksize=(121, 5), meta=np.ndarray>
    Io_gauss_values                          (Io_gauss_time, Io_wavelength) float32 1kB dask.array<chunksize=(61, 5), meta=np.ndarray>
    lat                                      float32 4B ...
    lon                                      float32 4B ...
    alt                                      float32 4B ...
Attributes: (12/83)
    command_line:                                mfraod -n mfraod_mfrsr -s sg...
    Conventions:                                 ARM-1.3
    process_version:                             vap-mfraod-2.9-1.el7
    dod_version:                                 mfrsraod1mich-c1-2.8
    input_datastreams:                           sgpmetE13.b1 : 4.41 : 202001...
    site_id:                                     sgp
    ...                                          ...
    Forgan_EndDate:                              20200131
    history:                                     created by user bfaye on mac...
    _file_dates:                                 ['20200101']
    _file_times:                                 ['000000']
    _datastream:                                 sgpmfrsraod1michE13.c1
    _arm_standards_flag:                         1

One singular file does not struggle to read, as we are not trying to concatenate files etc. But by looking at the dataset above, we can see that there are multiple dimensions such as time, io_gauss_time, wavelength and more. Using the default settings of combine=’by_coords’ will fail because xarray will try to solve on its own on how to combine these files due to the many dimensions and even dimensionless variables.

An example of parameters that would fail:

ds = act.io.read_arm_netcdf(
    files,
    combine='by_coords',
    join='outer',
    data_vars='all',
    coords='minimal',
)

xarray will either time out, or raise a ValueError for base time and the other non time dimension variables.

How do we solve this?#

An option is to help xarray determine how to combine these files. There are two options, we can combine the files by time and ignore variables that do not have a time dimension, or we can combine files by time and expand non time variables by time. Essentially the variables that for example have wavelength as their dimensions, now are (time, wavelength). This is up to the user and whether they are only using the time variables or if they wish to use other variables.

An example of working code, that will merge just the time variables:

ds_all = act.io.read_arm_netcdf(
    files,
    combine='nested',
    concat_dim='time',
    join='outer',
    data_vars='all',
)
ds_all
<xarray.Dataset> Size: 3GB
Dimensions:                                  (time: 30240, bench_angle: 181,
                                              wavelength: 750,
                                              Io_interquartile_time: 133,
                                              Io_wavelength: 5,
                                              Io_gauss_time: 67)
Coordinates:
  * time                                     (time) datetime64[ns] 242kB 2020...
  * bench_angle                              (bench_angle) float32 724B 0.0 ....
  * wavelength                               (wavelength) float32 3kB 325.0 ....
  * Io_interquartile_time                    (Io_interquartile_time) object 1kB ...
  * Io_gauss_time                            (Io_gauss_time) object 536B 2019...
Dimensions without coordinates: Io_wavelength
Data variables: (12/245)
    base_time                                (time) datetime64[ns] 242kB 2020...
    time_offset                              (time) datetime64[ns] 242kB 2020...
    hemisp_broadband_raw                     (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter1_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter2_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter3_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    ...                                       ...
    direct_normal_transmittance_filter5      (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    Io_interquartile_values                  (time, Io_interquartile_time, Io_wavelength) float32 80MB dask.array<chunksize=(4320, 121, 5), meta=np.ndarray>
    Io_gauss_values                          (time, Io_gauss_time, Io_wavelength) float32 41MB dask.array<chunksize=(4320, 61, 5), meta=np.ndarray>
    lat                                      (time) float32 121kB 36.6 ... 36.6
    lon                                      (time) float32 121kB -97.49 ... ...
    alt                                      (time) float32 121kB 318.0 ... 3...
Attributes: (12/83)
    command_line:                                mfraod -n mfraod_mfrsr -s sg...
    Conventions:                                 ARM-1.3
    process_version:                             vap-mfraod-2.9-1.el7
    dod_version:                                 mfrsraod1mich-c1-2.8
    input_datastreams:                           sgpmetE13.b1 : 4.41 : 202001...
    site_id:                                     sgp
    ...                                          ...
    Forgan_EndDate:                              20200131
    history:                                     created by user bfaye on mac...
    _file_dates:                                 ['20200101', '20200102', '20...
    _file_times:                                 ['000000', '000000', '000000...
    _datastream:                                 sgpmfrsraod1michE13.c1
    _arm_standards_flag:                         1

So… what is this code doing differently and what do these parameters mean?

Well to start, by setting combine=’nested’ and concat_dim=’time’, we are telling xarray that we are merging the files by the order they are in the file list and by the time dimension. We are also telling xarray that we want to expand the time dimension for the non time variables data_vars=’all’ and we are joining by using the union of object indexes join=’outer’. As you can see we have 30240 times, a combination of all the files times, and we can see time was added as a dimension to base_time and more.

Warning: this will only work if your file list is sorted by the order that you which to merge these files. Built in to ACT we use a sorted to sort the files by time in the file name.#

Now, if we do not care about the non time variables, we can set data_vars=’minimal’ which will try to concatenate only the time dimensions variables, since we are specifying time. We also would want to set compat=’override’ as well. What this does, is it tells xarray if there is a conflict, which we know there will be for the variable base_time as there is no dimension, so it will just use the first files variables data for the non time dimension variables.

ds_min = xr.open_mfdataset(
    files,
    combine='nested',
    concat_dim='time',
    join='outer',
    data_vars='minimal',
    coords='minimal',
    compat='override',
)
ds_min
<xarray.Dataset> Size: 19MB
Dimensions:                                  (time: 30240, bench_angle: 181,
                                              wavelength: 750,
                                              Io_interquartile_time: 133,
                                              Io_wavelength: 5,
                                              Io_gauss_time: 67)
Coordinates:
  * time                                     (time) datetime64[ns] 242kB 2020...
  * bench_angle                              (bench_angle) float32 724B 0.0 ....
  * wavelength                               (wavelength) float32 3kB 325.0 ....
  * Io_interquartile_time                    (Io_interquartile_time) datetime64[ns] 1kB ...
  * Io_gauss_time                            (Io_gauss_time) datetime64[ns] 536B ...
Dimensions without coordinates: Io_wavelength
Data variables: (12/245)
    base_time                                datetime64[ns] 8B ...
    time_offset                              (time) datetime64[ns] 242kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_broadband_raw                     (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter1_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter2_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    hemisp_narrowband_filter3_raw            (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    ...                                       ...
    direct_normal_transmittance_filter5      (time) float32 121kB dask.array<chunksize=(4320,), meta=np.ndarray>
    Io_interquartile_values                  (Io_interquartile_time, Io_wavelength) float32 3kB dask.array<chunksize=(121, 5), meta=np.ndarray>
    Io_gauss_values                          (Io_gauss_time, Io_wavelength) float32 1kB dask.array<chunksize=(61, 5), meta=np.ndarray>
    lat                                      float32 4B ...
    lon                                      float32 4B ...
    alt                                      float32 4B ...
Attributes: (12/79)
    command_line:                                mfraod -n mfraod_mfrsr -s sg...
    Conventions:                                 ARM-1.3
    process_version:                             vap-mfraod-2.9-1.el7
    dod_version:                                 mfrsraod1mich-c1-2.8
    input_datastreams:                           sgpmetE13.b1 : 4.41 : 202001...
    site_id:                                     sgp
    ...                                          ...
    mfr_internal_longitude:                      -97.485000
    Langley_data_used:                           michalsky algorithm
    pressure_fraction_for_Rayleigh_calculation:  0.959231
    Forgan_StartDate:                            20191202
    Forgan_EndDate:                              20200131
    history:                                     created by user bfaye on mac...
ds_min.base_time.values
np.datetime64('2020-01-01T00:00:00.000000000')

As we can see, the data was still concatenated by time, but the other variables such as base_time, lat etc are unchanged and using the first files data.

Summary#

When using act.io.read_arm_netcdf, which uses xarray open_mfdataset, we want to change the parameters to work with multiple files if they contain multiple dimensions. By doing so the user can choose to merge just the time dimension variables or merge the files and expand non time dimension variables with time so that those variables are also concatenated between files.

For more on the xarray open_mfdataset, see: https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html