Skip to content

nsidc/granule-metgen

Repository files navigation

MetGenC README.md Table of contents

NSIDC logo

MetGenC

build & test workflow publish workflow Documentation Status Documentation Status

The MetGenC toolkit enables Operations staff and data producers to create metadata files conforming to NASA's Common Metadata Repository UMM-G specification and ingest data directly to NASA EOSDIS’s Cumulus archive. Cumulus is an open source cloud-based data ingest, archive, distribution, and management framework developed for NASA's Earth Science data.

Level of Support

This repository is fully supported by NSIDC. If you discover any problems or bugs, please submit an Issue. If you would like to contribute to this repository, you may fork the repository and submit a pull request.

See the LICENSE for details on permissions and warranties. Please contact [email protected] for more information.

Accessing the OPS MetGenC VM and Tips and Assumptions

  • from nusnow: $ vssh production metgenc

  • the one swell foop command line to kick off everything you need to run MetGenC:

    uat
    cd metgenc;source .venv/bin/activate;source metgenc-env.sh cumulus-uat
    
    prod
    cd metgenc;source .venv/bin/activate;source metgenc-env.sh cumulus-prod
    

BE AWARE: IF YOU'BE BEEN TESTING/INGEST CUAT INGEST, WHEN YOU'RE READY TO INGEST TO CPRD, MAKE SURE TO RUN source metgenc-env.sh cumulus-prod. MetGenC will happily let you use the -e prod option, but you need to have the right credentials sourced!! If the creds aren't pointing to the right environment, MetGenC will return:

* The kinesis stream does not exist.
* The staging bucket does not exist.

Commands within the above one-liner detailed:

  • CD Into, and activate, the venv:

      $ cd metgenc
      $ source .venv/bin/activate
    
  • Before you run end-to-end ingest, be sure to source the AWS credentials:

      $ source metgenc-env.sh cumulus-<uat or prod>
    

Available profiles are cumulus-uat and cumulus-prod.

If you think you've already run it but can't remember, run the following:

        $ aws configure list

The output will either indicate that you need to source your credentials by returning:

Name                    Value             Type    Location
----                    -----             ----    --------
profile             <not set>             None    None
access_key          <not set>             None    None
secret_key          <not set>             None    None
region              <not set>             None    None

Or it'll show that you're all set (AWS comms-wise) for ingesting to Cumulus by returning the following:

Name                         Value             Type    Location
----                         -----             ----    --------
profile                 cumulus-<uat or prod>   env    ['AWS_DEFAULT_PROFILE', 'AWS_PROFILE']
access_key     ****************SQXY             env
secret_key     ****************cJ+5             env
region                    us-west-2     config-file    ~/.aws/config

Assumptions for netCDF files for MetGenC

  • NetCDF files have an extension of .nc (per CF conventions).
  • Projected spatial information is available in coordinate variables having a standard_name attribute value of projection_x_coordinate or projection_y_coordinate attribute.
  • (y[0],x[0]) represents the upper left corner of the spatial coverage.
  • Spatial coordinate values represent the center of the area covered by a measurement.
  • Only one coordinate system is used by all data variables in all science files (i.e. only one grid mapping variable is present in a file, and the content of that variable is the same in every science file).

MetGenC .ini File Assumtions

  • A pixel_size attribute is needed in a data set's .ini file when gridded science files don't include a GeoTransform attribute in the grid mapping variable. The value specified should be just a number—no units (m, km) need to be specified since they're assumed to be the same as the units of those defined by the spatial coordinate variables in the data set's science files.
    • e.g., pixel_size = 25
  • Date/time strings can be parsed using datetime.fromisoformat
  • The checksum_type must be SHA256

NetCDF Attributes MetGenC Relies upon to Generate UMM-G json Files

CF Conventions and NSIDC Guidelines (=NSIDC Guidelines for netCDF Attributes) are the driving forces behind emphatically suggesting data producers include the Attributes used by MetGenC in their netCDF files.

  • Required required
  • RequiredC conditionally required
  • R+ highly or strongly recommended
  • R recommended
  • S suggested
Attribute used by MetGenC (location in netCDF file) CF Conventions NSIDC Guidelines Notes
time_coverage_start (global) R 1, OC, P
time_coverage_end (global) R 1, OC, P
grid_mapping_name (variable) RequiredC R+ 2
crs_wkt (variable with grid_mapping_name attribute) R 3
GeoTransform (variable with grid_mapping_name attribute) R 4, OC
geospatial_lon_min (global) R
geospatial_lon_max (global) R
geospatial_lat_min (global) R
geospatial_lat_max (global) R
geospatial_bounds (global) R 7, OC
geospatial_bounds_crs (global) ? 8
standard_name, projection_x_coordinate (variable) RequiredC
standard_name, projection_y_coordinate (variable) RequiredC

Notes column key:

OC = Optional configuration attributes (or elements of them) that may be represented in an .ini file in order to allow "nearly" compliant netCDF files to be run with MetGenC without premet/spatial files. See Required and Optional Configuration Elements

P = Premet file attributes that may be specified in a premet file; when used, a premet_dirpath must be defined in the .ini file.

1 = Used to populate the time begin and end UMM-G values; OC .ini attribute for time_coverage_start is time_start_regex = <value>, and for time_coverage_end the .ini attribute is time_coverage_duration = <value>.

2 = A grid mapping variable is required if the horizontal spatial coordinates are not longitude and latitude and the intent of the data provider is to geolocate the data. grid_mapping and grid_mapping_name allow programmatic identification of the variable holding information about the horizontal coordinate reference system.

3 = The crs_wkt ("coordinate referenc system well known text") value is handed to the CRS and Transformer modules in pyproj to conveniently deal with the reprojection of (y,x) values to EPSG 4326 (lon, lat) values.

4 = The GeoTransform value provides the pixel size per data value, which is then used to calculate the padding added to x and y values to create a GPolygon enclosing all of the data; OC .ini attribute is pixel_size = .

5 = The values of the coordinate variable identified by the standard_name attribute with a value of projection_x_coordinate are reprojected and thinned to create a GPolygon, bounding rectangle, etc.

6 = The values of the coordinate variable identified by the standard_name attribute with a value of projection_y_coordinate are reprojected and thinned to create a GPolygon, bounding rectangle, etc.

7 = The geospatial_bounds netCDF file global attribute contains spatial boundary information as a WKT POLYGON string. When present and prefer_geospatial_bounds = true is set in the .ini file, MetGenC will use this attribute instead of spatial coordinate values to generate spatial representations of granules in collections with a GEODETIC granule spatial representation. If the geospatial_bounds_crs attribute is also present in netCDF files, coordinates will be transformed to EPSG:4326 if needed. The corresponding .ini parameter is prefer_geospatial_bounds = true/false.

8 = The geospatial_bounds_crs netCDF file global attribute specifies the coordinate reference system for the coordinates in the geospatial_bounds global attribute. It can be an EPSG identifier (e.g., "EPSG:4326") or other CRS format. When present, MetGenC will transform geospatial_bounds coordinates to EPSG:4326 if needed. If geospatial_bounds is true and no geospatial_bounds_crs attribute exists, the coordinates in the geospatial_bounds attribute are assumed to represent points in EPSG:4326.

How to query a netCDF file for presence of MetGenC-Required Attributes

On V0 wherever the data are staged (/disks/restricted_ftp or /disks/sidads_staging, etc.) you can run ncdump to check whether a netCDF representative of the collection's files contains the MetGenC-required attributes. When not reported, that attribute will have to be accommodated by its associated .ini attribute being added to the .ini file. See Required and Optional Configuration Elements for full details/descriptions of these.

ncdump -h <file name.nc> | grep -e time_coverage_start -e time_coverage_end -e GeoTransform -e crs_wkt -e spatial_ref -e grid_mapping_name -e geospatial_bounds -e geospatial_bounds_crs -e 'standard_name = "projection_y_coordinate"' -e 'standard_name = "projection_x_coordinate"'

Geometry Logic

The geometry behind the granule-level spatial representation (point, gpolygon, or bounding rectangle) required for a data set can be implemented by MetGenC via either: file-level metadata (such as a CF/NSIDC Compliant netCDF file), .spatial / .spo files, or its collection-level spatial representation.

When MetGenC is run with netCDF files that are both CF and NSIDC Compliant (for those requirements, refer to the table: NetCDF Attributes Used to Populate the UMM-G files generated by MetGenC) information from within the file's metadata will be used to generate an appropriate gpolygon or bounding rectangle for each granule.

In some cases, non-netCDF files, and/or netCDF files that are non-CF or non-NSIDC compliant will require an operator to define or modify data set details expressed through attributes in an .ini file, in other cases an operator will need to further modify the .ini file to specify paths to where premet and spatial files are stored for MetGenC to use as input files.

For granules suited to using the spatial extent defined for its collection, a collection_geometry_override = True attribute/value pair can be added to the .ini file (as long as it's a single bounding rectangle, and not two or more bounding rectangles). Setting collection_geometry_override = False in the .ini file will make MetGenC look to the science files or premet/spatial files for the granule-level spatial representation geometry to use.

Geometry Rules

Granule Spatial Representation Geometry Granule Spatial Representation Coordinate System (GSRCS)
GPolygon (GPoly) Geodetic
Bounding Rectangle (BR) Cartesian
Points Geodetic

Geometry Logic and Expectations Table

.spo = .spo file associated with each granule, used to directly define the vertices of a gPoly.
.spatial = .spatial file associated with each granule to define either: BR, Point, or the data footprint (i.e., the .spatial simply contains a listing of all coordinates parsed from the science file) for which MetGenC is to generate a detailed, encompassing GPoly.
source num points GSRCS error? expected output comments
.spo any cartesian yes .spo inherently defines GPoly vertices; GPolys cannot be cartesian.
.spo <= 2 geodetic yes At least three points are required to define a GPoly.
.spo > 2 geodetic no GPoly as described by .spo file contents.
.spatial 1 cartesian yes NSIDC data curators always associate a GEODETIC granule spatial representation with point data.
.spatial 1 geodetic no Point as defined by spatial file.
.spatial 2 cartesian no BR as defined by spatial file.
.spatial >= 2 geodetic no GPoly(s) calculated to enclose all points. If spatial_polygon_enabled=true (default) and ≥3 points, uses optimized polygon generation with target coverage and vertex limits.
.spatial > 2 cartesian yes There is no cartesian-associated geometry for GPolys.
science file (NSIDC/CF-compliant netCDF) NA cartesian no BR min/max lon/lat points for BR expected to be included in global attributes.
science file (NSIDC/CF-compliant) 1 or > 2 geodetic no Error if only two points. GPoly calculated from grid perimeter.
science file, non-NSIDC/CF-compliant netCDF or other format NA either no As specified by .ini file. Configuration file must include a spatial_dir value (a path to the directory with valid .spatial or .spo files), or collection_geometry_override = True entry (which must be defined as a single point or a single bounding rectangle).
collection spatial metadata geometry = cartesian with one BR NA cartesian no BR as described in collection metadata.
collection spatial metadata geometry = cartesian with one BR NA geodetic yes Collection geometry and GSRCS must both be cartesian.
collection spatial metadata geometry = cartesian with two or more BR NA cartesian yes Two-part bounding rectangle is not a valid granule-level geometry.
collection spatial metadata geometry specifying one or more points NA NA Not a known use case

Running MetGenC: Its Commands In-depth

help

Show MetGenC's help text:

    $ metgenc --help
    Usage: metgenc [OPTIONS] COMMAND [ARGS]...

      The metgenc utility allows users to create granule-level metadata, stage
      granule files and their associated metadata to Cumulus, and post CNM
      messages.

    Options:
      --help  Show this message and exit.

    Commands:
      info     Summarizes the contents of a configuration file.
      init     Populates a configuration file based on user input.
      process  Processes science files based on configuration file...
      validate Validates the contents of local JSON files.
  • For detailed help on each command, run: metgenc <command name> --help:

      $ metgenc process --help
    

init

The init command can be used to generate a metgenc configuration (i.e., .ini) file for your data set, or edit an existing .ini file.

  • You don't need to run this command if you already have an .ini file that you prefer to copy and edit manually (any text editor will work) to apply to the collection you're ingesting.
  • If running metgenc init, the name of the new ini file you specify needs to include the .ini suffix.
metgenc init --help
Usage: metgenc init [OPTIONS]

  Populates a configuration file based on user input.

Options:
  -c, --config TEXT  Path to configuration file to create or replace
  --help             Show this message and exit

Example running init

$ metgenc init -c ./init/<name of config file to create or modify>.ini
INI RULES:
  • The .ini file's checksum_type = SHA256 should never be edited
  • The kinesis_stream_name and staging_bucket_name should never be edited
  • auth_id and version must accurately reflect the collection's authID and versionID
  • log_dir specifies the directory where metgenc log files will be written. Log files are named metgenc-{config-name}-{timestamp}.log where config-name is the base name of the .ini file and timestamp is in YYYYMMDD-HHMM format. The default log directory is /share/logs/metgenc, but this can be edited to write metgenc logs to a different existing, writable directory location.
  • provider is a free text attribute where, for now, the version of metgenc being run should be documented
    • running metgenc --version will return the current version

Required and Optional Configuration Elements

Some attribute values may be read from the .ini file if the values can't be gleaned from—or don't exist in—the science file(s), but whose values are known for the data set. Use of these elements can be typical for data sets comprising non-CF/non-NSIDC-compliant netCDF science files, as well as non-netCDF data sets comprising .tif, .csv, .h5, etc. The element values must be manually added to the .ini file, as none are prompted for in the metgenc init functionality.

See this project's GitHub file, fixtures/test.ini for examples.

.ini element .ini section Attribute absent from netCDF file the .ini attribute stands in for Attribute populated in UMMG Note
time_start_regex Collection time_coverage_start BeginningDateTime 1
time_coverage_duration Collection time_coverage_end EndingDateTime 2
pixel_size Collection GeoTransform n/a 3

R = Required for all non-netCDF file types (e.g., csv, .tif, .h5, etc) and netCDF files missing the global attribute specified

  1. This regex attribute leverages a netCDF's file name containing a date to populate UMMG files' TemporalExtent field attribute, BeginningDateTime. Must match using the named group (?P<time_coverage_start>).

    • This attribute is meant to be used with "nearly" compliant netCDF files, but not other file types (csv, tif, etc.) since these should rely on premet files containing temporal details for each file.
  2. The time_coverage_duration attribute value specifies the duration to be applied to the time_coverage_start value in order to generate EndingDateTime values in UMMG files; this value is a constant. It's only capable of appling the same value to all time_start_regex value gleaned from files. The time_coverage_duration value must be a valid ISO duration value.

    • This attribute is meant to be used only with "nearly" compliant netCDF files--not any other file types since all other file types will rely on premet files to generate temporal details in output ummg metadata files. Example:
time_start_regex = IRTIT3_(?P<time_coverage_start>\d{8})_
time_coverage_duration = P0DT23H59M59S
  1. Rarely applicable for science files that aren't gridded netCDF (.txt, .csv, .jpg, .tif, etc.); this value is a constant that will be applied to all granule-level metadata.

Granule and Browse regex

.ini element .ini section Note
browse_regex Collection 1
granule_regex Collection 2
reference_file_regex Collection 3

Note column:

  1. The file name pattern identifying the browse file(s) accompanying single or multi-file granules. Granules with multiple associated browse files work fine with MetGenC! The default is _brws, change it to reflect the browse file names of the data delivered. This element is prompted for when running metgenc init.
  2. The file name pattern to be used for multi-file granules to define a file name pattern to appropriately group files together as a granule using the elements common amongst their names.
    • This must result in a globally unique: product/name (in CNM), and Identifier (as the IdentifierType: ProducerGranuleId in UMM-G) generated for each granule. This init element value must be added manually as it's not included in the metgenc init prompts.
  3. The file name pattern identifying a single file for metgenc to reference as the primary file in a multi-file granule. This must be specified whenever working with multi-file granules. This element is prompted for when running metgenc init.
    • In the case of multi-file granules containing a CF-compliant netCDF science file and other supporting files like .tif, or .txt files, etc., specifying the netCDF will allow MetGenC to parse it as it would any other CF-compliant netCDF file, making it so operators don't need to supply premet/spatial files.
INI File Example 1: Use of granule_regex for multi-file granules with no browse

Given the Config file Source and Collection contents:

[Source]
data_dir = data/IPFLT1B_DUCk
premet_dir = premet/ipflt1b
spatial_dir = spatial/ipflt1b

[Collection]
auth_id = IPFLT1B_DUCk
version = 1
provider = OIB; metgenc version 1.10.0rc0
granule_regex = (IPFLT1B_)(?P<granuleid>.+?(?=_)_)?(DUCk)
reference_file_regex = _DUCk.kml

And a multi-file granule comprising the following files:

IPFLT1B_20101226_085033_DUCk.dbf
IPFLT1B_20101226_085033_DUCk.kml
IPFLT1B_20101226_085033_DUCk.shp
IPFLT1B_20101226_085033_DUCk.shx
IPFLT1B_20101226_085033_DUCk.txt

The granule_regex sections:

  • (IPFLT1B_), and (DUCk) identify the 1st and 3rd (the last) Capture Groups to parse the constants to be included in each granule name: authID, and DUCk.

  • The Named Capture Group granuleid (?P<granuleid>.+?(?=_)_)? matches the unique date range contained in each file name to be included in each granule name, e.g., IPFLT1B_20101226_085033_.

  • Thus, IPFLT1B_ and DUCk are combined with the granuleid capture group element to become the producerGranuleId reflected for each granule in EDSC's Granules listing. This will globally, uniquely identify all granules associated with a given collection from any other files in other collections in CUAT or CPROD. In this case that's IPFLT1B_20101226_085033_DUCk. This is reflected in the CNM as the product/name value, and the UMMG as the Identifier value. Note: Ideally there would also be a version ID in this file name, but version wasn't assigned in most IceBridge collection granule names.

INI File Example 2: Single-file granule with good file names and no browse-omit browse_regex and granule_regex

This .ini file's [Source] and [Collection] contents apply to a single-file granule with no browse images:

[Source]
data_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/data
premet_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/premet
spatial_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/spatial

[Collection]
auth_id = SNEX23_CSU_GPR
version = 1
provider = SnowEx

No regex are necessary since the file name will simply become the granule name.

INI File Example 3: Single-file granule with good file names and browse images-omit granule_regex

This .ini file's [Source] and [Collection] contents work for single-file granules with browse images:

[Source]
data_dir = ./data/0081

[Collection]
auth_id = NSIDC-0081
version = 2
provider = DPT
browse_regex = _F\d{2}

And two granules + their associated browse files and good granule names:

NSIDC0081_SEAICE_PS_N25km_20211101_v2.0.nc
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F16.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F17.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F18.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0.nc
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F16.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F17.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F18.png

Only the browse_regex needs to be set to capture that which distinguishes the browse from the science files, in this case that's the presence of _F\d{2}, where _F\d{2} captures the number _F16, _F17, and _F18.

INI File Example 4: Use of granule_regex and browse_regex for single-file granules with interrupted file names

Given the .ini file's [Source] and [Collection] contents:

[Source]
data_dir = ./data/0081DUCk

[Collection]
auth_id = NSIDC-0081DUCk
version = 2
provider = DPT
browse_regex = _brws
granule_regex = (NSIDC0081_SEAICE_PS_)(?P<granuleid>[NS]{1}\d{2}km_\d{8})(_v2.0_)(?:F\d{2}_)?(DUCk)

And two granules + their associated browse files:

NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_DUCk.nc
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F16_DUCk_brws.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F17_DUCk_brws.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F18_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_DUCk.nc
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F16_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F17_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F18_DUCk_brws.png

The browse_regex: This simply identifies the part of the browse file name that distinguishes it as the browse from the science file, in this example: browse_regex = _brws.

The granule_regex sections: In the case where a file name element interrupts what would be a string common to both the science and browse file names, a granule_regex is required to identify the granule name.

  • (NSIDC0081_SEAICE_PS_)(_v2.0_), and (DUCk) identify the 1st, 3rd, and 4th (the last) Capture Groups. These are constants required to be present in each granules name: authID, version ID, and DUCk (the latter was only relevant for early CUAT testing). These are combined with the following...

  • The Named Capture Group granuleid (?P<granuleid>[NS]{1}\d{2}km_\d{8}) matches the region, resolution, and date elements unique to each file name (e.g., N25km_20211101 and S25km_20211102), which are combined with the elements in the bullet above to form unique granule names.

  • (?:F\d{2}_)? matches the F16_, F17_, and F18_ strings in the browse file names as a Non-capture Group; these elements will be matched but won't be included in granule names.

  • In summary: NSIDC0081_SEAICE_PS_, _v2.0_, and DUCk are combined with the granuleid capture group element, (?P<granuleid>[NS]{1}\d{2}km_\d{8}), to form the producerGranuleId reflected for each granule, e.g., NSIDC0081_SEAICE_PS_N25km_20211105_v2.0_DUCk.nc and NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_DUCk.nc. These are the names that will be shown for the granules in EDSC. They globally, uniquely distinguish granules in a specific collection from any other granules in any other collections in CUAT or CPROD. These names are found in the CNM as the product/name value, and the UMMG metadata file as the Identifier value.

    • If the granule_regex was omitted from the .ini file in this case, the cnm output would only define data and metadata files for ingest, the browse images would not be included!
    • Since metgenc validate doesn't check attribute values, no validation errors are thrown when this happens.
    • This hopefully is largely an example portraying a made-up edge case due to the way I'd added the _DUCk identifier to these files for early MetGenC testing!! But be aware of this if you find yourself dealing with complicated file names where the element meant to comprise the granule id are interrupted by other elements.

Using Premet and Spatial files

When necessary, the following two .ini elements can be added to define the paths to directories containing premet and spatial files—they must be two separate directories, and separate from the data directory. The user will be prompted for these values when running metgenc init.

.ini element .ini section
premet_dir Source
spatial_dir Source
  • The spatial_dir defines the path to the directory containing either .spatial or .spo files.
  • The composition of .spatial/.spo and .premet files and their naming convention is to remain exactly it is/has been for their use with SIPSMetgen, and is described here: https://nsidc.org/sites/default/files/documents/other/guidelines-preliminary-metadata-creation-and-data-product-delivery.pdf
    • This was done to avoid changing existing ops and/or data producer workflows/scripts.
  • Reminder for premets: there should be a compelling reason (i.e., preserving continuity of an existing collection) from the pub team in order to include more attributes than just begin/end date/time. Most, if not all, new data sets requiring premets should see them include only begin/end date/time.

Setting Collection Spatial Extent as Granule Spatial Extent

In cases of data sets where granule spatial information is not available by interrogating the data or via spatial or .spo files, the operator may set a flag to force the metadata representing each granule's spatial extents to be set to that of the collection. The user will be prompted for the collection_geometry_override value when running metgenc init. The default value is False; setting it to True signals MetGenC to use the collection's spatial extent for each granule.

.ini element .ini section
collection_geometry_override Source

Setting Collection Temporal Extent as Granule Temporal Extent

RARELY APPLICABLE (if ever)!! An operator may set an .ini flag to indicate that a collection's temporal extent should be used to populate every granule via granule-level UMMG json to be the same TemporalExtent (SingleDateTime or BeginningDateTime and EndingDateTime) as what's defined for the collection. In other words, every granule in a collection would display the same start and end times in EDSC. In most collections, this is likely ill-advised use case. The operator will be prompted for a collection_temporal_override value when running metgenc init. The default value is False and should likely always be accepted; setting it to True is what would signal MetGenC to set each granule to the collection's TemporalExtent.

.ini element .ini section
collection_temporal_override Source

Spatial Polygon Generation

MetGenC includes optimized polygon generation capabilities for creating spatial coverage polygons from point data, particularly useful for LIDAR flightline data.

When a granule has an associated .spatial file containing geodetic point data (≥3 points), MetGenC will automatically generate an optimized polygon to enclose the data points instead of using the basic point-to-point polygon method. This results in more accurate spatial coverage with fewer vertices.

This feature, while optional, is always enabled by default in MetGenC.

  • To disable it entirely, edit the .ini file, add a [Spatial] section if necessary, and add the line spatial_polygon_enabled = false. CURRENTLY RECOMMENDED TO SET spatial_polygon_enabled = false WHENEVER .SPO FILES ARE USED.
  • When spatial_polygon_enabled = true (either by default or when set as such in the .ini file) the other parameters listed below can be added to and edited in the .ini file. For the most part, the values shouldn't need to be altered! However, if ingest fails due to GPolygonSpatial errors, the first attribute to add to or edit in the .ini file should be spatial_polygon_cartesian_tolerance by decreasing its coordinate precision (e.g., .0001 => .01) which will increase the distance between gpolygon vertices, expanding the spatial extent.

Configuration Parameters:

.ini section .ini element Type Default Description
Spatial spatial_polygon_enabled boolean true Enable/disable polygon generation for .spatial files
Spatial spatial_polygon_target_coverage float 0.98 Target data coverage percentage (0.80-1.0)
Spatial spatial_polygon_max_vertices integer 100 Maximum vertices in generated polygon (10-1000)
Spatial spatial_polygon_cartesian_tolerance float 0.0001 Minimum distance between polygon points in degrees (0.00001-0.01)
Example Spatial Polygon Generation Configuration

Example showing content added to an .ini file, having edited the CMR default vertex tolerance (distance between two vertices) to decrease the precision of the GPoly coordinate pairs listed in the UMMG json files MetGenC generates:

[Spatial]
spatial_polygon_enabled = true
spatial_polygon_target_coverage = 0.98
spatial_polygon_max_vertices = 100
spatial_polygon_cartesian_tolerance = .01

Example showing the key pair added to an .ini file to disable spatial polygon generation:

[Spatial]
spatial_polygon_enabled = false

When Polygon Generation is Applied:

  • ✅ Granule has a .spatial file with ≥3 geodetic points
  • spatial_polygon_enabled = true (default)
  • ✅ Granule spatial representation is GEODETIC

When Original Behavior is Used:

  • ❌ No .spatial file present (data from other sources)
  • spatial_polygon_enabled = false
  • ❌ Granule spatial representation is CARTESIAN
  • ❌ Insufficient points (<3) for polygon generation
  • ❌ Polygon generation fails (automatic fallback)

Tolerance Requirements: The spatial_polygon_cartesian_tolerance parameter ensures that generated polygons meet NASA CMR validation requirements. The CMR system requires that each point in a polygon must have a unique spatial location - if two points are closer than the tolerance threshold in both latitude and longitude, they are considered the same point and the polygon becomes invalid. MetGenC automatically filters points during polygon generation to ensure this requirement is met.

This enhancement is backward compatible - existing workflows continue unchanged, and polygon generation only activates for appropriate .spatial file scenarios.

Geospatial Bounds Configuration

MetGenC can extract polygon vertices directly from the geospatial_bounds netCDF attribute when it contains a WKT POLYGON string. This extracts all polygon vertices as individual points, providing an alternative to the default of using spatial coordinate values to generate a polygon. If no geospatial_bounds_crs attribute exists, the geospatial_bounds value is assumed to represent points in EPSG:4326.

Example Configuration:

[Spatial]
prefer_geospatial_bounds = true

When Geospatial Bounds Extraction is Applied:

  • ✅ Granule spatial representation is GEODETIC
  • prefer_geospatial_bounds = true in .ini file
  • ✅ NetCDF file contains valid geospatial_bounds global attribute with WKT POLYGON

info

The info command can be used to display the information within the configuration file as well as MetGenC system default values for data ingest.

metgenc info --help
Usage: metgenc info [OPTIONS]

  Summarizes the contents of a configuration file.

Options:
  -c, --config TEXT  Path to configuration file to display  [required]
  --help             Show this message and exit.

Example running info

metgenc info -c /share/apps/metgenc/SNEX23_CSU_GPR/init/SNEX23_CSU_GPR.ini
                   __
   ____ ___  ___  / /_____ ____  ____  _____
  / __ `__ \/ _ \/ __/ __ `/ _ \/ __ \/ ___/
 / / / / / /  __/ /_/ /_/ /  __/ / / / /__
/_/ /_/ /_/\___/\__/\__, /\___/_/ /_/\___/
                   /____/
Using configuration:
  + environment: uat
  + data_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/data
  + auth_id: SNEX23_CSU_GPR
  + version: 1
  + provider: SnowEx
  + local_output_dir: /share/apps/metgenc/SNEX23_CSU_GPR/output
  + ummg_dir: ummg
  + kinesis_stream_name: nsidc-cumulus-uat-external_notification
  + staging_bucket_name: nsidc-cumulus-uat-ingest-staging
  + write_cnm_file: True
  + overwrite_ummg: True
  + checksum_type: SHA256
  + number: 1000000
  + dry_run: False
  + premet_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/premet
  + spatial_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/spatial
  + collection_geometry_override: False
  + collection_temporal_override: False
  + time_start_regex: None
  + time_coverage_duration: None
  + pixel_size: None
  + browse_regex: _brws
  + granule_regex: None
  + reference_file_regex: None
  + spatial_polygon_enabled: False
  + spatial_polygon_target_coverage: 0.98
  + spatial_polygon_max_vertices: 100
  + spatial_polygon_cartesian_tolerance: 0.0001
  + prefer_geospatial_bounds: False
  + log_dir: /share/logs/metgenc
  + name: SNEX23_CSU_GPR

process

metgenc process --help

Usage: metgenc process [OPTIONS]

  Processes science files based on configuration file contents.

Options:
  -c, --config TEXT   Path to configuration file  [required]
  -d, --dry-run       Don't stage files on S3 or publish messages to Kinesis
  -e, --env TEXT      environment  [default: uat]  #note: this can be set to either `uat` or `prod`
  -n, --number count  Process at most 'count' granules.
  -wc, --write-cnm    Write CNM messages to files.
  -o, --overwrite     Overwrite existing UMM-G files.
  --help              Show this message and exit.

The process command can be run either with or without specifying the -d / --dry-run option.

  • When the dry run option is specified and the -wc / --write-cnm option is invoked, or your config file contains write_cnm_file = true (instead of = false), CNM will be written locally to the output/cnm directory (the operator must have already created!). This promotes operators having the ability to validate and visually QC their content before ingesting a collection.
  • When run without the dry run option, metgenc will transfer CNM to AWS, kicking off end-to-end ingest of data and UMM-G files.

Examples running process

The following is an example of using the dry run option (-d) to generate UMM-G and write CNM as files (-wc) for three granules (-n 3):

$ metgenc process -c ./init/test.ini -d -n 3 -wc

This next example would run end-to-end ingest of all granules (assuming < 1000000 granules) in the data directory specified in the test.ini config file and their UMM-G files into the CUAT environment:

$ metgenc process -c ./init/test.ini -e <uat or prod>

Note: Before running process without the dry run option, post Slack messages to NSIDC's #Cumulus and cloud-ingest-ops channels, and post a quick "done" note when you're done ingest testing as a courtesy to Cumulus devs and ops folks

Troubleshooting metgenc process

  • MetGenC processing, metgenc process -d -c init/xxxxx.ini, must be run at the ~/metgenc level in the vm's virtual environment, e.g., vagrant@vmpolark2:~/metgenc$. If you run it in the data/, or init/, or any other directory, you'll see errors like:
The configuration is invalid:
  * The data_dir does not exist.
  * The premet_dir does not exist.
  * The spatial_dir does not exist.
  * The local_output_dir does not exist.
  • If running metgenc process fails for other reasons, check for an error message in the metgenc log. This is written by default to/as (/share/logs/metgenc/metgenc-{config-name}-{timestamp}.log).

    • The metgenc.log will spell out the reason for the error for the operator, so the .ini file or paths pointed to in the .ini file can be spiffed up.
  • If running metgenc process without the -d / --dry-run option leads to the following warning:

  The configuration is invalid:
    The kinesis stream does not exist.
    The staging bucket does not exist.

It's almost certainly indicating that you've not sourced the credentials required (cumulus-uat, cumulus-prod) for the environment you're telling MetGenC to process in.

  • If metgenc reports "Successful   : False" for a specific granule, you can copy the UUID (or, just the last alphanumeric block after the dash is adequate), and then grep the metgenc log for that processing run for that id specifying only 46 lines after the id to be returned. That'll show you the log details just for that granule!
  e.g., grep -A 46 43eae1561cba metgenc.log

validate

The validate command lets you review the JSON CNM or UMM-G output files created by running process.

metgenc validate --help

Usage: metgenc validate [OPTIONS]

  Validates the contents of local JSON files.

Options:
  -c, --config TEXT  Path to configuration file  [required]
  -t, --type TEXT    JSON content type  [default: cnm]
  --help             Show this message and exit.

Example running validate

$ metgenc validate -c init/modscg.ini -t ummg (adding the -t ummg option will validate all UMM-G files; -t cnm will validate all CNM that have been written locally)
$ metgenc validate -c init/modscg.ini (without the -t option specified, just all locally written CNM will be validated)

running the following is an alternate way to validate ummg and cnm json files, but can only be run on one file at a time:

$ check-jsonschema --schemafile <path to schema file> <path to CNM or UMM-G file to check>

If running metgenc validate fails, check the metgenc.log for an error message to begin troubleshooting.

Pretty-print a json file in your shell

Handy tip: While not a MetGenC command, a handy way to show a file's contents without having to wade through unformatted json chaos is to run: cat <UMM-G or CNM file name> | jq

e.g., running cat /share/apps/metgenc/SNEX23_CSU_GPR/output/cnm/SNEX23_CSU_GPR_FLCF_20230307_20230316_v01.csv.cnm.json | jq will pretty-print the contents of this cnm.json file in the comfort of your own shell!

For Developers

Contributing

Requirements

You can install Poetry either by using the official installer if you’re comfortable following the instructions, or by using a package manager (like Homebrew) if this is more familiar to you. When successfully installed, you should be able to run:

$ poetry --version
Poetry (version 1.8.3)

Installing Dependencies

  • Use Poetry to create and activate a virtual environment

    $ poetry shell
    
  • Install dependencies

    $ poetry install
    

Run tests

$ poetry run pytest

Run tests when source changes

This uses pytest-watcher

$ poetry run ptw . --now --clear

Running the linter for code style issues

$ poetry run ruff check

The ruff tool will check the source code for conformity with various style rules. Some of these can be fixed by ruff itself, and if so, the output will describe how to automatically fix these issues.

The CI/CD pipeline will run these checks whenever new commits are pushed to GitHub, and the results will be available in the GitHub Actions output.

Running the code formatter

$ poetry run ruff format

The ruff tool will check the source code for conformity with source code formatting rules. It will also fix any issues it finds and leave the changes uncommitted so you can review the changes prior to adding them to the codebase.

As with the linter, the CI/CD pipeline will run the formatter when commits are pushed to GitHub.

Ruff integration with your editor

Rather than running ruff manually from the commandline, it can be integrated with the editor of your choice. See the ruff editor integration guide.

Releasing

  • Update CHANGELOG.md according to its representation of the current version:

    • If the current "version" in CHANGELOG.md is UNRELEASED, add an entry describing your new changes to the existing change summary list.

    • If the current version in CHANGELOG.md is not a release candidate, add a new line at the top of CHANGELOG.md with a "version" consisting of the string literal UNRELEASED (no quotes surrounding the string). It will be replaced with the release candidate form of an actual version number after the major, minor, or patch version is bumped (see below). Add a list summarizing the changes (thus far) in this new version below the UNRELEASED version entry.

    • If the current version in CHANGELOG.md is a release candidate, add an entry describing your new changes to the existing change summary list for this release candidate version. The release candidate version will be automatically updated when the rc version is bumped (see below).

  • Commit CHANGELOG.md so the working directory is clean.

  • Show the current version and the possible next versions:

      $ bump-my-version show-bump
      1.4.0 ── bump ─┬─ major ─── 2.0.0rc0
                     ├─ minor ─── 1.5.0rc0
                     ├─ patch ─── 1.4.1rc0
                     ├─ release ─ invalid: The part has already the maximum value among ['rc', 'release'] and cannot be bumped.
                     ╰─ rc ────── 1.4.0release1
    
  • If the currently released version of metgenc is not a release candidate and the goal is to start work on a new version, the first step is to create a pre-release version. As an example, if the current version is 1.4.0 and you'd like to release 1.5.0, first create a pre-release for testing:

      $ bump-my-version bump minor
    

    Now the project version will be 1.5.0rc0 -- Release Candidate 0. As testing for this release-candidate proceeds, you can create more release-candidates by:

      $ bump-my-version bump rc
    

    And the version will now be 1.5.0rc1. You can create as many release candidates as needed.

  • When you are ready to do a final release, you can:

      $ bump-my-version bump release
    

    Which will update the version to 1.5.0. After doing any kind of release, you will see the latest commit and tag by looking at git log. You can then push these to GitHub (git push --follow-tags) to trigger the CI/CD workflow.

  • On the GitHub repository, click 'Releases' and follow the steps documented on the GitHub Releases page. Draft a new Release using the version tag created above. By default, the 'Set as the latest release' checkbox will be selected. To publish a pre-release from a release candidate version, be sure to select the 'Set as a pre-release' checkbox. After you have published the (pre-)release in GitHub, the MetGenC Publish GHA workflow will be started. Check that the workflow succeeds on the MetGenC Actions page, and verify that the new MetGenC (pre-)release is available on PyPI.

Credit

This content was developed by the National Snow and Ice Data Center with funding from multiple sources.

About

Metadata generator for direct-to-Cumulus era

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •