- MetGenC
- Level of Support
- Accessing the OPS MetGenC VM and Tips and Assumptions
- Assumptions for netCDF files for MetGenC
- MetGenC .ini File Assumtions
- NetCDF Attributes MetGenC Relies upon to generate UMM-G json files
- Geometry Logic
- Running MetGenC: Its Commands In-depth
- help
- init
- Required and Optional Configuration Elements
- Granule and Browse regex
- INI File Example 1: Use of granule_regex for multi-file granules with no browse
- INI File Example 2: Single-file granule with good file names and no browse-omit browse_regex and granule_regex
- INI File Example 3: Single-file granule with good file names and browse images-omit granule_regex
- INI File Example 4: Use of granule_regex and browse_regex for single-file granules with interrupted file names
- Using Premet and Spatial Files
- Setting Collection Spatial Extent as Granule Spatial Extent
- Setting Collection Temporal Extent as Granule Temporal Extent
- Spatial Polygon Generation
- info
- process
- validate
- Pretty-print a json file in your shell
- For Developers
The MetGenC
toolkit enables Operations staff and data
producers to create metadata files conforming to NASA's Common Metadata Repository UMM-G
specification and ingest data directly to NASA EOSDIS’s Cumulus archive. Cumulus is an
open source cloud-based data ingest, archive, distribution, and management framework
developed for NASA's Earth Science data.
This repository is fully supported by NSIDC. If you discover any problems or bugs, please submit an Issue. If you would like to contribute to this repository, you may fork the repository and submit a pull request.
See the LICENSE for details on permissions and warranties. Please contact [email protected] for more information.
-
from nusnow:
$ vssh production metgenc
-
the one swell foop command line to kick off everything you need to run MetGenC:
uat cd metgenc;source .venv/bin/activate;source metgenc-env.sh cumulus-uat prod cd metgenc;source .venv/bin/activate;source metgenc-env.sh cumulus-prod
BE AWARE: IF YOU'BE BEEN TESTING/INGEST CUAT INGEST, WHEN YOU'RE READY TO INGEST TO CPRD, MAKE SURE TO RUN source metgenc-env.sh cumulus-prod
. MetGenC will happily let you use the -e prod option, but you need to have the right credentials sourced!!
If the creds aren't pointing to the right environment, MetGenC will return:
* The kinesis stream does not exist.
* The staging bucket does not exist.
Commands within the above one-liner detailed:
-
CD Into, and activate, the venv:
$ cd metgenc $ source .venv/bin/activate
-
Before you run end-to-end ingest, be sure to source the AWS credentials:
$ source metgenc-env.sh cumulus-<uat or prod>
Available profiles are cumulus-uat
and cumulus-prod
.
If you think you've already run it but can't remember, run the following:
$ aws configure list
The output will either indicate that you need to source your credentials by returning:
Name Value Type Location
---- ----- ---- --------
profile <not set> None None
access_key <not set> None None
secret_key <not set> None None
region <not set> None None
Or it'll show that you're all set (AWS comms-wise) for ingesting to Cumulus by returning the following:
Name Value Type Location
---- ----- ---- --------
profile cumulus-<uat or prod> env ['AWS_DEFAULT_PROFILE', 'AWS_PROFILE']
access_key ****************SQXY env
secret_key ****************cJ+5 env
region us-west-2 config-file ~/.aws/config
- NetCDF files have an extension of
.nc
(per CF conventions). - Projected spatial information is available in coordinate variables having
a
standard_name
attribute value ofprojection_x_coordinate
orprojection_y_coordinate
attribute. - (y[0],x[0]) represents the upper left corner of the spatial coverage.
- Spatial coordinate values represent the center of the area covered by a measurement.
- Only one coordinate system is used by all data variables in all science files (i.e. only one grid mapping variable is present in a file, and the content of that variable is the same in every science file).
- A
pixel_size
attribute is needed in a data set's .ini file when gridded science files don't include a GeoTransform attribute in the grid mapping variable. The value specified should be just a number—no units (m, km) need to be specified since they're assumed to be the same as the units of those defined by the spatial coordinate variables in the data set's science files.- e.g.,
pixel_size = 25
- e.g.,
- Date/time strings can be parsed using
datetime.fromisoformat
- The checksum_type must be SHA256
CF Conventions and NSIDC Guidelines (=NSIDC Guidelines for netCDF Attributes) are the driving forces behind emphatically suggesting data producers include the Attributes used by MetGenC in their netCDF files.
- Required required
- RequiredC conditionally required
- R+ highly or strongly recommended
- R recommended
- S suggested
Attribute used by MetGenC (location in netCDF file) | CF Conventions | NSIDC Guidelines | Notes |
---|---|---|---|
time_coverage_start (global) | R | 1, OC, P | |
time_coverage_end (global) | R | 1, OC, P | |
grid_mapping_name (variable) | RequiredC | R+ | 2 |
crs_wkt (variable with grid_mapping_name attribute) |
R | 3 | |
GeoTransform (variable with grid_mapping_name attribute) |
R | 4, OC | |
geospatial_lon_min (global) | R | ||
geospatial_lon_max (global) | R | ||
geospatial_lat_min (global) | R | ||
geospatial_lat_max (global) | R | ||
geospatial_bounds (global) | R | 7, OC | |
geospatial_bounds_crs (global) | ? | 8 | |
standard_name, projection_x_coordinate (variable) |
RequiredC | ||
standard_name, projection_y_coordinate (variable) |
RequiredC |
Notes column key:
OC = Optional configuration attributes (or elements of them) that may be represented in an .ini file in order to allow "nearly" compliant netCDF files to be run with MetGenC without premet/spatial files. See Required and Optional Configuration Elements
P = Premet file attributes that may be specified in a premet file; when used, a
premet_dir
path must be defined in the .ini file.
1 = Used to populate the time begin and end UMM-G values; OC .ini attribute for
time_coverage_start is time_start_regex
= <value>, and for time_coverage_end the
.ini attribute is time_coverage_duration
= <value>.
2 = A grid mapping variable is required if the horizontal spatial coordinates are not
longitude and latitude and the intent of the data provider is to geolocate
the data. grid_mapping
and grid_mapping_name
allow programmatic identification of
the variable holding information about the horizontal coordinate reference system.
3 = The crs_wkt
("coordinate referenc system well known text") value is handed to the
CRS
and Transformer
modules in pyproj
to conveniently deal
with the reprojection of (y,x) values to EPSG 4326 (lon, lat) values.
4 = The GeoTransform
value provides the pixel size per data value, which is then used
to calculate the padding added to x and y values to create a GPolygon enclosing all
of the data; OC .ini attribute is pixel_size
= .
5 = The values of the coordinate variable identified by the standard_name
attribute
with a value of projection_x_coordinate
are reprojected and thinned to create a
GPolygon, bounding rectangle, etc.
6 = The values of the coordinate variable identified by the standard_name
attribute
with a value of projection_y_coordinate
are reprojected and thinned to create a
GPolygon, bounding rectangle, etc.
7 = The geospatial_bounds
netCDF file global attribute contains spatial boundary information as a
WKT POLYGON string. When present and prefer_geospatial_bounds = true
is set in the
.ini file, MetGenC will use this attribute instead of spatial coordinate values to generate
spatial representations of granules in collections with a GEODETIC granule spatial representation.
If the geospatial_bounds_crs
attribute is also present in netCDF files, coordinates
will be transformed to EPSG:4326 if needed. The corresponding .ini parameter is prefer_geospatial_bounds
= true/false.
8 = The geospatial_bounds_crs
netCDF file global attribute specifies the coordinate reference system
for the coordinates in the geospatial_bounds
global attribute. It can be an EPSG identifier (e.g., "EPSG:4326")
or other CRS format. When present, MetGenC will transform geospatial_bounds
coordinates to EPSG:4326 if needed.
If geospatial_bounds
is true
and no geospatial_bounds_crs
attribute exists, the
coordinates in the geospatial_bounds
attribute are assumed to represent points in EPSG:4326.
On V0 wherever the data are staged (/disks/restricted_ftp or /disks/sidads_staging, etc.) you can run ncdump to check whether a netCDF representative of the collection's files contains the MetGenC-required attributes. When not reported, that attribute will have to be accommodated by its associated .ini attribute being added to the .ini file. See Required and Optional Configuration Elements for full details/descriptions of these.
ncdump -h <file name.nc> | grep -e time_coverage_start -e time_coverage_end -e GeoTransform -e crs_wkt -e spatial_ref -e grid_mapping_name -e geospatial_bounds -e geospatial_bounds_crs -e 'standard_name = "projection_y_coordinate"' -e 'standard_name = "projection_x_coordinate"'
The geometry behind the granule-level spatial representation (point, gpolygon, or bounding
rectangle) required for a data set can be implemented by MetGenC via either: file-level metadata
(such as a CF/NSIDC Compliant netCDF file), .spatial
/ .spo
files, or
its collection-level spatial representation.
When MetGenC is run with netCDF files that are both CF and NSIDC Compliant (for those requirements, refer to the table: NetCDF Attributes Used to Populate the UMM-G files generated by MetGenC) information from within the file's metadata will be used to generate an appropriate gpolygon or bounding rectangle for each granule.
In some cases, non-netCDF files, and/or netCDF files that are non-CF or non-NSIDC compliant will require an operator to define or modify data set details expressed through attributes in an .ini file, in other cases an operator will need to further modify the .ini file to specify paths to where premet and spatial files are stored for MetGenC to use as input files.
For granules suited to using the spatial extent defined for its collection,
a collection_geometry_override = True
attribute/value pair can be added to the .ini file
(as long as it's a single bounding rectangle, and not two or more bounding rectangles).
Setting collection_geometry_override = False
in the .ini file will make MetGenC look to the
science files or premet/spatial files for the granule-level spatial representation geometry
to use.
Granule Spatial Representation Geometry | Granule Spatial Representation Coordinate System (GSRCS) |
---|---|
GPolygon (GPoly) | Geodetic |
Bounding Rectangle (BR) | Cartesian |
Points | Geodetic |
.spo = .spo file associated with each granule, used to directly define the vertices of a gPoly.
.spatial = .spatial file associated with each granule to define either: BR, Point, or the data footprint (i.e., the .spatial simply contains a listing of all coordinates parsed from the science file) for which MetGenC is to generate a detailed, encompassing GPoly.
source | num points | GSRCS | error? | expected output | comments |
---|---|---|---|---|---|
.spo | any | cartesian | yes | .spo inherently defines GPoly vertices; GPolys cannot be cartesian. |
|
.spo | <= 2 | geodetic | yes | At least three points are required to define a GPoly. | |
.spo | > 2 | geodetic | no | GPoly as described by .spo file contents. |
|
.spatial | 1 | cartesian | yes | NSIDC data curators always associate a GEODETIC granule spatial representation with point data. |
|
.spatial | 1 | geodetic | no | Point as defined by spatial file. | |
.spatial | 2 | cartesian | no | BR as defined by spatial file. | |
.spatial | >= 2 | geodetic | no | GPoly(s) calculated to enclose all points. | If spatial_polygon_enabled=true (default) and ≥3 points, uses optimized polygon generation with target coverage and vertex limits. |
.spatial | > 2 | cartesian | yes | There is no cartesian-associated geometry for GPolys. | |
science file (NSIDC/CF-compliant netCDF) | NA | cartesian | no | BR | min/max lon/lat points for BR expected to be included in global attributes. |
science file (NSIDC/CF-compliant) | 1 or > 2 | geodetic | no | Error if only two points. GPoly calculated from grid perimeter. | |
science file, non-NSIDC/CF-compliant netCDF or other format | NA | either | no | As specified by .ini file. | Configuration file must include a spatial_dir value (a path to the directory with valid .spatial or .spo files), or collection_geometry_override = True entry (which must be defined as a single point or a single bounding rectangle). |
collection spatial metadata geometry = cartesian with one BR | NA | cartesian | no | BR as described in collection metadata. | |
collection spatial metadata geometry = cartesian with one BR | NA | geodetic | yes | Collection geometry and GSRCS must both be cartesian. | |
collection spatial metadata geometry = cartesian with two or more BR | NA | cartesian | yes | Two-part bounding rectangle is not a valid granule-level geometry. | |
collection spatial metadata geometry specifying one or more points | NA | NA | Not a known use case |
Show MetGenC's help text:
$ metgenc --help
Usage: metgenc [OPTIONS] COMMAND [ARGS]...
The metgenc utility allows users to create granule-level metadata, stage
granule files and their associated metadata to Cumulus, and post CNM
messages.
Options:
--help Show this message and exit.
Commands:
info Summarizes the contents of a configuration file.
init Populates a configuration file based on user input.
process Processes science files based on configuration file...
validate Validates the contents of local JSON files.
-
For detailed help on each command, run:
metgenc <command name> --help
:$ metgenc process --help
The init command can be used to generate a metgenc configuration (i.e., .ini) file for your data set, or edit an existing .ini file.
- You don't need to run this command if you already have an .ini file that you prefer to copy and edit manually (any text editor will work) to apply to the collection you're ingesting.
- If running metgenc init, the name of the new ini file you specify needs to include the
.ini
suffix.
metgenc init --help
Usage: metgenc init [OPTIONS]
Populates a configuration file based on user input.
Options:
-c, --config TEXT Path to configuration file to create or replace
--help Show this message and exit
Example running init
$ metgenc init -c ./init/<name of config file to create or modify>.ini
- The .ini file's
checksum_type = SHA256
should never be edited - The
kinesis_stream_name
andstaging_bucket_name
should never be edited auth_id
andversion
must accurately reflect the collection's authID and versionIDlog_dir
specifies the directory where metgenc log files will be written. Log files are namedmetgenc-{config-name}-{timestamp}.log
where config-name is the base name of the .ini file and timestamp is in YYYYMMDD-HHMM format. The default log directory is/share/logs/metgenc
, but this can be edited to write metgenc logs to a different existing, writable directory location.- provider is a free text attribute where, for now, the version of metgenc being run should be documented
- running
metgenc --version
will return the current version
- running
Some attribute values may be read from the .ini file if the values
can't be gleaned from—or don't exist in—the science file(s), but whose
values are known for the data set. Use of these elements can be typical
for data sets comprising non-CF/non-NSIDC-compliant netCDF science files,
as well as non-netCDF data sets comprising .tif, .csv, .h5, etc. The element
values must be manually added to the .ini file, as none are prompted for
in the metgenc init
functionality.
See this project's GitHub file, fixtures/test.ini
for examples.
.ini element | .ini section | Attribute absent from netCDF file the .ini attribute stands in for | Attribute populated in UMMG | Note |
---|---|---|---|---|
time_start_regex | Collection | time_coverage_start | BeginningDateTime | 1 |
time_coverage_duration | Collection | time_coverage_end | EndingDateTime | 2 |
pixel_size | Collection | GeoTransform | n/a | 3 |
R = Required for all non-netCDF file types (e.g., csv, .tif, .h5, etc) and netCDF files missing the global attribute specified
-
This regex attribute leverages a netCDF's file name containing a date to populate UMMG files' TemporalExtent field attribute, BeginningDateTime. Must match using the named group
(?P<time_coverage_start>)
.- This attribute is meant to be used with "nearly" compliant netCDF files, but not other file types (csv, tif, etc.) since these should rely on premet files containing temporal details for each file.
-
The time_coverage_duration attribute value specifies the duration to be applied to the
time_coverage_start
value in order to generate EndingDateTime values in UMMG files; this value is a constant. It's only capable of appling the same value to all time_start_regex value gleaned from files. The time_coverage_duration value must be a valid ISO duration value.- This attribute is meant to be used only with "nearly" compliant netCDF files--not any other file types since all other file types will rely on premet files to generate temporal details in output ummg metadata files. Example:
time_start_regex = IRTIT3_(?P<time_coverage_start>\d{8})_
time_coverage_duration = P0DT23H59M59S
- Rarely applicable for science files that aren't gridded netCDF (.txt, .csv, .jpg, .tif, etc.); this value is a constant that will be applied to all granule-level metadata.
.ini element | .ini section | Note |
---|---|---|
browse_regex | Collection | 1 |
granule_regex | Collection | 2 |
reference_file_regex | Collection | 3 |
Note column:
- The file name pattern identifying the browse file(s) accompanying single or multi-file granules. Granules
with multiple associated browse files work fine with MetGenC! The default is
_brws
, change it to reflect the browse file names of the data delivered. This element is prompted for when runningmetgenc init
. - The file name pattern to be used for multi-file granules to define a file name pattern to appropriately
group files together as a granule using the elements common amongst their names.
- This must result in a globally unique: product/name (in CNM), and Identifier (as the IdentifierType: ProducerGranuleId in UMM-G)
generated for each granule. This init element value must be added manually as it's not included in the
metgenc init
prompts.
- This must result in a globally unique: product/name (in CNM), and Identifier (as the IdentifierType: ProducerGranuleId in UMM-G)
generated for each granule. This init element value must be added manually as it's not included in the
- The file name pattern identifying a single file for metgenc to reference as the primary
file in a multi-file granule. This must be specified whenever working with multi-file granules. This element
is prompted for when running
metgenc init
.- In the case of multi-file granules containing a CF-compliant netCDF science file and other supporting files like .tif, or .txt files, etc., specifying the netCDF will allow MetGenC to parse it as it would any other CF-compliant netCDF file, making it so operators don't need to supply premet/spatial files.
Given the Config file Source and Collection contents:
[Source]
data_dir = data/IPFLT1B_DUCk
premet_dir = premet/ipflt1b
spatial_dir = spatial/ipflt1b
[Collection]
auth_id = IPFLT1B_DUCk
version = 1
provider = OIB; metgenc version 1.10.0rc0
granule_regex = (IPFLT1B_)(?P<granuleid>.+?(?=_)_)?(DUCk)
reference_file_regex = _DUCk.kml
And a multi-file granule comprising the following files:
IPFLT1B_20101226_085033_DUCk.dbf
IPFLT1B_20101226_085033_DUCk.kml
IPFLT1B_20101226_085033_DUCk.shp
IPFLT1B_20101226_085033_DUCk.shx
IPFLT1B_20101226_085033_DUCk.txt
The granule_regex sections:
-
(IPFLT1B_)
, and(DUCk)
identify the 1st and 3rd (the last) Capture Groups to parse the constants to be included in each granule name: authID, and DUCk. -
The Named Capture Group granuleid
(?P<granuleid>.+?(?=_)_)?
matches the unique date range contained in each file name to be included in each granule name, e.g.,IPFLT1B_20101226_085033_
. -
Thus, IPFLT1B_ and DUCk are combined with the granuleid capture group element to become the producerGranuleId reflected for each granule in EDSC's Granules listing. This will globally, uniquely identify all granules associated with a given collection from any other files in other collections in CUAT or CPROD. In this case that's
IPFLT1B_20101226_085033_DUCk
. This is reflected in the CNM as the product/name value, and the UMMG as the Identifier value. Note: Ideally there would also be a version ID in this file name, but version wasn't assigned in most IceBridge collection granule names.
INI File Example 2: Single-file granule with good file names and no browse-omit browse_regex and granule_regex
This .ini file's [Source] and [Collection] contents apply to a single-file granule with no browse images:
[Source]
data_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/data
premet_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/premet
spatial_dir = /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/spatial
[Collection]
auth_id = SNEX23_CSU_GPR
version = 1
provider = SnowEx
No regex are necessary since the file name will simply become the granule name.
This .ini file's [Source] and [Collection] contents work for single-file granules with browse images:
[Source]
data_dir = ./data/0081
[Collection]
auth_id = NSIDC-0081
version = 2
provider = DPT
browse_regex = _F\d{2}
And two granules + their associated browse files and good granule names:
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0.nc
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F16.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F17.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F18.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0.nc
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F16.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F17.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F18.png
Only the browse_regex needs to be set to capture that which distinguishes the browse from the science files, in this case that's the presence of _F\d{2}, where _F\d{2} captures the number _F16, _F17, and _F18.
INI File Example 4: Use of granule_regex
and browse_regex
for single-file granules with interrupted file names
Given the .ini file's [Source] and [Collection] contents:
[Source]
data_dir = ./data/0081DUCk
[Collection]
auth_id = NSIDC-0081DUCk
version = 2
provider = DPT
browse_regex = _brws
granule_regex = (NSIDC0081_SEAICE_PS_)(?P<granuleid>[NS]{1}\d{2}km_\d{8})(_v2.0_)(?:F\d{2}_)?(DUCk)
And two granules + their associated browse files:
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_DUCk.nc
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F16_DUCk_brws.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F17_DUCk_brws.png
NSIDC0081_SEAICE_PS_N25km_20211101_v2.0_F18_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_DUCk.nc
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F16_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F17_DUCk_brws.png
NSIDC0081_SEAICE_PS_S25km_20211102_v2.0_F18_DUCk_brws.png
The browse_regex:
This simply identifies the part of the browse file name that distinguishes it as the browse from the science file, in this example: browse_regex = _brws
.
The granule_regex sections: In the case where a file name element interrupts what would be a string common to both the science and browse file names, a granule_regex is required to identify the granule name.
-
(NSIDC0081_SEAICE_PS_)
,(_v2.0_)
, and(DUCk)
identify the 1st, 3rd, and 4th (the last) Capture Groups. These are constants required to be present in each granules name: authID, version ID, and DUCk (the latter was only relevant for early CUAT testing). These are combined with the following... -
The Named Capture Group granuleid
(?P<granuleid>[NS]{1}\d{2}km_\d{8})
matches the region, resolution, and date elements unique to each file name (e.g.,N25km_20211101
andS25km_20211102
), which are combined with the elements in the bullet above to form unique granule names. -
(?:F\d{2}_)?
matches the F16_, F17_, and F18_ strings in the browse file names as a Non-capture Group; these elements will be matched but won't be included in granule names. -
In summary: NSIDC0081_SEAICE_PS_, _v2.0_, and DUCk are combined with the granuleid capture group element,
(?P<granuleid>[NS]{1}\d{2}km_\d{8})
, to form the producerGranuleId reflected for each granule, e.g.,NSIDC0081_SEAICE_PS_N25km_20211105_v2.0_DUCk.nc
andNSIDC0081_SEAICE_PS_S25km_20211102_v2.0_DUCk.nc
. These are the names that will be shown for the granules in EDSC. They globally, uniquely distinguish granules in a specific collection from any other granules in any other collections in CUAT or CPROD. These names are found in the CNM as theproduct
/name
value, and the UMMG metadata file as theIdentifier value
.- If the granule_regex was omitted from the .ini file in this case, the cnm output would only define data and metadata files for ingest, the browse images would not be included!
- Since metgenc validate doesn't check attribute values, no validation errors are thrown when this happens.
- This hopefully is largely an example portraying a made-up edge case due to the way I'd added the _DUCk identifier to these files for early MetGenC testing!! But be aware of this if you find yourself dealing with complicated file names where the element meant to comprise the granule id are interrupted by other elements.
When necessary, the following two .ini elements can be added to define the paths
to directories containing premet
and spatial
files—they must be two separate directories, and separate from the data directory.
The user will be prompted for these values when running metgenc init
.
.ini element | .ini section |
---|---|
premet_dir | Source |
spatial_dir | Source |
- The spatial_dir defines the path to the directory containing either .spatial or .spo files.
- The composition of .spatial/.spo and .premet files and their naming convention is to remain exactly
it is/has been for their use with SIPSMetgen, and is described here: https://nsidc.org/sites/default/files/documents/other/guidelines-preliminary-metadata-creation-and-data-product-delivery.pdf
- This was done to avoid changing existing ops and/or data producer workflows/scripts.
- Reminder for premets: there should be a compelling reason (i.e., preserving continuity of an existing collection) from the pub team in order to include more attributes than just begin/end date/time. Most, if not all, new data sets requiring premets should see them include only begin/end date/time.
In cases of data sets where granule spatial information is not available
by interrogating the data or via spatial
or .spo
files, the operator
may set a flag to force the metadata representing each granule's spatial
extents to be set to that of the collection. The user will be prompted
for the collection_geometry_override
value when running metgenc init
.
The default value is False
; setting it to True
signals MetGenC to
use the collection's spatial extent for each granule.
.ini element | .ini section |
---|---|
collection_geometry_override | Source |
RARELY APPLICABLE (if ever)!! An operator may set an .ini flag to indicate
that a collection's temporal extent should be used to populate every granule
via granule-level UMMG json to be the same TemporalExtent (SingleDateTime or
BeginningDateTime and EndingDateTime) as what's defined for the collection.
In other words, every granule in a collection would display the same start
and end times in EDSC. In most collections, this is likely ill-advised use case.
The operator will be prompted for a collection_temporal_override
value when running metgenc init
. The default value is False
and should likely
always be accepted; setting it to True
is what would signal MetGenC to set each
granule to the collection's TemporalExtent.
.ini element | .ini section |
---|---|
collection_temporal_override | Source |
MetGenC includes optimized polygon generation capabilities for creating spatial coverage polygons from point data, particularly useful for LIDAR flightline data.
When a granule has an associated .spatial
file containing geodetic point data (≥3 points), MetGenC will automatically generate an optimized polygon to enclose the data points instead of using the basic point-to-point polygon method. This results in more accurate spatial coverage with fewer vertices.
This feature, while optional, is always enabled by default in MetGenC.
- To disable it entirely, edit the .ini file, add a [Spatial] section if necessary, and add the line
spatial_polygon_enabled = false
. CURRENTLY RECOMMENDED TO SETspatial_polygon_enabled = false
WHENEVER .SPO FILES ARE USED. - When
spatial_polygon_enabled = true
(either by default or when set as such in the .ini file) the other parameters listed below can be added to and edited in the .ini file. For the most part, the values shouldn't need to be altered! However, if ingest fails due to GPolygonSpatial errors, the first attribute to add to or edit in the .ini file should bespatial_polygon_cartesian_tolerance
by decreasing its coordinate precision (e.g., .0001 => .01) which will increase the distance between gpolygon vertices, expanding the spatial extent.
Configuration Parameters:
.ini section | .ini element | Type | Default | Description |
---|---|---|---|---|
Spatial | spatial_polygon_enabled | boolean | true | Enable/disable polygon generation for .spatial files |
Spatial | spatial_polygon_target_coverage | float | 0.98 | Target data coverage percentage (0.80-1.0) |
Spatial | spatial_polygon_max_vertices | integer | 100 | Maximum vertices in generated polygon (10-1000) |
Spatial | spatial_polygon_cartesian_tolerance | float | 0.0001 | Minimum distance between polygon points in degrees (0.00001-0.01) |
Example showing content added to an .ini file, having edited the CMR default vertex tolerance (distance between two vertices) to decrease the precision of the GPoly coordinate pairs listed in the UMMG json files MetGenC generates:
[Spatial]
spatial_polygon_enabled = true
spatial_polygon_target_coverage = 0.98
spatial_polygon_max_vertices = 100
spatial_polygon_cartesian_tolerance = .01
Example showing the key pair added to an .ini file to disable spatial polygon generation:
[Spatial]
spatial_polygon_enabled = false
When Polygon Generation is Applied:
- ✅ Granule has a
.spatial
file with ≥3 geodetic points - ✅
spatial_polygon_enabled = true
(default) - ✅ Granule spatial representation is
GEODETIC
When Original Behavior is Used:
- ❌ No
.spatial
file present (data from other sources) - ❌
spatial_polygon_enabled = false
- ❌ Granule spatial representation is
CARTESIAN
- ❌ Insufficient points (<3) for polygon generation
- ❌ Polygon generation fails (automatic fallback)
Tolerance Requirements:
The spatial_polygon_cartesian_tolerance
parameter ensures that generated polygons meet NASA CMR validation requirements. The CMR system requires that each point in a polygon must have a unique spatial location - if two points are closer than the tolerance threshold in both latitude and longitude, they are considered the same point and the polygon becomes invalid. MetGenC automatically filters points during polygon generation to ensure this requirement is met.
This enhancement is backward compatible - existing workflows continue unchanged, and polygon generation only activates for appropriate .spatial
file scenarios.
MetGenC can extract polygon vertices directly from the geospatial_bounds
netCDF attribute when it contains a WKT POLYGON string. This extracts all
polygon vertices as individual points, providing an alternative to the default
of using spatial coordinate values to generate a polygon.
If no geospatial_bounds_crs
attribute exists, the
geospatial_bounds
value is assumed to represent points in EPSG:4326.
Example Configuration:
[Spatial]
prefer_geospatial_bounds = true
When Geospatial Bounds Extraction is Applied:
- ✅ Granule spatial representation is
GEODETIC
- ✅
prefer_geospatial_bounds = true
in .ini file - ✅ NetCDF file contains valid
geospatial_bounds
global attribute with WKT POLYGON
The info command can be used to display the information within the configuration file as well as MetGenC system default values for data ingest.
metgenc info --help
Usage: metgenc info [OPTIONS]
Summarizes the contents of a configuration file.
Options:
-c, --config TEXT Path to configuration file to display [required]
--help Show this message and exit.
metgenc info -c /share/apps/metgenc/SNEX23_CSU_GPR/init/SNEX23_CSU_GPR.ini
__
____ ___ ___ / /_____ ____ ____ _____
/ __ `__ \/ _ \/ __/ __ `/ _ \/ __ \/ ___/
/ / / / / / __/ /_/ /_/ / __/ / / / /__
/_/ /_/ /_/\___/\__/\__, /\___/_/ /_/\___/
/____/
Using configuration:
+ environment: uat
+ data_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/data
+ auth_id: SNEX23_CSU_GPR
+ version: 1
+ provider: SnowEx
+ local_output_dir: /share/apps/metgenc/SNEX23_CSU_GPR/output
+ ummg_dir: ummg
+ kinesis_stream_name: nsidc-cumulus-uat-external_notification
+ staging_bucket_name: nsidc-cumulus-uat-ingest-staging
+ write_cnm_file: True
+ overwrite_ummg: True
+ checksum_type: SHA256
+ number: 1000000
+ dry_run: False
+ premet_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/premet
+ spatial_dir: /disks/sidads_staging/SNOWEX_metgen/SNEX23_CSU_GPR_metgen/spatial
+ collection_geometry_override: False
+ collection_temporal_override: False
+ time_start_regex: None
+ time_coverage_duration: None
+ pixel_size: None
+ browse_regex: _brws
+ granule_regex: None
+ reference_file_regex: None
+ spatial_polygon_enabled: False
+ spatial_polygon_target_coverage: 0.98
+ spatial_polygon_max_vertices: 100
+ spatial_polygon_cartesian_tolerance: 0.0001
+ prefer_geospatial_bounds: False
+ log_dir: /share/logs/metgenc
+ name: SNEX23_CSU_GPR
metgenc process --help
Usage: metgenc process [OPTIONS]
Processes science files based on configuration file contents.
Options:
-c, --config TEXT Path to configuration file [required]
-d, --dry-run Don't stage files on S3 or publish messages to Kinesis
-e, --env TEXT environment [default: uat] #note: this can be set to either `uat` or `prod`
-n, --number count Process at most 'count' granules.
-wc, --write-cnm Write CNM messages to files.
-o, --overwrite Overwrite existing UMM-G files.
--help Show this message and exit.
The process command can be run either with or without specifying the -d
/ --dry-run
option.
- When the dry run option is specified and the
-wc
/--write-cnm
option is invoked, or your config file containswrite_cnm_file = true
(instead of= false
), CNM will be written locally to the output/cnm directory (the operator must have already created!). This promotes operators having the ability to validate and visually QC their content before ingesting a collection. - When run without the dry run option, metgenc will transfer CNM to AWS, kicking off end-to-end ingest of data and UMM-G files.
The following is an example of using the dry run option (-d) to generate UMM-G and write CNM as files (-wc) for three granules (-n 3):
$ metgenc process -c ./init/test.ini -d -n 3 -wc
This next example would run end-to-end ingest of all granules (assuming < 1000000 granules) in the data directory specified in the test.ini config file and their UMM-G files into the CUAT environment:
$ metgenc process -c ./init/test.ini -e <uat or prod>
Note: Before running process without the dry run option, post Slack messages to NSIDC's #Cumulus
and cloud-ingest-ops
channels, and post a quick "done" note when you're done ingest testing as a courtesy to Cumulus devs and ops folks
- MetGenC processing,
metgenc process -d -c init/xxxxx.ini
, must be run at the ~/metgenc level in the vm's virtual environment, e.g.,vagrant@vmpolark2:~/metgenc$
. If you run it in the data/, or init/, or any other directory, you'll see errors like:
The configuration is invalid:
* The data_dir does not exist.
* The premet_dir does not exist.
* The spatial_dir does not exist.
* The local_output_dir does not exist.
-
If running
metgenc process
fails for other reasons, check for an error message in the metgenc log. This is written by default to/as (/share/logs/metgenc/metgenc-{config-name}-{timestamp}.log
).- The metgenc.log will spell out the reason for the error for the operator, so the .ini file or paths pointed to in the .ini file can be spiffed up.
-
If running metgenc process without the -d / --dry-run option leads to the following warning:
The configuration is invalid:
The kinesis stream does not exist.
The staging bucket does not exist.
It's almost certainly indicating that you've not sourced the credentials required (cumulus-uat, cumulus-prod) for the environment you're telling MetGenC to process in.
- If metgenc reports "Successful : False" for a specific granule, you can copy the UUID (or, just the last alphanumeric block after the dash is adequate), and then grep the metgenc log for that processing run for that id specifying only 46 lines after the id to be returned. That'll show you the log details just for that granule!
e.g., grep -A 46 43eae1561cba metgenc.log
The validate command lets you review the JSON CNM or UMM-G output files created by
running process
.
metgenc validate --help
Usage: metgenc validate [OPTIONS]
Validates the contents of local JSON files.
Options:
-c, --config TEXT Path to configuration file [required]
-t, --type TEXT JSON content type [default: cnm]
--help Show this message and exit.
$ metgenc validate -c init/modscg.ini -t ummg (adding the -t ummg option will validate all UMM-G files; -t cnm will validate all CNM that have been written locally)
$ metgenc validate -c init/modscg.ini (without the -t option specified, just all locally written CNM will be validated)
running the following is an alternate way to validate ummg and cnm json files, but can only be run on one file at a time:
$ check-jsonschema --schemafile <path to schema file> <path to CNM or UMM-G file to check>
If running metgenc validate
fails, check the metgenc.log for an error message to begin troubleshooting.
Handy tip: While not a MetGenC command, a handy way to show a file's contents without having
to wade through unformatted json chaos is to run:
cat <UMM-G or CNM file name> | jq
e.g., running cat /share/apps/metgenc/SNEX23_CSU_GPR/output/cnm/SNEX23_CSU_GPR_FLCF_20230307_20230316_v01.csv.cnm.json | jq
will pretty-print the contents of this cnm.json file in the comfort of your own shell!
You can install Poetry either by using the official installer if you’re comfortable following the instructions, or by using a package manager (like Homebrew) if this is more familiar to you. When successfully installed, you should be able to run:
$ poetry --version
Poetry (version 1.8.3)
-
Use Poetry to create and activate a virtual environment
$ poetry shell
-
Install dependencies
$ poetry install
$ poetry run pytest
This uses pytest-watcher
$ poetry run ptw . --now --clear
$ poetry run ruff check
The ruff
tool will check
the source code for conformity with various style rules. Some of
these can be fixed by ruff
itself, and if so, the output will
describe how to automatically fix these issues.
The CI/CD pipeline will run these checks whenever new commits are pushed to GitHub, and the results will be available in the GitHub Actions output.
$ poetry run ruff format
The ruff
tool will check
the source code for conformity with source code formatting rules. It
will also fix any issues it finds and leave the changes uncommitted
so you can review the changes prior to adding them to the codebase.
As with the linter, the CI/CD pipeline will run the formatter when commits are pushed to GitHub.
Rather than running ruff
manually from the commandline, it can be
integrated with the editor of your choice. See the
ruff editor integration guide.
-
Update
CHANGELOG.md
according to its representation of the current version:-
If the current "version" in
CHANGELOG.md
isUNRELEASED
, add an entry describing your new changes to the existing change summary list. -
If the current version in
CHANGELOG.md
is not a release candidate, add a new line at the top ofCHANGELOG.md
with a "version" consisting of the string literalUNRELEASED
(no quotes surrounding the string). It will be replaced with the release candidate form of an actual version number after themajor
,minor
, orpatch
version is bumped (see below). Add a list summarizing the changes (thus far) in this new version below theUNRELEASED
version entry. -
If the current version in
CHANGELOG.md
is a release candidate, add an entry describing your new changes to the existing change summary list for this release candidate version. The release candidate version will be automatically updated when therc
version is bumped (see below).
-
-
Commit
CHANGELOG.md
so the working directory is clean. -
Show the current version and the possible next versions:
$ bump-my-version show-bump 1.4.0 ── bump ─┬─ major ─── 2.0.0rc0 ├─ minor ─── 1.5.0rc0 ├─ patch ─── 1.4.1rc0 ├─ release ─ invalid: The part has already the maximum value among ['rc', 'release'] and cannot be bumped. ╰─ rc ────── 1.4.0release1
-
If the currently released version of
metgenc
is not a release candidate and the goal is to start work on a new version, the first step is to create a pre-release version. As an example, if the current version is1.4.0
and you'd like to release1.5.0
, first create a pre-release for testing:$ bump-my-version bump minor
Now the project version will be
1.5.0rc0
-- Release Candidate 0. As testing for this release-candidate proceeds, you can create more release-candidates by:$ bump-my-version bump rc
And the version will now be
1.5.0rc1
. You can create as many release candidates as needed. -
When you are ready to do a final release, you can:
$ bump-my-version bump release
Which will update the version to
1.5.0
. After doing any kind of release, you will see the latest commit and tag by looking atgit log
. You can then push these to GitHub (git push --follow-tags
) to trigger the CI/CD workflow. -
On the GitHub repository, click 'Releases' and follow the steps documented on the GitHub Releases page. Draft a new Release using the version tag created above. By default, the 'Set as the latest release' checkbox will be selected. To publish a pre-release from a release candidate version, be sure to select the 'Set as a pre-release' checkbox. After you have published the (pre-)release in GitHub, the MetGenC Publish GHA workflow will be started. Check that the workflow succeeds on the MetGenC Actions page, and verify that the new MetGenC (pre-)release is available on PyPI.
This content was developed by the National Snow and Ice Data Center with funding from multiple sources.