Skip to content

Not using an ingested column as a raw_column results in error #69

@deliahu

Description

@deliahu

Description

It should be possible to have a column defined in env.data.schema that isn't used as a raw_column (CSV and Parquet).

Application Configuration

Start from Iris, and then completely remove one of the features (e.g. sepal_width)

To Reproduce

  1. cx deploy

Stack Trace

Starting

INFO:cortex:Starting
INFO:cortex:Ingesting
INFO:cortex:Ingesting iris-1 data from s3a://cortex-examples/iris.csv
ERROR:cortex:An error occurred, see `cx logs raw_column sepal_width` for more details.
Traceback (most recent call last):
  File "/src/spark_job/spark_job.py", line 307, in run_job
    raw_df = ingest_raw_dataset(spark, ctx, cols_to_validate, should_ingest)
  File "/src/spark_job/spark_job.py", line 151, in ingest_raw_dataset
    ingest_df = spark_util.ingest(ctx, spark)
  File "/src/spark_job/spark_util.py", line 223, in ingest
    expected_schema = expected_schema_from_context(ctx)
  File "/src/spark_job/spark_util.py", line 117, in expected_schema_from_context
    for fname in expected_field_names
  File "/src/spark_job/spark_util.py", line 117, in <listcomp>
    for fname in expected_field_names
KeyError: 'petal_width'

Version

master

Additional Context

It should also be possible to not ingest all of the columns in the dataset (just Parquet for now); we should test this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions