-
Notifications
You must be signed in to change notification settings - Fork 607
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Description
It should be possible to have a column defined in env.data.schema that isn't used as a raw_column (CSV and Parquet).
Application Configuration
Start from Iris, and then completely remove one of the features (e.g. sepal_width
)
To Reproduce
- cx deploy
Stack Trace
Starting
INFO:cortex:Starting
INFO:cortex:Ingesting
INFO:cortex:Ingesting iris-1 data from s3a://cortex-examples/iris.csv
ERROR:cortex:An error occurred, see `cx logs raw_column sepal_width` for more details.
Traceback (most recent call last):
File "/src/spark_job/spark_job.py", line 307, in run_job
raw_df = ingest_raw_dataset(spark, ctx, cols_to_validate, should_ingest)
File "/src/spark_job/spark_job.py", line 151, in ingest_raw_dataset
ingest_df = spark_util.ingest(ctx, spark)
File "/src/spark_job/spark_util.py", line 223, in ingest
expected_schema = expected_schema_from_context(ctx)
File "/src/spark_job/spark_util.py", line 117, in expected_schema_from_context
for fname in expected_field_names
File "/src/spark_job/spark_util.py", line 117, in <listcomp>
for fname in expected_field_names
KeyError: 'petal_width'
Version
master
Additional Context
It should also be possible to not ingest all of the columns in the dataset (just Parquet for now); we should test this.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working