Skip to content

Projects from Substrait do not include input fields as output fields #12204

@EpsilonPrime

Description

@EpsilonPrime

Describe the bug

According to the Substrait specification project relations emit all if the input fields followed by the list of new expressions. Datafusion only emits the new expressions.

To Reproduce

Pass a Substrait plan such as the following to Datafusion. (A literal can be used instead of a window function but this is what I had handy.)

{
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": "/functions_arithmetic.yaml"
    }
  ],
  "extensions": [
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 1,
        "name": "row_number"
      }
    }
  ],
  "relations": [
    {
      "root": {
        "input": {
          "project": {
            "common": {
              "direct": {}
            },
            "input": {
              "read": {
                "common": {
                  "direct": {}
                },
                "baseSchema": {
                  "names": [
                    "user_id",
                    "name",
                    "paid_for_service"
                  ],
                  "struct": {
                    "types": [
                      {
                        "string": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      {
                        "string": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      {
                        "bool": {
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      }
                    ],
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },
                "namedTable": {
                  "names": [
                    "users"
                  ]
                }
              }
            },
            "expressions": [
              {
                "windowFunction": {
                  "functionReference": 1,
                  "sorts": [
                    {
                      "expr": {
                        "selection": {
                          "directReference": {
                            "structField": {
                              "field": 1
                            }
                          },
                          "rootReference": {}
                        }
                      },
                      "direction": "SORT_DIRECTION_ASC_NULLS_FIRST"
                    }
                  ],
                  "upperBound": {
                    "unbounded": {}
                  },
                  "lowerBound": {
                    "unbounded": {}
                  },
                  "outputType": {
                    "i64": {
                      "nullability": "NULLABILITY_REQUIRED"
                    }
                  },
                  "invocation": 3
                }
              }
            ]
          }
        },
        "names": [
          "user_id",
          "name",
          "paid_for_service",
          "row_number"
        ]
      }
    }
  ],
  "version": {
    "minorNumber": 52,
    "producer": "spark-substrait-gateway"
  }
}

Expected behavior

The result of the plan above would be 4 columns to match the 4 names provided. The current behavior is that Datafusion returns just one column (row_number) for the project.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions