Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions datafusion/core/src/dataframe/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ use datafusion_common::{
plan_err, Column, DFSchema, DataFusionError, ParamValues, SchemaError, UnnestOptions,
};
use datafusion_expr::{
avg, count, is_null, max, median, min, stddev, utils::COUNT_STAR_EXPANSION,
avg, count, count_null, max, median, min, stddev, utils::COUNT_STAR_EXPANSION,
TableProviderFilterPushDown, UNNAMED_TABLE,
};

Expand Down Expand Up @@ -534,7 +534,7 @@ impl DataFrame {
vec![],
original_schema_fields
.clone()
.map(|f| count(is_null(col(f.name()))).alias(f.name()))
.map(|f| count_null(col(f.name())).alias(f.name()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a count_null() new function, what do you think about using the existing count function?

Something like count(expr.is_null())

Suggested change
.map(|f| count_null(col(f.name())).alias(f.name()))
.map(|f| count(col(f.name()).is_null()).alias(f.name())))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using count, it sums all non-null Expr while is_null converts Expr to a boolean expression, which is why it doesn't work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Make sense to me. Another classic trick is to use a CASE expression, something like

SELECT  SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...

.collect::<Vec<_>>(),
),
// mean aggregation
Expand Down
4 changes: 2 additions & 2 deletions datafusion/core/tests/dataframe/describe.rs
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ async fn describe() -> Result<()> {
"| describe | id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month |",
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
"| count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |",
"| null_count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |",
"| null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0.0 | 0.0 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that certainly looks better

"| mean | 3649.5 | null | 4.5 | 4.5 | 4.5 | 45.0 | 4.949999964237213 | 45.45 | null | null | null | 2009.5 | 6.526027397260274 |",
"| std | 2107.472815166704 | null | 2.8724780750809518 | 2.8724780750809518 | 2.8724780750809518 | 28.724780750809533 | 3.1597258182544645 | 29.012028558317645 | null | null | null | 0.5000342500942125 | 3.44808750051728 |",
"| min | 0.0 | null | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 01/01/09 | 0 | 2008-12-31T23:00:00 | 2009.0 | 1.0 |",
Expand Down Expand Up @@ -69,7 +69,7 @@ async fn describe_boolean_binary() -> Result<()> {
"| describe | a | b |",
"+------------+------+------+",
"| count | 1 | 1 |",
"| null_count | 1 | 1 |",
"| null_count | 0 | 0 |",
"| mean | null | null |",
"| std | null | null |",
"| min | a | null |",
Expand Down
17 changes: 17 additions & 0 deletions datafusion/expr/src/expr_fn.rs
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,18 @@ pub fn count(expr: Expr) -> Expr {
))
}

/// Create an expression to represent the COUNT(IS NULL x) aggregate function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this confusing as first because it isn't COUNT(x IS NULL) -- I think a more accurate description is COUNT(x) FILTER (WHERE x IS NULL) or something like that

Suggested change
/// Create an expression to represent the COUNT(IS NULL x) aggregate function
/// Create an aggregate that returns the number `NULL` values in a column

pub fn count_null(expr: Expr) -> Expr {
Expr::AggregateFunction(AggregateFunction::new(
aggregate_function::AggregateFunction::Count,
vec![expr.clone()],
false,
Some(Box::new(expr.is_null())),
None,
None,
))
}

/// Return a new expression with bitwise AND
pub fn bitwise_and(left: Expr, right: Expr) -> Expr {
Expr::BinaryExpr(BinaryExpr::new(
Expand Down Expand Up @@ -448,6 +460,11 @@ pub fn is_null(expr: Expr) -> Expr {
Expr::IsNull(Box::new(expr))
}

/// Create is not null expression
pub fn is_not_null(expr: Expr) -> Expr {
Expr::IsNotNull(Box::new(expr))
}

/// Create is true expression
pub fn is_true(expr: Expr) -> Expr {
Expr::IsTrue(Box::new(expr))
Expand Down