Skip to content

Non-propogation, skipmissing-related improvements to Missing handling.  #35050

@pdeffebach

Description

@pdeffebach

There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.

  1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
  2. Make way more functions in Base like cor, accept any iterator and not vectors so we don't need to collect skipmissings.
  3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.
  4. Change broadcasting so that we can go from a skipmissing back to a vector with missings in the same locations. It would be nice for skipmissing to have some kind of persistence so that you don't lose the location of missings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries.
  5. Use dispatch for DataFrames (or Tables or NamedTuples etc.) to simulate Stata's if syntax, where the new dataframe is a view into the non-missing elements uses 4. (above) to fill in missings where needed. R doesn't have this feature so I don't think its obvious to everyone that this is an option. Stata is really great with this, you can do
egen x = y - mean(z) if !missing(v)

And it will apply a filter on everything at the start of the function.

These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    missing dataBase.missing and related functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions