-
-
Couldn't load subscription status.
- Fork 5.7k
Description
There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.
- Make skipmissing work for multiple iterators, returning a
Tupleof iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting. - Make way more functions in Base like
cor, accept any iterator and not vectors so we don't need to collectskipmissings. - Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike
skipmissingsabove in (1), this returns an iterator of tuples. both are useful. - Change broadcasting so that we can go from a
skipmissingback to a vector with missings in the same locations. It would be nice forskipmissingto have some kind of persistence so that you don't lose the location ofmissings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries. - Use dispatch for DataFrames (or Tables or NamedTuples etc.) to simulate Stata's if syntax, where the new dataframe is a view into the non-missing elements uses 4. (above) to fill in missings where needed. R doesn't have this feature so I don't think its obvious to everyone that this is an option. Stata is really great with this, you can do
egen x = y - mean(z) if !missing(v)
And it will apply a filter on everything at the start of the function.
These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.