Skip to content

Conversation

TomAugspurger
Copy link
Contributor

No description provided.

@jbrockmendel
Copy link
Member

does this subsume #28048?

@@ -775,6 +775,7 @@ Reshaping
- Bug where :meth:`DataFrame.equals` returned True incorrectly in some cases when two DataFrames had the same columns in different orders (:issue:`28839`)
- Bug in :meth:`DataFrame.replace` that caused non-numeric replacer's dtype not respected (:issue:`26632`)
- Bug in :func:`melt` where supplying mixed strings and numeric values for ``id_vars`` or ``value_vars`` would incorrectly raise a ``ValueError`` (:issue:`29718`)
- Dtypes are now preserved when transposing a ``DataFrame`` where each column is the same extension dtyep (:issue:``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtyep -> dtype

@TomAugspurger
Copy link
Contributor Author

does this subsume #28048?

Ah, yeah forgot about that one.

@jbrockmendel
Copy link
Member

Easy to forget about, i closed it to clear the queue and then only reopened it a couple days ago. I'm going to re-close it in favor or this one; maybe its worth salvaging some of the tests it implemented


# Slow, but unavoidable with 1D EAs.
new_values = []
for i in range(len(self)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rethinking this approach. This results in n_rows * n_columns __getitem__s. My intent was to avoid going through a 2D object-dtype ndarray. But we're essentially doing that with lists. So I think it'll be better to just do .values.T and then rebuild the EAs from the object-dtype array.

kwargs.pop("copy", None) # by definition, we're copying
dtype = self._data.blocks[0].dtype
arr_type = dtype.construct_array_type()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this logic to pandas/core/reshape/reshape.py this has a lot of similiarity to _unstack_extension_series

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 6, 2019
if (
self._is_homogeneous_type
and len(self._data.blocks)
and is_extension_array_dtype(self._data.blocks[0].dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid self._data references by making this len(self.dtypes) and is_extension_array_dtype(self.dtypes.iloc[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on 731 with self.dtypes

@TomAugspurger
Copy link
Contributor Author

Pushed a largeish refactor. We actually don't need NDFrame.transpose anymore. Series gets its from IndexOpsMixin. So I moved the logic to DataFrame.transpose, and was able to remove all the axes handling stuff.

maybe its worth salvaging some of the tests it implemented

Added.

if self._is_homogeneous_type and is_extension_array_dtype(self.iloc[:, 0]):
dtype = self.dtypes.iloc[0]
arr_type = dtype.construct_array_type()
values = self.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the single-column case could be done without this casting. think its worth special-casing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be done without casting in general. We'll still need to reshape the (N, 1) to (1, N).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im pretty sure i did this in the other PR, something like:

values = self._data.blocks[0].values
new_vals = [values[[n]] for n in range(len(values))]

(of course, if we had 2D EAs...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we can avoid casting to an ndarray by making N * P length-1 __getitem__ calls, which makes N * P extension arrays, which are concatenated into N final EAs.

My prior expectation is that converting to an ndarray and doing __getitem__ on that will be faster, and should have roughly the same amount of memory usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is P here? in the end this probably isn't worth bikeshedding (except to add to the pile of "reasons why EAs should support 2D")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, i was specifically referring to single-column

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with not-doing this optimization here, just want to make sure we're on the same page about what the available optimization is

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

*args, **kwargs
Additional arguments and keywords have no effect but might be
accepted for compatibility with numpy.
*args : tuple, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we still need kwargs for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, no. Neither np.transpose nor ndarray.transpose take additional keyword arguments.

@@ -644,50 +644,6 @@ def _set_axis(self, axis, labels):
self._data.set_axis(axis, labels)
self._clear_item_cache()

def transpose(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does Series have transpose for compat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, via IndexOpsMixin.

@@ -2587,7 +2592,28 @@ def transpose(self, *args, **kwargs):
dtype: object
"""
nv.validate_transpose(args, dict())
return super().transpose(1, 0, **kwargs)
# construct the args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you think this is better located in pandas/core/reshape ? (and called as a helper function here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a slight preference for keeping it here just for readability. The reshape part is essentially just a list comprehension.

values = self.values
new_values = [arr_type._from_sequence(row, dtype=dtype) for row in values]

which I don't think warrants its own function. I don't see anything places in /core/reshape.py that could use this. I believe those are reshaping to / from 1-D things. This is 2D -> 2D.

But happy to move it if you want / if you see other places that could use parts of this.

@TomAugspurger
Copy link
Contributor Author

I think this is ready.

else:
new_values = self.values.T
if copy:
new_values = new_values.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous, then new_values above is already a copy, can avoid re-copying here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% sure about that? Or are there types distinct dtypes that .values can combine without copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if non-homogeneous then we have multiple blocks, so multiple ndarrays that are going through np.c_ or something like it, right? AFAIK that has to allocate new data for the output. are there corner cases were missing @shoyer?

@@ -755,18 +755,18 @@ def test_pi_sub_isub_offset(self):
rng -= pd.offsets.MonthEnd(5)
tm.assert_index_equal(rng, expected)

def test_pi_add_offset_n_gt1(self, box_transpose_fail):
@pytest.mark.parametrize("transpose", [True, False])
def test_pi_add_offset_n_gt1(self, box_with_array, transpose):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but transpose param is sub-optimal. For DataFrame case it will correctly test both cases, but for EA/Index/Series it will mean duplicate tests. I'll try to come up with something nicer in an upcoming arithmetic-test-specific pass

df = pd.DataFrame({"a": ser, "b": ser})
result = df.T
assert (result.dtypes == ser.dtype).all()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round_trip = df.T.T
tm.assert_frame_equal(df, round_trip)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the pd.date_range("2016-04-05 04:30", periods=3).astype("category") case fails that test. All the values are NaT.

I've xfalied it for now, and likely won't have time to look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks

@jbrockmendel
Copy link
Member

A handful of comments, generally looks good

@jreback jreback added this to the 1.0 milestone Dec 26, 2019
@jreback
Copy link
Contributor

jreback commented Dec 26, 2019

@TomAugspurger merge master when you have a chance (or @jbrockmendel if this is a blocker)

@jbrockmendel
Copy link
Member

rebased. not a blocker for the blockwise PRs

@jreback jreback merged commit cb5f9d1 into pandas-dev:master Dec 27, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

thanks @TomAugspurger

@jbrockmendel I don't think we had a dedicated issue for this to be closed......

@jbrockmendel
Copy link
Member

#22120, since its about cyberpandas, i think we should ask tom to double-check that this fixes it.

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019
keechongtan added a commit to keechongtan/pandas that referenced this pull request Dec 29, 2019
…ndexing-1row-df

* upstream/master: (333 commits)
  CI: troubleshoot Web_and_Docs failing (pandas-dev#30534)
  WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525)
  DEPR: camelCase in offsets, get_offset (pandas-dev#30340)
  PERF: implement scalar ops blockwise (pandas-dev#29853)
  DEPR: Remove Series.compress (pandas-dev#30514)
  ENH: Add numba engine for rolling apply (pandas-dev#30151)
  [ENH] Add to_markdown method (pandas-dev#30350)
  DEPR: Deprecate pandas.np module (pandas-dev#30386)
  ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405)
  BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491)
  CI: Fix GBQ Tests (pandas-dev#30478)
  Bug groupby quantile listlike q and int columns (pandas-dev#30485)
  ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402)
  TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398)
  BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335)
  Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502)
  BUG: preserve EA dtype in transpose (pandas-dev#30091)
  BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498)
  REF/TST: method-specific files for test_append (pandas-dev#30503)
  marked unused parameters (pandas-dev#30504)
  ...
@TomAugspurger TomAugspurger deleted the ea-transpose branch February 28, 2020 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants