Skip to content

Commit a48557c

Browse files
⚡️ Speed up function correlation by 12,290%
The optimized code achieves a 12290% speedup by replacing row-by-row pandas DataFrame access with vectorized NumPy operations. Here are the key optimizations: **1. Pre-convert DataFrame to NumPy array** - `values = df[numeric_columns].to_numpy(dtype=float)` converts all numeric columns to a single NumPy array upfront - This eliminates the expensive `df.iloc[k][col_i]` operations that dominated the original runtime (51.8% + 23.7% + 23.7% = 99.2% of total time) **2. Vectorized NaN filtering** - Original: Row-by-row iteration with `pd.isna()` checks in Python loops - Optimized: `mask = ~np.isnan(vals_i) & ~np.isnan(vals_j)` creates boolean mask in one vectorized operation - Filtering becomes `x = vals_i[mask]` instead of appending valid values one by one **3. Vectorized statistical calculations** - Original: Manual computation using Python loops (`sum()`, list comprehensions) - Optimized: Native NumPy methods (`x.mean()`, `x.std()`, `((x - mean_x) * (y - mean_y)).mean()`) - NumPy's C-level implementations are orders of magnitude faster than Python loops **Performance characteristics by test case:** - **Small datasets (3-5 rows)**: 75-135% speedup - overhead of NumPy conversion is minimal - **Medium datasets (100-1000 rows)**: 200-400% speedup - vectorization benefits become significant - **Large datasets (1000+ rows)**: 11,000-50,000% speedup - vectorization dominance is overwhelming - **Edge cases with many NaNs**: Excellent performance due to efficient boolean masking - **Multiple columns**: Scales well since NumPy array slicing (`values[:, i]`) is very fast The optimization transforms an O(n²m) algorithm with expensive Python operations into O(nm) with fast C-level NumPy operations, where n is rows and m is numeric columns.
1 parent 9b951ff commit a48557c

File tree

1 file changed

+18
-19
lines changed

1 file changed

+18
-19
lines changed

src/numpy_pandas/dataframe_operations.py

Lines changed: 18 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -66,14 +66,17 @@ def pivot_table(
6666

6767
def agg_func(values):
6868
return sum(values) / len(values)
69+
6970
elif aggfunc == "sum":
7071

7172
def agg_func(values):
7273
return sum(values)
74+
7375
elif aggfunc == "count":
7476

7577
def agg_func(values):
7678
return len(values)
79+
7780
else:
7881
raise ValueError(f"Unsupported aggregation function: {aggfunc}")
7982
grouped_data = {}
@@ -209,34 +212,30 @@ def correlation(df: pd.DataFrame) -> dict[Tuple[str, str], float]:
209212
]
210213
n_cols = len(numeric_columns)
211214
result = {}
215+
values = df[numeric_columns].to_numpy(dtype=float)
212216
for i in range(n_cols):
213217
col_i = numeric_columns[i]
218+
vals_i = values[:, i]
214219
for j in range(n_cols):
215220
col_j = numeric_columns[j]
216-
values_i = []
217-
values_j = []
218-
for k in range(len(df)):
219-
if not pd.isna(df.iloc[k][col_i]) and not pd.isna(df.iloc[k][col_j]):
220-
values_i.append(df.iloc[k][col_i])
221-
values_j.append(df.iloc[k][col_j])
222-
n = len(values_i)
221+
vals_j = values[:, j]
222+
# Vectorized: Only keep rows without NaN in either column
223+
mask = ~np.isnan(vals_i) & ~np.isnan(vals_j)
224+
x = vals_i[mask]
225+
y = vals_j[mask]
226+
n = x.size
223227
if n == 0:
224228
result[(col_i, col_j)] = np.nan
225229
continue
226-
mean_i = sum(values_i) / n
227-
mean_j = sum(values_j) / n
228-
var_i = sum((x - mean_i) ** 2 for x in values_i) / n
229-
var_j = sum((x - mean_j) ** 2 for x in values_j) / n
230-
std_i = var_i**0.5
231-
std_j = var_j**0.5
232-
if std_i == 0 or std_j == 0:
230+
mean_x = x.mean()
231+
mean_y = y.mean()
232+
std_x = x.std()
233+
std_y = y.std()
234+
if std_x == 0 or std_y == 0:
233235
result[(col_i, col_j)] = np.nan
234236
continue
235-
cov = (
236-
sum((values_i[k] - mean_i) * (values_j[k] - mean_j) for k in range(n))
237-
/ n
238-
)
239-
corr = cov / (std_i * std_j)
237+
cov = ((x - mean_x) * (y - mean_y)).mean()
238+
corr = cov / (std_x * std_y)
240239
result[(col_i, col_j)] = corr
241240
return result
242241

0 commit comments

Comments
 (0)