You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The optimized code achieves a 12290% speedup by replacing row-by-row pandas DataFrame access with vectorized NumPy operations. Here are the key optimizations:
**1. Pre-convert DataFrame to NumPy array**
- `values = df[numeric_columns].to_numpy(dtype=float)` converts all numeric columns to a single NumPy array upfront
- This eliminates the expensive `df.iloc[k][col_i]` operations that dominated the original runtime (51.8% + 23.7% + 23.7% = 99.2% of total time)
**2. Vectorized NaN filtering**
- Original: Row-by-row iteration with `pd.isna()` checks in Python loops
- Optimized: `mask = ~np.isnan(vals_i) & ~np.isnan(vals_j)` creates boolean mask in one vectorized operation
- Filtering becomes `x = vals_i[mask]` instead of appending valid values one by one
**3. Vectorized statistical calculations**
- Original: Manual computation using Python loops (`sum()`, list comprehensions)
- Optimized: Native NumPy methods (`x.mean()`, `x.std()`, `((x - mean_x) * (y - mean_y)).mean()`)
- NumPy's C-level implementations are orders of magnitude faster than Python loops
**Performance characteristics by test case:**
- **Small datasets (3-5 rows)**: 75-135% speedup - overhead of NumPy conversion is minimal
- **Medium datasets (100-1000 rows)**: 200-400% speedup - vectorization benefits become significant
- **Large datasets (1000+ rows)**: 11,000-50,000% speedup - vectorization dominance is overwhelming
- **Edge cases with many NaNs**: Excellent performance due to efficient boolean masking
- **Multiple columns**: Scales well since NumPy array slicing (`values[:, i]`) is very fast
The optimization transforms an O(n²m) algorithm with expensive Python operations into O(nm) with fast C-level NumPy operations, where n is rows and m is numeric columns.
0 commit comments