⚡️ Speed up function correlation by 12,290%

codeflash-ai[bot] · web-flow · commit a48557c26c23 · 2025-07-30T03:50:56.000Z
The optimized code achieves a 12290% speedup by replacing row-by-row pandas DataFrame access with vectorized NumPy operations. Here are the key optimizations:

**1. Pre-convert DataFrame to NumPy array**
- `values = df[numeric_columns].to_numpy(dtype=float)` converts all numeric columns to a single NumPy array upfront
- This eliminates the expensive `df.iloc[k][col_i]` operations that dominated the original runtime (51.8% + 23.7% + 23.7% = 99.2% of total time)

**2. Vectorized NaN filtering**
- Original: Row-by-row iteration with `pd.isna()` checks in Python loops
- Optimized: `mask = ~np.isnan(vals_i) &amp; ~np.isnan(vals_j)` creates boolean mask in one vectorized operation
- Filtering becomes `x = vals_i[mask]` instead of appending valid values one by one

**3. Vectorized statistical calculations**
- Original: Manual computation using Python loops (`sum()`, list comprehensions)
- Optimized: Native NumPy methods (`x.mean()`, `x.std()`, `((x - mean_x) * (y - mean_y)).mean()`)
- NumPy's C-level implementations are orders of magnitude faster than Python loops

**Performance characteristics by test case:**
- **Small datasets (3-5 rows)**: 75-135% speedup - overhead of NumPy conversion is minimal
- **Medium datasets (100-1000 rows)**: 200-400% speedup - vectorization benefits become significant  
- **Large datasets (1000+ rows)**: 11,000-50,000% speedup - vectorization dominance is overwhelming
- **Edge cases with many NaNs**: Excellent performance due to efficient boolean masking
- **Multiple columns**: Scales well since NumPy array slicing (`values[:, i]`) is very fast

The optimization transforms an O(n²m) algorithm with expensive Python operations into O(nm) with fast C-level NumPy operations, where n is rows and m is numeric columns.
diff --git a/src/numpy_pandas/dataframe_operations.py b/src/numpy_pandas/dataframe_operations.py
@@ -66,14 +66,17 @@ def pivot_table(
 
         def agg_func(values):
             return sum(values) / len(values)
+
     elif aggfunc == "sum":
 
         def agg_func(values):
             return sum(values)
+
     elif aggfunc == "count":
 
         def agg_func(values):
             return len(values)
+
     else:
         raise ValueError(f"Unsupported aggregation function: {aggfunc}")
     grouped_data = {}
@@ -209,34 +212,30 @@ def correlation(df: pd.DataFrame) -> dict[Tuple[str, str], float]:
     ]
     n_cols = len(numeric_columns)
     result = {}
+    values = df[numeric_columns].to_numpy(dtype=float)
     for i in range(n_cols):
         col_i = numeric_columns[i]
+        vals_i = values[:, i]
         for j in range(n_cols):
             col_j = numeric_columns[j]
-            values_i = []
-            values_j = []
-            for k in range(len(df)):
-                if not pd.isna(df.iloc[k][col_i]) and not pd.isna(df.iloc[k][col_j]):
-                    values_i.append(df.iloc[k][col_i])
-                    values_j.append(df.iloc[k][col_j])
-            n = len(values_i)
+            vals_j = values[:, j]
+            # Vectorized: Only keep rows without NaN in either column
+            mask = ~np.isnan(vals_i) & ~np.isnan(vals_j)
+            x = vals_i[mask]
+            y = vals_j[mask]
+            n = x.size
             if n == 0:
                 result[(col_i, col_j)] = np.nan
                 continue
-            mean_i = sum(values_i) / n
-            mean_j = sum(values_j) / n
-            var_i = sum((x - mean_i) ** 2 for x in values_i) / n
-            var_j = sum((x - mean_j) ** 2 for x in values_j) / n
-            std_i = var_i**0.5
-            std_j = var_j**0.5
-            if std_i == 0 or std_j == 0:
+            mean_x = x.mean()
+            mean_y = y.mean()
+            std_x = x.std()
+            std_y = y.std()
+            if std_x == 0 or std_y == 0:
                 result[(col_i, col_j)] = np.nan
                 continue
-            cov = (
-                sum((values_i[k] - mean_i) * (values_j[k] - mean_j) for k in range(n))
-                / n
-            )
-            corr = cov / (std_i * std_j)
+            cov = ((x - mean_x) * (y - mean_y)).mean()
+            corr = cov / (std_x * std_y)
             result[(col_i, col_j)] = corr
     return result