Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Sep 10, 2025

📄 4,315% (43.15x) speedup for fillna in src/numpy_pandas/dataframe_operations.py

⏱️ Runtime : 93.2 milliseconds 2.11 milliseconds (best of 149 runs)

📝 Explanation and details

The optimized code achieves a 43x speedup by replacing an inefficient row-by-row loop with vectorized pandas operations. Here are the key optimizations:

1. Eliminated Expensive Row-by-Row Operations

  • Original: Used for i in range(len(df)) with df.iloc[i][column] and result.iloc[i, col_idx] = value inside the loop
  • Optimized: Creates a boolean mask pd.isna(df[column]) once and uses result.iloc[mask.values, col_idx] = value for batch assignment

2. Moved Column Index Lookup Outside Loop

  • Original: Called df.columns.get_loc(column) inside the assignment (3,460 times in profiling)
  • Optimized: Computed col_idx once before the conditional logic

3. Added Short-Circuit Logic

  • Original: Always executed the loop regardless of whether NaN values existed
  • Optimized: Uses if mask.any(): to skip assignment entirely when no NaN values are present

Performance Impact by Test Case:

  • Large datasets with many NaNs: Massive gains (6,629% to 27,822% faster) because vectorized operations scale much better than Python loops
  • Small datasets: Modest improvements or slight overhead due to mask creation, but still net positive
  • No NaN cases: Excellent performance (14,007% faster for large datasets) due to short-circuiting

The optimization leverages pandas' internal C implementations for boolean indexing and bulk assignment, which are orders of magnitude faster than Python's interpreted row-by-row operations. This is especially effective for the typical use case of filling multiple missing values in larger datasets.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import fillna

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_fillna_basic_single_nan():
    # Single NaN in a column, should be filled
    df = pd.DataFrame({'A': [1, None, 3]})
    codeflash_output = fillna(df, 'A', 0); filled = codeflash_output # 85.7μs -> 100μs (14.5% slower)

def test_fillna_basic_multiple_nans():
    # Multiple NaNs in a column, all should be filled
    df = pd.DataFrame({'A': [None, 2, None, 4]})
    codeflash_output = fillna(df, 'A', 99); filled = codeflash_output # 75.5μs -> 59.2μs (27.4% faster)

def test_fillna_basic_no_nans():
    # No NaNs present, DataFrame should be unchanged
    df = pd.DataFrame({'A': [1, 2, 3]})
    codeflash_output = fillna(df, 'A', 0); filled = codeflash_output # 33.9μs -> 35.2μs (3.67% slower)

def test_fillna_basic_different_types():
    # Fill NaN in string column
    df = pd.DataFrame({'A': ['foo', None, 'bar']})
    codeflash_output = fillna(df, 'A', 'baz'); filled = codeflash_output # 49.4μs -> 56.5μs (12.7% slower)

def test_fillna_basic_fill_with_none():
    # Fill NaN with None (should remain None)
    df = pd.DataFrame({'A': [None, 2, 3]})
    codeflash_output = fillna(df, 'A', None); filled = codeflash_output # 92.2μs -> 90.0μs (2.46% faster)

# -------------------------
# Edge Test Cases
# -------------------------

def test_fillna_edge_empty_dataframe():
    # Empty DataFrame should remain empty
    df = pd.DataFrame({'A': []})
    codeflash_output = fillna(df, 'A', 0); filled = codeflash_output # 7.04μs -> 34.6μs (79.7% slower)

def test_fillna_edge_all_nans():
    # All values are NaN, all should be filled
    df = pd.DataFrame({'A': [None, float('nan'), pd.NA]})
    codeflash_output = fillna(df, 'A', 42); filled = codeflash_output # 68.5μs -> 54.2μs (26.5% faster)

def test_fillna_edge_column_does_not_exist():
    # Should raise KeyError if column does not exist
    df = pd.DataFrame({'A': [1, None]})
    with pytest.raises(KeyError):
        fillna(df, 'B', 0) # 23.3μs -> 13.5μs (73.1% faster)

def test_fillna_edge_column_with_mixed_types():
    # Column with mixed types and NaN
    df = pd.DataFrame({'A': [1, 'foo', None, 3.5]})
    codeflash_output = fillna(df, 'A', 'bar'); filled = codeflash_output # 54.3μs -> 53.0μs (2.36% faster)

def test_fillna_edge_nan_in_other_columns():
    # Only fill NaN in the specified column
    df = pd.DataFrame({'A': [1, None], 'B': [None, 2]})
    codeflash_output = fillna(df, 'A', 0); filled = codeflash_output # 44.8μs -> 52.1μs (13.9% slower)

def test_fillna_edge_nan_at_first_and_last():
    # NaN at first and last positions
    df = pd.DataFrame({'A': [None, 2, 3, None]})
    codeflash_output = fillna(df, 'A', 7); filled = codeflash_output # 69.5μs -> 51.0μs (36.5% faster)

def test_fillna_edge_nan_is_not_string_nan():
    # 'nan' string should not be treated as NaN
    df = pd.DataFrame({'A': ['nan', None]})
    codeflash_output = fillna(df, 'A', 'filled'); filled = codeflash_output # 40.0μs -> 51.5μs (22.4% slower)

def test_fillna_edge_nan_is_not_zero():
    # 0 should not be treated as NaN
    df = pd.DataFrame({'A': [0, None]})
    codeflash_output = fillna(df, 'A', 5); filled = codeflash_output # 43.4μs -> 50.5μs (14.1% slower)

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_fillna_large_scale_half_nans():
    # DataFrame with 1000 rows, half NaN, half not
    size = 1000
    data = [None if i % 2 == 0 else i for i in range(size)]
    df = pd.DataFrame({'A': data})
    codeflash_output = fillna(df, 'A', 'x'); filled = codeflash_output # 8.94ms -> 132μs (6629% faster)
    for i in range(size):
        if i % 2 == 0:
            pass
        else:
            pass

def test_fillna_large_scale_all_nans():
    # DataFrame with 1000 rows, all NaN
    size = 1000
    df = pd.DataFrame({'A': [None] * size})
    codeflash_output = fillna(df, 'A', 123); filled = codeflash_output # 12.4ms -> 68.5μs (18038% faster)

def test_fillna_large_scale_no_nans():
    # DataFrame with 1000 rows, no NaN
    size = 1000
    df = pd.DataFrame({'A': list(range(size))})
    codeflash_output = fillna(df, 'A', 9999); filled = codeflash_output # 4.90ms -> 34.6μs (14059% faster)

def test_fillna_large_scale_multiple_columns():
    # DataFrame with multiple columns, only fill specified column
    size = 1000
    df = pd.DataFrame({
        'A': [None if i % 3 == 0 else i for i in range(size)],
        'B': [None if i % 5 == 0 else i for i in range(size)],
        'C': list(range(size))
    })
    codeflash_output = fillna(df, 'B', 'filled'); filled = codeflash_output # 13.8ms -> 143μs (9486% faster)
    for i in range(size):
        expected = 'filled' if i % 5 == 0 else i

def test_fillna_large_scale_string_column():
    # DataFrame with string column, some NaNs
    size = 500
    data = ['foo' if i % 4 != 0 else None for i in range(size)]
    df = pd.DataFrame({'A': data})
    codeflash_output = fillna(df, 'A', 'bar'); filled = codeflash_output # 3.48ms -> 62.0μs (5518% faster)
    for i in range(size):
        expected = 'bar' if i % 4 == 0 else 'foo'
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import fillna

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_fillna_basic_single_nan():
    # One NaN in a column, should be filled
    df = pd.DataFrame({'a': [1, None, 3]})
    codeflash_output = fillna(df, 'a', 0); result = codeflash_output # 52.2μs -> 51.7μs (0.888% faster)
    expected = pd.DataFrame({'a': [1, 0, 3]})

def test_fillna_basic_no_nan():
    # No NaN in the column, should be unchanged
    df = pd.DataFrame({'a': [1, 2, 3]})
    codeflash_output = fillna(df, 'a', 0); result = codeflash_output # 34.0μs -> 33.3μs (2.00% faster)
    expected = pd.DataFrame({'a': [1, 2, 3]})

def test_fillna_basic_multiple_nan():
    # Multiple NaNs in a column, all should be filled
    df = pd.DataFrame({'a': [None, 2, None]})
    codeflash_output = fillna(df, 'a', 99); result = codeflash_output # 62.9μs -> 51.6μs (21.9% faster)
    expected = pd.DataFrame({'a': [99, 2, 99]})

def test_fillna_basic_other_columns_untouched():
    # Only the specified column should be affected
    df = pd.DataFrame({'a': [None, 2, 3], 'b': [4, None, 6]})
    codeflash_output = fillna(df, 'a', 10); result = codeflash_output # 50.5μs -> 51.2μs (1.46% slower)
    expected = pd.DataFrame({'a': [10, 2, 3], 'b': [4, None, 6]})

def test_fillna_basic_fill_with_string():
    # Fill with a string value
    df = pd.DataFrame({'a': [None, 'foo', None]})
    codeflash_output = fillna(df, 'a', 'bar'); result = codeflash_output # 58.8μs -> 51.7μs (13.9% faster)
    expected = pd.DataFrame({'a': ['bar', 'foo', 'bar']})

def test_fillna_basic_fill_with_float():
    # Fill with a float value
    df = pd.DataFrame({'a': [None, 1.5, None]})
    codeflash_output = fillna(df, 'a', 2.5); result = codeflash_output # 66.3μs -> 52.3μs (26.8% faster)
    expected = pd.DataFrame({'a': [2.5, 1.5, 2.5]})

# -------------------------
# Edge Test Cases
# -------------------------

def test_fillna_edge_empty_dataframe():
    # Empty DataFrame should remain empty
    df = pd.DataFrame({'a': []})
    codeflash_output = fillna(df, 'a', 42); result = codeflash_output # 6.92μs -> 34.2μs (79.8% slower)
    expected = pd.DataFrame({'a': []})

def test_fillna_edge_all_nan():
    # All values are NaN, all should be filled
    df = pd.DataFrame({'a': [None, None, None]})
    codeflash_output = fillna(df, 'a', 7); result = codeflash_output # 66.1μs -> 52.3μs (26.5% faster)
    expected = pd.DataFrame({'a': [7, 7, 7]})

def test_fillna_edge_nan_in_other_column():
    # NaN in a different column should not be filled
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [None, 5, 6]})
    codeflash_output = fillna(df, 'a', 0); result = codeflash_output # 48.5μs -> 35.2μs (37.6% faster)
    expected = pd.DataFrame({'a': [1, 2, 3], 'b': [None, 5, 6]})

def test_fillna_edge_column_does_not_exist():
    # Should raise KeyError if column does not exist
    df = pd.DataFrame({'a': [1, 2, 3]})
    with pytest.raises(KeyError):
        fillna(df, 'b', 0) # 22.0μs -> 12.3μs (78.4% faster)

def test_fillna_edge_nan_types():
    # Test with np.nan and pd.NA
    import numpy as np
    df = pd.DataFrame({'a': [np.nan, pd.NA, 3]})
    codeflash_output = fillna(df, 'a', 1); result = codeflash_output # 58.2μs -> 52.3μs (11.2% faster)
    expected = pd.DataFrame({'a': [1, 1, 3]})

def test_fillna_edge_fill_with_none():
    # Filling NaN with None should not change the NaN
    df = pd.DataFrame({'a': [None, 2, None]})
    codeflash_output = fillna(df, 'a', None); result = codeflash_output # 112μs -> 77.3μs (44.9% faster)
    # None is interpreted as NaN in pandas
    expected = pd.DataFrame({'a': [None, 2, None]})

def test_fillna_edge_fill_with_nan():
    # Filling NaN with np.nan should not change the NaN
    import numpy as np
    df = pd.DataFrame({'a': [None, 2, None]})
    codeflash_output = fillna(df, 'a', np.nan); result = codeflash_output # 66.0μs -> 51.9μs (27.0% faster)
    expected = pd.DataFrame({'a': [np.nan, 2, np.nan]})

def test_fillna_edge_column_with_mixed_types():
    # Column with mixed types (int, float, str, NaN)
    import numpy as np
    df = pd.DataFrame({'a': [1, 'x', np.nan, None]})
    codeflash_output = fillna(df, 'a', 'filled'); result = codeflash_output # 62.8μs -> 51.5μs (22.1% faster)
    expected = pd.DataFrame({'a': [1, 'x', 'filled', 'filled']})

def test_fillna_edge_dataframe_not_modified():
    # Ensure original DataFrame is not modified
    df = pd.DataFrame({'a': [None, 2, 3]})
    df_copy = df.copy(deep=True)
    codeflash_output = fillna(df, 'a', 0); _ = codeflash_output # 49.8μs -> 48.7μs (2.14% faster)

def test_fillna_edge_column_with_all_non_nan_types():
    # Column with all non-NaN types (should not be changed)
    df = pd.DataFrame({'a': ['x', 'y', 'z']})
    codeflash_output = fillna(df, 'a', 'foo'); result = codeflash_output # 32.9μs -> 34.2μs (3.78% slower)
    expected = pd.DataFrame({'a': ['x', 'y', 'z']})

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_fillna_large_scale_many_rows():
    # Large DataFrame with some NaNs scattered
    import numpy as np
    size = 1000
    data = [i if i % 10 != 0 else np.nan for i in range(size)]
    df = pd.DataFrame({'a': data})
    codeflash_output = fillna(df, 'a', -1); result = codeflash_output # 5.98ms -> 54.1μs (10942% faster)
    expected_data = [i if i % 10 != 0 else -1 for i in range(size)]
    expected = pd.DataFrame({'a': expected_data})

def test_fillna_large_scale_all_nan():
    # Large DataFrame with all NaNs
    import numpy as np
    size = 1000
    df = pd.DataFrame({'a': [np.nan] * size})
    codeflash_output = fillna(df, 'a', 123); result = codeflash_output # 14.3ms -> 55.2μs (25789% faster)
    expected = pd.DataFrame({'a': [123] * size})

def test_fillna_large_scale_multiple_columns():
    # Large DataFrame with many columns, only one is filled
    import numpy as np
    size = 1000
    df = pd.DataFrame({
        'a': [np.nan if i % 2 == 0 else i for i in range(size)],
        'b': [i for i in range(size)],
        'c': [None if i % 3 == 0 else 'x' for i in range(size)]
    })
    codeflash_output = fillna(df, 'a', 0); result = codeflash_output # 23.0ms -> 82.4μs (27822% faster)
    expected_a = [0 if i % 2 == 0 else i for i in range(size)]
    expected = pd.DataFrame({
        'a': expected_a,
        'b': [i for i in range(size)],
        'c': [None if i % 3 == 0 else 'x' for i in range(size)]
    })

def test_fillna_large_scale_no_nan():
    # Large DataFrame with no NaNs (should be unchanged)
    size = 1000
    df = pd.DataFrame({'a': list(range(size))})
    codeflash_output = fillna(df, 'a', 9999); result = codeflash_output # 4.88ms -> 34.6μs (14007% faster)
    expected = pd.DataFrame({'a': list(range(size))})

To edit these changes git checkout codeflash/optimize-fillna-mfejxgdn and push.

Codeflash

The optimized code achieves a **43x speedup** by replacing an inefficient row-by-row loop with vectorized pandas operations. Here are the key optimizations:

**1. Eliminated Expensive Row-by-Row Operations**
- **Original**: Used `for i in range(len(df))` with `df.iloc[i][column]` and `result.iloc[i, col_idx] = value` inside the loop
- **Optimized**: Creates a boolean mask `pd.isna(df[column])` once and uses `result.iloc[mask.values, col_idx] = value` for batch assignment

**2. Moved Column Index Lookup Outside Loop**
- **Original**: Called `df.columns.get_loc(column)` inside the assignment (3,460 times in profiling)
- **Optimized**: Computed `col_idx` once before the conditional logic

**3. Added Short-Circuit Logic**
- **Original**: Always executed the loop regardless of whether NaN values existed
- **Optimized**: Uses `if mask.any():` to skip assignment entirely when no NaN values are present

**Performance Impact by Test Case:**
- **Large datasets with many NaNs**: Massive gains (6,629% to 27,822% faster) because vectorized operations scale much better than Python loops
- **Small datasets**: Modest improvements or slight overhead due to mask creation, but still net positive
- **No NaN cases**: Excellent performance (14,007% faster for large datasets) due to short-circuiting

The optimization leverages pandas' internal C implementations for boolean indexing and bulk assignment, which are orders of magnitude faster than Python's interpreted row-by-row operations. This is especially effective for the typical use case of filling multiple missing values in larger datasets.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 September 10, 2025 22:29
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Sep 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants