Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Sep 13, 2025

📄 11,311% (113.11x) speedup for find_last_node in src/dsa/nodes.py

⏱️ Runtime : 98.8 milliseconds 866 microseconds (best of 161 runs)

📝 Explanation and details

The optimization transforms the algorithm from O(N*M) to O(N+M) complexity by precomputing source IDs into a set for O(1) lookups.

Key Changes:

  • Precomputed set: Creates source_ids = {e["source"] for e in edges} once upfront
  • Fast membership check: Replaces all(e["source"] != n["id"] for e in edges) with n["id"] not in source_ids

Why This Is Faster:
The original code performed a nested loop - for each node, it checked against every edge's source (O(N*M) operations). The optimized version builds a hash set of source IDs once (O(M)), then performs constant-time lookups for each node (O(N)), resulting in O(N+M) total complexity.

Performance Benefits by Test Case:

  • Large dense graphs see massive speedups (18000%+ for chains and fully-connected graphs) where the quadratic complexity penalty was severe
  • Small graphs show moderate improvements (12-60%) due to reduced constant overhead
  • Star topologies benefit significantly (122-140%) as the original algorithm repeatedly scanned many edges
  • Empty/minimal cases show slight slowdowns due to set creation overhead, but this is negligible in practice

The optimization maintains identical behavior while dramatically improving scalability for larger graphs.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from src.dsa.nodes import find_last_node

# unit tests

# -------------------- Basic Test Cases --------------------

def test_single_node_no_edges():
    # One node, no edges: should return that node
    nodes = [{"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.17μs -> 1.04μs (12.0% faster)

def test_two_nodes_one_edge():
    # Two nodes, one edge from A to B: last node is B (no outgoing edges)
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.58μs -> 1.12μs (40.8% faster)

def test_three_nodes_chain():
    # Chain: A -> B -> C, last node is C
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "C"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.00μs -> 1.29μs (54.8% faster)

def test_three_nodes_star():
    # Star: A -> B, A -> C. Both B and C have no outgoing edges, should return B (first found)
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "A", "target": "C"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.12μs (48.2% faster)

def test_multiple_last_nodes():
    # Multiple nodes without outgoing edges, returns first found
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}]
    # B and C have no outgoing edges, B comes first
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.08μs (49.9% faster)

# -------------------- Edge Test Cases --------------------

def test_empty_nodes():
    # No nodes: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 708ns -> 916ns (22.7% slower)

def test_edges_with_missing_nodes():
    # Edges refer to nodes not present in nodes list
    nodes = [{"id": "A"}]
    edges = [{"source": "B", "target": "A"}, {"source": "C", "target": "A"}]
    # Only node A, which has no outgoing edges
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.33μs -> 1.17μs (14.3% faster)

def test_node_with_self_loop():
    # Node with an edge to itself: not a last node
    nodes = [{"id": "A"}]
    edges = [{"source": "A", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.17μs -> 1.00μs (16.7% faster)

def test_all_nodes_have_outgoing_edges():
    # Every node has at least one outgoing edge: should return None
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "A"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.58μs -> 1.17μs (35.6% faster)

def test_duplicate_edges():
    # Duplicate edges should not affect result
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "A", "target": "B"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.75μs -> 1.12μs (55.6% faster)

def test_node_with_multiple_outgoing_edges():
    # Node with multiple outgoing edges
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "A", "target": "C"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.58μs -> 1.12μs (40.8% faster)

def test_node_with_incoming_but_no_outgoing_edges():
    # Node with incoming edges but no outgoing edges
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.54μs -> 1.08μs (42.3% faster)

def test_node_with_no_edges_at_all():
    # Node present but no edges at all
    nodes = [{"id": "X"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.12μs -> 1.00μs (12.5% faster)

def test_edges_with_extra_keys():
    # Edges have extra keys, should not affect result
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [
        {"source": "A", "target": "B", "weight": 5},
        {"source": "A", "target": "B", "label": "foo"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.17μs (42.8% faster)

def test_nodes_with_extra_keys():
    # Nodes have extra keys, should not affect result
    nodes = [{"id": "A", "label": "alpha"}, {"id": "B", "label": "beta"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.08μs (50.0% faster)

def test_nodes_with_non_string_ids():
    # Node ids are integers
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.21μs (38.0% faster)

def test_edges_with_non_string_ids():
    # Edges use integer ids
    nodes = [{"id": 10}, {"id": 20}]
    edges = [{"source": 10, "target": 20}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.12μs (48.2% faster)

def test_mixed_type_ids():
    # Mixed types, should match by equality
    nodes = [{"id": "1"}, {"id": 2}]
    edges = [{"source": "1", "target": 2}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.71μs -> 1.12μs (51.8% faster)

def test_node_id_none():
    # Node id is None
    nodes = [{"id": None}, {"id": "A"}]
    edges = [{"source": "A", "target": None}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.33μs -> 1.04μs (28.0% faster)

def test_edge_source_none():
    # Edge source is None
    nodes = [{"id": None}, {"id": "A"}]
    edges = [{"source": None, "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.83μs -> 1.17μs (57.2% faster)

# -------------------- Large Scale Test Cases --------------------

def test_large_linear_chain():
    # Large chain: 1000 nodes, A0 -> A1 -> ... -> A999. Last node is A999
    N = 1000
    nodes = [{"id": f"A{i}"} for i in range(N)]
    edges = [{"source": f"A{i}", "target": f"A{i+1}"} for i in range(N-1)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 19.7ms -> 107μs (18159% faster)

def test_large_star_graph():
    # One root node, 999 leaf nodes, all edges from root to leafs
    N = 1000
    nodes = [{"id": "root"}] + [{"id": f"L{i}"} for i in range(1, N)]
    edges = [{"source": "root", "target": f"L{i}"} for i in range(1, N)]
    # All leaves have no outgoing edges, first leaf is L1
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 39.0μs -> 17.6μs (122% faster)

def test_large_all_connected():
    # Every node connects to every other node (except itself)
    N = 100
    nodes = [{"id": f"N{i}"} for i in range(N)]
    edges = [{"source": f"N{i}", "target": f"N{j}"} for i in range(N) for j in range(N) if i != j]
    # Every node has outgoing edges, should return None
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 19.1ms -> 349μs (5363% faster)

def test_large_sparse_graph():
    # 1000 nodes, only first 10 have outgoing edges to next 10
    N = 1000
    nodes = [{"id": f"N{i}"} for i in range(N)]
    edges = [{"source": f"N{i}", "target": f"N{i+1}"} for i in range(10)]
    # N11 is first node without outgoing edges (after those with outgoing edges)
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 6.21μs -> 2.75μs (126% faster)

def test_large_no_edges():
    # 1000 nodes, no edges, should return first node
    N = 1000
    nodes = [{"id": f"N{i}"} for i in range(N)]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.12μs -> 1.04μs (8.07% faster)

def test_large_multiple_last_nodes():
    # 1000 nodes, only first node has outgoing edge to second
    N = 1000
    nodes = [{"id": f"N{i}"} for i in range(N)]
    edges = [{"source": "N0", "target": "N1"}]
    # N1..N999 have no outgoing edges, N1 comes first
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.92μs -> 1.38μs (39.3% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from src.dsa.nodes import find_last_node

# unit tests

# ----------- BASIC TEST CASES -----------

def test_single_node_no_edges():
    # One node, no edges: should return the node itself
    nodes = [{"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges) # 1.12μs -> 916ns (22.8% faster)

def test_two_nodes_one_edge():
    # Two nodes, one edge from A to B: last node is B (no outgoing edges)
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges) # 1.67μs -> 1.04μs (59.9% faster)

def test_three_nodes_chain():
    # Chain: A->B->C, last node is C
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges) # 1.96μs -> 1.25μs (56.6% faster)

def test_multiple_last_nodes():
    # Multiple nodes with no outgoing edges: should return the first one found
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}]  # C has no edges, B has no outgoing edges
    # B and C are both "last nodes" by definition; function returns first found (B)
    codeflash_output = find_last_node(nodes, edges) # 1.58μs -> 1.12μs (40.7% faster)

# ----------- EDGE TEST CASES -----------

def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges) # 750ns -> 958ns (21.7% slower)

def test_nodes_with_self_loops():
    # Node with a self-loop: should not be last node
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "A"}]
    # B has no outgoing edges, so should be returned
    codeflash_output = find_last_node(nodes, edges) # 1.58μs -> 1.08μs (46.3% faster)

def test_all_nodes_have_outgoing_edges():
    # Every node is a source in at least one edge: should return None
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges) # 1.67μs -> 1.17μs (42.8% faster)

def test_node_with_multiple_outgoing_edges():
    # Node with multiple outgoing edges, only one node with none
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "A", "target": "C"},
        {"source": "B", "target": "C"},
    ]
    # C has no outgoing edges
    codeflash_output = find_last_node(nodes, edges) # 2.04μs -> 1.25μs (63.4% faster)

def test_duplicate_node_ids():
    # Duplicate node IDs: function should treat them as separate nodes
    nodes = [{"id": "A"}, {"id": "A"}]
    edges = [{"source": "A", "target": "A"}]
    # Both nodes have outgoing edges, so should return None
    codeflash_output = find_last_node(nodes, edges) # 1.50μs -> 1.08μs (38.5% faster)

def test_edges_with_nonexistent_nodes():
    # Edges refer to node IDs not present in nodes: should ignore those
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "C"}, {"source": "D", "target": "B"}]
    # B has no outgoing edges, so should be returned
    codeflash_output = find_last_node(nodes, edges) # 1.67μs -> 1.12μs (48.1% faster)



def test_edge_missing_target():
    # Edge missing 'target' key: should not affect last node determination
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A"}]
    # B has no outgoing edges, so should be returned
    codeflash_output = find_last_node(nodes, edges) # 2.00μs -> 1.17μs (71.4% faster)

def test_nodes_with_non_string_ids():
    # Node IDs are integers
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges) # 1.71μs -> 1.21μs (41.3% faster)

def test_edges_with_non_string_source():
    # Edge sources are integers
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": 1, "target": "A"}]
    # Both nodes have no outgoing edges, returns first
    codeflash_output = find_last_node(nodes, edges) # 1.38μs -> 1.08μs (27.0% faster)

# ----------- LARGE SCALE TEST CASES -----------

def test_large_linear_chain():
    # Large chain of 1000 nodes: last node is node_999
    nodes = [{"id": f"node_{i}"} for i in range(1000)]
    edges = [{"source": f"node_{i}", "target": f"node_{i+1}"} for i in range(999)]
    codeflash_output = find_last_node(nodes, edges) # 20.0ms -> 116μs (17113% faster)

def test_large_star_topology():
    # One central node with edges to 999 others: all others are last nodes
    nodes = [{"id": "center"}] + [{"id": f"leaf_{i}"} for i in range(999)]
    edges = [{"source": "center", "target": f"leaf_{i}"} for i in range(999)]
    # First leaf is returned
    codeflash_output = find_last_node(nodes, edges) # 42.2μs -> 17.6μs (140% faster)

def test_large_no_edges():
    # 1000 nodes, no edges: first node is returned
    nodes = [{"id": f"node_{i}"} for i in range(1000)]
    edges = []
    codeflash_output = find_last_node(nodes, edges) # 1.17μs -> 1.17μs (0.000% faster)

def test_large_all_nodes_have_outgoing_edges():
    # 1000 nodes, each has at least one outgoing edge: should return None
    nodes = [{"id": f"node_{i}"} for i in range(1000)]
    edges = [{"source": f"node_{i}", "target": f"node_{(i+1)%1000}"} for i in range(1000)]
    codeflash_output = find_last_node(nodes, edges) # 20.0ms -> 109μs (18194% faster)

def test_large_multiple_last_nodes():
    # 1000 nodes, only last 10 have no outgoing edges
    nodes = [{"id": f"node_{i}"} for i in range(1000)]
    edges = [{"source": f"node_{i}", "target": f"node_{i+1}"} for i in range(989)]
    # nodes 990-999 have no outgoing edges, first one found is node_990
    codeflash_output = find_last_node(nodes, edges) # 19.9ms -> 104μs (18857% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mfhvkfq0 and push.

Codeflash

The optimization transforms the algorithm from O(N*M) to O(N+M) complexity by precomputing source IDs into a set for O(1) lookups.

**Key Changes:**
- **Precomputed set**: Creates `source_ids = {e["source"] for e in edges}` once upfront
- **Fast membership check**: Replaces `all(e["source"] != n["id"] for e in edges)` with `n["id"] not in source_ids`

**Why This Is Faster:**
The original code performed a nested loop - for each node, it checked against every edge's source (O(N*M) operations). The optimized version builds a hash set of source IDs once (O(M)), then performs constant-time lookups for each node (O(N)), resulting in O(N+M) total complexity.

**Performance Benefits by Test Case:**
- **Large dense graphs** see massive speedups (18000%+ for chains and fully-connected graphs) where the quadratic complexity penalty was severe
- **Small graphs** show moderate improvements (12-60%) due to reduced constant overhead
- **Star topologies** benefit significantly (122-140%) as the original algorithm repeatedly scanned many edges
- **Empty/minimal cases** show slight slowdowns due to set creation overhead, but this is negligible in practice

The optimization maintains identical behavior while dramatically improving scalability for larger graphs.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 September 13, 2025 06:18
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Sep 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants