⚡️ Speed up function gridmake by 304%
#69
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 304% (3.04x) speedup for
gridmakeinquantecon/_ce_util.py⏱️ Runtime :
12.9 milliseconds→3.18 milliseconds(best of7runs)📝 Explanation and details
The optimized code achieves a 304% speedup by replacing NumPy's high-level array operations (
np.tile,np.repeat,np.column_stack) with a Numba JIT-compiled function that uses explicit loops to construct the Cartesian product directly.Key Optimization
Numba JIT Compilation: The
_gridmake2function is decorated with@njit(cache=True, fastmath=True), which compiles it to native machine code. This eliminates Python interpreter overhead and enables low-level optimizations.Why This Works:
Direct Memory Access: Instead of creating temporary arrays (
np.tilecreates a full tiled copy,np.repeatcreates a full repeated copy), the optimized code writes directly to the output array using explicit loopsMemory Efficiency: The original approach allocates multiple intermediate arrays:
np.tile(x1, x2.shape[0])creates a copy of sizem*nnp.repeat(x2, x1.shape[0])creates another copy of sizem*nnp.column_stackcreates yet another copy to combine themThe optimized version allocates just the final output array once
Cache Locality: Sequential loop access patterns provide better CPU cache utilization compared to NumPy's strided views and copies
Dtype Promotion Handling: The optimized code correctly handles mixed dtypes (e.g., int + float) by using NumPy's type promotion rules:
temp = np.empty(1, dtype=x1.dtype) + np.empty(1, dtype=x2.dtype)determines the result dtypePerformance Characteristics
The speedup is most significant for small-to-medium arrays (10-1000 elements):
_gridmake2)Impact on Workloads
Based on
function_references, this function is called in hot paths within quadrature node generation (qnwnorm,_make_multidim_func). These are computationally intensive operations used in numerical integration and computational economics, where:gridmakeis called inside loops to construct multi-dimensional gridsThe optimization is particularly effective for typical use cases in quantitative economics where grids of 10-100 points per dimension are common.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-gridmake-mjvywofcand push.