Performance Guide
Optimization techniques and performance analysis for OULY components.
Overview
This section covers performance optimization with OULY, including:
Optimization Guide - Techniques for maximizing performance
Memory Profiling - Tools and techniques for memory analysis
Parallel Performance - Scaling and threading considerations
Best Practices - Proven patterns for high-performance applications
Performance Philosophy
OULY is designed with performance as a primary goal:
- Zero-Cost Abstractions
Template-heavy design that compiles to optimal machine code.
- Cache-Friendly Design
Structure of Arrays (SoA) patterns and memory layout optimization.
- Lock-Free Algorithms
Atomic operations and work-stealing for minimal contention.
- Memory Layout Optimization
Structure of Arrays (SoA) patterns and cache-friendly access.
Optimization Strategies
Memory Access Patterns
Optimize for cache efficiency:
// Poor: Array of Structures (AoS)
struct Particle { float x, y, z, mass; };
std::vector<Particle> particles;
// Better: Structure of Arrays (SoA)
struct Particle { float x, y, z, mass; };
ouly::soavector<Particle> particles;
// Access individual components
auto& x_coords = particles.get<0>(); // x coordinates
auto& y_coords = particles.get<1>(); // y coordinates
// Process arrays with vectorized operations
Memory Allocation Strategies
Choose allocators based on usage patterns:
// Frame-based allocations (games)
ouly::linear_allocator<> frame_allocator(1024 * 1024);
// Fixed-size objects (object pools)
ouly::pool_allocator<> entity_pool(sizeof(Entity), 10000);
// Growing collections (dynamic content)
ouly::linear_arena_allocator<> dynamic_allocator(1024 * 1024);
Parallel Processing Optimization
Optimize task granularity and work distribution:
// Optimal grain size for parallel algorithms
constexpr size_t OPTIMAL_GRAIN_SIZE = 1000; // Tune for your hardware
void parallel_process(ouly::task_context& ctx, std::vector<float>& data) {
if (data.size() <= OPTIMAL_GRAIN_SIZE) {
// Process directly
process_range(data.begin(), data.end());
return;
}
// Split work
auto mid = data.begin() + data.size() / 2;
// Process halves in parallel
auto left_future = ouly::async(ctx, ctx.current_workgroup(),
[&](auto& ctx) { parallel_process(ctx, {data.begin(), mid}); });
parallel_process(ctx, {mid, data.end()});
left_future.wait();
}
Compilation Optimization
Enable compiler optimizations for maximum performance:
# CMake optimization flags
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG -march=native -flto")
# Enable link-time optimization
set_property(TARGET your_target PROPERTY INTERPROCEDURAL_OPTIMIZATION TRUE)
Profiling and Analysis
Memory Profiling with Valgrind
# Memory usage analysis
valgrind --tool=massif ./your_program
ms_print massif.out.* > memory_profile.txt
# Cache miss analysis
valgrind --tool=cachegrind ./your_program
cg_annotate cachegrind.out.* > cache_analysis.txt
CPU Profiling with perf
# Profile CPU usage
perf record -g ./your_program
perf report > cpu_profile.txt
# Analyze memory access patterns
perf mem record ./your_program
perf mem report > memory_access.txt
OULY Built-in Profiling
Enable statistics collection for performance analysis:
// Enable allocator statistics
using DebugConfig = ouly::config<
ouly::cfg::compute_stats, // Basic statistics
ouly::cfg::track_memory // Memory tracking
>;
ouly::linear_allocator<DebugConfig> allocator(1024 * 1024);
// ... use allocator ...
// Access statistics if available (implementation-dependent)
// Check allocator documentation for statistics access methods
Platform-Specific Optimizations
x86_64 Optimizations
// Enable AVX/AVX2 for vectorized operations
#ifdef __AVX2__
// Use OULY's SIMD-optimized containers
struct Position { float x, y, z; };
ouly::soavector<Position> positions;
#endif
// Optimize for specific CPU architectures
#ifdef __INTEL_COMPILER
#pragma intel optimization_level 3
#endif
ARM Optimizations
// NEON SIMD optimizations
#ifdef __ARM_NEON
// ARM-specific optimizations
#endif
// Apple Silicon optimizations
#ifdef __aarch64__
// 64-bit ARM optimizations
#endif
Scaling Considerations
Thread Scaling
// Optimal thread count
unsigned int optimal_threads = std::min(
std::thread::hardware_concurrency(),
static_cast<unsigned int>(workload_size / MIN_WORK_PER_THREAD)
);
ouly::scheduler scheduler(optimal_threads);
// Create workgroups BEFORE begin_execution
auto workgroup = scheduler.create_workgroup();
// Begin execution
scheduler.begin_execution();
// ... use scheduler ...
scheduler.end_execution();
scheduler.shutdown();
Memory Scaling
// Scale allocator sizes based on expected load
size_t memory_budget = get_available_memory() * 0.8; // 80% of available
ouly::linear_arena_allocator<> allocator(memory_budget);
Manual NUMA Optimization
// OULY scheduler does not have built-in NUMA support
// For NUMA optimization, manually configure:
#include <numa.h> // Linux NUMA API
void setup_numa_optimization() {
// 1. Set thread affinity manually
// 2. Allocate memory on appropriate NUMA nodes
// 3. Partition work based on NUMA topology
if (numa_available() >= 0) {
int num_nodes = numa_num_configured_nodes();
// Configure based on NUMA topology
}
}
Performance Testing
Performance Testing with External Tools
For comprehensive performance testing, integrate with external benchmarking libraries:
#include <benchmark/benchmark.h>
#include <ouly/ouly.hpp>
static void BM_LinearAllocator(benchmark::State& state) {
ouly::linear_allocator<> allocator(1024 * 1024);
for (auto _ : state) {
void* ptr = allocator.allocate(64);
benchmark::DoNotOptimize(ptr);
allocator.deallocate(ptr, 64);
}
}
BENCHMARK(BM_LinearAllocator);
Performance Monitoring in CI
# Add performance tests to CMake (using external benchmark library)
find_package(benchmark REQUIRED)
add_executable(performance_tests performance_tests.cpp)
target_link_libraries(performance_tests ouly::ouly benchmark::benchmark)
# Run performance tests in CI
add_test(NAME performance_regression_test COMMAND performance_tests)
Common Performance Pitfalls
Memory Allocation Anti-patterns
// AVOID: Frequent small allocations
for (int i = 0; i < 1000000; ++i) {
auto* ptr = new int(i); // Very expensive
delete ptr;
}
// BETTER: Use pool allocator
ouly::pool_allocator<> pool(sizeof(int), 1000000);
for (int i = 0; i < 1000000; ++i) {
auto* ptr = static_cast<int*>(pool.allocate(sizeof(int)));
new(ptr) int(i);
// Batch cleanup later
}
Container Growth Anti-patterns
// AVOID: Repeated growth
ouly::dynamic_array<int> numbers;
for (int i = 0; i < 1000000; ++i) {
numbers.push_back(i); // Multiple reallocations
}
// BETTER: Reserve capacity
ouly::dynamic_array<int> numbers;
numbers.reserve(1000000); // Single allocation
for (int i = 0; i < 1000000; ++i) {
numbers.push_back(i);
}
Threading Anti-patterns
// AVOID: Too many small tasks
for (int i = 0; i < 1000000; ++i) {
scheduler.submit(workgroup, [i]() { process_single_item(i); });
}
// BETTER: Batch processing
constexpr size_t BATCH_SIZE = 1000;
for (size_t i = 0; i < 1000000; i += BATCH_SIZE) {
scheduler.submit(workgroup, [i]() {
for (size_t j = i; j < std::min(i + BATCH_SIZE, 1000000UL); ++j) {
process_single_item(j);
}
});
}
Continuous Performance Monitoring
Set up automated performance monitoring using external tools:
# GitHub Actions performance monitoring (using external benchmark tools)
name: Performance Monitoring
on: [push, pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install benchmark library
run: |
git clone https://github.com/google/benchmark.git
cd benchmark && cmake -B build && cmake --build build --target install
- name: Build performance tests
run: |
cmake -B build -DOULY_BUILD_TESTS=ON
cmake --build build
- name: Run custom benchmarks
run: ./build/performance_tests --benchmark_format=json > results.json
This comprehensive performance guide helps you get the most out of OULY in your applications. Regular profiling and optimization should be part of your development workflow for performance-critical applications.