Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

0x08-e Performance Profiling & Optimization

🇺🇸 English    |    🇨🇳 中文

🇺🇸 English

📦 Code Changes: View Diff

Background: After introducing Cancel, execution time exploded from ~30s to 7+ minutes. We need to identify and fix the issue.

Goal:

  1. Establish architecture-level profiling to pinpoint bottlenecks.
  2. Fix the identified O(N) issues.
  3. Verify improvements with data.

1. Symptoms

Performance collapsed after adding Cancel:

  • Execution Time: ~30s → 7+ minutes
  • Throughput: ~34k ops/s → ~3k ops/s

Hypothesis:

  • Is it the O(N) Cancel scan?
  • VecDeque removal overhead?
  • Something else?

Hypothesis implies guessing. Profiling provides facts.


2. Optimization 1: Order Index

2.1 The Problem

Cancelling requires looking up an order. The naive remove_order_by_id iterates the entire book:

#![allow(unused)]
fn main() {
// Before: O(N) full scan
pub fn remove_order_by_id(&mut self, order_id: u64) -> Option<InternalOrder> {
    for (key, orders) in self.bids.iter_mut() {
        if let Some(pos) = orders.iter().position(|o| o.order_id == order_id) {
            // ...
        }
    }
    // Scan asks...
}
}

2.2 The Solution

Introduce order_index: FxHashMap<OrderId, (Price, Side)> for O(1) lookup.

#![allow(unused)]
fn main() {
pub struct OrderBook {
    asks: BTreeMap<u64, VecDeque<InternalOrder>>,
    bids: BTreeMap<u64, VecDeque<InternalOrder>>,
    order_index: FxHashMap<u64, (u64, Side)>,  // New
    trade_id_counter: u64,
}
}

2.3 Index Maintenance

OperationAction
rest_order()Insert
cancel_order()Remove
remove_order_by_id()Remove
Match FillRemove

2.4 Optimized Implementation

#![allow(unused)]
fn main() {
pub fn remove_order_by_id(&mut self, order_id: u64) -> Option<InternalOrder> {
    // O(1) Lookup
    let (price, side) = self.order_index.remove(&order_id)?;
    
    // O(log n) Find level
    let (book, key) = match side {
        Side::Buy => (&mut self.bids, u64::MAX - price),
        Side::Sell => (&mut self.asks, price),
    };
    
    // O(k) Find in level (k is small)
    let orders = book.get_mut(&key)?;
    let pos = orders.iter().position(|o| o.order_id == order_id)?;
    let order = orders.remove(pos)?;
    
    if orders.is_empty() {
        book.remove(&key);
    }
    
    Some(order)
}
}

2.5 Result 1

MetricBeforeAfter
Time7+ min87s
Throughput~3k ops/s15k ops/s
Boost-5x

Huge improvement! But 87s for 1.3M orders is still slow (15k ops/s). Further analysis is needed.


3. Architecture Profiling

3.1 Design

Measure time at architectural stages:

Order Input
    │
    ▼
┌─────────────────┐
│  1. Pre-Trade   │  ← UBSCore: WAL + Balance Lock
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. Matching    │  ← Pure ME: process_order
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. Settlement  │  ← UBSCore: settle_trade
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  4. Event Log   │  ← Ledger writes
└─────────────────┘

3.2 PerfMetrics

#![allow(unused)]
fn main() {
pub struct PerfMetrics {
    pub total_pretrade_ns: u64,    // UBSCore WAL + Lock
    pub total_matching_ns: u64,    // Match processing
    pub total_settlement_ns: u64,  // Balance updates
    pub total_event_log_ns: u64,   // Ledger I/O
    
    pub place_count: u64,
    pub cancel_count: u64,
}
}

4. Optimization 2: Matching Engine

4.1 Bottleneck Identification

Profiling revealed Matching Engine used 96% of time. Deep dive found:

#![allow(unused)]
fn main() {
// Problem: Copy ALL price keys on every match
let prices: Vec<u64> = book.asks().keys().copied().collect();
}

With 250k+ price levels in the Cancel test, copying keys O(P) + Alloc every match is disastrous.

4.2 Solution

Use BTreeMap::range() to iterate only relevant prices.

#![allow(unused)]
fn main() {
// Solution: Iterate only valid price range
let max_price = if buy_order.order_type == OrderType::Limit {
    buy_order.price
} else {
    u64::MAX
};
let prices: Vec<u64> = book.asks().range(..=max_price).map(|(&k, _)| k).collect();
}

5. Final Results

5.1 Environment

  • Dataset: 1.3M Orders (1M Place + 300k Cancel)
  • HW: MacBook Pro M1

5.2 Breakdown

=== Performance Breakdown ===
Orders: 1300000, Trades: 538487

1. Pre-Trade:        621.97ms (  3.5%)  [  0.48 µs/order]
2. Matching:       15014.08ms ( 84.0%)  [ 15.01 µs/order]
3. Settlement:        21.57ms (  0.1%)  [  0.04 µs/trade]
4. Event Log:       2206.71ms ( 12.4%)  [  1.70 µs/order]

Total Tracked:     17864.33ms

5.3 Improvements

StageLatency BeforeLatency AfterGain
Matching83.53 µs/order15.01 µs/order5.6x
Cancel LookupO(N)0.29 µs-

6. Comparison Table

VersionTimeThroughputGain
Before optimization7+ min~3k ops/s-
Order Index87s15k ops/s5x
+ BTreeMap range18s72k ops/s24x

7. Summary

7.1 Achievements

OptimizationProblemSolutionResult
Order IndexO(N) CancelFxHashMap0.29 µs
Range QueryFull key copyrange()83→15 µs

7.2 Final Design Pattern

┌─────────────────────────────────────────────────────────┐
│                     OrderBook                           │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │   order_index   │◄───│  Sync on: rest, cancel,     │ │
│  │ FxHashMap<id,   │    │           match, remove     │ │
│  │   (price,side)> │    └─────────────────────────────┘ │
│  └────────┬────────┘                                    │
│           │ O(1) lookup                                 │
│           ▼                                             │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │      bids       │    │          asks               │ │
│  │ BTreeMap<price, │    │  BTreeMap<price,            │ │
│  │   VecDeque>     │    │    VecDeque>                │ │
│  │  + range()      │    │    + range()                │ │
│  └─────────────────┘    └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

Optimization Conclusion: From 7 minutes to 18 seconds. 24x boost. 🚀




🇨🇳 中文

📦 代码变更: 查看 Diff

背景:引入 Cancel 功能后,执行时间从 ~30s 暴涨到 7+ 分钟,需要定位并解决问题。

本章目的

  1. 建立正确的架构级 Profiling 方法
  2. 通过 Profiling 精确定位性能瓶颈
  3. 针对性修复发现的问题

关键点:直觉可以指导方向,但必须用 Profiling 数据验证。


1. 问题现象

引入 Cancel 后性能急剧下降:

  • 执行时间:~30s → 7+ 分钟
  • 吞吐量:~34k ops/s → ~3k ops/s

初始假设可能的原因:

  • Cancel 的 O(N) 查找?
  • VecDeque 删除开销?
  • 其他未知问题?

但在 Profile 之前,这些都只是猜测。


2. Order Index 优化(第一次优化)

2.1 问题

撤单操作需要在 OrderBook 中查找订单。原始实现 remove_order_by_id 需要遍历整个订单簿:

#![allow(unused)]
fn main() {
// 优化前:O(N) 全表扫描
pub fn remove_order_by_id(&mut self, order_id: u64) -> Option<InternalOrder> {
    for (key, orders) in self.bids.iter_mut() {
        if let Some(pos) = orders.iter().position(|o| o.order_id == order_id) {
            // ...
        }
    }
    // 再遍历 asks...
}
}

2.2 解决方案

引入 order_index: FxHashMap<OrderId, (Price, Side)> 实现 O(1) 查找:

#![allow(unused)]
fn main() {
pub struct OrderBook {
    asks: BTreeMap<u64, VecDeque<InternalOrder>>,
    bids: BTreeMap<u64, VecDeque<InternalOrder>>,
    order_index: FxHashMap<u64, (u64, Side)>,  // 新增
    trade_id_counter: u64,
}
}

2.3 索引维护

操作索引动作
rest_order()插入
cancel_order()移除
remove_order_by_id()移除
撮合成交移除

2.4 优化后实现

#![allow(unused)]
fn main() {
pub fn remove_order_by_id(&mut self, order_id: u64) -> Option<InternalOrder> {
    // O(1) 查找
    let (price, side) = self.order_index.remove(&order_id)?;
    
    // O(log n) 定位价格层级
    let (book, key) = match side {
        Side::Buy => (&mut self.bids, u64::MAX - price),
        Side::Sell => (&mut self.asks, price),
    };
    
    // O(k) 在价格层级内查找 (k 通常很小)
    let orders = book.get_mut(&key)?;
    let pos = orders.iter().position(|o| o.order_id == order_id)?;
    let order = orders.remove(pos)?;
    
    if orders.is_empty() {
        book.remove(&key);
    }
    
    Some(order)
}
}

2.5 第一次优化结果

指标优化前优化后
执行时间7+ 分钟87s
吞吐量~3k ops/s15k ops/s
提升-5x

提升巨大! 但 87s 处理 130万订单仍然很慢。需要继续分析。


3. 架构级 Profiling(定位真正瓶颈)

3.1 Profiling 设计

按照订单生命周期的顶层架构分阶段计时:

Order Input
    │
    ▼
┌─────────────────┐
│  1. Pre-Trade   │  ← UBSCore: WAL + Balance Lock
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. Matching    │  ← Pure ME: process_order
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. Settlement  │  ← UBSCore: settle_trade
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  4. Event Log   │  ← Ledger writes
└─────────────────┘

3.2 PerfMetrics 设计

#![allow(unused)]
fn main() {
pub struct PerfMetrics {
    // 顶层架构计时
    pub total_pretrade_ns: u64,    // UBSCore WAL + Lock
    pub total_matching_ns: u64,    // Pure ME
    pub total_settlement_ns: u64,  // Balance updates
    pub total_event_log_ns: u64,   // Ledger writes
    
    // 操作计数
    pub place_count: u64,
    pub cancel_count: u64,
}
}

4. Matching Engine 优化(第二次优化)

4.1 问题定位

通过架构级 Profiling 发现 Matching Engine 占用 96% 时间。深入分析发现:

#![allow(unused)]
fn main() {
// 问题代码:每次 match 都复制所有价格 keys
let prices: Vec<u64> = book.asks().keys().copied().collect();
}

当订单簿有 25万+ 价格层级时,每次 match 都要:

  1. 遍历整个 BTreeMap 收集 keys - O(P)
  2. 分配 Vec 存储 - 内存分配开销
  3. 再遍历 Vec 进行匹配

4.2 优化方案

使用 BTreeMap::range() 只收集匹配范围内的 keys:

#![allow(unused)]
fn main() {
// 优化后:只收集匹配价格范围内的 keys
let max_price = if buy_order.order_type == OrderType::Limit {
    buy_order.price
} else {
    u64::MAX
};
let prices: Vec<u64> = book.asks().range(..=max_price).map(|(&k, _)| k).collect();
}

5. 性能测试结果

5.1 测试环境

  • 数据集:130万订单(100万 Place + 30万 Cancel)
  • 机器:MacBook Pro M1

5.2 最终 Breakdown

=== Performance Breakdown ===
Orders: 1300000 (Place: 1000000, Cancel: 300000), Trades: 538487

1. Pre-Trade:        621.97ms (  3.5%)  [  0.48 µs/order]
2. Matching:       15014.08ms ( 84.0%)  [ 15.01 µs/order]
3. Settlement:        21.57ms (  0.1%)  [  0.04 µs/trade]
4. Event Log:       2206.71ms ( 12.4%)  [  1.70 µs/order]

Total Tracked:     17864.33ms

5.3 优化效果

阶段优化前优化后提升
Matching83.53 µs/order15.01 µs/order5.6x
Cancel LookupO(N)0.29 µs-

6. 执行性能对比

版本执行时间吞吐量改进
优化前 (O(N) 撤单 + 全量 keys)7+ 分钟~3k ops/s-
Order Index 优化87s15k ops/s5x
+ BTreeMap range query18s72k ops/s24x

7. 总结

7.1 优化成果

优化问题解决方案效果
Order IndexO(N) 撤单查找FxHashMap 索引0.29 µs/cancel
BTreeMap range全量 keys 复制range() 范围查询83→15 µs/order

7.2 最终设计模式

┌─────────────────────────────────────────────────────────┐
│                     OrderBook                           │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │   order_index   │◄───│  Sync on: rest, cancel,     │ │
│  │ FxHashMap<id,   │    │           match, remove     │ │
│  │   (price,side)> │    └─────────────────────────────┘ │
│  └────────┬────────┘                                    │
│           │ O(1) lookup                                 │
│           ▼                                             │
│  ┌─────────────────┐    ┌─────────────────────────────┐ │
│  │      bids       │    │          asks               │ │
│  │ BTreeMap<price, │    │  BTreeMap<price,            │ │
│  │   VecDeque>     │    │    VecDeque>                │ │
│  │  + range()      │    │    + range()                │ │
│  └─────────────────┘    └─────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘

本次优化先到此为止!从 7 分钟到 18 秒,吞吐量提升 24 倍! 🚀