enfinity7B commited on
Commit
c298aec
·
verified ·
1 Parent(s): e382dc5

Fix ref_count leak + add results/admission_reserve_2025-04-21.json

Browse files
results/admission_reserve_2025-04-21.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "experiment": "admission_reserve_sweep",
3
+ "date": "2025-04-21",
4
+ "description": "Admission reserve analysis - reserving blocks for burst absorption",
5
+
6
+ "finding_1_constrained": {
7
+ "setup": {
8
+ "num_requests": 200,
9
+ "num_blocks": 128,
10
+ "workload": "prefix-sharing",
11
+ "system_prompts": 5,
12
+ "system_prompt_length": 256
13
+ },
14
+ "results": {
15
+ "0pct_reserve": {
16
+ "throughput_rps": 5.8,
17
+ "p99_ttft_ms": 33159,
18
+ "cache_hit_rate": 0.1643,
19
+ "preemptions": 118
20
+ },
21
+ "10pct_reserve": {
22
+ "throughput_rps": 8.8,
23
+ "p99_ttft_ms": 21591,
24
+ "cache_hit_rate": 0.3292,
25
+ "preemptions": 118,
26
+ "throughput_improvement_pct": 52,
27
+ "ttft_improvement_pct": 35
28
+ }
29
+ },
30
+ "insight": "Under high memory pressure, 10% reserve IMPROVES throughput by 52% and reduces p99 TTFT by 35%. Fewer blocks paradoxically helps because it reduces contention and eviction churn."
31
+ },
32
+
33
+ "finding_2_underloaded": {
34
+ "setup": {
35
+ "num_requests": 100,
36
+ "num_blocks": 64,
37
+ "workload": "prefix-sharing-small"
38
+ },
39
+ "results": {
40
+ "0pct_reserve": {
41
+ "throughput_rps": 48.5,
42
+ "p99_ttft_ms": 27,
43
+ "cache_hit_rate": 0.9918,
44
+ "preemptions": 0
45
+ },
46
+ "10pct_reserve": {
47
+ "throughput_rps": 23.4,
48
+ "p99_ttft_ms": 3440,
49
+ "cache_hit_rate": 0.4283,
50
+ "preemptions": 31
51
+ }
52
+ },
53
+ "insight": "Under light load, any reserve is harmful. With no memory pressure, reducing blocks just reduces cache capacity without benefit."
54
+ },
55
+
56
+ "key_insights": [
57
+ "Admission reserve is a CONDITIONAL improvement - only helps at the right operating point",
58
+ "Under high memory pressure: small reserve (10%) significantly improves throughput and latency",
59
+ "Under light load: any reserve is harmful (wastes cache capacity)",
60
+ "The optimal reserve ratio depends on: arrival rate, memory pressure, prefix sharing degree",
61
+ "APAC needs to ADAPT the reserve dynamically based on observed conditions",
62
+ "This validates the core thesis: adaptive policy beats any single static setting"
63
+ ],
64
+
65
+ "implications_for_controller": [
66
+ "Controller should monitor memory pressure (KV usage, eviction rate)",
67
+ "When pressure is HIGH: activate admission reserve (5-15%)",
68
+ "When pressure is LOW: zero reserve, maximize cache",
69
+ "The transition point is an online estimation problem",
70
+ "Hysteresis needed to avoid oscillation between states"
71
+ ]
72
+ }