Serpent / AI strategy study / mathematical notes

Serpent AI Strategy Study: From Shortest Paths to Pressure Control

This is not a normal game guide. It is a record of how a small browser game turned into a pathfinding problem, then a resource economy, then a geometric pressure-control problem.

← Articles · Play Serpent · Open all strategy replays · 中文版本

Main claim: Serpent is no longer just a shortest-path variant of Snake. It is a control problem with stochastic resources, geometric pressure, and a hard time constraint.

The first question was: how can the agent eat more food? The final question became: under a 30-second feeding constraint, how can the agent use food, bonuses, walls, and hazards to regulate geometric pressure P=Nq² while keeping score drift positive?

0. Problem redefinition: from a small game to a dynamical system

At first, I thought Serpent was a Snake variant: eat objects, avoid obstacles, survive as long as possible. That description is not wrong, but it is too shallow.

Once Greedy, A*, Lookahead, NoGrow, TailGuard, Depth-2, and MinLength strategies were tested, the game stopped looking like a simple “which AI scores higher” exercise. It exposed a structure: score grows, length grows, size perturbs geometry, walls and hazards trim the body, and timeout forces the system to eat periodically.

So the real question is not “how do we get a high score?” It is: what is the correct long-term control variable in Serpent?

Early objective: maximize score Middle objective: maximize score subject to survival Late objective: maximize E[dS/dt] subject to E[dP/dt] ≈ 0, τ < 30s, low death risk

The answer is not simply length, and it is not simply size. The right aggregate variable is:

pressure = length × size² P(t) = N(t)q(t)²

1. Mathematical abstraction of the game rules

1.1 State variables

Let the current game state be:

X(t) = {N(t), q(t), S(t), τ(t), M(t)}
SymbolMeaning
N(t)Snake length, measured in body segments.
q(t)Snake size or thickness, taking values 1, 2, or 3.
S(t)Current score.
τ(t)Time since the last normal food or bonus was eaten.
M(t)Map state: food, bonus, hazard, obstacles, and the body shape.

The code constants translate directly into constraints: MAX_SIZE = 3, no-progress death at NO_PROGRESS_TIMEOUT = 30000ms, normal food gives +10, a big bonus gives +50, hazards give -20, obstacle score costs are -5/-10/-15, and obstacle segment losses are 1/1/2.

1.2 Action variables

On each tick, the agent directly controls direction:

u(t) ∈ {up, down, left, right}

At the strategy level, however, it is choosing between action types:

ActionNotationEffect
Eat normal foodF+10 score, +1 segment, reset τ
Eat big bonusB+50 score, +1 segment, size may increase, reset τ
Hit fragile wallW1-5 score, -1 segment, keep at least 1 segment
Hit brick wallW2-10 score, -1 segment, keep at least 1 segment
Hit iron wallW3-15 score, -2 segments, keep at least 1 segment
Eat hazardH-20 score, -2 segments, keep at least 1 segment, does not reset τ
Wait / detourØNo score or length change, but consumes timeout margin.
Implementation detail: a hazard can trim length, but it does not refresh lastFoodEaten. So hazards are pressure-relief resources, not survival-clock resources.

2. Basic resource equations

2.1 Length equation

Over a time window, let F be normal-food count, B bonus count, W1/W2/W3 wall-hit counts, and H hazard count. The nominal length change is:

ΔN = F + B - W1 - W2 - 2W3 - 2H

But the game has a crucial lower bound:

N ≥ 1

So the actual trimming is:

actual segment loss = min(penalty loss, N - 1)

This is why a one-segment snake does not get shortened further by walls or hazards. That is not a minor implementation detail; it is one of the reasons small-length steady states can exist.

2.2 Score equation

ΔS = 10F + 50B - 5W1 - 10W2 - 15W3 - 20H

If obstacle type is treated as a random variable, obstacle probabilities are:

P(W1)=0.55 P(W2)=0.30 P(W3)=0.15

The expected score cost of a random wall hit is:

E[C_wall] = 0.55×5 + 0.30×10 + 0.15×15 = 2.75 + 3 + 2.25 = 8

The expected segment loss of a random wall hit is:

E[L_wall] = 0.55×1 + 0.30×1 + 0.15×2 = 0.55 + 0.30 + 0.30 = 1.15

So the expected cost per trimmed segment is approximately:

8 / 1.15 ≈ 6.96 score / segment

Therefore:

normal food net value ≈ 10 - 6.96 = +3.04 big bonus net value ≈ 50 - 6.96 = +43.04

Key conclusion: as long as the agent can keep trimming away the length it gains, both normal food and big bonuses are positive-return resources.

This is the transition from “avoid obstacles” to “manage a resource account.” A wall is not simply failure; it is paid trimming.

2.3 Timeout constraint

The snake dies after 30 seconds without eating, so:

τ(t) < 30s

In long-term strategies, food is not only a score source. It is also a survival clock reset. If average food interval is T_food, then:

T_food < 30s

In practice the strategy needs margin:

T_food ≤ 22s ~ 25s

Walls and hazards do not reset τ. The long-run loop must include food or bonus. Walls and hazards can trim pressure, but they cannot keep the game clock alive.

3. Geometric pressure: the real control variable

Early analysis only tracked N = length. That is necessary, but insufficient. In Serpent, each segment occupies more cells as size grows.

sizeCells per segment
11×1 = 1
22×2 = 4
33×3 = 9

Define approximate geometric pressure:

P = N × q²

A more exact version is:

P_real = |OccupiedCells(snake, q)|

Because thick adjacent segments overlap, Nq² is an upper bound or proxy rather than exact occupied area. But it captures the essential fact: the same length is not the same state under different sizes.

3.1 Wall trimming efficiency at different sizes

A random wall hit trims 1.15 segments in expectation. If size is q, expected geometric area removed is:

E[A_trim | q] ≈ 1.15 × q²

Expected score cost remains about 8, so cost per cleared geometric cell is:

Cost_per_cell(q) = 8 / (1.15q²)
sizeExpected area trimmedCost per occupied cell
11.158 / 1.15 ≈ 6.96
24.608 / 4.60 ≈ 1.74
310.358 / 10.35 ≈ 0.77

Counterintuitive conclusion: after the snake gets thicker, walls become more efficient geometric pressure valves.

But thickness also makes navigation harder, increases accidental collisions, and reduces reachable space. That is the central tension of Serpent: size improves trimming efficiency while amplifying geometric risk.

3.2 Steady-state conditions

The mature long-term target is not N ≤ constant. It is:

P = Nq² ≤ P_max
sizeSuggested length cappressure
1N ≤ 5P ≤ 5
2N ≤ 3P ≤ 12
3N ≤ 2 or 3P ≤ 18 or 27

The control objective becomes:

Minimize pressure drift: E[ΔP] ≤ ε Subject to: E[ΔS] > 0 τ < 30s

4. L1–L17: strategy evolution and implementation details

Each strategy level is not just a version number. It is a problem redefinition. The improvement is not that the code becomes more complicated; it is that the objective function becomes closer to the real structure of the game.

L1–L2

Greedy and A*: shortest-path thinking

baseline / pathfinding

L1 Greedy simply pursues the nearest or highest-return target:

argmin_target distance(head, target) or argmax_target reward(target) / distance(head, target)

It optimizes immediate gain and ignores future reachability, self-collision, pressure, and timeout margin. It is a useful baseline, but not a sustainable strategy.

L2 A* upgrades the problem into real path search:

path* = argmin_path cost(path) cost(path) = path length
Implementation notes
open_set: priority queue ordered by f(n)=g(n)+h(n)
g(n): actual path length from the head to cell n
h(n): Manhattan distance to the target
blocked: snake body + non-passable obstacles
neighbors: four directions, with wrap-around if the engine allows it

A* is better than Greedy because it respects static blockers. But it still answers only one question: can I reach the target now? It does not answer: after I eat this target, will I still have an exit?

A* replay
Baseline evidence. Useful for showing what pure shortest-path planning misses.
L3a

Lookahead / Chase Tail: safety after eating

reachability-constrained planning

L3a adds the first important safety test: before committing to food, simulate the path to the food, then check whether the new head can still reach the new tail.

simulate(snake, π) → snake' Safety condition: Reachable(head(snake'), tail(snake')) = true
Implementation notes
def safe_to_eat(game, food):
    path = astar(game.head, food.position, game.blocked_cells)
    if not path:
        return False

    future = simulate(game, path)
    return bfs_reachable(future.head, future.tail, future.blocked_cells)

The tail is a moving exit. If the head can still reach the tail after eating, the agent has at least one dynamic escape route. The limitation is that it still evaluates one food at a time and does not maximize future space.

L3a Lookahead replay
Evidence for the tail-reachability idea.
L4

Economic Decision: a wall is not an obstacle, it is a price

resource-cost optimization

This is the first major conceptual turn. A wall is not an absolute blocker. It is a passable cell with a cost.

Net(target) = Reward(target) - ObstacleCost(path_to_target)

Normal food after a fragile wall is still profitable:

+10 - 5 = +5

A big bonus can remain profitable even after hitting a wall:

+50 - 15 = +35
Implementation notes
def cell_cost(cell):
    if cell.is_empty:
        return 1
    if cell.is_fragile_wall:
        return 1 + wall_weight * 5
    if cell.is_brick_wall:
        return 1 + wall_weight * 10
    if cell.is_iron_wall:
        return 1 + wall_weight * 15
    return INF

The failure mode is also important: the score ledger says bonuses are valuable, but the geometry ledger says size growth can destroy future reachability.

L4 Economic replay
Use this to show wall-as-price behavior, not as a best-score proof.
L4-NoGrow

Why a big bonus can become a liability

score-rush / q=1 meta

NoGrow keeps the useful part of A*/economic planning but refuses the big bonus. The strict version treats bonus cells as forbidden.

q(t)=1 P=Nq²=N

This simplifies the geometry dramatically. In short-term score-rush mode, preserving mobility can be worth more than the bonus's +50.

Implementation notes
def food_score(food, game):
    if food.type == "big_reward":
        return -INF       # NoGrow-Strict
    return 10 - distance_penalty(food)

The core lesson is not “bonuses are bad.” It is narrower and more precise: in the short-term score-rush meta, size growth can be a negative asset because it increases pressure too quickly.

NoGrow-Strict replay
seed0 reaches 1500. A key stage ceiling.
L5

Space: after eating, how much room remains?

future feasible volume

L5 asks a better question than “can I reach the food?” It asks: after I eat the food, how much reachable space do I still have?

Space(snake') = FloodFillReachableCells(head(snake'), blocked(snake'))
score(path) = reward - obstacle_cost - λ·path_length + μ·Space(after_eat)
Implementation notes
def reachable_area(game, start):
    queue = [start]
    visited = {start}
    while queue:
        cell = queue.pop()
        for nxt in neighbors(cell):
            if nxt not in visited and not blocked(nxt):
                visited.add(nxt)
                queue.append(nxt)
    return len(visited)

This is close to preserving a controllable region. It is no longer only about immediate path length; it is about future feasible state volume.

L5 Space replay
seed0 reaches 1500. A strong space-preservation proof case.
L6–L7

TailGuard and Depth-2: the score-rush peak

hard safety constraint + shallow planning

L6 turns tail reachability from a bonus into a hard safety constraint:

maximize Space(after_eat) subject to TailReachable(after_eat)

L7 adds a second visible-food lookahead:

V(s) ≈ R(s,a) + γ max R(s',a')
Implementation notes
def choose_food_tailguard(game):
    candidates = []
    for food in visible_foods(game):
        path = astar(game.head, food.position, game)
        if not path:
            continue
        future = simulate_path(game, path)
        if can_reach_tail(future):
            candidates.append((food, path, evaluate(food, path, future)))
    return best(candidates)

L6/L7 represent the strongest short-term score-rush line: reject bonuses, keep size at 1, eat normal food, and verify that the post-eating ecosystem is still alive.

L6 TailGuard replay
seed0 reaches 2190. The strongest score-rush evidence.
L7 Depth-2 replay
seed0 reaches 1660. Useful for demonstrating shallow two-step planning.
L11–L12

Failed branches: TailStall and CycleCruise

negative evidence

L11 TailStall

Hypothesis: when not eating, follow the tail to stall safely. Failure: local tail reachability is not a sufficient condition for long-term shape health. The body can still fold into a bad topology.

L12 CycleCruise / Hamiltonian

Hypothesis: a fixed Hamiltonian-style cycle prevents self-collision. Failure: Serpent has timeout, random food, dynamic obstacles, hazards, and bonuses. A fixed cruise can be too slow and die by timeout.

Hamiltonian food interval may be O(GRID²) But Serpent requires: T_food < 30s
L11 TailStall replay
seed0 reaches about 1010, then self-collision. Good failure evidence.
L12 Cycle replay
seed0 reaches 30, then timeout. Good failure evidence.
L8–L17

Long-term control: from slowing a long snake to small-length steady states

endurance-control

The long-control line begins with a simple positive-return loop:

eat 1 food → +10, +1 segment hit 1 fragile wall → -5, -1 segment net: +5 score, length unchanged

So in theory:

E[ΔN]=0 E[ΔS]>0

L8 tries a fixed target length. L9 relaxes strict equilibrium into small positive drift. L10/L13 hybridize this with safer eating. L14 creates the real turn: instead of trying to stabilize a dangerous long snake, keep the system inside a small-length safe set.

A = {N small} Keep X(t) ∈ A

L16/L17 then overturn the strong NoGrow assumption: the agent does not need to stay size 1 forever. If pressure is small, even size 3 can remain stable.

A = {q ∈ {1,2,3}, P=Nq² ≤ P_max}
Long-control state machine
if timeout_soon:
    seek_safe_food_or_bonus()
elif pressure_high:
    trim_by_wall_or_hazard()
elif pressure_low:
    take_profitable_food()
else:
    safe_wander_or_follow_tail()
L16 AllSize replay
5000-tick clipped segment. Evidence for the small-thick-snake steady mode.
L17 AllSize A* replay
Long-run steady-state evidence. This is not a full 106-minute replay.

5. Hazard correction

A hazard has this effect:

H = -20 score, -2 segments

So its cost per trimmed segment is:

20 / 2 = 10 score / segment
Trimming resourceCost per segment
Fragile wall5
Random obstacle6.96
Iron wall7.5
Brick wall10
Hazard10

Hazards are not high-quality trimming resources. Their value is emergency relief: they cut two segments at once, and they can substitute for walls if no safe wall is nearby.

Boundary: a hazard does not reset the food timer. In L18, it should be emergency pressure relief, not a standard cycle resource.

6. L18: theoretical design for PressureControl-Profit

The next strategy should not be called MinLength-Profit. A more accurate name is:

L18-PressureControl-Profit

Objective:

maximize E[dS/dt] subject to: P(t) ≤ P_max most of the time τ(t) < 30s death probability minimized

6.1 State machine

LOW_PRESSURE:
  eat normal food freely
  take a big bonus if its path is safe and reward is worth the pressure shock

MEDIUM_PRESSURE:
  prefer normal food
  avoid big bonus unless timeout pressure is high

HIGH_PRESSURE:
  stop taking rewards
  seek fragile/iron walls for trimming
  use hazard as emergency trimming

TIMEOUT_PRESSURE:
  ignore pressure temporarily; eat the nearest safe food or bonus

RECOVERY_AFTER_BONUS:
  if size just increased, immediately reduce pressure

6.2 Per-action value function

Value(action) = immediate_score - λ · P_after - μ · timeout_risk - ν · death_risk - ρ · path_cost + κ · future_reachability

The pressure weight λ should increase with current pressure:

λ = λ0 + λ1·max(0, P - P_target)

The higher pressure gets, the more the controller should value trimming. When pressure is low, it can afford to eat and profit.

6.3 Action valuation

Normal food

V(F) = +10 - λ·((N+1)q²)

Big bonus

q<3: V(B) = +50 - λ·((N+1)(q+1)²) q=3: V(B) = +50 - λ·((N+1)q²)

Wall hit

V(Wi) = -cost_i - path_cost + λ·pressure_reduction_i pressure_reduction_i ≈ min(loss_i, N-1)q²

Hazard

V(H) = -20 + λ·min(2,N-1)q² - timeout_penalty

A hazard is only worth eating when λ is large: when pressure is high enough that emergency trimming matters more than the score penalty.

7. Nine problem redefinitions

StageProblem frameModelKey change
1Snake AI problemminimize distance to foodGreedy / A*
2Safe pathfindingReachable(head', tail') = trueEating must preserve an exit
3Economic problemNet = reward - costWalls are paid trimming
4NoGrow metaq=1 fixedShort-term score-rush rejects bonus
5Steady-state controlE[ΔN]=0, E[ΔS]>0Wall hits cancel food growth
6Low positive drift0≤E[ΔN]≤εStrict cancellation is unnecessary
7Small-length safe setA={N small}Do not stabilize a long snake
8Pressure controlP=Nq²Control geometric pressure
9Absorb bonusq∈{1,2,3}, P≤PmaxThickness is fine if pressure stays controlled

The real value of this research cycle is that it did not stop at “which AI scored higher.” It kept asking: what is the long-term dynamical system inside this game?

8. What can be claimed, and what cannot

8.1 Weak proposition: Serpent admits long-run positive-return pressure-controlled strategies

Under the following conditions:

  1. The agent can find normal food or a bonus before τ<30s is violated;
  2. The agent can find walls or hazards as trimming resources at finite cost;
  3. The agent controls pressure so that P=Nq² stays near an upper bound;
  4. q≤3, so size is bounded;
  5. Obstacles and hazards do not create a permanent unreachable dead state.

Then simulation evidence and positive expected-value accounting support strategies with:

E[ΔS] > 0 E[ΔP] ≈ 0 or E[ΔP] ≤ ε survival time very long in simulation

In plain language: as long as the controller can keep bringing geometric pressure back into the safe region, score can keep drifting upward.

8.2 The stronger theorem we cannot claim yet

We cannot honestly claim:

All random seeds survive forever.

The reasons are concrete: map generation is random; food, bonus, and hazard locations can be bad for a while; the agent is not doing perfect dynamic programming; timeout is a hard constraint; size 3 reduces reachability; true occupied cells are more complex than Nq²; and wall-collision accounting may still need calibration.

Careful statement: we have strong simulation evidence and a positive expected-resource account supporting the existence of long-run positive-return pressure-controlled strategies. We do not yet have a formal proof of almost-sure infinite survival across all random environments.

9. Replay citation review

The article does not embed all 22 recordings. It uses selected replays as evidence for specific claims.

Use in articleReplayJudgment
baselineastar_seed0.jsonNot a high-score run, but useful for showing shortest-path limitations.
wall economyeconomic_seed0.jsonShows walls as passable priced cells; not used as best-score evidence.
NoGrow stage ceilingno_grow_strict_seed0.json1500. Core evidence.
space scoringl5_space_seed0.json1500. Good article evidence.
score-rush peakl6_tail_guard_seed0.json2190. The most important score-rush replay.
Depth-2 planningl7_depth2_seed0.json1660. Useful for showing shallow two-step planning.
failed branchesl11_tail_stall_seed0.json / l12_cycle_seed0.jsonAccurate negative evidence.
pressure steady statel16_min_length_all_size_seed0.json / l17_endurance_seed1.jsonUsed as steady-state segments, not full-game records.

The remaining recordings stay in the replay library as appendices. Long-run or clipped recordings such as L8/L9/L10/L13 should not be cited as final-score proof; they are better used as supplementary strategy evidence.