Serpent AI Strategy Study: From Shortest Paths to Pressure Control

0. Problem redefinition: from a small game to a dynamical system

At first, I thought Serpent was a Snake variant: eat objects, avoid obstacles, survive as long as possible. That description is not wrong, but it is too shallow.

Once Greedy, A*, Lookahead, NoGrow, TailGuard, Depth-2, and MinLength strategies were tested, the game stopped looking like a simple “which AI scores higher” exercise. It exposed a structure: score grows, length grows, size perturbs geometry, walls and hazards trim the body, and timeout forces the system to eat periodically.

So the real question is not “how do we get a high score?” It is: what is the correct long-term control variable in Serpent?

Early objective: maximize score Middle objective: maximize score subject to survival Late objective: maximize E[dS/dt] subject to E[dP/dt] ≈ 0, τ < 30s, low death risk

The answer is not simply length, and it is not simply size. The right aggregate variable is:

pressure = length × size² P(t) = N(t)q(t)²

1. Mathematical abstraction of the game rules

1.1 State variables

Let the current game state be:

X(t) = {N(t), q(t), S(t), τ(t), M(t)}

Symbol	Meaning
`N(t)`	Snake length, measured in body segments.
`q(t)`	Snake size or thickness, taking values 1, 2, or 3.
`S(t)`	Current score.
`τ(t)`	Time since the last normal food or bonus was eaten.
`M(t)`	Map state: food, bonus, hazard, obstacles, and the body shape.

The code constants translate directly into constraints: MAX_SIZE = 3, no-progress death at NO_PROGRESS_TIMEOUT = 30000ms, normal food gives +10, a big bonus gives +50, hazards give -20, obstacle score costs are -5/-10/-15, and obstacle segment losses are 1/1/2.

1.2 Action variables

On each tick, the agent directly controls direction:

u(t) ∈ {up, down, left, right}

At the strategy level, however, it is choosing between action types:

Action	Notation	Effect
Eat normal food	`F`	`+10 score, +1 segment, reset τ`
Eat big bonus	`B`	`+50 score, +1 segment, size may increase, reset τ`
Hit fragile wall	`W1`	`-5 score, -1 segment, keep at least 1 segment`
Hit brick wall	`W2`	`-10 score, -1 segment, keep at least 1 segment`
Hit iron wall	`W3`	`-15 score, -2 segments, keep at least 1 segment`
Eat hazard	`H`	`-20 score, -2 segments, keep at least 1 segment, does not reset τ`
Wait / detour	`Ø`	No score or length change, but consumes timeout margin.

Implementation detail: a hazard can trim length, but it does not refresh lastFoodEaten. So hazards are pressure-relief resources, not survival-clock resources.

2. Basic resource equations

2.1 Length equation

Over a time window, let F be normal-food count, B bonus count, W1/W2/W3 wall-hit counts, and H hazard count. The nominal length change is:

ΔN = F + B - W1 - W2 - 2W3 - 2H

But the game has a crucial lower bound:

N ≥ 1

So the actual trimming is:

actual segment loss = min(penalty loss, N - 1)

This is why a one-segment snake does not get shortened further by walls or hazards. That is not a minor implementation detail; it is one of the reasons small-length steady states can exist.

2.2 Score equation

ΔS = 10F + 50B - 5W1 - 10W2 - 15W3 - 20H

If obstacle type is treated as a random variable, obstacle probabilities are:

P(W1)=0.55 P(W2)=0.30 P(W3)=0.15

The expected score cost of a random wall hit is:

E[C_wall] = 0.55×5 + 0.30×10 + 0.15×15 = 2.75 + 3 + 2.25 = 8

The expected segment loss of a random wall hit is:

E[L_wall] = 0.55×1 + 0.30×1 + 0.15×2 = 0.55 + 0.30 + 0.30 = 1.15

So the expected cost per trimmed segment is approximately:

8 / 1.15 ≈ 6.96 score / segment

Therefore:

normal food net value ≈ 10 - 6.96 = +3.04 big bonus net value ≈ 50 - 6.96 = +43.04

Key conclusion: as long as the agent can keep trimming away the length it gains, both normal food and big bonuses are positive-return resources.

This is the transition from “avoid obstacles” to “manage a resource account.” A wall is not simply failure; it is paid trimming.

2.3 Timeout constraint

The snake dies after 30 seconds without eating, so:

τ(t) < 30s

In long-term strategies, food is not only a score source. It is also a survival clock reset. If average food interval is T_food, then:

T_food < 30s

In practice the strategy needs margin:

T_food ≤ 22s ~ 25s

Walls and hazards do not reset τ. The long-run loop must include food or bonus. Walls and hazards can trim pressure, but they cannot keep the game clock alive.

3. Geometric pressure: the real control variable

Early analysis only tracked N = length. That is necessary, but insufficient. In Serpent, each segment occupies more cells as size grows.

size	Cells per segment
1	`1×1 = 1`
2	`2×2 = 4`
3	`3×3 = 9`

Define approximate geometric pressure:

P = N × q²

A more exact version is:

P_real = |OccupiedCells(snake, q)|

Because thick adjacent segments overlap, Nq² is an upper bound or proxy rather than exact occupied area. But it captures the essential fact: the same length is not the same state under different sizes.

3.1 Wall trimming efficiency at different sizes

A random wall hit trims 1.15 segments in expectation. If size is q, expected geometric area removed is:

E[A_trim | q] ≈ 1.15 × q²

Expected score cost remains about 8, so cost per cleared geometric cell is:

Cost_per_cell(q) = 8 / (1.15q²)

size	Expected area trimmed	Cost per occupied cell
1	`1.15`	`8 / 1.15 ≈ 6.96`
2	`4.60`	`8 / 4.60 ≈ 1.74`
3	`10.35`	`8 / 10.35 ≈ 0.77`

Counterintuitive conclusion: after the snake gets thicker, walls become more efficient geometric pressure valves.

But thickness also makes navigation harder, increases accidental collisions, and reduces reachable space. That is the central tension of Serpent: size improves trimming efficiency while amplifying geometric risk.

3.2 Steady-state conditions

The mature long-term target is not N ≤ constant. It is:

P = Nq² ≤ P_max

size	Suggested length cap	pressure
1	`N ≤ 5`	`P ≤ 5`
2	`N ≤ 3`	`P ≤ 12`
3	`N ≤ 2 or 3`	`P ≤ 18 or 27`

The control objective becomes:

Minimize pressure drift: E[ΔP] ≤ ε Subject to: E[ΔS] > 0 τ < 30s

4. L1–L17: strategy evolution and implementation details

Each strategy level is not just a version number. It is a problem redefinition. The improvement is not that the code becomes more complicated; it is that the objective function becomes closer to the real structure of the game.

L1–L2

Greedy and A*: shortest-path thinking

baseline / pathfinding

L1 Greedy simply pursues the nearest or highest-return target:

argmin_target distance(head, target) or argmax_target reward(target) / distance(head, target)

It optimizes immediate gain and ignores future reachability, self-collision, pressure, and timeout margin. It is a useful baseline, but not a sustainable strategy.

L2 A* upgrades the problem into real path search:

path* = argmin_path cost(path) cost(path) = path length

Implementation notes

open_set: priority queue ordered by f(n)=g(n)+h(n)
g(n): actual path length from the head to cell n
h(n): Manhattan distance to the target
blocked: snake body + non-passable obstacles
neighbors: four directions, with wrap-around if the engine allows it

A* is better than Greedy because it respects static blockers. But it still answers only one question: can I reach the target now? It does not answer: after I eat this target, will I still have an exit?

A* replay
Baseline evidence. Useful for showing what pure shortest-path planning misses.

L3a

Lookahead / Chase Tail: safety after eating

reachability-constrained planning

L3a adds the first important safety test: before committing to food, simulate the path to the food, then check whether the new head can still reach the new tail.

simulate(snake, π) → snake' Safety condition: Reachable(head(snake'), tail(snake')) = true

Implementation notes

def safe_to_eat(game, food):
    path = astar(game.head, food.position, game.blocked_cells)
    if not path:
        return False

    future = simulate(game, path)
    return bfs_reachable(future.head, future.tail, future.blocked_cells)

The tail is a moving exit. If the head can still reach the tail after eating, the agent has at least one dynamic escape route. The limitation is that it still evaluates one food at a time and does not maximize future space.

L3a Lookahead replay
Evidence for the tail-reachability idea.

Economic Decision: a wall is not an obstacle, it is a price

resource-cost optimization

This is the first major conceptual turn. A wall is not an absolute blocker. It is a passable cell with a cost.

Net(target) = Reward(target) - ObstacleCost(path_to_target)

Normal food after a fragile wall is still profitable:

+10 - 5 = +5

A big bonus can remain profitable even after hitting a wall:

+50 - 15 = +35

Implementation notes

def cell_cost(cell):
    if cell.is_empty:
        return 1
    if cell.is_fragile_wall:
        return 1 + wall_weight * 5
    if cell.is_brick_wall:
        return 1 + wall_weight * 10
    if cell.is_iron_wall:
        return 1 + wall_weight * 15
    return INF

The failure mode is also important: the score ledger says bonuses are valuable, but the geometry ledger says size growth can destroy future reachability.

L4 Economic replay
Use this to show wall-as-price behavior, not as a best-score proof.

L4-NoGrow

Why a big bonus can become a liability

score-rush / q=1 meta

NoGrow keeps the useful part of A*/economic planning but refuses the big bonus. The strict version treats bonus cells as forbidden.

q(t)=1 P=Nq²=N

This simplifies the geometry dramatically. In short-term score-rush mode, preserving mobility can be worth more than the bonus's +50.

Implementation notes

def food_score(food, game):
    if food.type == "big_reward":
        return -INF       # NoGrow-Strict
    return 10 - distance_penalty(food)

The core lesson is not “bonuses are bad.” It is narrower and more precise: in the short-term score-rush meta, size growth can be a negative asset because it increases pressure too quickly.

NoGrow-Strict replay
seed0 reaches 1500. A key stage ceiling.

Space: after eating, how much room remains?

future feasible volume

L5 asks a better question than “can I reach the food?” It asks: after I eat the food, how much reachable space do I still have?

Space(snake') = FloodFillReachableCells(head(snake'), blocked(snake'))

score(path) = reward - obstacle_cost - λ·path_length + μ·Space(after_eat)

Implementation notes

def reachable_area(game, start):
    queue = [start]
    visited = {start}
    while queue:
        cell = queue.pop()
        for nxt in neighbors(cell):
            if nxt not in visited and not blocked(nxt):
                visited.add(nxt)
                queue.append(nxt)
    return len(visited)

This is close to preserving a controllable region. It is no longer only about immediate path length; it is about future feasible state volume.

L5 Space replay
seed0 reaches 1500. A strong space-preservation proof case.

L6–L7

TailGuard and Depth-2: the score-rush peak

hard safety constraint + shallow planning

L6 turns tail reachability from a bonus into a hard safety constraint:

maximize Space(after_eat) subject to TailReachable(after_eat)

L7 adds a second visible-food lookahead:

V(s) ≈ R(s,a) + γ max R(s',a')

Implementation notes

def choose_food_tailguard(game):
    candidates = []
    for food in visible_foods(game):
        path = astar(game.head, food.position, game)
        if not path:
            continue
        future = simulate_path(game, path)
        if can_reach_tail(future):
            candidates.append((food, path, evaluate(food, path, future)))
    return best(candidates)

L6/L7 represent the strongest short-term score-rush line: reject bonuses, keep size at 1, eat normal food, and verify that the post-eating ecosystem is still alive.

L6 TailGuard replay
seed0 reaches 2190. The strongest score-rush evidence.

L7 Depth-2 replay
seed0 reaches 1660. Useful for demonstrating shallow two-step planning.

L11–L12

Failed branches: TailStall and CycleCruise

negative evidence

L11 TailStall

Hypothesis: when not eating, follow the tail to stall safely. Failure: local tail reachability is not a sufficient condition for long-term shape health. The body can still fold into a bad topology.

L12 CycleCruise / Hamiltonian

Hypothesis: a fixed Hamiltonian-style cycle prevents self-collision. Failure: Serpent has timeout, random food, dynamic obstacles, hazards, and bonuses. A fixed cruise can be too slow and die by timeout.

Hamiltonian food interval may be O(GRID²) But Serpent requires: T_food < 30s

L11 TailStall replay
seed0 reaches about 1010, then self-collision. Good failure evidence.

L12 Cycle replay
seed0 reaches 30, then timeout. Good failure evidence.

L8–L17

Long-term control: from slowing a long snake to small-length steady states

endurance-control

The long-control line begins with a simple positive-return loop:

eat 1 food → +10, +1 segment hit 1 fragile wall → -5, -1 segment net: +5 score, length unchanged

So in theory:

E[ΔN]=0 E[ΔS]>0

L8 tries a fixed target length. L9 relaxes strict equilibrium into small positive drift. L10/L13 hybridize this with safer eating. L14 creates the real turn: instead of trying to stabilize a dangerous long snake, keep the system inside a small-length safe set.

A = {N small} Keep X(t) ∈ A

L16/L17 then overturn the strong NoGrow assumption: the agent does not need to stay size 1 forever. If pressure is small, even size 3 can remain stable.

A = {q ∈ {1,2,3}, P=Nq² ≤ P_max}

Long-control state machine

if timeout_soon:
    seek_safe_food_or_bonus()
elif pressure_high:
    trim_by_wall_or_hazard()
elif pressure_low:
    take_profitable_food()
else:
    safe_wander_or_follow_tail()

L16 AllSize replay
5000-tick clipped segment. Evidence for the small-thick-snake steady mode.

L17 AllSize A* replay
Long-run steady-state evidence. This is not a full 106-minute replay.

5. Hazard correction

A hazard has this effect:

H = -20 score, -2 segments

So its cost per trimmed segment is:

20 / 2 = 10 score / segment

Trimming resource	Cost per segment
Fragile wall	5
Random obstacle	6.96
Iron wall	7.5
Brick wall	10
Hazard	10

Hazards are not high-quality trimming resources. Their value is emergency relief: they cut two segments at once, and they can substitute for walls if no safe wall is nearby.

Boundary: a hazard does not reset the food timer. In L18, it should be emergency pressure relief, not a standard cycle resource.

6. L18: theoretical design for PressureControl-Profit

The next strategy should not be called MinLength-Profit. A more accurate name is:

L18-PressureControl-Profit

Objective:

maximize E[dS/dt] subject to: P(t) ≤ P_max most of the time τ(t) < 30s death probability minimized

6.1 State machine

LOW_PRESSURE:
  eat normal food freely
  take a big bonus if its path is safe and reward is worth the pressure shock

MEDIUM_PRESSURE:
  prefer normal food
  avoid big bonus unless timeout pressure is high

HIGH_PRESSURE:
  stop taking rewards
  seek fragile/iron walls for trimming
  use hazard as emergency trimming

TIMEOUT_PRESSURE:
  ignore pressure temporarily; eat the nearest safe food or bonus

RECOVERY_AFTER_BONUS:
  if size just increased, immediately reduce pressure

6.2 Per-action value function

Value(action) = immediate_score - λ · P_after - μ · timeout_risk - ν · death_risk - ρ · path_cost + κ · future_reachability

The pressure weight λ should increase with current pressure:

λ = λ0 + λ1·max(0, P - P_target)

The higher pressure gets, the more the controller should value trimming. When pressure is low, it can afford to eat and profit.

6.3 Action valuation

Normal food

V(F) = +10 - λ·((N+1)q²)

Big bonus

q<3: V(B) = +50 - λ·((N+1)(q+1)²) q=3: V(B) = +50 - λ·((N+1)q²)

Wall hit

V(Wi) = -cost_i - path_cost + λ·pressure_reduction_i pressure_reduction_i ≈ min(loss_i, N-1)q²

Hazard

V(H) = -20 + λ·min(2,N-1)q² - timeout_penalty

A hazard is only worth eating when λ is large: when pressure is high enough that emergency trimming matters more than the score penalty.

7. Nine problem redefinitions

Stage	Problem frame	Model	Key change
1	Snake AI problem	`minimize distance to food`	Greedy / A*
2	Safe pathfinding	`Reachable(head', tail') = true`	Eating must preserve an exit
3	Economic problem	`Net = reward - cost`	Walls are paid trimming
4	NoGrow meta	`q=1 fixed`	Short-term score-rush rejects bonus
5	Steady-state control	`E[ΔN]=0, E[ΔS]>0`	Wall hits cancel food growth
6	Low positive drift	`0≤E[ΔN]≤ε`	Strict cancellation is unnecessary
7	Small-length safe set	`A={N small}`	Do not stabilize a long snake
8	Pressure control	`P=Nq²`	Control geometric pressure
9	Absorb bonus	`q∈{1,2,3}, P≤Pmax`	Thickness is fine if pressure stays controlled

The real value of this research cycle is that it did not stop at “which AI scored higher.” It kept asking: what is the long-term dynamical system inside this game?

8. What can be claimed, and what cannot

8.1 Weak proposition: Serpent admits long-run positive-return pressure-controlled strategies

Under the following conditions:

The agent can find normal food or a bonus before τ<30s is violated;
The agent can find walls or hazards as trimming resources at finite cost;
The agent controls pressure so that P=Nq² stays near an upper bound;
q≤3, so size is bounded;
Obstacles and hazards do not create a permanent unreachable dead state.

Then simulation evidence and positive expected-value accounting support strategies with:

E[ΔS] > 0 E[ΔP] ≈ 0 or E[ΔP] ≤ ε survival time very long in simulation

In plain language: as long as the controller can keep bringing geometric pressure back into the safe region, score can keep drifting upward.

8.2 The stronger theorem we cannot claim yet

We cannot honestly claim:

All random seeds survive forever.

The reasons are concrete: map generation is random; food, bonus, and hazard locations can be bad for a while; the agent is not doing perfect dynamic programming; timeout is a hard constraint; size 3 reduces reachability; true occupied cells are more complex than Nq²; and wall-collision accounting may still need calibration.

Careful statement: we have strong simulation evidence and a positive expected-resource account supporting the existence of long-run positive-return pressure-controlled strategies. We do not yet have a formal proof of almost-sure infinite survival across all random environments.

9. Replay citation review

The article does not embed all 22 recordings. It uses selected replays as evidence for specific claims.

Use in article	Replay	Judgment
baseline	`astar_seed0.json`	Not a high-score run, but useful for showing shortest-path limitations.
wall economy	`economic_seed0.json`	Shows walls as passable priced cells; not used as best-score evidence.
NoGrow stage ceiling	`no_grow_strict_seed0.json`	1500. Core evidence.
space scoring	`l5_space_seed0.json`	1500. Good article evidence.
score-rush peak	`l6_tail_guard_seed0.json`	2190. The most important score-rush replay.
Depth-2 planning	`l7_depth2_seed0.json`	1660. Useful for showing shallow two-step planning.
failed branches	`l11_tail_stall_seed0.json` / `l12_cycle_seed0.json`	Accurate negative evidence.
pressure steady state	`l16_min_length_all_size_seed0.json` / `l17_endurance_seed1.json`	Used as steady-state segments, not full-game records.

The remaining recordings stay in the replay library as appendices. Long-run or clipped recordings such as L8/L9/L10/L13 should not be cited as final-score proof; they are better used as supplementary strategy evidence.