Testing method
Monkey baseline a.k.a. random-portfolio control, “dart-throwing monkeys”
A control that answers “is the system’s stock-picking actually better than chance?” It is a random-entry system — the proverbial monkey throwing darts at the stock listings to choose names — run through the exact same rules as the real system (same filters, position sizing, rebalance schedule, exits, and regime filter). The only thing changed is which stocks get picked: the real system’s ranking is replaced by a random draw from the same eligible names.
We run a few hundred such random portfolios (each a different random seed) to build a distribution of outcomes. If the real system isn’t clearly better than that crowd of monkeys, its stock-selection rule isn’t adding anything you couldn’t get by chance. The phrase traces to Burton Malkiel’s A Random Walk Down Wall Street.
Ablation
Removing or swapping one part of a system to see how much that part actually mattered — the way a mechanic unplugs one component to find what a car really needs to run. If you suspect the position-sizing rule is responsible for a result, you run the system again with that rule replaced by a plain equal-weight rule and compare. Whatever changes is attributable to the part you removed; what stays the same was not coming from it. (The term is borrowed from experimental science.)
Gated vs. un-gated
Two ways of running a test. Gated keeps the system’s market filter on (e.g. “only buy when the index is above its 200-day average”), so it stands aside in downturns. Un-gated removes that filter and stays fully invested at all times. Comparing the two shows how much of a result comes from the market-timing filter versus the rest of the system.
Percentile (vs. the monkey distribution)
Where the real system falls within the spread of random (monkey) outcomes. A CAGR at the 90th percentile means the real system beat 90% of the random portfolios. For a metric where lower is better, like maximum drawdown, a low percentile is bad — it means the real system drew down deeper than most of the monkeys. As a rough guide, landing above the 95th percentile is a genuine edge; landing in the middle (40th–60th) is indistinguishable from luck.
Skill excess
The real system’s return minus the median random portfolio’s return, in percentage points per year. Positive means the stock-picking added return over random selection; negative means it subtracted. It puts a number on “how much did the signal actually buy you.”
In-harness test
Running the control (e.g. the monkey baseline) through the system’s own backtest engine, so the only difference is the one thing being tested. This keeps the comparison fair — the same commissions, fills, sizing, and timing apply to both — rather than comparing against a separate, differently-built benchmark.
Buy-and-hold benchmark
The simplest possible alternative: put the money in the index itself (via its ETF, with dividends reinvested) and do nothing. For a long-only equity system this is the benchmark that matters — if a system can’t beat simply owning the index, the work and risk of trading it has to be justified by something else, usually a smoother ride (shallower drawdowns).
Return & risk metrics
CAGR compound annual growth rate
The single yearly growth rate that, compounded over the whole test, gets you from the starting balance to the ending balance. $100k growing to $294k over 21 years is about 5.2%/yr. It is the standard way to state “how fast did the money grow,” smoothing over the good and bad years.
Maximum drawdown MaxDD
The worst peak-to-trough drop in account value over the test — how much you would have lost if you bought at the highest point before the worst decline and held to the bottom. A −35% max drawdown means the account fell 35% from a prior high at its worst. It is the headline measure of pain and the main thing that makes a system hard to stick with.
Sharpe ratio
Return per unit of volatility — the average return divided by how much that return bounces around day to day (annualized). Higher is smoother-per-unit-of-return. It rewards steady gains and penalizes a choppy ride, but it treats upside swings and downside swings the same, which is why it is usually read alongside a drawdown-based measure like Calmar.
Calmar ratio
Return per unit of drawdown: annualized return divided by the maximum drawdown. Where Sharpe penalizes all volatility, Calmar penalizes only the thing investors actually fear — deep losses. A Calmar of 0.5 means you earned half your worst-drawdown size in annual return.
Rolling 36-month Calmar. Instead of one number for the whole history (which can hinge on a single old crash), we compute Calmar over a moving three-year window and report the median. This shows how consistently the system delivered return-per-drawdown across many three-year holding periods, not just over one lucky or unlucky span.
CAR25 & safe-f Bandy risk-normalized return
A way to state a strategy’s return at a controlled level of risk, so a rarely-invested system and a fully-invested one can be compared on equal terms. Using Monte-Carlo resampling of the trade history (the “Bandy” method), it first finds safe-f — the largest position size (allowing leverage) at which the strategy’s worst-case drawdown stays within a chosen limit (here, a 20% drawdown at the 5th percentile of simulated outcomes). CAR25 is then the compound annual return at that safe-f, read at the 25th percentile — a deliberately conservative “bad-luck” outcome rather than the average.
The diagnostic value: if safe-f hits its leverage ceiling without the drawdown limit ever binding, the strategy is limited by how much capital it can deploy, not by risk — it has spare risk budget but too few opportunities to use it. That is the signature of a capital-light, high-selectivity system.
Return / MaxDD full-period Calmar
CAGR divided by the single worst drawdown over the entire test — the whole-history version of the Calmar ratio. A quick read on how much growth you got for the worst loss you had to endure.
Profit factor
Gross profit divided by gross loss across all trades. Above 1.0 means winners outweighed losers in total dollars; 1.5 means $1.50 won for every $1.00 lost. A value near 1.0 means the system barely came out ahead on closed trades.
Win rate
The share of trades that closed profitably. On its own it says little — trend-following systems often win less than half their trades but still make money because the winners are much larger than the losers. Read it together with average win vs. average loss.
Expectancy
The average profit or loss on a typical trade — the single number that blends how often you win with how big the wins and losses are:
Expectancy = (Win% × average win) − (Loss% × average loss)
It answers “what do I make per trade, on average?” A high win rate doesn’t guarantee a positive expectancy — if the occasional loss is much larger than the typical win, a 64%-win system can still bleed. Conversely a low-win-rate trend follower can have strong positive expectancy from rare big winners. Multiply expectancy by the number of trades to approximate total profit, so it pairs naturally with trade frequency (how many chances the edge gets per year).
System mechanics
Regime gate regime filter, market filter
A rule that turns new buying on or off based on the overall market’s health — commonly “only open new positions when the index is above its 200-day moving average.” The idea is to stand aside during sustained downtrends and participate during uptrends. In practice it is often the single biggest driver of a system’s drawdown reduction.
Momentum ranking (slope × R²)
A way to rank stocks by trend strength. Fit a straight line to each stock’s recent log price (log so that a steady percentage gain looks like a straight line). The line’s slope measures how fast the stock is rising; R² (0 to 1) measures how cleanly it follows that line versus bouncing around. Multiplying them favors stocks that are both rising and rising smoothly, and downranks erratic movers.
Volatility sizing inverse-volatility, ATR sizing
Giving each position a size based on how jumpy the stock is, so calmer stocks get more capital and wilder ones get less — the goal being that every position contributes a similar amount of risk. “ATR” (average true range) is the volatility measure used. The alternative, equal weight, simply splits capital evenly across positions.
Exposure capital deployed, time in market
How much of the account is actually invested versus sitting in cash. A system that is only 60% invested on average is putting less capital at risk than a buy-and-hold investor who is 100% invested, so comparing their raw returns is not apples-to-apples — the lower-exposure system’s return per dollar deployed is higher than its headline number suggests. Two related measures: capital deployed (the fraction of equity invested) and time in market (the fraction of days holding any position). A system that dynamically cuts exposure in downturns trades away upside in bull markets for protection in bear markets.
Rebalance cadence
How often the system re-ranks its universe and adjusts holdings — daily, weekly, or monthly. Faster cadence reacts sooner (useful around fast crashes) but trades more; slower cadence trades less but can sit in a deteriorating position longer between reviews.
Data & universe
Survivorship-bias-free (SBF)
A dataset that includes the companies that later failed, were acquired, or got dropped from the index — not just the ones that survived to today. Testing only on survivors flatters results, because you are implicitly “knowing” which companies made it. An SBF test sees the losers too, so the result reflects what was actually tradable at the time.
Point-in-time membership
Using the index’s membership as it actually was on each historical date — so a 2008 backtest holds the names that were in the index in 2008, not today’s list. This avoids accidentally trading companies before they were added (or after they were removed), a subtle but common backtest error.
Total return (dividend-adjusted)
Price performance plus reinvested dividends. Comparing a system to a buy-and-hold benchmark is only fair on a total-return basis, since the buy-and-hold investor collects dividends too.
Seeds / K=200
A “seed” is the starting value for a random-number generator; a different seed produces a different random portfolio. K=200 means 200 independent random portfolios were run to build the monkey distribution. More seeds give a smoother, more reliable distribution to compare against.