H0 H0 H0 (Santa's hypothesis testing) - Centre for Wood Science & Technology

We all know that Santa is going to find out whether a child has been naughty or nice – but how does he do it?

The first thing we know (Gillespie & Coots, 1934) is that naughty and nice are binary states for Santa. A child either gets a stocking filled with presents or gets coal. We also know that actual naughtiness and niceness are analogue. A child can be a bit naughty or very naughty – but cannot increase naughtiness and niceness at the same time (unlike a chocolate pudding). So we can consider naughtiness and niceness to be a scale that runs from fully naughty at one end, to fully nice at the other. No child is ever so pure, so they’ll move along the scale over time, depending on their moods and opportunities. Over the course of the year, each child will accumulate a distribution of naughtiness and niceness – spending a certain amount of time at different naughtiness levels.

The first question is – how does Santa turn this this distribution of naughtiness and niceness into a binary decision and classify each child as being either naughty or nice? Well, if Santa had perfect knowledge of each child he could apply some criteria to their naughty-nice distribution. This could simply be the mean average of that distribution over the year, evaluated for each child, and a threshold that classifies each child into either the naughty or nice category. However, the problem with mean naughtiness is that a child could spend most of the year being just nice enough, and still get away with being spectacularly naughty on occasion.

A better idea would be to use some concept of percentile naughtiness. Perhaps a criteria like:
“if a child spends more than X% of their time being naughtier than Y they will be classified as naughty.”

This makes more sense if Santa is mostly concerned about the time a child spends being naughty and how naughty they are being. Santa could assess a child’s X percentile naughtiness level and put them in the naughty category if this falls below his Y threshold. (Or, to say it another way around, check how much time the child has spent being naughtier than Y and put them in the naughty category if this is more than X%).

Santa could even decide his definition level of X and Y in order to meet a percentage of children classified as naughty in a year in order to match how much coal he has available.

But what if Santa doesn’t have perfect knowledge about every child? Children can, after all, be naughty in quite secretive ways. How could he do it? Well, he could have some parameter that he measures that correlates with a child’s X percentile naughtiness. This would allow a relatively simple assessment – which he could even do before he arrives in town (Minority Report style) to enable the elves to plan ahead with their present manufacturing/coal mining. Since this is an estimate rather than an accurate assessment, Santa has to consider two risks (even if he checks his list twice):
Child’s risk: a child is declared naughty when they are actually nice (Type I error – “incorrectly naughty”)
Santa’s risk: a child is declared nice when they are actually naughty (Type II error – “incorrectly nice”)

Somehow, those two risks need to be balanced, and Santa has to make a decision about how acceptable he finds them. He probably considers the child’s risk to be more critical than his risk since he is notoriously jolly.

Once Santa has decided on acceptable levels of two types of risk, he can check that his method of declaring whether a child is naughty or nice is good enough.

The risks are more complicated than simple probabilities, because Santa also needs to take into account the magnitude of his error. Classifying a child as naughty when they have really been borderline nice is a less serious error than classifying a child as naughty when they have been very well behaved all year. Similarly, classifying a child as nice when they have been very very naughty is a more serious error than classifying a child as nice when they have only been borderline naughty.

How can this be done? Well, Santa can’t make too many assumptions about the shape of a child’s naughty-nice distribution since children don’t always behave normally. That would seem to limit Santa to a non-parametric method.

Let’s consider the boundary case of a child that is on the cusp of being classified as naughty (by Santa’s definition of naughty). In this case, the X percentile of their naughty nice distribution is equal to Santa’s definition of critical naughtiness Y. A tiny bit more naughtiness would have put the child into the naughty category. If Santa wants to be safe in favour of the child he could use a decision threshold (Ylist) level of naughtiness that is a bit more naughty than his definition of critical naughtiness level (Y). That way he is biasing the errors of naughty-nice child sorting in favour of nice. The problem with this is that the probabilities depend on the naughtiness distribution of the child. A child that spends all their waking hours being naughtier than Y but not quite as naughty as Ylist is less likely to be declared as naughty (a correct sorting) than a child that spends just a little more than X% of their time being naughtier than Y and is occasionally naughtier than Ylist (also a correct sorting). This makes it difficult to evaluate the level of Santa’s risk and the child’s risk on the basis of the sorting method alone.

Santa can overcome this problem by thinking just about the percentiles. Remember that other way of saying the criteria:
Check how much time the child has spent being naughtier than Y and put them in the naughty category if this is more than X%

This can be tweaked to balance Santa’s risk and the child’s risk by allowing the percentage of time to be a little higher than Santa’s defined level of X%.

Consider again the boundary case of a child that is on the cusp of being classified as naughty (by Santa’s definition of naughty). In this case, the proportion of time spent being naughtier than Santa’s definition of critical naughtiness Y is equal to Santa’s definition of critical percentage X%. Even a second more spent being naughtier than Y would have put the child into the naughty category.

If Santa wants to be safe in favour of the child he could use a decision threshold on time spent being naughtier than critical naughtiness level Y (Xlist) that is a bit more than his definition of critical naughtiness percentage time (X). That way the risk of a child being graded nice (by Santa’s list) when they are in fact classified (by definition) as naughty can be higher than the risk of a child being graded naughty when they are in fact classified (by definition) as nice.

Santa’s method could be as simple as taking a number (n) of observations of a child’s behaviour and counting up the number of them that are at levels of naughtiness that exceed Y. Santa could then set his list to naughty for that child if the percentage of those n observations is higher than Xlist.

Make a number (n) of random observations of a child. For each observation record a 1 if they are being nice and a 0 if they are not being nice. Naughtiness is estimated by the proportion of 0s.

Santa could decide on the number of observations he needs, and his decision level Xlist, in order to meet his acceptable levels of Santa’s risk and the child’s risk. Of course, in order to evaluate the risks, Santa needs to have done some very comprehensive measurements on a sample of children (hopefully non-destructively) to actually see how often children are incorrectly sorted (as “incorrectly naughty” or “incorrectly nice”). Each year he could do some additional comprehensive sampling to make sure his method is still working (in case children have innovated any new methods or patterns of naughtiness).

Santa needs to be careful that all of his observations are representative of the child’s actual behaviour over the course of the whole year. This is probably the biggest challenge for Santa as representativeness cannot be easily specified or corrected for with statistical methods. Santa has to use his judgement and knowledge of children’s behaviour patterns to decide on a sampling protocol (both for his comprehensive measurements and for his assessment of each child) that works well most of the time, and is able to catch the real problem cases (extremely naughty and secretive children).

But what about extreme naughtiness? Well, if Santa doesn’t have perfect knowledge, he doesn’t actually know the peak level of naughtiness of a child. But he could set more criteria for his decision list model based on different levels of naughty. He could add a second check allowing a much smaller percentage of time spent at a higher naughtiness level.

What can we conclude from this? “Be good for goodness sake” is not the best advice. Better advice is to be concentrate on spending less time being naughty …or to consolidate all your naughtiness into certain periods of time in order to defeat Santa’s naughtiness sampling protocol.

Do you have any better ideas for how this might work? If so, please write them in a comment – we’d like to know for science reasons.

Reference:
Coots, J.F. & Gillespie, H., “Santa Claus Is Coming to Town”, Decca, 1934

The discussion continues here: “H0 H0 H0 (Santa’s list again)”, 2024.

H0 H0 H0 (Santa’s hypothesis testing)

Be the first to comment

Leave a Reply Cancel reply