Hurray! Another bilious bug defeated!
Last time, we left our intrepid heroes—our desperate digital desperadoes—trapped—at the mercy of the extremely rare and apparently invisible 2,147,483,646! The villainous variable had been masquerading as 2147483, by all accounts a hardworking and kindhearted value, who wishes to say that she is in no way affiliated with that more nefarious number. 2147483646, at last unmasked, still holds the entire system in his iron grip. The Figures and their programmer must somehow find a way to deal with this invidious integer.
For background, checkout the previous post on this topic.
Once I knew the right number to look for, the rest of life came along. It wasn’t until days later that I got a chance to fix the problem. I started by having the system print out information on the state of the figure that was throwing the error.
“
Figure 124 at 124, size=76
pointer 2147483646
outer address 124
Outer head 0
other address 168
other head 0
java.lang.ArrayIndexOutOfBoundsException: 2147483646
“
And there it is. The pointer value is the number that’s causing the problem. I wish I’d thought of this test earlier, I would have noticed that I had the wrong number, and where the problem was right off.
Each figure has a memory. It’s an array of int values—those are 32-bit signed integers. An array is a bunch of numbers grouped together for easy access. In this case, those numbers are the figure’s program. The numbers get interpreted, and the figure’s program gets executed. While the figure runs, it can, if it happens to be made up of the right numbers, make a new copy of itself. The figures can also add, delete, or read or write numbers to themselves and each other. It’s a busy life for the squiggly little things.
To get numbers from the array, the system has to reference them by where they are in it. This figure is 76 numbers long. It has 76 numbers in its memory. If the system tries to check a slot that isn’t there, there’s an error, and everything shuts down. For this figure that means any negative number, or 76 and any number hirer. Array positions are numbered from zero, to the length of the array minus one.
It takes three numbers to make up one command. To interpret the commands in a figure’s memory, the system fetches three numbers from the memory array, pointer, pointer plus one, and pointer plus two. Obviously if it does that when the pointer is at more than two-billion in a figure that is only 76 numbers long, things will go terribly wrong. Except, I thought of that.
It was one of the first things I did. I had to, or the system wouldn’t run. I put in tests to check the current value of the pointer. If the number is negative or too large, the figure is reset and restarted. This can mean that a given figure might do nothing but reset over and over again, but that doesn’t stop the entire world from running, and other more industrious figures can still do their thing.
So, how does this obviously too large value sneak pass the safety checks to bring everything to a halt? And why, if the pointer being out of range is the problem, hasn’t the problem happened before. The system has been around for months, run for hour upon hour, and it only now has this issue?
I decided to check and see if the number the pointer was at was one of the numbers in the memory array. Rather than list out all 76 numbers, I had test code loop through it and set a flag to true if one of the numbers was equal to 2147483646, and added that to the test output.
“
saw num=true
“
Looping through in search of the number was the sort of test I’d been running last time, but I was looking for the wrong number so it didn’t help. This time, searching for the right one, I found it right away. That was comforting, since the commands set the pointer to the next value for the next command, so one expects to see the number in the array. If it hadn’t been in the memory, it would have meant that something extremely strange was happening. I was comforted, but still confused.
First the program checks to be certain that the pointer isn’t a negative number, that it isn’t less than zero. 2147483646 is not less than zero and passes that first test. Then, because the program is about to take three numbers, it checks to be certain that pointer plus two isn’t greater than or equal to the size of the memory array. If you just check to be certain the pointer isn’t greater than or equal to the figure’s size, you’ll get an error whenever the pointer was at the very end, or one space up from it. When I wrote that safety check, I thought myself clever. However, I was blithely assuming that any number the pointer might have that was too large, when increased by two, would still be too large. Unfortunately, computer numbers don’t always act that way.
If you have a signed variable, a variable that can be a positive or negative number, adding one to the largest number it can represent will turn it into the lowest number it can represent. The largest number a signed 32-bit integer can be, is 2147483647. This means that if you add two to the number we’re dealing with, 2147483646, you don’t get 2147483648; you get negative 2147483648. -2147483648 is definitely less than 76, so the number snuck through, and the error happened.
That explains everything!
Of the 4,294,967,296 possible numbers an int value can be, only two numbers, 2147483646 and 2147483647 would cause this problem. That means that the odds of this error happening are one in 2147484648. That explains why the system could run for so long before the error finally showed up.
It was an easy fix, and now I can get back to running experiments, instead of trying to solve issues.
Hurray! Another bilious bug defeated!