I have largely been using my LiveJournal page (stenz.livejournal.com) instead of this page as my journal. But I just finished something sufficiently dorky that it is likely better suited for here for a bit - then will later discuss it more in depth there and then also post up the source code to PerlMonks.

I have been trying to figure out how many possible game boards there are for the game Scrabble. It is a far from infinite number, but it is a huge ass number.

I figured, if given the competition dictionary, then you have a fixed pool of words that can be used, and therefore a solvable end number of possible boards that exist.

One could write a program that would just brute force its way through the entire board space and track what all of those boards are.

The problem is when to stop? How do you know when you have all of the possible boards?

It is easy to figure out how many possible boards there would be if you just tried every combination of tiles available on the board space. (there are 100 tiles and 255 spots on the board - that means you can do easy combinatorial math and see that there are something like 10^80 possible boards - that is off of the top of my head from another time I worked it out - that is not exact).

That number is huger than huge. But it is way larger than the real solution space, which is limited by the rules of the game, as well as the competition dictionary which determines that actual words and a limited set need to be formed, not just every combination of tiles on the board.

That should dramatically reduce the space that we are looking at - but by how much?

Well, I wrote some Perl code just now that looks at the competition dictionary. There are over 172K words in there.

I looked at it from the point of view of a blank board and the first starting move.

We know that the starting move is going to be 7 letters long or less.

That allows us to ignore more than 2 thirds of the possible words as our starting moves - down to over 50K possible words that we can start with.

This is a huge break - if that number were huge to start, it would only get huger real fast.

So from that first move, we know that it can be in the number of positions as it is long (a 2 letter word can be in 2 different positions, and a 7 letter word in 7 different positions on the board to start - at least one tile needs to start on the center tile of the board for those that don't know Scrabble).

From there, we look at each letter in the word and determine what words could be made to spawn off of there. If the same letter occurs more than once in a word, we still count the number of words available for it again since we are counting the probability space.

So we add each of those up and then we multiply that number by the number of letters in the word since there are that number of possible starting positions for the word.

So we iterate over the 50K+ words, and then see how many next possible spaces there are. We are one level deep.

To go more levels deep gets more complicated in the programming and the numbers involved start growing a lot - so I only ran it out one level just to see what sorts of numbers we were going to be expecting.

The number I got with the competition dictionary is 147,993,683,161 possible boards for the first two moves.

Things that aren't accounted for at this point:

1) I haven't looked to see what the cutoff for Perl's int is and when I will need to go to Math::BigInt - I presume fairly soon. The only reason that would be a worry is if that number that I got wasn't the real number and instead some overflow or rollover value.

2) I don't account for shifting off of the board. Technically we could start with a 7 letter word and after that use a 12 letter word that goes off the edge of the board - I don't check for that. Not to mention that also would defy the number of letters available to the second move at that time.

3) I don't watch the bag at all - this doesn't simulate real play. Similar to number 2 in that sense. That means that there might be 40 letter Zs on the board, but there should really be no more than 3 (1 allowed and then 2 blanks).

The issue at hand is do we care about the error words creeping in and throwing off the count?

If we can say that we don't care, then that is very good because we can save a huge amount of computation each turn and therefore crank through this quite quickly.

At first I thought it wouldn't matter - but as I write this I see it does matter.

I was thinking in a linear way - which is how us humans normally thing - but this is growing a non-linear way.

For every error that creeps in and is assumed to be a legit word, it is magnified by the possible moves of the start word - so up to 7 times the errors straight off - and then it will have its own children spawned off of itself - and none of those should count towards the total.

So while I initially thought it wasn't that big a deal if a few hundred thousand errors crept into the first one, it does matter because it adds many many more to the end total.

As a result, I am seeing that this code is a rough draft, but currently too large a number...

In the end, that is a good thing - we want this number as low as possible the whole time.

Both for brute forcing, but especially in terms of tracking the number of boards. Each board is a 255 char string - if there are billions of them (which it looks like there are going to be more like trillions at this point) - then can we really track that?

Say we wanted a database that had a table that was just "boards" - could we have trillions of rows? I'm guessing not, but I honestly don't know.

A database that could handle that would be very nice though since you could then have a massively distributed effort to brute force the boards and have the boards submitted and tracked.

This adds in the issue of error checking the incoming data to see if there are the proper number of tiles and the like.

Anyway, this is the first step in all of that.

I'm hoping the addition of error checking code in my Perl script will allow me to get that number down in the millions - but I'm suspecting it will remain in the billions for now - which worries me for later growth.

At least we know that we will only approach the 10^80 figure but never reach it by a long shot.