Building an Automated Crossword Puzzle Generator

October 22, 2010

For my in-progress iPad puzzle game, I needed to generate a whole lot of crossword-like puzzles. I'm not making a crossword puzzle, but the game requires something similar than can be easily adapted from such an algorithm.

Crossword puzzles are basically a grid of letters constrained by shared letters and words coming from a dictionary or word list. It doesn't seem all that difficult, but coming up with a working, reasonably fast algorithm is anything but easy.

There are a few examples of crossword puzzle generators around, notably one in C by Robert Morris. I'd rather build my own in C++, since I have slightly different set of requirements.

There are probably a host of ways to implement such a beast. I tried a couple ways and came up with something that works sufficiently for my purpose. It would seem to be perfect for an evolutionary algorithm, something I find fascinating, but I didn't try that. In any case finding a type of solution is usually complicated by finding a data structure to support the grid, letters, words and intermediate structures, and building word lookups that will be the bulk of the runtime spent.

In my case I built a multi-level structure with a simple array of letters, with a layer above describing where the words will go. My word lookup came from some existing code I used in a prior iPhone game and could be thought of as a simplified regex engine (not how it's implemented) which only supports letters and positional wildcards.

There are also two types of layouts that the algorithm has to consider, one I call "loose" which you see in British crossword puzzles (lots of holes) and "tight", which are generally seen in the corners of US style crosswords (like a 3 x 3 grid with six words). It's possible to write support for loose and fail to work in tight. It took me a few tries to get the tight to work.

My generator will be used offline so speed wasn't terribly important. I profiled the code with Instruments (XCode) and found that the word lookups take 98% of the running time, so optimizing that would likely improve it immensely. Since I am not using a canned regex engine I have plenty of ways to make it faster by pre-calculating the data. I did a few of these when I wrote the code originally. Once I have built a few patterns (the layout of letters and blank spaces) I expect to generate 100-200 puzzles in about an hour.

There is bug in the code as it currently exists which I won't deal with for now; apparently under some circumstance the algorithm can miss a whole subtree of potential solutions. Debugging something that runs anywhere from a thousand to hundreds of thousands of steps where one step might be wrong is massively painful. Add having to use GDB for debugging with a lot of painful printfs of STL data structures. Here I wish I was using Visual Studio's debugger. Maybe XCode 4's new debugger will make life easier once it ships.

My algorithm can be thought of as a depth first solution; if you think of the solution space as a tree, I am trying to go as deep as possible before backing up from dead ends. It might not be the best solution but it works sufficiently in that it returns correct solutions in an acceptable time, even if there is the possibility of unfound solutions.

Currently the algorithm is designed as a single thread. I think creating an algorithm using multiple threads in order to speculatively try branches simultaneously might obtain better runtimes, it isn't required at the moment. The evolutionary option is also likely to result in faster solutions but a much longer development time. Mine took on and off about 4 days this week.

I have to admit this was both fun, and probably the hardest C++ code I have written since I did the memory allocator HeapManager (13 years ago). Now back to Objective-C and the iPad.