CodeWorld post-mortem: Shared project outage
Unfortunately, users of CodeWorld for the last two weeks experienced some problems with projects they had shared in the past. Projects that were shared too long in the past would fail to load without source code. However, the same projects continued to work when loaded and run with source code in the CodeWorld environment.
This post explains what happened, who was impacted, and what I’m doing to prevent the similar issues in the future.
Background
The problem was with shared projects in CodeWorld. When you create and run a project in CodeWorld, you can easily share it with others, by using the Share button at the bottom of the screen.

You have two options for sharing a project. The default is to share with the code. When you choose to share with code, others following your URL can see both the running program and the code that you’ve written. The other choice is to share without the code. When you choose to share without code, your URL will point straight to the running program, but it cannot be used to view your code.

Sharing without code is mainly used in two situations. The first is where you really want to keep your code secret, such as if you are creating an assignment where you want to give your own students an example of what their program should do, but do not want them to have the code to copy. The second is when you are embedding your CodeWorld program into another page, like a blog post, gallery, or other web site. In this case, the full CodeWorld environment and code editor would just get in the way.
What happened
Two weeks ago, I partially broke shares without code. That is, while you could still share your program with code, removing the code no longer worked correctly. Even worse, in some ways, was the fact that it did appear to work, but the links that you shared would stop working at a later time.
The consequence of this was mainly that other resources that embed or link to CodeWorld programs would now show blank white pages where the programs were supposed to be. The most important thing I’m aware of that broke is the first few weeks of Joachim Breitner’s CIS 194 lecture notes at http://www.seas.upenn.edu/~cis194/fall16/, which embed a large number of CodeWorld programs into some of the pages. That looked something like this:

As of the early morning of Friday, August 3rd, all links should be working again. Therefore, these resources were broken for about two weeks. During this period, there were approximately 8,000 attempts to view shared CodeWorld programs that failed to load.
The bug that tracked this as it was happening is here.
A red herring
Initially, I misdiagnosed the problem. Once I noticed the thousands of errors in the server logs, I picked one of the errors to reproduce, so I could watch it occur and find out what went wrong. Something odd happened!
All shared code is stored on the server in files based on a hash of the source code. I discovered that the program that had failed was stored in a different hash than the one computed from its source code! This was suspicious, and I set out immediately to figure out why it was happening.
This was a mistake. I still don’t know why the hash originally computed for that program is different from the one computed today. Maybe there was a bug in the hashing algorithm, in some older version of that library? Who knows, really! In any case, I now do know that this has nothing to do with the shared program outage. In fact, once the code is stored, it no longer matters at all how its file name was computed.
The bad news is that this cost me hours of debugging time. The good news, though, is that I now understand that changing a hash function is not a disaster for CodeWorld. This frees me up to move to a new hash function in the future, instead of the very sub-optimal choice of MD5 that I made seven years ago!
Why it really happened
When students write code in CodeWorld, it’s written in a dialect of Haskell. In order to run, it needs to be translated to JavaScript. In the past, CodeWorld would store the original Haskell and the translated JavaScript code in the same directory. When a new version of the CodeWorld library is released, part of the process is to delete all that translated JavaScript so that it will be re-translated when needed, using the new library.
As part of an effort to make CodeWorld more scalable for more students, I moved all data from one server’s local disk to a network filesystem (NFS) on July 20th. This made walking through all the directories of shared programs and deleting old translated JavaScript about a thousand times slower! That was too much. In order to make it faster, I moved the translated JavaScript code to a new directory, separate from the Haskell code, so that it can be deleted in one big operation.

When I made this change, though, I needed to make sure that before translating the Haskell to JavaScript, I’d created the directory where the JavaScript is supposed to go. Before, this hadn’t been necessary, because that was the same directory where the Haskell source code already lived. But I forgot one place where this needed to happen. As a result, CodeWorld shared programs would only work if they were first run using a method that created the right directories. Once this happened once, they would work fine; but only until the next time the CodeWorld library was updated.
How it was fixed, and lessons learned
The immediate fix was to modify the translation process to always create the output directories whenever they are needed, regardless of how it was run. That change got the site back into a working state again.

The next task is to prevent something similar from happening again. This problem should have been easily detected, because the web server was giving errors every time someone attempted to view a shared program (8,000 times over two weeks). Unfortunately, the existing monitoring only checks one URL to ensure it’s working. It does not look at errors from actual users. I’ve filed a bug to track extending the monitoring of the web site to look at actual user requests, and page me if a substantial number of them are failing. I’ll work on that in the near future.
The big lesson here is that CodeWorld is becoming more of a widespread shared resource. Down time doesn’t just matter during school sessions. Rather, CodeWorld is a resource that is used for documentation, demos, games, blogs, and other bits and pieces by many people. We ran into some growing pains here. The kind of monitoring I need to add would already exist in any commercial business project. Time to step up that game!
Conclusions
I do want to apologize to anyone affected by this mistake, and thank the whole community for their good will and patience. If you made one or more of the 8,000 failed requests, I hope you will keep trying, and I can win back your trust!
Thanks to our great community.