Thursday, December 15, 2005

FreeBSD ports via authenticated proxy firewall

If you need to use the FreeBSD ports system from behind a proxy, you may have found these articles outlining how to make fetch(1) work over a proxy, and how to replace fetch(1) with wget(1). What you might be interested in if you are using FreeBSD 4.10 (I think 4.x since 4.7) with an authenticated proxy server (that requires you to give a name and password to access web sites) is that you need to add DISABLE_SIZE=1 to /etc/make.conf under the FETCH_CMD line to make these versions of FreeBSD ports work with wget.
Hope that helps someone.

Lessons learned in error handling

Error messages, user feedback, status reporting, etc.

Can the operator understand this message?
Can they do anything about it?

First, it needs to be determined what the operator must do in this event. Then that instruction needs to be clearly relayed to them, and potentially logged as well, perhaps via email. The message to the operator needs to be unmissable - for the desktop apps we're using in our system the message appears in big red letters in a separate modal box with a warning sound, nothing but deliberately using the mouse to hit the close button will allow things to proceed. The message is worded in clear language and contains no unnecessary software/database/network jargon or error codes.

If there is nothing for the operator to do, the message is logged and send to the engineers, without even notifying (read disrupting) the operator.

This follows on to the next principle: if the software can make an effort at recovery from any given issue, then it should do so.

The desktop apps in our system call COM+ components to do work, and those components call stored procedures in the database that is the heart of the system. Originally the apps just tried to call the component method, and put up a tiny message box if there was an error. The error code and the standard error messages were put first, then maybe some kind of explanation was included, perhaps a general one or an easy explanation.

Thus, quite often the operators would suddenly discover a little box they could not understand and had no idea of what to do with, and would call the engineers. For all kinds of reasons.

I implemented a simple idea - if the call to the COM+ application server failed, try again before bailing. This means that temporary network outages, or application server restarts, etc, have little to no impact on the operators, where once they would bring the factory to a halt.

The system was originally designed in the Microsoft culture of software design - users are stupid, and they should learn how to use this system better. This involves just putting up little boxes with confusing messages whenever something goes wrong.

I've tried to use the Apple culture of software design - users are experts in their field, but not in this software system. This means it is up to the software team to make it easy for the experts to do what they do, and my job to keep the software out of their way.

For comparison, the Unix culture of software development simply does not have users - Unix is designed for programmers, which is why we love it.

Users and software - clicking dialogs

There is a concept in software UI that users just click whatever they can to get the software to go away and let them get on with their work. When a dialog pops up users will click the buttons to get rid of it without so much as glancing at what the message says.

Why do they do this? They have been well trained to work this way.

Who trained them? Thousands of poorly designed applications that throw up meaningless dialogs constantly that poor users just have to click to get rid of.

The first 100 times each user saw a dialog they might have read what it said, maybe even written it down on paper to try and work out what was happening. After they realise that these messages are useless, and even mostly harmless, they just ignore them. When you have a culture like that you'll never get a user to read anything.