The program takes several command line arguments, listed below. By default, it starts with the --nocoor and --no-problem-reports flags. It is easy to change the default flags that the program starts with (see below).
-f Display fact list from monitor server after each MS query
-q Dump the Monitor Server query before sending it to the MS
-r file Use the file to load CLIPS rules.
-s hours On what multiple of the hours of the day should we generate a
shift report? (default is 8)
-d directory Directory to write the shift reports, etc. in.
--dump-facts Dump the facts after every iteration (to stdout)
--dump-active-messages Dump active log messages after every iteration
--matches patname Print out the CLIPS match info for a rule after every iteration
--offline Use defaults for running outside online system
--nocoor Don't send commands to coor
--no-problem-reports Don't log problem reports
To change the default startup options:
setup daqAI
vi $DAQAI_DIR/bin/start_daqAI.sh
scroll down in the file until you find the startup for the daqAI_XMLMonitor program. Alter the command line options there. For example, if you'd like the program to send sclinits to coor when it finds a l2muf problem, remove the --nocoor option from the command line.
The current simple version is running online, and its log files are on the web (thanks to Doug for all the web code, which is what runs the L3 log file system).
Here is an example shift report from the morning of Sept 12, 2002. The beam cam in at about 3:20am or so, so this represents what happened for 4 hours at fairly high luminosity (20E30, I think). You can see a list of all shift reports at this link. They are created once every 8 hours.
The program also writes out a much more detailed log file. For example, see this log file. Since log files don't continuously change, I can't point you to an interesting spot. However, here is an example "problem report" for one of the detected sync errors.
System seems to be back to normal (68 seconds) Here is a history of the problem:
2002-08-12-08:25:42 Event Rate has Dropped to 0Hz
2002-08-12-08:25:51 L2 Crate 34 has its error bit set
2002-08-12-08:25:51 Issuing COOR command: sclinit
2002-08-12-08:25:51 Established reason for downtime: L2 Crate 34 lost sync
2002-08-12-08:25:51 L1 FEB in crate 99 is >50%
2002-08-12-08:25:51 L1 FEB in crate 98 is >50%
2002-08-12-08:25:51 L1 FEB in crate 97 is >50%
2002-08-12-08:25:51 L1 FEB in crate 96 is >50%
2002-08-12-08:25:51 L1 FEB in crate 83 is >50%
2002-08-12-08:25:51 L1 FEB in crate 82 is >50%
2002-08-12-08:25:51 L1 FEB in crate 81 is >50%
2002-08-12-08:25:51 L1 FEB in crate 80 is >50%
2002-08-12-08:25:51 L1 FEB in crate 31 is >50%
2002-08-12-08:25:51 L1 FEB in crate 16 is >50%
2002-08-12-08:25:51 L1 FEB in crate 107 is >50%
2002-08-12-08:25:51 L1 FEB in crate 106 is >50%
2002-08-12-08:25:51 L1 FEB in crate 105 is >50%
2002-08-12-08:25:51 L1 FEB in crate 104 is >50%
2002-08-12-08:25:51 L1 FEB in crate 103 is >50%
2002-08-12-08:25:51 L1 FEB in crate 102 is >50%
2002-08-12-08:25:51 L1 FEB in crate 101 is >50%
2002-08-12-08:25:51 L1 FEB in crate 100 is >50%
2002-08-12-08:26:42 Event Rate has Dropped to 1.276Hz
2002-08-12-08:26:50 System seems to be back to normal (68 seconds)
Yes, sorry, everything is in decimal here. It should be hex (I hate decimal). You can see all the log files at this link. A new log file is started every time the program is restarted (don't click on the "current link" -- there seems to be a Linux kernel/NFS bug that prevents that from working most of the time).
The current version looks for a few specific errors.
Besides recognizing more problems, things on the todo list include (please send suggestions):
If this project goes anywhere I'll get some graphics up. The order of things is as follows:
This code is implemented in two cvs packages. The CLIPS package contains the clips code (version 6.2 currently), and the daqAI package contains everything else. The daqAI_XMLMonitor.cpp file is the overall driver.
You can look at the generic rules file here. The basic idea is to start with the most basic inferences (the daq rate is low, there is a store in, etc.), and then combine them to higher levels (the daq rate is low, there is a store in so there must be a problem, etc.).
The expert system is stateless. That is, all inferences are cleared each time through the process loop (see the Design section above). If the system is in a problem state for 60 seconds, the problem identifier rules will fire 60 times (once per second if all goes well). The logging and COOR command infrastructure notices when the log messages or COOR command requests change and only fire off actions (like an scl init or a log message) when these sequences change.
This design does make it hard to do "try x, if x fails, then try y". For that you almost want a state machine and the expert machine combined, with the expert system firing transition in the state machine. This is possible, but more work, and not required for the current simple version implemented here.