I don't really talk about this much, but we deliver a lot of real time market data to thousands of customers. Part of my responsibility is running a group which monitors the real time environment. With very strange things like multicast, and other abnormal requirements that most standard products don't deal with. A good example is the exchange holidays, open and close times for each exchange out of the 200+ exchanges we bring feeds into global POPs.
The infrastructure has more changes by development than anything else at the company. Figuring out the proper "state" is next to impossible, which make monitoring a challenge to say the least. They have a custom tool, which we are working with the developers on to get a web services interface so we can better understand state before presenting a false alarms to the Realtime operators.
We are cleaning up the rules by pushing the responsibility onto developers to write the proper rules. This should fix things as we audit the existing 75,000 rules in the custom monitoring tool. Going forward the rules are required with each software release.
Part of all of these custom, old, homebuilt, somewhat crappy tools is that we need to extract metrics which are non-standard and use them for capacity analysis. Aside fromt he standard system, and network capacity planning there also has to be software capacity planning. The team generates a monthly report, which is very large a complex. Taking anywhere from 60-100 man hours to create I want to automate the report more, or build some kind of self-service reporting or BI portal. These are initial thoughts, but something we need to start discussing.
Have a good weekend, please leave comments!