Friday, April 25, 2008

Travels

Been travelling a bit the last week, and I am off to London for most of next week to work on strategy and organizational structure. I was offered a seat at IBM Pulse, but I don't think I'm going to make it. I will still be attending HP Software Universe in mid-June, which should be interesting given the advances in the HP products from Mercury and Opsware integrations.

Toolset Changes

    As many of you know, we just completed a large purchase of another company, so in the next few weeks I will know what I am going to be doing. In the meantime we are figuring out how to converge our toolsets. There will be more activity here as we work through the issues and determine the final set of products we are going to be pushing out across the universe of the company. Based on some of the deals we make with IBM and other vendors it could also create other areas where we can standardize.

More to come in this area.

IBM POTs

This week we were offsite at the tech center in NYC for a day trip. We looked at IBM Tivoli Provisioning Manager (TPM), the provisioning, deployment product. It is one of the products we are considering standardizing on. I wanted to get clear picture of what it can and cannot do versus Opsware SAS. The product looks good, but I still need to write up the full gap analysis. It definitely would meet most of our patching, inventory, and deployment requirements, but it doesn't fill the system administration, or complex audit and control requirements we are given due to customer audits and regulatory compliance.

Last week IBM brought us a POC for IBM Tivoli Monitoring (ITM), which is the monitoring platform. It compares to HP Openview Operations (OVO). We are going to bring it in house to do more testing, but upon the initial 1 day with the product, we found the following comparison to be true:


 

Issues in POC:

  1. Multiple times the agent died, and the server died. There was no indication of the error aside from a manual restart.
  2. Did not go over agent installation.

Environment:

    Pros to ITM:

  1. Reporting is nicer, and based on open standards.
  2. Multiple server roll into a single TEMS easier than OVO.
  3. More flexible on operating system, database and platform the components can run on.
  4. IBM is quicker to support new component versions (OS, Application server, etc)

    Cons to ITM:

  1. Email management for notifications outside of event escalation are not manageable aside from using command line calls with emails as arguments.
  2. Scenarios applied to groups are not easily manageable, meaning you have to manage the policy in a lot of notifications.
  3. UI is not as easy to use, there are fewer wizards to guide the engineer with the workflow of making a change or implementing something new.
  4. Everything seems to run as a separate agent. So you will have a Windows OS agent, a Universal Agent, and a Custom Agent etc, with all of them running as separate services and processes.

Friday, April 4, 2008

Realtime market data systems

I don't really talk about this much, but we deliver a lot of real time market data to thousands of customers. Part of my responsibility is running a group which monitors the real time environment. With very strange things like multicast, and other abnormal requirements that most standard products don't deal with. A good example is the exchange holidays, open and close times for each exchange out of the 200+ exchanges we bring feeds into global POPs.

The infrastructure has more changes by development than anything else at the company. Figuring out the proper "state" is next to impossible, which make monitoring a challenge to say the least. They have a custom tool, which we are working with the developers on to get a web services interface so we can better understand state before presenting a false alarms to the Realtime operators.

We are cleaning up the rules by pushing the responsibility onto developers to write the proper rules. This should fix things as we audit the existing 75,000 rules in the custom monitoring tool. Going forward the rules are required with each software release.

Part of all of these custom, old, homebuilt, somewhat crappy tools is that we need to extract metrics which are non-standard and use them for capacity analysis. Aside fromt he standard system, and network capacity planning there also has to be software capacity planning. The team generates a monthly report, which is very large a complex. Taking anywhere from 60-100 man hours to create I want to automate the report more, or build some kind of self-service reporting or BI portal. These are initial thoughts, but something we need to start discussing.

Have a good weekend, please leave comments!

What’s going on in my crazy life

Well I bought a new ride this week as a second car : http://www.vw.com/R32/en/us/ Its pretty fun, and I'm happy with the selection. The R32 is more practical than my current car, but also still fun and good in the snow (AWD).

On to more technical matters, we were hoping to accomplish more this week than we in fact did, probably due to many people on my teams being Ill, and one of our major contributors being out for half the week OOO. Here are some highlights of another fun filled week:

  1. Got the new cluster in hand for the Sevone expansion. We are getting huge requests do to diagnostic netflow analysis with the product, so we want to start using it more and see what we can do with it.
  2. Implemented Opsware NAS on a physical box, which we are cutting to on Monday. This is the first V2P we have done. As NAS grows we need more horsepower than our current vmware implementation can offer it.
  3. Integration of two Netcool systems are progressing, we now have a common schema, and we are testing the schema and some replication.
  4. Had one of my guys show off the nice new BAC environment. We have finally worked out the last of the kinks (some bugs with alerting, etc) and should be live next week.
  5. Alerting from OVOW directly to Netcool with a genious custom solution and implementation by the guy who manages the OVO team was put into place. Still needs some debugging, but should be 100% next week.
  6. POCs:
    1. Stood up environment for OVOW 8.0 testing
  7. Tactical:
    1. We have a ton of reporting requests, we are trying to get some of them out of the way as quickly as possible.
    2. Backup of the F5 configs centrally (not really our space, but we took it).
    3. I am helping the network guys debug an application/network issue which they have been struggling with for weeks. It should go a lot faster now that I've broken down the debugging steps and requested the proper logs and traces.