April 2007 Archives

Brian K. Jones

AddThis Social Bookmark Button

Mike Hendrickson posted a chart showing what looks like a downward trend in sales of books to systems folks. I was not shocked by the chart. I was not shocked to see admin books falling off a bit. I *was* shocked to see that there are people within O’Reilly, people “in the know” where technical publishing trends are concerned, who seem surprised to see such a trend.

I think there are several pretty obvious reasons why books targeted at systems administrators are, and will likely continue to trend downward:

We’re writing more code

The first reason that comes to mind is that a great many of us are tired of waiting for developers to solve our problems for us, and there are lots of technologies available that are making it easy to roll our own solutions without having to dig into the guts of the systems to get things to integrate properly. In other words, we’re all becoming more developer-like than ever, and are buying more books related to that aspect of our jobs.

A great many admins know Perl, and that used to be the primary language for writing web-based applications. The emergence of Ruby, Python, and PHP as mature web development languages that are also very easy to work with have made admins take a look at turning to web-based tools to solve some of their problems. This has forced them to work not only with the scripting languages themselves, but also the databases and other interfacing technologies like SNMP, LDAP, NIS, DNS, and whatever else holds data of interest to sysadmins. For evidence, simply look around at how certain classes of admin tools are evolving: even where they aren’t browser-based, they are web based. Red Hat’s entire Red Hat Network is an XMLRPC-based web service written in Python. Splunk, Nagios, Ganglia, - all architected around web services and/or web based interfaces.

Furthermore, I think projects like Puppet (written in Ruby, by the way - by a sysadmin) show that administrators are tired of developers coming up with unnecessarily complex, convoluted solutions to our problems. We’re taking the bull by the horns and doing it ourselves, to the extent that we can. Those that aren’t writing full-blown tools from the ground up are probably busy writing all the glue code to make existing tools integrate properly or make them do something else properly (developers rarely write thorough, complete code for sysadmin tasks - I’ve contributed code to various projects to help them get LDAP right, for example).

In short, admins like base technologies like DNS servers, database servers, LDAP servers, log servers, and the like. But we *need* tools to help us make decisions faster. We have more data than we can reasonably boil down and use to make decisions before that data becomes irrelevant. We need tools that not only monitor and alert, but we need further aggregation, integration, automation, to help us make decisions about what’s going wrong today, what we’ll grow out of tomorrow, and how users are using our services. There’s still tons of development work to be done.

Another reason for the development shift is outsourcing. Lots of administrative tasks have been, and to some extent are being, outsourced. Of course, coding jobs have also been outsourced. What’s an admin to do? Well, I’ll tell you what lots of admins are doing: first - they’re hanging on to the jobs they have, which means they may be dealing with technology in a more evolutionary way rather than being forced to get up to speed on a particular technology for a new job, as may have been the case in the days of the bubble when turnover almost certainly was greater in our field. Second, they’re broadening their resumes to become more well rounded. Some may even chase after technologies to put on their resumes, or they may go after technologies in use by companies they’d like to work for. Many of these may be code-centric.

We’re not all moving toward development.

The plain fact of the matter is that, as data centers across the globe have matured, gotten bigger, more difficult to manage, etc., companies are forcing administrators into smaller and smaller pigeon holes. These days, generalist administrators seem to be working almost exclusively in shops where there are fewer than 10 administrators. Shops much bigger than that now employ for very highly specialized positions like “mail administrator”, “backup administrator”, “storage administrator”, “network administrator”, and the like. In other words, if there were ever large numbers of administrators who all needed a broad, deep knowledge (read: lots of books!), that’s no longer the case.

Alas, the chart lacks quite a bit of data that might also be helpful in understanding what it’s trying to say. What has happened to the job market? Is there still such a flow of admins into the job market that the market for these books could reasonably be supported? Have the books kept up with new technologies in administration? Have they at least kept up with newer versions of older technologies? Finally, what are the publishers doing to foster growth in book sales in these areas? Heck, I was at O’Reilly’s “OSCON ‘06″, and there was just about *nothing* there to attract administrators, unless O’Reilly also believes that we’re slowly starting to focus more on writing code and data storage, in which case, this chart should contain *zero* surprises! :-)

Publishers are part of the problem, too!

Not to harp on the publishers, but admin documentation, as currently published in books, is really only useful to a point. Generalist administrators who have been working in the field for less than 3 years can find more books than they could ever read, and receive tons of knowledge. There’s a decreasing rate of return, however, for admins who pass the 5-7 year mark (depending on environment). There are few books for admins beyond the 7 year mark, because a whole lot of what we deal with starts to become more about site-specific special cases that aren’t covered in most books. Even classics like the BIND book, and the bat book, have occasionally failed me due to some special case or another.

All of this doesn’t account for areas where the publishers have just flat-out failed administrators. The only *really good book on LDAP is over 1000 pages in size, and not put out by O’Reilly. The O’Reilly LDAP book is good, but it’s not an LDAP book - it’s an OpenLDAP book. SNMP is another extremely useful technology that has received disgustingly poor coverage. Many of the books on the topic are either too specific (SNMP over Wi-Fi, for example), outdated (some books were never updated for SNMPv3), or too focused on using specific tools instead of understanding and using the technology - which is where I felt the O’Reilly SNMP book fell down. There are, meanwhile, very few books that cover things like cfengine, LTSP, PXE/syslinux, and the like, in the detail they really deserve and warrant… and need - if these technologies are going to be made useful to a wider audience.

And what about the new technologies? There are new DNS servers, new automounters, new automation frameworks, new linux server distributions, more new tools, more databases, more systems programming interfaces, and more languages to use with them. Where are the books? Did we *really* need another SQL Book? There’s a whole shelf of them! Where’s my “Python for Systems Administrators”? Where’s my “CFEngine: the definitive guide”? Where’s the follow-up to probably the best-selling sysadmin book in 5 years “Time Management for System Administrators”? Do you *REALLY* think time management is the only soft skill lacking in the sysadmin community? Scour the self help and business aisles, grab a few books on negotiation, business logic, sun tzu, and self improvement, throw “for Systems Administrators” or “For geeks” on the end of it. Some of them will do well.

I could go on like this for hours, but I should probably let everyone else chime in before their lunch hours are over :-) I just want to say that I think we should’ve all seen this drought coming, and I, for one, am pretty surprised that it’s not more pronounced.

Niel M. Bornstein

AddThis Social Bookmark Button

Hello again for the first time! Until now I have concentrated my postings on XML.com, but my work has lately taken me into other realms. So let me start by introducing myself to the Sysadmin crowd. I work as a Senior Architect for Novell’s Systems and Resource Management Business Unit. What that means is that I spend most of my time visiting our clients’ data centers and doing proofs of concept of our first data center automation (DCA) product, ZENworks Orchestrator.

Data center automation? Well, that’s what the new data center is all about. It incorporates all the stuff that’s hot (or cool actually) in the data center world: consolidation, virtualization, simplification, rationalization, and a bunch of other -ations. But in short, it’s all about getting you a good night’s sleep.

Anton Chuvakin

AddThis Social Bookmark Button

After long months of undercover work, CEE is ready to be presented to the world.

Below is an excerpt from a brochure, to be published at MITRE’s site any day now. I do think that the world is ready for another battle for the establishment of a logging standard, after a long string of miserable failures.

Common Event Expression (CEEā„¢): A standard log language for event interoperability in electronic systems.

CEE standardizes the way computer events are described, logged, and exchanged. By utilizing a common language and syntax, CEE takes the guesswork out of even the most menial of event- or log-related tasks. Tasks including log correlation and aggregation, enterprise-wide log management, auditing, and incident handling which once required expensive, specialized analysts or equipment can now be performed more efficiently and produce better results.

Why CEE?

If multiple systems observe the same occurrence, it should be expected that their description of that event is identical. When combined with relevant event details (time, source, destination), a computer should be able to immediately determine whether two or more logs, data logs, audit logs, alerts, alarms, or audit trails refer to the same event. In order to make this happen, there needs to be a scalable, well-defined way to express events.”

I will post more stuff as well as the link to the detailed brochure, when it is available. Next: four areas of log standardization, recommended by CEE. Stand by!

Brian K. Jones

AddThis Social Bookmark Button

I’ve seen more than a couple of sites in the past where there are teams of administrators who work together to maintain the system and/or network infrastructure, or the data management infrastructure, or whatever. On these teams there’s often a lot of task overlap even when the team is made up of specialists. For example, it might be the case that no single person on the team is “in charge” of DNS. If there’s a DNS-related issue, whoever sees it first, or whoever is on call, is the person who deals with it.

Inevitably, whoever is dealing with the issue today seems to have a script they’re not very proud of that does something useful that helps them solve some problem that comes up, well, often enough that there’s a script to help deal with it.

The key words these admins tend to use to describe these scripts are “hack”, “brute force”, “spaghetti code”, “inelegant”, “5-liner”, and “quick-n-dirty”. This is by no means a complete list - other slang and colloquialisms abound. They often try to rationalize the hack’s creation by following up with something that makes them seem witty. A comment like “I’m lazy, so I just hacked it together one day”. Putting aside the condescending tone of the comment (it implies that you’re not nearly as “lazy” or witty as they are), the fact of the matter is that the truly lazy among us will find a way to *not* write *any* code if they don’t have to, and minimize the code we ever have to write. Ever.

How? Well, one way is to take advantage of two common admin traits: a natural tendency toward sloth, and their ability to write and improve code to do useful things. See, in a lot of groups, where there’s task overlap, there’s also code redundancy. In other words, everyone has their own hack they coded in a hurry to do basically the same exact task…. because they’re lazy.

If you’re in a group with lots of disparate and probably redundant code, you might find an unlikely friend in CVS. You can create a repository called “adminstuff”, and under it, import modules representing a problem domain, a service, or something else that makes sense for your needs. While you’re at it, go ahead and create modules for the various directories around your environment that are full of code that’s either managed using something local to the machine like RCS, or not managed at all. This has several benefits:

First, stuff that’s not managed, or is managed using a facility on the local machine is now in a central location. This is nice because, if the machine croaks on you one day, you don’t have to go to back ups - you can instead just do a checkout of the module to another box and get back to work.

Second, everyone winds up writing less code, because instead of having a script per person per task, there’s just a script per task that anyone/everyone in the group can work to improve. This increases the chances that the code will be *less* hackish than it once was.

Third, if you create a read-only user to check out the code, and read-write accounts for the developers in the group, then you get some level of accountability for free, because each person will have to use their own credentials to commit changes to the code.

Fourth, I forgot to mention the ability to rollback to earlier versions of the code if something breaks!

Fifth, and this is one of my favorites, it means you don’t have to ssh to a machine, su to root (or remember to use sudo), and edit the code as root, which just feels dirty to me. Now you can just check the module out to your workstation, work in your own development environment, and check it into CVS when you’re done. This means I don’t have to think about whether I’m using the Solaris vi or the Linux vim install, and I can even use my shiny new IDE I found if I want to. This is a little more convenient than scp’ing code around, or using a root account to copy it to a user directory and chown’ing it or some of the other hacks I’ve seen (and even used once or twice) to get around limitations of not having some code management mechanism in place.

Finally, moving and organizing admin scripting tasks in this way may naturally lead to the creation of an administrative API for your environment, which means everyone writes much, much less code. For example, I grabbed a “hack” someone told me about the other day, and after editing it to take advantage of our API, I was able to cut the line count of that “hack” in half, while simultaneously adding more checks to the operation, making it more robust.

There are other benefits, of course, like the ability to write the enforcement of certain data handling or task-related policies into the single, unified code base instead of hoping everyone is following the policy in whatever hacks they’re using. I’ve also found that I work on more code, because it’s more convenient to work on. If that feeling catches on in your environment, how can in *not* improve how things get done?

If you’re an admin who is new to CVS, I’ve taken some time to write up a CVS cheat sheet that covers some things you might find useful. Enjoy!

Anton Chuvakin

AddThis Social Bookmark Button

Somebody posted a message to a loganalysis list seeking help with analyzing a trillion log messages. Yes, you’ve heard it right - a trillion. Apart from some naive folks suggesting totally unsuitable vendor solutions, there was one smart post from Jose Nasario (here), which implied that the original poster will need to write some code himself. Why?

Here is why (see also my post to the list): assuming 1 trillions records of 200 bytes, which is a typical
PIX log message size (a bit optimistic, in fact), we are looking at roughly 180TB of uncompressed log data. And we need to analyze it (even if we are not exactly sure for what, hopefully the poster himself knows) … not just to store.

Thus, I hate (ehh, make it “have” :-)) to admit that Jose is probably right: writing purpose-specific code might be the only way out. About a year ago, there was a discussion titled “parsing logs ultra-fast inline” on firewall-wizards list about something very similar. We can look up some old posts by Marcus Ranum for useful tips on super-fast but purpose-specific log processing.

For example, here he suggests a few specific data structure to “handle truly ginormous amounts of log data quickly” and concludes that “this approach runs faster than hell on even low-end hardware and can crunch through a lot of logs extremely rapidly.” One of the follow-ups really hits the point that I am making here and in my post: “if you put some thought into figuring out what you want to get from your log analysis, you can do it at extremely high speeds.” A few more useful tips are added here.

So, nothing much we can do here - you are writing some code here, buddy :-) And, as far as tips are concerned, here is the “strategy” :-) to solve it:

1. figure out what you want to do

2. write the code to do it

3. run it and wait, wait, wait … possibly for a long time :-)

Indeed, there are many great general purpose log management solutions on the market. However, we all know that there is always that “ginormous” amount of data that calls for custom code, heavily optimized for the task at hand.

James Turner

AddThis Social Bookmark Button

The electron, smallest of the three particles that make up the atom, how often do we take this plucky little lepton for granted? But let them stop flowing through the wires leading into our computers, and we quickly realize just how dependent we are on them. This lesson was brought home to me last night, as a late season snow storm took out the power to my house for 4 hours.

This isn’t the first time that PSNH (Power Shutdowns Noticeably Happen) has failed in their contractual obligation to keep the electrons flowing to my house. In fact, we lost power for 3 days in late January, and for 4 days in 1998 during the ice storms. I have pretty much everything in the house on a UPS (7 in all), and have a auto-starting generator on the budget for this year.

But once we take care of something, we tend to put it out of our mind. ” No need to worry about power failures, I have a UPS…” But we forget that most solutions come with new problems of their own. I got a practical demonstration of this fact when I started to hear a loud annoying chirping coming out of my rack a couple of times a day, lasting a minute. With that much stuff in my rack, there’s a lot of things that can make noise, but I surmised right off the bat that one of my UPSi was trying to tell me something.

Sure enough, one of them had it’s “replace battery” light on. Rather than replace the battery, I took the opportunity to order a 1500 VA rack mount UPS, since traditional “cinderblock” UPSi don’t really work well in a rack. Besides, the rack mount ones look cooler. That UPS is due to arrive via, well, UPS today. Unfortunately, when I needed it last night, I was stuck with the old one. It immediately started to chirp in a rapid, panic-inducing manner. Luckily, that UPS doesn’t power either of the two systems in the rack, so I didn’t need to worry about an abrupt shutdown on a PC. It did however power my brand new 24″ Gateway monitor, which began to flash on and off about once a second, as the UPS failed in an interesting mode.

There are two lessons for SysAdmins to take away from this. Firstly, a UPS is not a buy-once and forget item. You’re going to need to plan in advance to replace the batteries as they age. In a medium-size datacenter (one large enough to have lots of racks, but small enough not to have centralized conditioned power), it probably makes sense to standardize on a single model of UPS, and keep spare batteries around. I also learned something I already knew but had forgotten, you should have all of the critical items for a PC attached to the same UPS as the PC. Because the USB hub that my keyboard and mouse were attached to was on the flaky UPS, I had to scramble to attach them directly into the back of the PC so that I could shut it down.

Chris Josephes

AddThis Social Bookmark Button

I read an article on Slashdot the other day about an newly released open source application. I read a few of the comments, and I found this one (slightly paraphrased):

$ apt-get fooapp
"You have searched for packages named fooapp in all distributions....Can't find that package."
Sorry, I'm not interested.

The comment suggested that since he can’t try the software in a pre-built distribution then it isn’t worth trying.

Unlike a few years ago when every Linux user ran configure by hand, the speed and convenience of installing packages has put the compiler on the back burner. Packages aren’t a bad thing, but I think it’s a poor reflection of an administrator’s skill set if they shun the development tools that are available for every Unix environment.

I’m not saying that packages should be avoided. I build them myself for software that I have compiled and tested manually. Once packaged, they’re pushed out during the remote installation process. After that, I am considered the distributor for certain applications in the infrastructure, and the primary support contact. I’ll also concede that it doesn’t make much sense to recompile Gnome or KDE if my OS vendor provides a pre-built package, along with support and regular upgrades. I won’t be too optimistic about installing packages that aren’t built or approved by the author of the software.

When I interview an administrator, I usually throw in a couple of programming interview questions, such as, “How do you determine which shared libraries a program requires?” or “What steps would you take to compile the Apache webserver?”. I don’t expect them to be a full fledged C programmer, but I think it’s important to know how to build software. The candidates that have demonstrated these skills have also been more proficient in debugging, tracing system calls, and identifying performance problems ahead of time.

Anton Chuvakin

AddThis Social Bookmark Button

I’ve been wanting to create those for a loooooong time and finally - here they are (you can guess I’ve been on a long flight :-)). Some are admittedly tongue-in-cheek, but useful nonetheless. So, enjoy Anton’s “Top 11 Reasons to Collect and Preserve Computer Logs”, presented in no particular order:

  1. Before anything else, do you deal with credit cards? Patient info? Are you a government org under FISMA? A financial org? You have to keep’em - stop reading further.
  2. What if there is a law or a regulation that requires you to retain logs - and you don’t know about it yet? Does the world “compliance” ring a bell?
  3. An auditor comes and asks for logs. Do you want to respond “Eh, what do you mean?”?
  4. A system starts crashing and keeps doing so. Where is the answer? Oops, it was in the logs - you just didn’t retain them …
  5. Somebody posts a piece of your future quarterly report online. Did John Smith did it? How? If not him, who did? Let’s see who touched this document, got logs?
  6. A malware is rampant on your network. Where it came from? Who spreads it? Just check the logs - but only if you have them saved.
  7. Your boss comes and says ‘I emailed you this and you ignored it!!’ - ‘No, you didn’t!!!’ Who is right? Only email logs can tell!
  8. Network is slow; somebody is hogging the bandwidth. Let’s catch the bastard! Is your firewall logging? Keep the info at least until you can investigate.
  9. Somebody added a table to your database. Maybe he did something else too - no change control forms were filed. Got database log management? How else would you know?
  10. Disk space is cheap; tape is cheaper still. Save a log! Got SAN or NAS? Save a few of them!
  11. If you plan to throw away a log record, think - are you 100% sure you won’t need it, ever? Exactly! :-) Keep it.

Have more? Feel free to suggest your own reasons below!

Coming soon: “Top 11 Reasons to Look at Your Logs”

Technorati tags: , , , ,
Brian K. Jones

AddThis Social Bookmark Button

Ok, so I’m not completely sold yet. I still have a boatload of Perl code floating about, and for certain things I’m still writing *new* Perl code. However, I was coerced into using Python for a project I’m working on, and I have to say that I think Python is coming on to me.

I try to ignore the furtive glances, and those times when I could swear it’s actually winking at me. I don’t acknowledge the beguiling smiles and greetings I get from Python when I open my laptop. I just get down to the business of coding and pretend none of it ever happened. That pretending is getting harder by the day.

Here’s the thing. I’ve been doing all of my sysadmin scripting in perl, awk and shell (sometimes together) for a decade. After 10 years, Perl still doesn’t even say “hello” to me. It seems to stand ready to spit my own code all over me whenever I try to talk to it. And just when I’m about ready to call it my friend, just when I think I know it, it completely changes. Well, I’m tired of it. I’m tired of the schizophrenia. I’m tired of the attitude. I’m tired of feeling like a Perl n00b after using it for 10 years.

I’m leaving.

Of course, like lots of relationships, it’s complicated. Over the years, Perl and I have spawned offspring that aren’t going to just disappear because I decide I don’t like Perl anymore. I promise to care for them and keep them up to date.

But from this day forward, I’m going with Python in those places where I can. I *want* to feel confident with a language. I *want* to take advantage of code reuse, self-documenting code, and OO design principles. I *want* to have readable, concise code. I *want* to solve problems that are larger than the every day “please change my shell” requests. I want to build tools. I want to architect solutions. I want to solve some of the problems sysadmins face, but I had to solve my own big problem first: namely, being a self-hating Perl slinger who was never particularly comfortable with how Perl, at a very high level, works.

If you’re an admin using Python on a regular basis for your admin scripting, let me know how you think it compares to Perl for equivalent tasks. If you’re a Perl coder who has tried and *not* used Python for reasons besides the lack of curly braces, fill us in! If you’re a religious zealot for one language and have never used the other, feel free to move on!