Thursday, 20 April 2017

Sys-Admin/Dev Ops : Assumption is Danger

As a systems admin, or dev ops, or whatever your job title might be, never ever assume that the person you're handing a system to has a clue.  This might seem harsh, but it's true, and proves itself true time and time again.

"Assumption is the mother of all f**k ups"

About a year ago I deployed a system which automatically sent requests to remote machines (via SMS) getting those machines to report their status or send back error information, but also to gather some basic information.

It has run happily for a whole year, it has been all pretty plain sailing, the hours and hours of work I put into it, to automate it and keep it self-sustained have paid off, zero faults, zero down time, self-regulation is the way forward for me; even if it took slightly longer to put the system in place, it has needed no human input for nearing a year!

However, the unit needed to move, about a week ago, it needed be physically picked up and taken out of my small server room and into the official server room, a dark cupboard basically controlled not by myself or my cohort, but the IT boffins.

Fine, I notified the customers, went off to the IT area, sorted out who I was to hand it to and physically delivered it to the chap, I watched him start to plug it all back together, power, wires, boot, fine....

I assumed he'd do this seamlessly....

Until this morning, well a morning last week, as I post these with a date in the future.  That morning was hell, I walked into a wall of customers not being able to get to their machines, the Easter weekend was looming, performance needed to be monitored, customer sites didn't have regular staff, explaining to temporary cover staff that system would be off was not a prospect I relished. 

To be frank, a lot of flapping going on, more than I expected... IT reported the system back online, but customers didn't stop flapping... Indeed, none of the estate seemed to be able to connect in... 1 hour, 2 hours, I've asked the boffins to check it time and again "It's fine", they tell me.

I look locally, I can't see the controller machine on the network, I can't see it through the remote management console... Where the hell is the machine?

I assure the customers I'll have answers within the hour, I hit social media with the same, this is going very public, and I'm rather annoyed as for a whole year things have run seamlessly; but been ignored, now its offline for a scheduled purpose and everyone is complaining, I do not want my success wiping away in a flood of negative press.

I call the IT boffins... "we'll look into it"... No, no no, you'll get onto it right now, not look, not glance, answers are needed.  Action from you is needed before my Re-Action goes nuclear.

I wait, five minutes, I was willing to give them ten.... My phone rings...

Them > "Hello?"...
Me > "Answers?"...
Them > "Yeah, you know when you brought it back?"...
Me > "The Machine?"....
Them > "Yes"...
Me > "I remember, why?"...
Them > "Well, it has power"...
Me > "Good"....
Them > "Not really"...
Me > "Why not?"...
Them > "Because that's all it has, it's not been plugged into the network"

I hung up.  They plugged it into the network, I had a slew of data come through... The customers were pacified.

I however was not.

I've had an on the spot review, firstly the IT bod who did this was held to account, second I was held to account for not noticing.

In not noticing I admit that having had it run cleanly for a year I had turned off the performance reports and I admitted I had assumed a network machine being handed to an IT bod would be plugged into the network.  People were not happy, least of all me, but that was the fall out.

However, I then had to do a tertiary clean up and after the Easter break I spoke to three of my main customers, trusted operators, the actual folk who should have been using the machines at the remote sites; not temporary staff; I asked them why they had not noticed.  The replies...  "Because it had worked for so long without an issue", "like you make it work, so we just guess it always is" and "we didn't notice it was offline".

They were very much putting everything into my court, assumption on the part of all parties was to blame.

The lessons learned for me are to now keep checking, keep monitoring, use my automation to report status, to fix faults and if human errors creep in, to let me know.

I'm now off to spec up a service I can run on one of my own servers, just to ping the network machine which went AWOL and receive a report from it to let me know what its up to, this might be a bit of python or just bash on a cron task, but it's going to be something rather than nothing.

I will NOT assume again.

Tuesday, 18 April 2017

Software Development : Failed to get Agile

I've just been party to a conversation about a project elsewhere in my work place, my team is not involved, I was observing passively (alright, alright, I was ear-wigging).

The conversation was quite heated, one member of staff was adamant things were fine, whilst another was adamant they were inadequate.  The two of them were at complete logger heads. The driver of the conversation ran like this:

"We're not really designing software, we're asking everyone's opinion, writing it all down and only picking the things we really need to do"

As an agile developer this is essentially how I run my team, we write every possible item down, everything and I weight them, schedule them and during out sprint hand-overs we reorg whom is going to tackle diffing parts of the system to share the experience and share different things.

This chap however, was incredulous... He expressed "WRITING EVERYTHING DOWN" as a bad thing... He only wanted to do the things he felt fit, he wanted to sit down and look at the specification, produce an analysis and ONLY do what he suggested.

This would have made perfect sense to be towards the end of my academic study of software development; before the reality struck home in the work place, and I was flabbergasted to hear this chap simply working twenty something years in the past.

I mean, he's old... This company is old... But, not that old surely?

I literally caught myself tipping my head to one side as if trying to pour those words, and the way they were said, back out of my brain.

He didn't stop there though, he sat and without knowing it essentially dismissed as absurd the complete concept of Agile development; at least Agile as I use it...

"You'd be constantly juggling which task to do next, swapping people on and off tasks.... What would you do?  Meet daily, what would be the point?"

I'm not sure whether this was genuine inflexibility or purposefully derailing the effort to adopt agile beyond the scope of my own team, whichever it was, it sounded and felt extremely awkward.

It makes me wonder quite if anyone outside my team actually uses Agile processes around here...

Sunday, 9 April 2017

Server Admin : How Good is your Backup?

How robust is your back up solution?  Go on, be honest with yourself, how good is it?... Because I've seen a whole host of them and, at this very moment, this is the screen up on one of my servers....


Yes, my raid 5, just a test raid 5 with three really bad recycled SAS drives in it has failed; this doesn't surprise me, but it does delay me because I now have to rebuild the data... However, I know my data is good.... Lets see how good my back up is.

This back up is coming from a DD created raw image of the virtual disk, stored to and soon lifted from my NFS accessible ZFS mirrored back up server.

Therefore you would be right to ask, why are you rebuilding the virtual RAID disk in the above screen shot?  Well, I'm going to test my back up strategy!

I popped the known bad disk and the good disks out, replaced all three and I'm able to test a restore to a new virtual disk set, I have a USB boot drive ready, this is a test.

This kind of test, a real live restore, is sorely missing from so many enterprise set ups, so ask yourself is your back up going to work?

Wednesday, 5 April 2017

Development : Anti-Hungarian Notation

Whilst cutting code I employ a coding style, which I enforce, whereby I output the scope of the variable being used with a prefix.

"l_" for Local
"m_" for Member
"c_" for constant
"e_" for enum

And so forth, for static, parameter and a couple of others.  I also allow compounds of these, so a static constant would be:

"sc_"

This is useful in many languages, and imperative in those which are not type strict, such as Python.

Some confuse this with "Hungarian Notation", it's not.  Hungarian notation is the practice of prefixing a type notification to the variable name, for example "an integer called count" might be "iCount".

I have several problems with anyone using Hungarian Notation, and argue against it thus. With modern code completion and IDE lookup tools this is really not needed, with useful and meaningful naming of your variables the type is not needed and finally there are multiple types with the same possible meaning... i.e. "bool", "BYTE" and "std::bitset" are they all 'b'?  What about signing notation, so you compound "unsigned long" as "ul" to the name?

It all gets rather messy, a good name is enough.

However, the scope of the variable might change, the scope might not be enforced, and in none strict languages you might have a variable go out of scope and then automatically re-create the value with a blank value, if you don't follow your scopes.

Therefore I can justify my usage and enforcement of this coding standard.

What I can't stand however is when someone listens to my explaining this, they read my coding standards document, they even go as far as having me reject their code during peer review for these reasons, and then they dismiss my comment with the "it's just Hungarian Notation"... Scope is not type, and type does not define scope, don't be fooled!

Friday, 31 March 2017

Linux Server Admin : Bash Kill Processes By Common Name

On my Linux server I've recently wanted to go through and kill a bunch of application instances in one go, this is a server where students have been connecting and running carious programs under python, therefore I want to remove from my processes anything called "python".

We can see these in our bash shell with the command:

sudo ps -aux | grep python

To remove all these programs I create the following bash shell script:

k = 0
for i in $(ps -aux | grep python)
do
  k=`expr $k + 1`
  kill -9 $i  
done
logger -s "Closed $k Python Instances"

Notice k=`exp... this is NOT a single quote (apostrophe) it is the "smart quote" on a UK English keyboard this is the key to the left of the number 1.  It is used to substitute the command into place, so the value counted in K becomes the result of the expression "$k + 1", i.e. K+1.  More about Command Substitution in Bash here.

The call to logger -s places the message both on screen and in syslog for me to review later.

This simply loops through all the applications resident and kills them off, I've saved this as a "sh" file, added executable rights with "sudo chmod +x ./killpythons.sh" and I created this to run as a cron job everyday at 3am (a pretty safe time, unless I have some students burning the candle at both ends).

That's everything about the bash script, for those of you wondering about the students, they're those folks following my learning examples from my book, which you can buy here.


Thursday, 30 March 2017

People : Some Life Advice (Reading)

My grandfather was quite a serious fellow, an absolutely lovely fellow, he was a (or even "the") chief quarter master for the Royal Mail in Nottinghamshire before he retired.  However, a mere five years, after he retired he had a diagnosis of lung cancer and was dead shortly thereafter.  The first major figure in my life to pass away.

I was eighteen, had just started university, and being honest with myself it affected me deeply, both personally, mentally and spiritually.  Personally as my mother then assumed she was the helmsman of the family, never have I seen the monkey leading the organ grinder more ineffectually.  Then mentally, he was gone, the one great intellectual figure in my life was gone.

His intellectual influence was deeper on me today than perhaps I ever realised, I remember when I was around three or four he showed me how to write the figure 8.  I distinctly remember his being behind me, his affirmative arms either side of me as he intoned an 8 before me, and I copied.  There was always paper to doodle on and pens in a draw, they were the main "play thing" of a wet or dank day, of which there are many in Britain.

He taught me to pronounce things, and to this day I can slap on a lovely accented English, very polite, which he taught me; and which my wife adores when I use it within a telephone conversation.  It empowers me to escape my strong, rough, Nottingham accent whenever I wish, both are part of me, one through nature the other nurture.

I also remember his buying books, or handing me books, the first I remember were the complete Encyclopedia Britannica, a lovely red leather bound set; which must have cost a fortune in the early 1980's; I'd sit and paw over these pages for hours, years later I remember some kid at school going on about French writers, I instantly named Victor Hugo... Thanks to those hours spent with my nose in a book.  Kids today can look things up instantly with the internet, but watch out for the kid who avidly reads anything, they might just be expressing their interests early.

Another day he handed me a huge book, it was a cheat book, with every answer to most all the crossword questions of the day, he was teaching me a lesson... You could sit and think about the intricate layout of the crossword, how things meet, depart and conjoin.  However, you could also just get the bloody answer.  This has become a massive power for me, I feel enabled, even if I don't know anything about a topic, or it's currently not on my mental radar, it is but a quick read away.

I know this, I'm sure reading this, you know it too.  However, how many people out there look at someone judge them and then think they can't do the same as them?  I'd suggest a lot of them do... When really they might just need to read the right book, or take the right advice, don't close yourself off from these people, embrace them, help them, let them help you too.

Wednesday, 29 March 2017

Development : People Error 404 (Scrum)

I talked last week about my implementing three new cards for my Scrum meetings, I'm thinking about another card... a "Person 404 Error"... When I find an empty chair, post scrum, when I expect people to be working feverishly or at least planning their tasks for the day, but instead I find them drinking a coffee in a corridor, eating an apple staring at a wall, or just bemoaning they've been up since 8, I am going to assign them a 404 card.

My reason being, the developers I'm talking about, do not have a fixed start time, they can start anytime they like, so long as they're at the scrum, the scrum is 10 to 15 minutes, and then they have to get on with their task.  Coffee collecting, and apple eating, and especially bemoaning, is not allowed after the scrum for at least an hour, results, progress and some personal planning are required of them.

So, if you're a developer who wants to do the scrum then vanish... 404 you're not found... I'll be watching... And waiting.... And I'll make sure your 404 card is in your least favourite colour!...

That means Pink for you SB!... PINK!.... You have been warned.