Tuesday, 5 May 2020

File Catalogue

I need to write myself a piece of software to catalogue all my files and mark/reduce duplication and then store them forwards onto my back up medium.  For you see, I have a back up routine, unfortunately since 2016, I've updated this three times... And changed the disks in one of them once, this means I have four copies of my back up.

Not bad, but also not storage efficient, and I want to reduce this to two, one on site disconnected and one off site disconnected, with a buffering server in my office for fast retrieval.

To do this, I first need to know what I have.

And I have a lot of files, at least 14TB in the office alone, split between the 500GB mirror ZFS pool, the file server itself and my local drives, upon which I reckon there's at least these 3 or 4 multiple copies of everything.

There was also an old machine which had 500GB of storage for family photo's, but I've already migrated that to cloud storage.

So we're purely talking about files I have, project files, videos clips, edits and lots of code.

Now, my original plan, when putting the new PC together was to migrate a single copy to the dual 4TB drives I bought, and from there split between the new local file server and one remove cold storage server.

However, I never got around to this, I started, but didn't finish.... and in the ensuing half year I've of course used some of that 8TB of spare space for things further exacerbating the task a head of me.

NEW plan therefore, write a cataloging software tool, to parse everything, get me a checksum and hash of the file contents so I can uniquely identify each file, boil all that down into one copy on one of the 4TB drives... And over spill if I absolutely have to onto the other.

Then, spew this backup blob onto the external storage, I'm thinking of using the garage file servers I have and an Amazon AWS instance or some other cloud solution, that'll be a mammoth upload, but crucially I have time to identify the location over the next day or so whilst I write the cataloging software.

The next step will be to keep only the active projects and use the file catalogue, even expanding it with metadata, to make the remote stuff easy to parse.  One tool I'm already thinking of adding is a cpp source tree parser, to give me the namespace, class/struct names and include file pattern for any cpp file it finds, and build a DOT file graph of the header includes from a project so I can see what it is without retrieving the files themselves.

This makes it a complicated task, and it's compounded by the sheer number of files and diversity of projects and tech tinkering I've undertaken and generally left to languish as I've ran out of time or simply moved onto other things.

This whole effort is to move my code bases towards completing tasks on my own terms, my own project, my own way, and in my own time, but finishing them, because at the moment at work I'm not in that kind of control and I miss it, I miss being in control of the projects I'm working on... Hey ho, on-wards and upwards and all that.

No comments:

Post a Comment