It’s become a running joke with my partner that when we talk about our day my response is often “I wrote some code”, which is usually the case but sometimes I don’t write much code. Although my job title is something like “Senior Developer” the role could be described as “getting stuff done to make it easier for other people to get stuff done”, “automating a lot of processes”, “optimisation and simplification of code and architecture”, “grunt work”, “low level debugging”, “mentoring”, “being a sounding board”, “adding test coverage”, “devops”, “DBA”, and of course “writing code”.
If your company gets big enough you need people to do this stuff. Someone to remove code, simplify processes, sanity check ideas and changes, put things back together that shouldn’t have been taken apart, pay off some of the technical debt to stop it getting out of control.
Software development, at the low level, is just moving data around. All the layers of abstraction we add on top and around it obfuscate that. If you understand the layers, and can get through them when required, you can better fix the problems. I guess amongst all the aforementioned things, “being able to get through the layers” is another part of the job description.
So here is how my week looked as a senior developer. I have added copious footnotes to try explain some of the terminology. Note that this largely relates to work time, but I ended up adding other times during the week I interacted with software/hardware that required some debugging. Also note that I’ve obfuscated/redacted some information.
Wednesday
Today started with some testing of recent changes I’d made to fix double encoding of non-ASCII characters1 in messages we were sending out using AWS’ SNS service2. Partly this was down to the rather odd format required (stringified JSON wrapped in JSON?3) but also the usual case of “character encodings go through several layers and one or more of them does the wrong thing, so go and find which layer is doing the wrong thing and fix it. Iterate until you have fixed all the layers, or fix it such that the layer you can’t fix ends up being fixed by association”.
Such is the nature of these kinds of bugs, it appeared that we weren’t quite doing the right thing, but upon testing from end to end, it worked. So it was merged and deployed. All of this took about an hour or so.
Next I looked into why one of our production server was crashing under very specific circumstances. This was related to a caching issue4, and as we had recently modified our cache to allow larger values (objects, essentially) to be stored so that was a potential avenue to explore. The issue could be easily replicated on our DEV environment so I decided to start with the low level approach5 - as the server was crashing with a SEGV I assumed it would be relatively easy to get the core dump6 and see exactly what was going on.
Sure enough, after having located the core dump, I could run it through gdb
7 to get a back trace:
Core was generated by `...
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f1dbf4e911e in XS_Cpanel__JSON__XS_DESTROY () from .../auto/Cpanel/JSON/XS/XS.so
(gdb) bt
#0 0x00007f1dbf4e911e in XS_Cpanel__JSON__XS_DESTROY () from .../auto/Cpanel/JSON/XS/XS.so
#1 0x00000000004c6d12 in Perl_pp_entersub ()
#2 0x000000000043f875 in Perl_call_sv ()
#3 0x00000000004cb652 in S_curse ()
So this was a problem in some low(er) level C code, something that I could look at but would take an awful lot of time to fix and would most likely introduce other issues. Since this was an issue that needed fixing now I tried some debugging within the code, starting with adding some log statements. The issue fixed itself by adding logging, thus was a heisenbug8. Action at a distance.
Since the logging was actually useful stuff, and it fixed the problem, I left it in as the “fix” and documented that in the commit message. Alas, that action at a distance caused other action at a distance and was picked up by the CI server9. I took another approach, and ended up turning the encoder attribute (the thing causing the SEGV) into a plain variable10, which again had enough action at a distance to fix the issue and not break other things. This was all documented in the commit message, and the investigation and fixing took in the region of three hours. To fully fix the issue could be a matter of days or weeks, and needs must so the quick fix won11.
The final big task of the day was updating our IP lookup data12. We use IP2Location for this, so we have to update our DB files ever few months (ideally once a month). I have scripts in place to automate most of this, but for this time we were moving to a more comprehensive DB format, which would give us extra information like latitude and longitude. So there was the additional task of tweaking the code using the DB to return that information, updating tests, and testing.
The additional task, which takes the most time here, is updating our base production images13 to contain the DB files. Since the DB files are several hundred MB in size, we don’t deploy them (this speeds up the deployment process) - the code assumes they’re already available on the server. The reason for this is to do with the auto-scaling nature of our production environment14, and having to sync a large file as part of the spin up of new instances slows down the scaling. It can also add extra cost due to (cumulatively) transferring hundreds of GB of data, which isn’t necessary if you just stick the file on the base server image (the AMI).
I should probably look into this again, to see if it’s still necessary, and automate more if I can. I may just be able to put the file on the network share15, which would not require new AMIs. During the process of these updates I realised the library I maintain to parse these DB files was a bit behind the official one16, so I updated and released that as well. All of this took about two or three hours with some interruptions in between.
In between all of the above I managed to review and merge a couple of pull requests on some open source libraries I maintain, and released those.
Commit summary17:
=== geo-ip2location-lite ===
* d65782c - (HEAD -> master, tag: v0.12) bring up to date with recent Geo::IP2Location
changes (18 hours ago) <Lee Johnson>
=== xxxxxxxxx_www ===
* 4bb073ac4 - (lee/update_and_upgrade_ip2location) update and upgrade ip lookup data
(19 hours ago) <Lee Johnson>
* 9350a3bcf - (lee/fix_post_cache_get_crashes) fix SEGV in cache Cpanel::JSON::XS
destruction (20 hours ago) <Lee Johnson>
=== yyyyyyyy_www ===
* ad45b19bb - (HEAD -> lee/update_ip_lookup_data) update IP2Location data (18 hours ago)
<Lee Johnson>
Addendum: I spent almost an hour in the evening trying to get our TV to work with Netflix. The TV could get a network connection but was unable to resolve the Netflix servers. The same thing was happening for YouTube. The attempts to fix it included “turning it off and on again” (several times). Restoring it to factory defaults and reconfiguring the network setup. Rebooting the router. Trying to update the software over the network. No luck. Tomorrow I may try again but use the USB port to update the software.
Because the TV is both figuratively and literally a black box it was impossible for me to see what the root cause of the network issue was. The debugging information is minimal to empty, and the error messages meaningless. The Netflix app itself had slightly better information, which revealed it might be something in DNS18 but that made little sense since my other devices (using the same router) could resolve fine.
Thursday
This morning was getting something done that we’ve been meaning to do for a while: enabling websockets19 in one of our apps. The ducks we needed to get in a row for testing this were:
- getting a non self-signed SSL certificate for our DEV environments20 and having that installed as part of the creation of developer VMs
- tweaking our nginx/apache21 configs to allow proxying of requests to
wss://
- testing this against our staging server, which is closer to our production environment22
- deploying this stuff to production and then figuring out how to make sure our load balancers didn’t cause problems for the websockets23
This took most of the morning. It’s actually mostly trivial stuff, but there’s the approach that this should be automated (creation of the certs) and updating the configuration files in the development, staging, and production repositories24 so that it’s all done once and usable through the rest of the development team without having to repeat any setup
Given the aforementioned “ducks in a row” I spent the afternoon looking at some higher priority bugs from our other app. This turned into just looking at one bug, which is probably best explained via the commit message25 (some information redacted):
paginate the tenant balances report
page: /foo/bar/XXX
this is a little complicated - the current queries use a group by clause
in different ways and the way this works is we get all tenants for the
agency in question (so 3,940 for agency XXXX) then filter out in the
nested foreach look - so we *always* do 3950 queries when this page
loads even if we don't display them all
we can't add pagination to the initial query due to the way the nested
query will skip certain tenants, so instead if *and only if* this is the
html version of the report we have to splice the array returned by the
first query to get the correct "page" of results for display
this is very much shoehorned in, and in fact it does change the display
such that we will now show tenants for those delete properties where the
tenant balance is zero - it's questionable whether or not this is a
regression given the page seems to show a load of redundant / useless
information anyway: when you load the tenant list for agency XXXX you see
422 records, the tenant balances page shows 3,052 rows of which many have
a balance of 0 (why not just skip them?). now we show 3,940
this seems to be the most reasonable trade off between having this page
take forever to load in the interface (and hammer the db with several
thousand queries) or have it load in a few seconds and show a bit more
information. perhaps we can add some filtering as a next step to skip
zero balances and/or inactive/archived tenants?
To explain more - this is some very old, very overloaded, very complicated code lacking any automated test coverage, which started as a report in the early life of the platform. It has reached the point, after many years, that its original implementation doesn’t scale well but the time to rewrite it isn’t worth the investment26.
So the least invasive thing to do for the time being was to add some pagination27, which couldn’t be done in the queries due to their disparate and complex nature so instead was done in the code. Initially i thought this might be some query optimisation work, and indeed one of the queries was still using temporary
28 but I couldn’t factor that out without reducing the number of rows returned.
I’ve already made a few tweaks in other parts of this code over the years (as I said, it’s very overloaded) and it feels like it is reaching the point of throwing it away and starting again. But, again, needs must and there are other higher priorities.
In between this there were a couple of queries and minor interruptions. The most interesting one was a colleague wanting me to demonstrate squashing their commits down in a branch that I had reviewed29 - this was a branch to add some feedback to users when they put incorrect data in forms, but the work had been done as a background task over a number of weeks due to other higher priority changes.
The result was the changes were, conceptually, quite simple but had ended up as many many small commits. In the review I suggested rebasing the commits30 to combine the related changes, since really it only needed to be three or four commits. The problem we had is that, reasonably enough, my colleague had been keeping their branch up to date with the main development line31 and when running the interactive rebase we would get all those commits pulled in from the main line rather than only the ones we wanted to squash down.
I didn’t want to risk conflicts and/or lost work so suggested instead that they branch off of the main line again and then cherry-pick32 their commits into the new branch then rebase that new branch. I’m still not sure of how to handle rebasing a branch that has been kept up to date with the main line, and I recall doing this myself successfully several times in the past but couldn’t see it working here. Any suggestions are welcome33.
Commit summary:
=== xxxxxxxxx_config_no_vm_sync ===
* be6395f - (HEAD -> new_staging/master) add websocket proxying in apache
config (23 hours ago) <Lee Johnson>
=== deployment-scripts ===
* 115af15 - (HEAD -> master, origin/master) albert/websockets_live_widgets ->
staging3 (23 hours ago) <Lee Johnson>
=== yyyyyyyy_www ===
* 85de97e3a - (HEAD -> lee/paginate_tenant_balances_report) paginate the tenant
balances report (16 hours ago) <Lee Johnson>
Addendum: I popped into the gallery after work as Sophie’s computer had not been able to send/receive any email through her gallery account for a couple of weeks. After some debugging the problem seemed related to Windows/Outlook and not the remote end or settings - the same account could be accessed and send/receive email fine from her iPhone34, and no details had been changed. A ping of the remote server35 from the Windows command prompt showed the network was fine so the reasons for it failing to send/receive was unknown. No errors, just a never ending spinner. More black boxes. I suggested Sophie update the operating system and try again.
I didn’t look into the issue with the TV tonight, and instead we plugged the Apple TV box into the TV and watched Netflix fine through that - another case one device being able to speak to the remote end while the other couldn’t.
Friday
Friday started with some more testing of the websocket changes from yesterday. This was deploying the changes to our staging environment, where they looked good, and then our production environment36. We realised it wasn’t going to work in our production setup, because of the load balanced and auto-scailing nature of the system, so we ended up reverting the changes37. This was a couple of hours work, which involved some research and confirming our suspicions about the architecture issues.
I also spent some time responding to feedback (making tweaks) on a review of the pagination changes from yesterday. There wasn’t much to do here, and the main issue was an off-by-one bug38 under certain circumstances (a good catch!).
Today we installed the new AMIs (related to the IP lookup data from Wednesday) so cycled the production instances. I kept one eye on this while doing other small things. While running some tests for those other things I noticed that new versions of VirtualBox and Vagrant were available39, so since I didn’t have anything pressing I took some time to update my local versions and test them with our development build scripts. The development build is fundamental, and I take the approach of rebuilding my development environment quite frequently40.
I want a developer to be able to go from checking out the necessary repository to having a working development environment in a matter of minutes. Although this is generally the case, trivial little updates to the various tools involved can interrupt this41. Since we use the same stack through all our apps (consistency is a good thing) I started afresh and rebuilt the base image, built a VM, and then ran the entire test suite. All good, and took about 2 hours of work/checking.
At the end of the day I updated my work Mac to v10.15.3 of the operating system. This was somewhat overdue and what pushed me over the edge was the machine running slow enough to become a frustration. This took about an hour.
Commit summary:
=== xxxxxxxxx_config_no_vm_sync ===
* e85fa24 - (HEAD -> production/mojolicious) revert websocket changes
(4 days ago) <Lee Johnson>
* e4e3801 - use non-secure websockets (4 days ago) <Lee Johnson>
* fa554db - enable the proxy_wstunnel apache mod (4 days ago) <Lee Johnson>
* 6d76a7c - (production/mojolicious_websocket_support) add websocket support
to the apache config (4 days ago) <Lee Johnson>
* 3c5e565 - (new_staging/master) remove HTTP Auth from staging3
(4 days ago) <Lee Johnson>
=== xxxxxxxxx_config ===
* 36423e8 - (refs/stash) WIP on dev/albert/websockets: da6d2e0 Merge branch
'dev/master' into dev/albert/websockets (4 days ago) <Lee Johnson>
* 689293a - index on dev/albert/websockets: da6d2e0 Merge branch
'dev/master' into dev/albert/websockets (4 days ago) <Lee Johnson>
=== xxxxxxxxx_www ===
* ce536284e - (lee/revert_websocket_changes) revert websocket changes
(4 days ago) <Lee Johnson>
* 14b666a0e - (tag: deployment/2020/02/07/10/10/57) Merge branch
'albert/websockets_live_widgets' into 'master' (4 days ago) <Lee Johnson>
=== yyyyyyyy_www ===
* cb1bb1e67 - (HEAD -> lee/paginate_tenant_balances_report, paginate the
tenant balances report (4 days ago) <Lee Johnson>
Addendum: There was a common pattern to today’s work - updating stuff. All the time. Sometimes this is what the job can feel like, and as the number of dependencies goes up the time spent updating these dependencies seems to compound.
Saturday
Not a work day, but I was in the gallery all day, and that involved some software wrangling such as posting to Instagram and Facebook, sending out a mailer with Mailchimp, and updating the website with details of the next event.
The website is built using github pages + Jekyll42, which makes factoring out common stuff, putting snippets together, and pushing out changes nice and easy: I edit a plain text markdown document then commit the changes and push out to github. The site then is up to date in about 1 minute. I can easily preview the changes by running jekyll
locally.
The only lacking thing is localisation, since the site needs to be in both French and English - I checked a couple of years ago and there wasn’t a simple (as in: as simple as my current setup is) way to do this. I may check again soon.
Sunday
Not a work day, but there was some interaction with broken software - we booked a table at a place in Lausanne, through Tripadvisor, which resulted in a confirmation SMS being sent. That told us to install an app to see the details. So I installed the app, registered (making sure to use the same details are linked to the reservation) and: nothing. The app failed to link up the parts so I couldn’t see the confirmation for the reservation.
The place was only a five minute walk from out hotel so we just wandered along anyway, and all was fine. They had received the reservation no problem.
Monday
I took Monday off, as we went to see live music in Lausanne the night before and I didn’t want to get back late then have an early morning. Plus I had other stuff to do in Lausanne. Also today I didn’t interact with any broken software, other than playing some A Link to the Past Randomizer, which is intentionally broken to make it more interesting43.
Tuesday
Today was half/half DBA44 and development work. I caught up a little bit on emails, as it doesn’t take much to get a backlog. These emails related to reports on batch scripts45 - we have a daily report that shows us anything that might need looking into. This is far more automated than it used to be, and the level of noise has been reduced by several orders of magnitude after I recognised a typical anti-pattern in reporting46.
Other emails related to our database backup process - we do get automated backups from AWS but we don’t just rely on that and also make “off site” backups47. This process is automated and part of it makes a small extract, which we primarily use in our development environment. If the database structure changes you need to also change the extract script so it knows what to do - if you fail to update the extract script it will send you an email to complain that you didn’t update it (but will still complete as much as it can).
Another DBA type thing was reviewing a change of the removal of 44 tables (including benchmarking some schema modifications against a representative data set). You read that correctly: forty four tables. This is the nature of one of our apps, it’s grown organically and there still is a lot of cruft and dead code that can be removed. I have been taking a “slowly slowly” approach. The change looked good, and the tables were actually empty so there was minimal risk involved in removing them (there were no longer any references in the code or elsewhere).
Finally, after lunch, I started writing some code. At least, writing a not insubstantial amount of code. This was one of our long term projects to convert a legacy app into a more modern one48. We are fortunate in that we can do this piece by piece, taking one page at a time and converting it. So I started looking at the current implementation, references to it, requirements, and sketched out (in my head) how this would work for the new implementation.
Since we have done this for other pages before, we have an established process in place. And indeed, since the new system is better factored some of the pieces are already in place. It’s likely I will spend the rest of the week completing this task, “writing some code”, amongst the various other tasks that interrupt me.
Commit summary:
=== yyyyyyyy_www ===
* c616fb067 - (HEAD -> lee/c_tenant_update_to_mojo) backend tweaks for
tenant create/update (2 minutes ago) <Lee Johnson>
* d4959cd9c - (lee/paginate_tenant_balances_report) paginate the tenant
balances report (8 hours ago) <Lee Johnson>
=== backups ===
* 9f700d4 - (HEAD -> database_backups) bump yyyyyyyy schema version
(6 hours ago) <Lee Johnson>
-
ASCII. In this case my test string was “J’ai mangé ton dîner”. ↩
-
AWS is Amazon Web Services, which runs most of our live infrastructure out on the web. SNS is one of the services that makes it easy to send notifications to users of our mobile application. ↩
-
JSON is a serialisation format designed to make interprocess communication easy, which looks something
{ "like" : "this" }
. The service in question wraps that in more JSON so we need to escape the reserved characters for the inner string:{ "something" : "{ \"like\" : \"this\" }" }
, which I find quite annoying. ↩ -
“There are only two hard things in Computer Science: cache invalidation and naming things.” – Phil Karlton ↩
-
Normally I would start at a higher level, possibly try a bisect, or sprinkle some debug statements here and there - but this the process crashing, which meant a core dump, which meant I could quite quickly find the likely culprit. ↩
-
Core dump - a file that contains the state of the program and its memory contents when the program crashed. Extremely useful for doing a post-mortem. ↩
-
gdb is a debugger that we can examine said core dump with. ↩
-
A bug that shows a behaviour change, including fixing itself, when you attempt to debug it. ↩
-
A Continuous Integration server is the thing that runs all (or most) of the tests against your changes whenever you push them out to the remote repository. It means you can find things that you might have broken, even though you assumed that your changes wouldn’t. This all assumes your test coverage is good enough. ↩
-
Essentially this went from being a “strongly typed” attribute on an object to a dynamically typed variable (something you can do in Perl). ↩
-
The behaviour was happening in a third party module, so even if the fix was quick there it would need to go upstream and be fully regression tested before it would be applied. Such is the nature of Perl’s testing ecosystem we comprehensively test changes against many different operating systems, distributions, versions, etc - and since this was in the lower level C code it would need more testing. ↩
-
Converting an IP address into a geographical address (as near as can be estimated). ↩
-
Servers in our production environment appear as load dictates, sometimes we have a few, sometimes many. So we need to start them up from a base “template” image - when we make some updates we need to update that template. ↩
-
Some of the code is synced across when new servers appear, but if we need to sync large files then that can slow down the time from a new server starting to it being available to serve. So we try to avoid sync of large files. ↩
-
Some files need to be shared across servers, so these are available from a “common” disk. We may or may not eventually move this to S3. ↩
-
I forked the library a few years ago as it had fallen out of maintenance (such is the nature of Perl support these days). They started updating it again recently. ↩
-
I have a shortcut
git timesheet
that will show me my last 24hr (or four days, or one week) of commits. Handy for the daily standup. ↩ -
DNS is the thing the resolves a hostname (www.google.com) to an IP address so we can actually communicate with the server. When DNS fails, or gives wrong/outdated information then we end up trying to speak to the wrong thing (or nothing). ↩
-
Websockets are something that allow a channel to be kept open between your browser and the server, so updates can appear on a page without having to have a refresh. ↩
-
Up until now we haven’t been too bothered about this, but best practices and actually wanting a more representative development environment means this point is a good one to actually get on and do this. ↩
-
nginx/apache are the kind of “front door” applications that direct traffic to the right place. They can server some assets but send other requests to the downstream application running on the box/elsewhere. We use nginx in development to emulate the load balancers that we use in production. So nginx -> apache -> our app ↩
-
The staging setup runs on AWS so has a load balancer, web server, db server, meaning it’s much closer to how the production servers are configured/setup. ↩
-
Historically websockets could be problematic in load balancing as they need to be kept open and point at the same server. If your load balancer is doing the right thing, distributing load, then that can interfere with that. At least, that is my understanding. ↩
-
We store as much as possible in git repositories - not just code, but configuration files, deployment scripts, infrastructure setup, development environment setup and bootstrapping. Store everything in repositories, there should be very few exceptions. ↩
-
I often write comprehensive commit messages. I’ve blogged about this before. ↩
-
This is a report used by a very small number of users, and it would take a few days (if not longer) to rewrite into something more maintainable. But really we would prefer to just throw it away. Code is a liability. ↩
-
Pagination - instead of showing n thousand results, show twenty results and then allow the user to view the first/next/previous/last page of results or filter down more. ↩
-
A temporary table means the db sever has to put the results from the first bit of the query somewhere and then filter them, rather than just filtering on the results from the first bit directly. If you see a query with this in its explain that’s usually a good candidate for optimisation, but the nature of this query meant I couldn’t factor it out. ↩
-
We do code review. ↩
-
Rebasing commits could be translated as “tidy up, combine, and simplify the record of the changes” ↩
-
Keeping your work up to date with the main development line is good over a medium to long term change as you avoid problems when the time arrives to put your work back into the main line. ↩
-
Basically replaying their changes but without the “keep it up to date” bits, since the new branching point would mean it is already up to date. ↩
-
For some reason I thought
git rebase
took more flags than it actually does. ↩ -
Apple’s black boxes tend to work better than other black boxes. ↩
-
ping
is a very basic “can I actually reach this server over the network, and will it respond to me”. ↩ -
Don’t deploy on a Friday? We handle that on a case-by-case basis and this change was non-impacting so good to roll out and “test” in production. ↩
-
Some information here. On reverting the change - although it was non-impacting we prefer that code/config that is not used not be present. Code is a liability. ↩
-
Off by one bugs are common, and easy to make, as we count from zero when programming (for the vast majority of stuff we do). ↩
-
The tools that we build our development environments (servers) with. ↩
-
At least once a week, often once a day. ↩
-
When I first arrived the development build was mostly a manual process: creating a VM and then running through a script to install dependencies and get the app up and running. If you were reasonably well versed in Linux and the command line (and the various dependencies should anything diverge from the script) this would take about one day of effort. All this stuff is easy to automate, so that’s one of the first things I did and now it takes about 5 minutes. ↩
-
I am considering giving a technical talk about this sometime in the near future. ↩
-
DBA: DataBase Administrator. We are (maybe still) hiring for this role. Get in touch ↩
-
Automated batch processes that take a while to complete, a lot of which run overnight given the nature of the payment industry. ↩
-
“e-mail all the things” is a massive anti-pattern. “e-mail one or two things” is manageable. ↩
-
“Off site” as in “out of the AWS infrastructure” ↩
-
An iterative rewrite of the application over several years done in small steps, with each change regression tested and deployed to production - after a while the old code is then removed. ↩