Eventually Consistent

A highly sporadic technical blog about the things I find interesting and want to gradually store in my brain. About me

Managing Up

When I was a teen or tween, I mentioned to my mom that I never asked her for anything after 9pm or on Fridays. She was tired and would likely say no to giving me a ride, letting me go on a camping trip, whatever it was. My mom still refers to this moment occasionally as a lesson to her in managing up.

You’re probably aware that the impression you make on your boss affects your pay and your performance reviews. You may also have some expectation that they’ll help you improve or grow your career. But when your relationship with your manager feels tense, frustrating, or just flat and awkward, it’s not always obvious what to do. There are many great resources out there on identifying great managers^[1], understanding the your manager’s role^[2], getting the perspective of a manager^[3], or coaching for managers^[4]. I want to talk specifically “managing up”, meaning intentionally navigating the relationships with your reporting chain and anyone else with some authority over you.

It’s a sliding scale

Any relationship takes effort to maintain. When you first start out, that duty will typically mostly fall on your manager. A new grad usually has little experience with anything other than “grading”. Their tasks are well-defined and easily evaluated, and their manager’s job is to set clear expectations and help them ramp up. Over time, you will start to carry more of the responsibility for the relationship as your duties increase in size and ambiguity, your ambitions become more specific, and you potentially report to higher level people whose time is more scarce. Experienced people must show how their work is high quality and high impact, ask deftly for what they need, and identify opportunities and sponsors.

At any stage, having some savvy with your manager will serve you, even if it’s just to make sure they think you’re doing a good job.

Observe

The first step in any new skill to develop curiosity. Managers and senior colleagues are people, and they have individual quirks and values systems. They will not always work the same way or agree with each other.

Values

Think about the people around you. What do they value?

To find out, you can ask them directly. One manager was explicit about being “results driven”, and said that all other skills are useful inasmuch as they further results.

More accurately, you can look at how they tie break difficult decisions. One hard decision is how to spend time. Is their calendar all 1:1s? They might value relationships. Do they make time for article clubs or bringing in external speakers? That points to valuing learning. When the team disagrees, do they go with the decision supported by the majority, the most senior person, or their own technical instinct?

Be aware that stated values and actual values may not be aligned. Once I was told that my projects may not be promotable due to insufficient “technical complexity”, when “managing complexity” was the phrase written into the rubric. This left me nonplussed. What if I chose an elegant solution and reduced complexity, and then my project seemed too simple? To solve this problem, I pointed out that gap to my manager. Second, I really hammered home the technicality of my projects in my self review.

Now that you know, you can figure out how it fits with your values. One of my favorite things to do is figure out how to get a project I think is valuable onto the roadmap by pitching it in terms of something my manager (and usually, the organization at large) considers worthwhile. This can be as simple as rolling refactor costs, which can have nebulous business benefit, into time estimation for a product launch in a very feature-based org.

Last, your values and that of your manager or organization may be completely irreconcilable, at which point it’s time to GTFO.

Fear and aspirations

Sometimes, there is some kind of fear that dwarfs positive values. It’s not a great thing, but it happens. I had a manager who really did not want to be at a loss when asked technical questions by the bigwigs, which resulted in micro-management. In order to keep from getting interrupted, I had to find ways to satisfy that anxiety on my own schedule.

Other times the motives of your manager as an individual will affect things. They might be “resting and vesting” and not be up to fight for you unless you make it very easy for them. Or they might take pride helping you plan and advance your career, in which case take advantage.

Constraints

If you take a human with the same values and hopes and fears and put them into different situations, they will of course act differently. One of the most rewarding questions I ask managers is “what’s on your mind?”. It can explain a lot - they’re pressuring for project completion because the team is new and needs to show progress. There’s an impending reorg in down-low early stages, and that’s why they’re being hand-wavvy about decisions. They spent a lot of time on performance managing a teammate who just exited, and now that it’s resolved, they’ll have more bandwidth.

The question takes the focus off you and your performance, which can be nice, and it gives your manager some space to be heard, which everyone likes. It also has a lovely side effect of helping you figure out if you’d ever want their job and their problems.

As a case study, I specifically relish working on cross-team projects: they’re collaborative and often extremely high impact. The problem is that many organizations put pressure on managers to deliver within their team’s scope and mandate. If part of my work is going to another team, or filling in the cracks of the organization, it will likely be seen as higher cost and lower priority by my direct manager. In order to sell the project, I need to reframe to satisfy my manager’s values and constraints. Alternatively, higher-up managers have a broader view and mandate, and might be able to champion or set direction from above.

Working style

People receive information and requests better in certain modes, or in my mom’s case, certain times of day. I once had a skip level manager who loved charts. Charts are succinct, they’re visual, and this person was busy. If a chart happened to look bad at a glance when the actual content was more subtle, it would randomize the conversation. On the other hand, my direct manager made a small change to turn all of the neutral, grey colored parts of a chart to a light green and it noticeably improved the skip level’s outlook on our work.

People have talents and gaps, too. Some managers will be great at pulling strings, some will be great at laying out the projects to get you to promotion. All of them need feedback and practice to improve. And everyone has emotions, as much as they would like to leave them at home. Having empathy for your manager, without forgetting your own interests, will help you unlock tricky situations.

Align expectations

This is the table stakes of managing up. If you don’t meet expectations, you are likely to be performance managed (not fun), or frustrated at lack of career progression (slightly less but still not fun). First, you need to find out what good work means to your manager. You can ask questions like “what does success look like on this project?”. You can also use tools like a goal setting framework, a career rubric, or a 30-60-90 day plan. This alignent work needs to be redone regularly.

If their expectations seem unfeasible to you, it’s an opportunity to get extra support to bridge the gap, or else present an alternative that is tailored to satisfy their values/constraints/fears/working style etc. You will also need to show evidence that their current expectations are not feasible. If there’s an untenably bad mismatch between your expectations and theirs, it may be time to move on.

Showcase what’s hard about your work

An incident occurred about 2 years into my career:

A project was harder than I thought it would be. I holed myself and hid my struggle, demotivated, but not sure who or how to ask for help. I worked on the wrong things, went about it inefficiently, and I made little progress. Eventually, team leadership noticed, and I had some of the hardest conversations of my career to date.

I wasn’t fundamentally incompetent, but I was doing something unfamiliar and I wasn’t surfacing my challenges or getting the support I needed. Senior team members sometimes identify a potential project and assign it to someone new without thinking too hard. The new person assumes it’s worthwhile, assumes it should be easy, and then runs into a bunch of unexpected complexity. Or, they lack context to gut check the purpose of the project and waste time. If they stay silent, no one can course correct.

Even as a tenured team member, you can encounter many sneaky sources of difficulty: technical debt, organizational impediments, unreliable dependencies. If you explain the novelty of the problem and showcase your solution, people will ooh and aah, the organization will learn, and you’ll get credit. Sometimes they’ll find ways to pitch in and make your life easier. If you give into imposter syndrome and hide the difficulty, everyone loses.

As for how: performance reviews, standups, and one-on-ones are obvious ways to celebrate your challenges. Also, lightning talks and write-ups - things that take an hour or less to prepare (you don’t want to take away too much time from doing the work). Asking questions in public channels. Elevator chats. Twitter.

Nominate yourself for opportunity

This has been said a million times before, but it took me a while to absorb: if you don’t ask for things, no one knows that you want them.

It’s vulnerable and uncomfortable, but it works. I had a manager who tended to shrug off things I asked for when I first brought them up, and then come back a couple of weeks later with progress made. This happened when I gave evidence I was underpaid (lol), but also when I wanted to work on higher impact projects with more stakeholders. Among peers, I’ve vaguely mentioned career goals and then been forwarded related opportunities. Hardest of all, you have to raise your hand: propose that project, submit abstracts to that conference, apply to that job, set up a meeting with that new co-worker. You are a business of one, and no one else cares more than you about your success.

Seek psychological safety

There has been a bunch of jazzy headlines^[5] recently about creating psychological safety (the belief you won’t be punished when you make a mistake) on teams. It also applies to relationships. Your manager will always be balancing multiple interests besides your own, but feeling like you can show that you’re struggling or bring up conflict will make every other part of this post easier.

Psychological safety takes time. Ideally managers go about fostering trust proactively, but you will probably encounter less-than-stellar managers in your career, so the ball is also in your court. The ways in which I’ve sought out trust vary by what I’ve observed about my managee, and takes iteration.

Making their jobs easier helps. Create visibility into your work, deliver on what you promise, present solutions alongside problems. Managers are in a tough spot, since they are accountable for the team’s work but often don’t have bandwidth or skills to directly push it along. Maybe to start, you need to over-communicate and write down everything you do. As you get traction on trust, you can move towards lighter weight mechanisms for syncing their expectations and your work. Going forward I want to make a point of running my priorities past my manager each week.

Giving feedback helps. Start small and use COIN^[6] or other feedback best practices. If you acknowledge your manager’s strengths, and if they can be imperfect in front of you, they are more likely to reciprocate.

Being a full(ish) human around them helps. Talking about bands with one (relatively reserved) manager meant filling awkwardly quiet time during one-on-ones, and afterwards I noticed they brightened up when they saw me in the hall. “How was your weekend” chit-chat can be painful, but open ended question allow people to share at their comfort level and set them at ease.

And finally, literally seeking out individuals who you trust helps. If every interaction with your manager feels like fighting a dragon, it may be time to switch teams or jobs. If you work really well with someone, it’s reasonable to prioritize staying with or following them. It can be stultifying and cliquey to always operate in a bubble of the same comfortable people, but if someone sees your value and creates opportunity for you to grow, it’s a boon.

On sponsorship

When your boss trusts you, they can be your sponsor. A sponsor is someone who does some of this managing up for you. They talk up your accomplishments. They nominate you for opportunity.

A lot has been said about finding mentorship, and I’d argue it’s more clear-cut. A mentor is someone you can learn from, and companies and organizations set up mentorship programs as a part of learning development. All it takes to be mentored is to find someone whose skills and perspective your admire and ask them to chat. All it takes to be a mentor is to give someone your time and answer questions.

A sponsor is taking a chance on you. If they throw your name in for a project, and it goes poorly, they can lose clout. Therefore, a sponsor has to see your value. It helps to find people who see something of themselves in you. Glibly and problematically this could be demographic similarity, better it’s someone whose career experiences are mirror yours, best is someone who aligns on values or viewpoints.

I’ve noticed that people in roles who do not have representation in the higher level of management tend to flounder. It’s harder to show value as the first SRE in an org with no executive understanding of the purpose of reliability. It’s harder to get promoted when the managers who hold the keys have not tried doing your work and get why it’s hard (see above for some ways to combat this).

So how do you find one? All of the above. Notice the values and working style of those who might be able to push your career along. Show what is hard about your work. Communicate what opportunities you want. Foster trust through both vulnerability and excellence. And when you find someone, cherish it and prove them right.

–

Editing help from Chen Lin and Trucy Phan

Python Functional Programming Best Practices

I was asked to write a post for an external site! It’s available here.

t takes a while to develop “taste” in programming, to figure what you think works well and what doesn’t. Your taste and another excellent engineer’s taste might be different, and result in different tradeoffs, but I find it useful and interesting to write down what I think (so far!). This is how I think about Functional Programming in Python.

A Risk Assessment Analogy

I was reading a book recently where a character scuba dives while pregnant. The book is set in the past, so I decided to look up modern medical information on whether this is a good idea.

The answer is “No”. There is no medical reason why the answer is no. To the best of our knowledge, none of the biological systems required for gestation are significantly affected by scuba diving. But most people think about scuba diving as risky, and if they were to guess without doing research, they’d conclude… maybe it’s not a great idea to do while pregnant. So even though the medical establishment agrees that it should theoretically be fine, if something goes wrong, psychologically it is very destructive to look back and see something you could have done differently.

Why am I talking about this in an engineering blog? Because it’s an analogy to failure scenario generation, safety, and verification in any discipline. Many engineers have shipped a piece of code, having told themselves it’s fine to the best of their knowledge, only to see if fall over or emit terrible bugs in production. This is why we often push to a small percentage of real traffic, do bug-bashes and conduct pre-mortems where we imagine different types of failures and what would have caused them. We’re trying to smoke out the unknown unknowns that cause issues. It’s a type of thinking I am actively learning how to lean into. As an optimist, someone who tends to seek out nuance, and a person who has a strong bias towards action, I tend to chafe against risk-aversion without a clear threat model. The term “Cover Your Ass” gets thrown around to describe extreme end of this - wasteful carefulness.

I am now on a team that is business-critical, and one of the lowest layers of the engineering stack. We have many customers for the code we write, including ourselves. There are some kind of things we can break without any consequence - low-priority monitoring, for example. There are things that if they break, it will be obvious, and someone can step in and fix it - like an auto-remediation service that can replaced by hand-operations.

Then there failures that are very hard to reverse, are hard to detect, and that heavily affect other teams and end user - like breaches in our durability and consistency guarantees. If something is permanently deleted, there is no undoing. If an inconsistency is introduced, there may not be repercussions until much later, when system invariant check begin to fire or users notice. These are the areas where defaulting to risk-aversion is the right thing to do.

You can’t guarantee nothing will go wrong, of course. And a good post-mortem will take goodwill into account. But there’s a rule of thumb you can you use: if there’s a serious outage, how would you justify your actions to engineering leadership? If you look at the testing and rollout strategy you’ve planned, and think about ways in which the system might fail, and you can sincerely tell your boss’s boss’s boss that you’ve made reasonable tradeoffs even if something goes wrong, then you’re good to go.

It’s a balance, like all things. There are probably situations where there is an urgent reason to metaphorically scuba dive that is worth the unknown-unknown nature of the risk, even with high stakes. People’s intuitions and risk-friendliness also vary based on personality, and how they’ve seen things fail in the past. A lot of growing as an engineer is fine-tuning that initial response to design decisions. Now when I see code that has a lot of side effects, that relies on implicit state, or that is clever but not clear, it makes me uncomfortable, because I emotionally remember having to debug similar issues in the post. Sometimes the situation is different, and the scary things really are justifiable, or less scary. That discomfort is worth exploring with an open mind, rather than fighting the last war over and over. Going forward, I hope to expand my intuition for system design as well.

Versioning and Release Process

I just wrapped up a couple of years working on things in the general space “release engineering” “continuous integration” and “delivery/auto-update”, with a bit of “build scripting”.

I now irrevocably think about software with two things in mind:

versioning
release process

These two bullet points are very closely related but quite the same thing. Just like software architecture, they are easy to over-engineer if you add a lot of structure too early, or to under-invest in if you wait too long. Ad-hoc designs that were totally logical and understandable at the time often become unwieldy and impair productivity.

First, some definitions.

Versioning

Let’s take versioning to mean any way to point to a snapshot of code. The formats can vary widely. Every commit in git is a distinct version described by a hash and some number of “refs” (branches and tags). End user software is usually published with a version - at the time of this example I’m running Chrome 64.0.3282.167 and PyCharm 2017.2.4 Build # PY-127.4343.27, on a MacBook with MacOS Sierra Version 10.12.6 G1651212.

In addition, versions are often used to embed a pointer into another piece of software, specifying that it should be incorporated as a dependency. There are programming language and operating system specific package managers designed to help us keep track of our versions and do upgrades (npm, apt, pip, brew), and configuration management tools to deploy changes of those versions (chef, puppet, salt, ansible). Dependency upgrades often cause cascading work to maintain compatibility, even just for consumer software. I personally have spent an unreasonable amount of time trying to install the right version of Java or Silverlight (and maybe had to upgrade my browser) to view a webpage or video.

The version formats themselves encode information about release ordering and hierarchy, with any number of digits/letter, a date stamp, a unique unordered id or hash, or a catchy name. Numerals may have hidden information. At Dropbox we used to use even/odd middle digits to keep track of whether a Desktop Client build was experimental, and Ubuntu’s major number is the year of release, the minor is the quarter, and even major numbers imply “Long Term Support”.

Release Process

This is a convenient segue into release process. Ubuntu does not want to have to maintain backwards compatibility forever, so it explicitly publishes a contract on the life cycle of the code. We are not all on the same version of Chrome (if you use Chrome), depending on whether you booted your laptop recently, how successful the auto-updater is, and maybe even whether you did some manual intervention like accepting new permissions or performing a force upgrade. Most larger projects have some way to feed higher risk-tolerance users earlier updates, with internal versions and/or alpha/beta testing programs. All of these versions have some cadence, some threshold of testing or quality, and some way they handle “ship-stopper” bugs. Rollouts can go even beyond the binary version and have A/B testing via dynamic flags to change code pathways depending on the user.

There is a proliferation of dependency management tools because when one piece of software relies on another, they have to pay attention to new releases, evaluate whether it’s a breaking change, maybe patch the code, potentially introduce bugs (from dependency change, from the code change, or from unveiled assumption mismatch between the two¹), and then release… maybe causing similar effort for someone upstream. Not doing this work often causes a panic when end of support happens on a critical library.

What’s more, the same dependency might be used in more than one place in the same codebase or company. Can they be upgraded at the same time? Did someone assume they’d always be pinned together?² Do they have dependencies on each other, further expanding the decision/testing matrix?³

The catch

I’m more or less convinced that when people say they like the low overhead of small companies, they’re not just talking about polynomial communication cost or political hierarchy. When you don’t have very much code, you don’t have very many dependencies.

So, why can’t we all share a standard three digit version number with strict hierarchy, and take out some of the complexity? Sometimes versioning feels like continuously reinventing the wheel - I have had to work on several almost-the-same little libraries that parse and do common operations on version numbers.

The problem is that version numbers are a manifestation of the developers’ assumptions of the life cycle of the code. Developers may want to record the date of the release, or have an ordering hierarchy, or specify build flags, or mark a version as backwards incompatible. In some cases the ecosystem enforces some really strict rollout process. Python’s packages are all versioned in the way I described above, because pip provides a contract for the author and end user.⁴ You get a single snapshot in time, rolling monotonically forward on a single logical branch. End users can cherry pick and patch changes if they want to, but it’s up to them to manage that complexity, eat the consequences of using a new version of a package that is insufficiently battle tested, and upgrade if e.g. a security vulnerability is found.

This doesn’t work well when the consumer is not another software project. Users generally hate having to manage their own updates, and it’s your bad if they end up with a buggy or insecure product. For-profit software orgs also need to figure out what to build, trying experiments early and often. At Dropbox a couple of years ago, we realized that even/odd thing wasn’t going to cut it. We needed to move faster, and to be able to segment metrics by the alpha/beta type even with overlapping rollouts. So we redesigned the system to be a simple three numbers, a major version to indicate a distinct “release train” (ever-hardening logical branch) that would go out to all users, middle/minor version 1-4 to indicate build type, a final point version always incrementing as builds are created on each release train. We made the necessary refactors, thought carefully about the ordering consequences because Dropbox doesn’t support non-manual downgrades, and wrote up a post for our forums users. But we didn’t leave enough space in between the minor versions! When we wanted to indicate new types of thing, like to break out versioning of some subcomponents or introduce a different type of experimental build, we were stuck, and had to do the same headache all over again.

Lessons learned

The end lesson, as hinted above, is to approach versioning and release like architecture. Implement what is useful to you now, but think about assumptions and whether they could change. Design it in a way so that adding new things is possible, and ideally not too painful. Evolving is good, but thrashing on design just causes confusion and overhead for you and your collaborators.

Release processes, like testing, should serve some quantifiable good for your effort. Do you actually care if this thing breaks? Then have more and longer verification periods. Just throwing spaghetti on the wall? Not so much. Don’t put code in stage/alpha/etc for a shorter amount of time than you can actually find the bugs you care about, or you’re just doing pretend quality assurance. And if that time is too long, improve you telemetry so you can find bugs.

Regardless, have a really really good plan for last minute changes, including bugfixes for emergent issues. The biggest take away I got from writing this series of blog posts for Dropbox is that your product is only as good as the test pass/canary period since the last code change. You can and should design code and processes to default to turning things back to a known good state in an emergency. Hot fix after hot fix only works if you can push quickly and are tolerant of poor quality. It doesn’t work well if you need strong consistency, durability, or availability, or care about miffing users with a buggy interface. Quick feedback loops are key - as long as you consciously think about the price you willing to pay for that feedback, whenever it be in hardware costs, user loss, or developer time.

Fun bug I’ve heard about recently: a new hardware specification uncovered a bug in the Go compiler. Go figure ;)
I have seen a system like this. It was good for forcing everyone to be on the same old version forever :).
AKA “Migrapocalypse”.
Turns out this is a bit more flexible than I’d realized, including allowing a variety of pre and post fixes, thanks @Sumana Harihareswara! Check out PEP440 for more detail. Pip is still pretty prescriptive.

Make it Work

I’ve been programming for about 6.5 years, in some capacity or another. I don’t know if you ever try to remember what it was like to be a past version of yourself, but I do sometimes, and it’s hard. Remembering not knowing something is really especially difficult - this is a big challenge in teaching.

I’m better at programming than past me. I’m MUCH better at getting things to work reliably. My effort is more efficiently expended, while my standards for “working” have changed - the code I write might be part of a collaborative project, roll up into a larger system or service with emergent behavior, or have a lot of potential edge cases.

This is what I remember of some key turning points in my “make it work” strategies. Like most learning, a lot of my knowledge came from ~~failure~~ educational misadventure.

Run the code

Like many people I first learned to program in a class. I studied physics, so our goals were usually to get a simulation working to demonstrate a concept, but i could in theory turn in my code without running it at all. If I did, 9 times of of 10 there would be some glaring bug.

So the first line of the defense in getting code to work is to run the code. Sometimes this is still enough - when writing little scripts to assist me I often write once and run once, or a handful of times with minimal troubleshooting. Sometimes, when the environment is tricky or the computation takes a long time, it is actually quite hard to do.

Print statements

Running code only tells you if the inputs and outputs are correct, and maybe any side effects that you can obviously observe. Intermediate checkpoints are incredibly useful. One of the first tools I learned in MatLab for physics computation was disp.

Code review

In my college programming classes, we were encouraged to work with a lab parter. Getting someone else to look at your code can smoke out obvious mistakes - just the other day a coworker pointed out that a bug I thought was complicated was just a misspelling of the word “response”.

Out-of-the-box debugging tools

My second Computer Science course was Data Structures and Algorithms in C++ – which meant memory management. My labmate and I were absolutely stuck on some segfault, until a friend a couple of courses ahead at the time (and later a contributing member of the C++ community) came by and showed us Valgrind, in exchange for oreos. I didn’t totally get what was going on, but having an independent program with a view onto my totally malfunctioning, un-print-statementable code was key.

Static analyzers

You know what’s great? IDEs. During college, I’d used environments given to me by my instructor to learn a language - Dr. Racket, Eclipse, MatLab, etc. I learned Python at Recurse Center (then called Hacker School), in the summer of 2013, and for the first time I had total choice over my workflow. After experimenting with vim plugins, I found PyCharm, and a lot of the rough edges of Python were (and continue) to be smoothed.

You know what’s great about PyCharm? It has a bunch of built in static analyzers, things that parse the code and look for known pitfalls, on top of standard syntax highlighting. PyCharm now supports type hints, which I cannot bang the drum of enough.

You might be thinking “my compiled language doesn’t have these problems”, but languages do not (and maybe should not) implement every feature you want by default. I was delighted to learn recently that there is a comment syntax and checker for null exceptions in Java, and seems to be similar to “strict optional” for Python.

Unittests

My first software engineering job at Juniper Networks was the first time I wrote code other people had to reuse and rely on. I worked on a team of about twelve people. I was encouraged to write unittests, and reviewers would block code reviews if I didn’t.

The thing is, writing unit tests is hard! There is a cost to any kind of testing, and an up-front cost of ramping up on a new set of tools. The big barrier to entry for me was learning how to effectively mock out components. Mocking means replacing real, production-accurate bits of code with fake functions or objects that can parrot whatever you tell them. My tech lead had written his own mocking library for Python, called vmock, and evangelized it to the entire team. Some other parts of code used the standard Python mock library. They had different philosophies, different APIs, and different documentation.

I still run into barriers to entry in learning a new testing paradigm today. I wanted to add a tiny bit of logging to an unfamiliar codebase recently, and reading and understanding the test codebase felt like a huge amount of overhead. Thankfully the existing unit tests on this code saved my butt - that tiny bit of logging had about 3 SyntaxErrors in it.

This brings up why unit testing is useful:

It reduces the chance that small units of code are broken. If 10% of your code has bugs, then the chance that both your test and production code have complimentary bugs that cause false positives should be less than 1%.
It documents the expected behavior for posterity. Assuming the tests are run regularly and kept green, the tests are forced to stay up to date with the code, unlike docstrings or external documentation. It’s a great way to remember what exactly the inputs and outputs to something should look like, or what side effects should hapen.

At this point, unit tests have become such a habit that I often write them for untested code in order to understand it. I also write them for anything new that I will share with others. Unit tests can be the happy outcome of laziness and fear. Writing a unit test when you understand code well is less work than running it over and over by hand, and you’re less likely to embarrass yourself by presenting someone else with broken code.

Integration tests

Early at Dropbox, where I currently work, I was trying to fix a regression in the screenshots feature on a new operating system version that wasn’t yet in wide release. I wrote some new unit tests, ran the code on the target system, and felt pretty confident. In the process, due to subtle differences in the operating system APIs (which I had mocked out) I broke it on every version of the platform except the one I was repairing. It rolled out to some users, who caught it. I could probably point you to the forums post with the reports :/.

Then I learned something about the limitations of unit tests:

Mocks encode your assumptions, and can lead to false positives.

Unit tests are intended to pass or fail deterministically, and therefore cannot rely on outside dependencies. It’s very easy to inaccurately mock a web API, for example. Even if you’re very careful about mocking as little as possible, you will feed the tests a constrained sets of inputs and outputs that may or may not reflect real usage.

Unit tests don’t test UI well.

Any person executing the code would have noticed that a dialog didn’t pop up saying their screenshot was saved. But it would be really annoying and repetitive to run that code on every single platform every time I made a small change.

Enter integration tests. Integration tests often affect multiple system, often from the outside, and usually test features end-to-end. This means they take more thought and time to write. They are are also necessarily more expensive and are more likely to fail inconsistently. Frankly, they are not always worth the expense to write and maintain, but when the tradeoff is between an explosion of manual test cases or integration tests, integration tests are way to go. It is an interesting exercise to figure out how to insert test hooks, and may force you (like unit tests) to improve the modularity of the code.

Slow rollouts, flagging, and logging

My first project at Dropbox touched our installer, a pretty important bit of code for the success of the application. I had written unit tests and done a battery of manual tests. When the new feature was starting to roll out, we got error reports from a handful of users, so we halted rollout, turned off the feature, and furiously investigated.

The root cause was something I never would have thought of - the name of a directory had been changed at some point while the Dropbox was installed, which broke some assumptions in the application surfaced by my code.

My three lessons here: even integration tests or thorough manual tests only catch your known unknowns. You’re still vulnerable to the things you cannot anticipate. Also, having ways to quickly turn off code, from the server if possible, is fantastic for limiting the exposure of risky code. Third, testing in a small subset of users only works if you know the important high level of what’s going on while the code is testing, either from event logging “step complete” or error reporting.

Acceptance testing

This is the rare case where I think I learned from best practices recommended by others. If you are going to cut over from one system to a supposedly parallel one, it is much better the run the new one “dark” and confirm they produce the same output rather than do the swap and wonder why things seem to be different, or why it’s falling over under realistic load. A recent surprise application: hardware misconfigurations. Now I help run a continuous integration cluster, and trying to add new hardware to expand capacity actually bit us when some machines were misconfigured.

“Belt and suspsenders”

A year or so ago I worked a migration of our built application binaries storage. Build scripts are really hard to test end-to-end without just making a build, since they mostly call other scripts and move files around, and I didn’t want to make too many extra builds. I already had a neat little acceptance testing plan with a double write. But it turned out I had runtime errors in the new code, which caused the script to crash.

The takeaway: when there’s something critical and hard to run or write a test for, it’s better to build a moat around your new bit (in this case, a try/except and logging would have done), and fall back to the old code if it errors for any reason. The tricky thing here is that the moat-code itself can be buggy (who amongst us has not mistyped the name of a feature flag on the first try, not I), so that moat-code needs to be well tested using the previous tools.

Configuration is code

A lot of code is at least a little dependent on its running environment. It can be as simple as the flags passed into the main function, or as complicated as the other applications running on the machine at the same time.

I’ve relearned the maxim “configuration is code” over and over in my current context, working on a contiguous integration cluster. For one example, we have a few “pet” machines that have been configured over time by many people. It is neigh on impossible to duplicate these machines, or look to a reference of how they were configured. This is why containers and tools like Chef and Puppet are useful. For another, our scripts run via a coordination service that has a UI with hand-editable JSON configuration. While it may be nice to be able to change the flags passed into scripts with little friction, there is no record of changes and it’s difficult to deploy the changes atomically.

I hope you enjoyed these accounts, and I may add on as I misadventure more and learn more :)

Taming Complexity on Desktop Platform

Part two of my blog post for Dropbox, “Accelerating Iteration Velocity on Dropbox’s Desktop Client” came out last week. I find emergent properties and trying to understand and engineer for complexity really interesting. One of the cool things about my job is that I get to think about the emergent properties of both code and people (specifically my coworkers). The article is about what has worked for us.

What I do at work

This week has been a good week for finishing writing. If you’d like to see what happens when I have a dedicated editor, the feedback of other close colleagues, and a professional audience read this post on the Dropbox Tech blog. It is the first of two on the things my larger team (Desktop Platform) did in 2016 to speed up the rate at which we can release new versions of the Dropbox Desktop Client.

Python Exception Handling

Definition and Background

In software, exceptions date from the late 1960s, when they were introduced into Lisp-based languages. By definition, they are intended to indicate something unusual is happening. Hardware often allows execution to continue in its original flow directly after an exception, but over time these kinds of “resumable” exceptions have been entirely phased out of software. You can read more about this in the Wikipedia article, but the important point is that these days, software exceptions are essentially a way to jump to an entirely different logical path.

Intuitively there are good reasons for exceptions to exist. It is difficult to anticipate the full scope of possible inputs to your program, or the potential outputs of your dependencies. Even if you were able to somehow able to guarantee that your program will only ever be called with the same parameters, cosmic rays happen, disks fail, and network connectivity flakes, and these can manifest in a variety ways that all result in your program needing to metaphorically vomit.

In Python

I primarily program in Python, which is a very flexible programming language. Given that it has “duck typing”, values are happily passed around as long as possible until it is proven that computation cannot succeed. I had noticed a tendency in my own programming and in other code I read to use exception handling as program flow control. Here is a really small example that I saw (paraphrased) in code earlier this week:

import json


def get_foo(filename):
    with open(filename, 'rb') as fobj:
        data = json.load(fobj)
    try:
        return data['foo']
    except KeyError:
        return None

If I run this, all that will ever happen is either foo will be returned, or None will be returned, right?

(no.)

There are a lot of other possible exceptions in this code. What if the data doesn’t turn out to be a dict? As far as I know, this would raise a TypeError. What if data isn’t valid JSON? With the standard json library, it would be a ValueError or JSONDecodeError. What if there isn’t even a file called filename? IOError. All of these are indicative of malformed data, rather than just a value that hasn’t been set.

It’s really tempting to enumerate all of these explicitly:

import json


class MalformedDataError(Exception):
    pass


def get_foo(filename):
    try:
        with open(filename, 'rb') as fobj:
            data = json.load(fobj)
    except IOError, ValueError, JSONDecodeError:
        raise MalformedDataError
        
    try:
        return data['foo']
    except TypeError:
        raise MalformedDataError()
    except KeyError:
        return None

When I look at this however, it feels a little weird… like it’s not the best practice. The exception handlers are being used for two fundamentally different things: an expected, but maybe unusual case, and violation of the underlying assumptions of the function.

Plus, trying to think of every single possible source of exceptions is arduous. It makes me nervous now, even in this toy example of four lines of code non-try/except code. What if the filename isn’t a str? Should I add another handler? There are already a lot of except clauses, and they’re starting to get in the way of the readabilily of the code.

Ok, new idea. Clearly the IndexError should be replaced with a simple get. Probably that would be more performant anyway (maybe some other time I’ll delve into the implementation of exception handlers and their perfomance). Around everything else, I could throw in a general try/except.

import json


class MalformedDataError(Exception):
    pass


def get_foo(filename):
    try:
        with open(filename, 'rb') as fobj:
            data = json.loads(fobj)
        return data.get('foo')
    except Exception:
        raise MalformedDataError()

Suddenly I’m getting a MalformedDataError on every run of this function… Am I missing the file? Is it formatted correctly? Eventually, in a hypothetical debugging session, I’d read closely and add some print statements and figure out that I accidentally used the wrong json load function: load and loads are sneakily similar.

The general try/except is disguising what is clearly a third kind of issue - a bug within my code, locally in this function. Try as I might, the first draft of my code regularly has this kind of “dumb” mistake. Things like misspellings or wrong parameter order are really hard for humans to catch in code review, so we want to fail local bugs early and loudly.

Aside: Exception hierarchies

The possibility of conflating two problems that raise the same exception, but need to be handled differently, is a nerve-wracking part of Python. Say my typo had caused a TypeError in line 16 of the 2nd code snippet, by, for example, trying to index with a mutable type, return data[['foo']]. Even if I tried to switch to a except TypeError around just that line, in case the JSON object was not a dict, the local-code bug would not be uniquely identified.

Conversely, when I write my own libraries, it can be overwhelming to decide whether two situations actually merit the same exception. If I found a list instead of a dict, I could raise TypeError, which is builtin and seems self-explanatory, but might get mixed up with thousands of other TypeErrors from other parts of the stack. Or I could a custom exception WrongJSONObjectError, but then I have to import it into other modules to catch it, and if I make too many my library can become bloated.

I could rabbit-hole further on this code, exploring more potential configurations. Maybe I should check the type of foo before returning. Maybe I could try to catch only the errors I definitely know aren’t my fault, and then stick a pass and retry in there in case there’s some transient error. Hey, filename me be on a network drive and connectivity is flaky. The proliferation is huge, so it’s time to get axiomatic.

It’s worth noting that all of the above full examples pass a standard pep8 linter. They are all patterns that I’ve seen somewhere in thoroughly tested and used production code, and they can be made to work. But the question I want to answer is, what are the best practices to make code as correct, readable, and maintainable as possible?

Classifying exceptions

So far we’ve enumerated three ways exceptions are used:

1. Unusual, expected cases.

It seems better to me to just not use exceptions, and try an if statement instead of relying on the underlying code to fail. The only exception (hehhh) to this rule that I can think of is if there is no way to catch a expected-unusual case early, but I can’t think of a good example for this. The Python language has made the design decision to make StopIteration and handful of other program flow oriented concepts an Exception subclass, but it’s worth noting they are not a subclass of StandardError. Probably our own custom subclasses should respect this nomenclature, using Error for something gone wrong and Exception for something unusual but non-problematic.

Another subtle facet: do we fold these mildly exceptional cases into the expected return type, by maybe returning an empty string if foo expected to be a string, or return something special like None? That is beyond the scope of this specific article, but I will give it more thought. I recommend looking into mypy, a static type analyzer for Python, and its “strict optional” type checking to help keep if foo is None checks under control.

2. A logical error in this section of code.

Logical errors happen, since writing code means writing bugs. Python does its best to find and dispatch obviously line-level buggy code with SyntaxError.

   >>> deg get_foo(filename):
  File "<stdin>", line 1
    deg get_foo(filename):
              ^
SyntaxError: invalid syntax 

In fact, SyntaxErrors AFAIK can’t be caught, which is a pretty good indicator of what should happen when executing locally buggy code - fail fast and with as much explicit information as possible on location and context. Otherwise, local bugs that are not raised immediately will show up as a violated expectation in the next layer up or down of the code.

That brings us to the next type:

3. Other things are effed.

Some other code broke the rules. Ugh. It feels like it shouldn’t be my job to clean up if something else messed up. It’s really attractive, when starting out on a project, to just let all exceptions bubble up, giving other code the benefit of the doubt. If something consistently comes up in testing or production, maybe then it’s worth adding a handler.

But, there is some some more nuance here. In our examples above, the “culprit” of errors could be many different things. They roughly break down into the following:

Problems with dependent systems. The callee of this code did not meet my expectations. Perhaps I called a function, and it raised an error. Could be an error due to a bug in their code, or it could be something further down the stack, like the disk is full. Maybe it returned the wrong type. Maybe the computer exploded.

Problems with depending systems. The caller of the code did not meet my expectations. They passed in a parameter that isn’t valid. They didn’t do the setup I expected. The program was executed in Python 3 and this is Python 2.

However, there are plenty of places where the distinction between these possibilities doesn’t seem particularly obvious. Did the callee raise an error because it failed, or because I violated its assumptions? Or did I pass through a violated assumption from my caller? In the examples above, IOError would be raised by json.loads(filename), regardless of whether the filename didn’t exist (probably the caller’s problem) or the disk was broken (the callee’s problem, since the ultimate callee of any software is hardware).

Design by Contract

The concepts might sound familiar if you’ve heard much about Design by Contract, a paradigm originated by Bertrand Meyer in the late 1990s. Thinking about our functions explicitly as contracts, and then enforcing them like contracts, is actually potentially the solution to both of the top-level problems we’ve come across: attributing error correctly and handling it effectively.

The basic idea behind Design by Contract is that interfaces should have specific, enforceable expectations, guarantees, and invariants. If a “client” meets the expectations, the “supplier” must complete the guarantee. If the “supplier” cannot complete the guarantee, at a minimum it maintains the invariant, and indicates failure. If the “supplier” cannot maintain the invariants, crash the program, because the world is broken and nothing makes sense anymore. An example of an invariant is something like, list size must not be negative, or there must be 0 or 1 winners of a game.

In application

The language Eiffel was designed by Meyer to include these concepts at the syntax level, but we can port a lot of the same benefit to Python. A first step is documenting what the contract actually is. Here is the list of required elements for a contract, quoted from the Wikipedia page:

Acceptable and unacceptable input values or types, and their meanings

Return values or types, and their meanings

Error and exception condition values or types that can occur, and their meanings

Side effects

Preconditions

Postconditions

Invariants

Most Python comment styles have standard ways of specifying the first three item. Static analyzers, like mypy or Pycharm’s code inspection, can even identify bugs by parsing comments before you ever run the code. Taking the last version of our example from above, and applying the Google comment style + type hints, we might end up with something like this:

import json
from typing import Optional


class MalformedDataError(Exception):
    pass


def get_foo(filename: str) -> Optional[int]:
    """Return the integer value of `foo` in JSON-encoded dictionary, if found.

    Args:
        filename: Full or relative path to JSON-encoded dictionary.

    Returns:
        The integer at key `foo`, or `None` if not found.

    Raises:
        MalformedDataError: if the JSON-encoded dictionary cannot be found and parsed to find `foo`.
    """

    try:
        with open(filename, 'rb') as fobj:
            data = json.loads(fobj)
            print data
        if 'foo' in data:
            return int(data['foo'])
        else:
            return None

    except Exception:
        raise MalformedDataError()

We even got a new piece of information! I hadn’t been forced to think about the return type previously, but with this docstring, I realized foo is expected to be an int, and added a bit of code to enforce that.

Side effects and invariants

Generally speaking there aren’t good ways of enumerating or statically checking the execution of side effects, which is a reason to avoid them when possible. A common example would be updating an instance variable on a class, in which case the side effect would be hopefully obvious from the method name and specified in the dostring. At a minimum, side effects should obey any invariants.

This tiny code snippet doesn’t have an obvious invariant, and I theorize this is because it is side-effect free. Conceivably, another function in the module would mutate the dictionary and write it, and our invariant would be that filename would preserve a JSON encoded dictionary, maybe even of a certain schema. The other classic example is a bank account: a transaction can succeed or fail, due to caller error (inputting an invalid currency) or callee error (account database isn’t available), but the accounts should never go negative.

Preconditions and postconditions

Preconditions are things that must be true before a function can execute. They are they responsibility of the “client” in the business contract analogy, and if they are not met, the “supplier” doesn’t have to do anything. Postconditions are the criteria by which the function is graded. If they are not met, and no error is indicated, the bug can be attributed to that function. This maps really cleanly to the caller and callee problems we delineated above.

Left up to the programmer is where to draw the line between pre- and post-. In our toy example above, I’ve implicitly decided to make the preconditions pretty strict: filename must exist and be a JSON encoded dictionary. I could have chosen to, for example, accept a file with a single integer ascii-encoded integer, or return None if the file doesn’t exist. However, I might have been more strict: I do allow dictionaries with a missing foo value, and simply return None, and I’m willing to cast the value at foo to an integer even if it’s a string or float.

Having seen a lot of code that tries to lump similar things together with zillions of if statements, usually in the name of deduplication or convenience and at the cost of unreadability, I would argue that stronger preconditions are better. The deduplication is better done via shared helpers with strong preconditions themselves. Casting it int is probably even a bad idea… but I’ll leave it as is given that the requirements of this program are pretty arbitrary.

My postconditions are pretty simple in this case: return the int or None that corresponds to the value of foo at that path.

Identifying violations

A lot of the ideas in this post are based on this article by Meyer, which has a bit of a vendetta against “Defensive Programming”. In the defensive paradigm, we would anticipate all possible problems and end up with a check at every stage. I added a client to the get_foo function to demonstrate this:

import json
import os
import sys

from typing import Optional

FOO_FILE = 'somefile.json'


class MalformedDataError(Exception):
    pass


def get_foo(filename: str) -> Optional[int]:
    """..."""
    if not (isintance(filename, str) and os.path.exists and os.path.isfile(filename)):
        raise MalformedDataError()

    with open(filename, 'rb') as fobj:
         data = json.loads(fobj)

    if not isinstance(data, dict):
        raise MalformedDataError()
    
    foo = dict.get('foo')

    if foo is not None
        try:
            foo = int(foo)
        except TypeError:
            raise MalformedDataError()

        assert isinstance(foo, int)
        return foo

    return None


def foo_times_five(): -> int
    """Print 'hello' the number of times that foo specifies."""

    if not os.path.isfile(filename):
        return 0
    
    try:
        foo = get_foo(FOO_FILE)
    except MalformedDataError():
        return 0

    assert isintance(foo, (int, None)), "Foo should be an integer value or None"

    if foo is None:
        return 0
    else:
        return foo * 5

Indeed, this is really verbose. But there isn’t a hard and fast rule that the article offers for who is supposed to check. The main thing it argues explicitly against is the exception handler usage I first identified as potentially dangerous above: special but expected cases. My interpretation of Meyer’s work, in the context of Python, is that assertions can be used to make it really clear what is at fault when something goes wrong. In fact, that should be the primary goal of all exception handling: ease of debugging.

I think this the final version I would go with:

import json

FOO_FILE = 'somefile.json'


def get_foo(filename: str) -> Optional[int]:
    """Return the integer value of `foo` in JSON-encoded dictionary, if found.

    Args:
        filename: Full or relative path at which to look for the value of 'foo'. File must exist and be a valid JSON-encoded dictionary.

    Returns:
        The integer-casted value at key `foo`, or `None` if not found.
    """

    with open(filename, 'rb') as fobj:
         data = json.loads(fobj)

    assert isinstance(data, dict), 'Must be a JSON-encoded dictionary'   
       
    if 'foo' in data:
        try:
            foo = int(dict.get('foo'))
        except TypeError:
            return None
    else:
        return None


def foo_times_five(): -> int
    """Print 'hello' the number of times that foo specifies"""

    if not os.path.isfile(FOO_FILE)):
        return 0
    
    foo = get_foo(FOO_FILE)

    assert isintance(foo, (int, None)), "Foo should be an integer value or None"

    if foo is None:
        return 0
    else:
        return foo * 5

My rationale is the following:

A static type checker would take care of the filename not being a string.
A failure of opening the file would be easily attributed to the caller, given the docstring. So, my caller takes care of that check, since it’s using a module-level variable it may not entirely trust (it could get mutated at runtime).
JSONDecodeError is also pretty easily to interpret. Plus, there isn’t really a way to assert for the validity of the file other than to use the json library’s own contracts.
I generally find failing to index a presumed dictionary to be confusing error to debug, especially since mypy and and json itself can’t help us out to catch type issues from validly deserialized data. Therefore I make an explicit check for the type of data.
foo_times_five wants to ensure its own guarantees, and since the * operator works in strings and lists and so on as well as ints, I added an assertion.
Also, in addressing the custom error vs builtin error question, I didn’t think it was necessary to make my own exception subclass. All the assertions I added should completely disrupt the flow of the program, and AssertionError comments are effective for communicating what went wrong.

Handling violations

You’ll notice that my two functions have drastically different ways of handling exceptional cases once they occur. foo_times_five returns an int almost at all costs, while get_foo raises more and asserts more. Once again, this is a really toy example, but the goal is to originate the exception as close to the source of the error as possible. As I think about i, further, there seems to be b

At the boundary of a self-contained component, because that component has violated an invarient and needs to be reset. Self-contained should be emphasized - if the component shares any state, then it could leak that invariant violation to other components before being torn down. This is an argument for distinct processes per component, and I think good part of the reason why microservices, that are deployed and rebooted independently, are increasingly popular. The Midori language is really opinionated about this, (and that blog post is a long-but-good read on how they came to their exception model). Any error that can not be caught statically results in the immediate teardown and recreation of an entire process, which is called “abandonment”.
A great big try/except and automatic reboot clause at the highest application level. This is essentially the equivalent of #1, but we’re giving up on everything. The downside is that any bug that makes it to production will cause crashloops, so this needs to be paired with good logging and reporting.
Really close to code that failed. An important distinction is that it should be near the code that could not fulfill its contract - not at the first level that happened to not be able to continue execution. Interestingly, most of the examples I can think of also qualify as #1, like retrying after network errors, server 500, or refused database connections. We just happened to already have well-defined ways of communicating that something has gone wrong across these “component” boundaries.
Wrapping callee libraries that are designed to have control-flow oriented exceptions.

I would add two more debug-facilitating rules: leave some trace every time an exceptions is strategically swallowed, and be as consistent as possible (perhaps, via documented invariants) about the criteria for exception recoverability across environments and parts of the codebase. Here is why:

A war story

At work, we used to have a general-purpose exception handler that would report stacktraces to a central server, but behaved somewhat differently in “debug” and “stable” contexts. In debug situations, it would bubble up the exception, causing a crash for the developer to notice and fix, since they were right there editing code anyway. In stable, it would continue as is. The intention was noble - we want to minimize crashes for end users. But this made it really difficult for developers to reason about what would happen when their code was promoted to stable, and in a worst case scenario the application could continue running after an error in a untested, unknown state. This problem was thrust into the light when a unittest began failing only when the stable environment flag was on. We systematically got rid of the handler, and also now run all tests with a mock-stable configuration on every commit to guard against other configuration-dependent regressions.

Python Packaging Disambiguated

I’m going to talk about something really boring. That thing is packaging and distributing Python code.

Motivation

I have been programming primarily in Python for about 3.5 years now, happily creating virtualenvs and pip installing things - and sometimes yak shaving non-Python dependencies of those libraries - without any idea how python code is packaged and distributed. And then, one day I had to share a library my team had written with another team, and lo and behold, I wrote my first requirements.txt file. I was inspired to peek under the hood.

It turns out some people think that Python packaging is sucky. For most of my purposes - working on small pet projects or in fully isolated development environments maintained by another engineer, this works fine. The problem is that the ONLY thing that Python tries to encapsulate is Python code - not dependencies, not environment variables, etc. There is some more exposition about that from other people if you’re interested: example.

The biggest hurdle to getting started was understanding the lingo.

Important terms

The PyPI (Python Package Index), previously known as the Cheese Shop, has a glossary with everything that you might want to know. Here are the ones I was previously confused about:

module - In Python, a single .py file is a module. This means that Python can be organized without classes quite nicely. You can put some methods and constants into a file and tada, a logically isolated bit of code.
package - A collection of modules. The tricky thing is that it can either be an import package, which is module that contains other modules that you can import, or a distribution package, described by distribution below. In colloquial usage it seems to mean “a collection of Python code published as a unit”.
distribution - An archive with a version and all kinds of files necessary to make the code run. This term tends to be avoided in the Python community to prevent confusion with Linux distros or larger software projects. Many distributions are binary or built, shortened to bdist, and are binary, platform-specific, and python-version specific. sdist or source distributions are made up only of source files and various other data files you would need. As you might expect, they must be built before they are run.
egg - A built packaging format of Python files with some metadata, introduced by the setuptools library, but gradually being phased out. It can be put direcly on the PYTHONPATH without unpacking, which is kind of nice.
wheel - The new version of the egg. It has many purported benefits, like being more implementaton-agnosticmm and faster than building from source. More info. About two thirds of the most popular Python packages are now wheels on PyPI rather than eggs. Citation.

I also found it mildly interesting to figure out how this somewhat fractured environment came to me.

A brief history

There is a chapter of The Architecture of Open Source dedicated to nitty gritty of python packaging. The CliffNotes version is that there was some turbulence over Python’s packinging libraries, setuptools and distutils. distutils came first, but it lacked some really integral features… like uninstallation. Setuptools was written to be a superset of distutils, but inherited some of the same problems. One key issue is that the same code was written to take care of both publishing and installing Python packages, which meant bad separation of responsibility.

Meanwhile, there was a also some friction between easy_install and pip. Easy_install comes automatically with Python, can install from eggs, and has perfect feature parity on Windows, but pip has a much richer feature set, like keeping track of requirements heirarchies and (most of the time) uninstallation, but is more finicky.

I may follow up with more on how virtualenv and pip actually work, stay tuned!

Linux block devices

One day at work I was presented with a problem. We use Amazon Web Services’ EC2 for a lot of things. An instance was storing lots of data in a specific directory, and it ran out of space.

The machine at hand: an Ubuntu 14.04 instance, the Trusty Tahr. I made a blank micro instance for demo purposes.
The solution: Elastic Block Store, or EBS.

What is an EBS volume?

From the Amazon documentation,

An Amazon EBS volume is a durable, block-level storage device that you can attach to a single EC2 instance.

Basically it’s a bit of storage that can have a filesystem and be treated by the operating system mostly like it would a physical drive you add or remove from your computer. However, it’s connected the the server over the network, and can therefore be umounted and saved or shifted from one instance to the next. They’re a really core AWS product - most EC2 instances these days are “EBS-backed”, meaning an EBS volume taken from a standard snapshot is mounted at the root directory. You can turn off the instance and keep all its data intact if you want.

The other kind of instance is “instance store”, which doesn’t persist and mostly seems to be around for legacy reasons.

Mounting

What does all of this mounting mean? Presumably no steeds are involved.

Well, a computer’s files can be spread across a bunch of different storage devices. Mount is the Unix command that tells your operating system that some storage with a filesystem is available and what its “mount point” is, i.e. the part of the directory tree for which it should be used. If you were dealing with a physical computer sitting on your lap or desk, you might use mount to indicate a new SD drive or DVD. As you might expect, umount is its opposite, and disengages the storage.

Some really simple sample usage:

$mount  # Just tells you what's going on currently. 
/dev/xvda1 on / type ext4 (rw,discard)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
none on /sys/fs/pstore type pstore (rw)
systemd on /sys/fs/cgroup/systemd type cgroup (rw,noexec,nosuid,nodev,none,name=systemd)

$mount /dev/xvdh ~/data # Would put the xvdf drive in my ~/data direcotry
$umount /dev/xvdh  # Unmount! Same things as `umount ~/data` really

Meanwhile, need to know how much space and you have and where? df is your gal. This is what it looks like for my test instance

/dev/xvda1       8125880 788204   6901864  11% /
none                   4      0         4   0% /sys/fs/cgroup
udev              290380     12    290368   1% /dev
tmpfs              60272    188     60084   1% /run
none                5120      0      5120   0% /run/lock
none              301356      0    301356   0% /run/shm
none              102400      0    102400   0% /run/user

So… where’s my EBS volume?

First of all, talking specifically about AWS, you have to do some configuration dances before you get an actual extra EBS volume anywhere near your instance. As you can see from the output above, there is already a device called /dev/xvda1 hanging out at the root directory, which is the root device.

When you’re creating an instance, there’s a handy button that says “Add new volume” and then you have to choose a name for your volume, use a specific snapshot, and various other specifications. The names are things like “/dev/sdb1”.

What does this name mean? The helpful popover says:

Depending on the block device driver of the selected AMI’s kernel, the device may be attached with a different name than what you specify.

Wat?

A little history of technology

EBS volumes are “block devices”, which means they read data in and out in chunks, do some buffering, allow random access, and in general have nice high level features. Their counterpoint is the character device, which is less abstracted from its hardware implementation. They are sound cards, modems, etc. Their interfaces exist as “special” or “device” files which live in the /dev directory.

The options for block device names in the AWS console pretty much all start with sd[a-z]{2}. This stands for “SCSI disk”, which is pronounced like “scuzzy disk”. SCSI is a protocol for interacting with storage developed long ago. It was the newfangled way to connect storage to your computer compared to IDE cables, which are those ribbons of wire I at least remember my dad pulling out of the computer to detach a floppy drive in the ’90s.

More nomenclature: hd would indicate a hard disk, and xvd is Xen virtual disk. Everything in Amazon is virtualized, and the tricky thing is that only some kinds of instances understand that and adjust operations to account for it. Those that do, like Ubuntu 14.04, convert the sd name you chose to xvd. Anyway, you can suss out what exactly your instance’s OS will think the device is name using this reference.

What does this all mean PRACTICALLY? Like how do I actually get to the volume?

Try lsblk.

$ lsblk
NAME  MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdh  202:112  0   8G  0 disk
xvda1 202:1    0   8G  0 disk /

Looks like my EBS volume got changed from sdh to xvdh. Yay! It’s virtualized.

Anyway, the next step is to mount it, right? I have to prefix /dev, since that’s the directory the special file is in.

$ sudo mount /dev/xvdh /data
mount: block device /dev/xvdh is write-protected, mounting read-only
mount: you must specify the filesystem type

Ok so actually there’s another step I haven’t explained yet.

Filesystems

I originally thought that filesystems were part of the operating system, since with OSX or Windows (which is what I’ve used historically), you’d have to go out of your way to use a non-standard one. With Linux, choice is abundant. This article has a lot of good things to say clarifying the many Linux filsystem options. For devices that are not crazily large, the standard on Linux these days is ext4 - it’s journaling (meaning it checks in both before and after writes which prevents corruption issues), it’s backwards compatible (so you can mount its predecessor ext filesystems as ext4), and it’s fast.

Important note: EBS volumes do not come with a filesystem already set up. That line about telling you to “specify” in the error above is a red herring. First, you have to make a filesystem with mkfs.

$ sudo mkfs -t ext4 /dev/xvdh
mke2fs 1.42.9 (4-Feb-2014)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
524288 inodes, 2097152 blocks
104857 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2147483648
64 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

YAYYYYYYY now we can actually sudo mount /dev/xvdh /data and dump all our cat pictures on there.

                /^--^\     /^--^\     /^--^\
                \____/     \____/     \____/
               /      \   /      \   /      \
              |        | |        | |        |
               \__  __/   \__  __/   \__  __/
|^|^|^|^|^|^|^|^|^\ \^|^|^|^/ /^|^|^|^|^\ \^|^|^|^|^|^|^|^|
| | | | | | | | | |\ \| | |/ /| | | | | | \ \ | | | | | | |
##################/ /######\ \###########/ /###############
| | | | | | | | | \/| | | | \/| | | | | |\/ | | | | | | | |
|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|

Some more resources

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html

Special thanks to Andy Brody and Steve Woodrow for pointers and company.

Welcome

Welcome to my blog! I was going to get right down to business, but this felt like a good time to set out what exactly I aim to accomplish by writing things publically available on the internet.

This is primarily a “technical” blog. I certainly love writing witty one-liners, posting about my personal life, or taking pictures of myself eating things - but I have twitter, facebook, and a silly tumblr for all of those respectively.

My goals:

To keep a record of stuff that I’ve learned recently for reinforcement and future reference.
To be useful to others looking for resources. For right now that will be a lot of systems and command-line utility oriented things, hopefully accessible to those who (like me) don’t have a computer architecture course under their belts.
To give myself a reason to go in depth into topics outside of work, like for example, learning how to set up a Jekyll-based blog.