Server Farmin': August 2020

Tuesday, August 18, 2020

Philosophy: Understanding Economic Pressure (Multiply by 100 rule)

This is a stub, more detail to come.

Philosophy: Common Misconceptions about Life in Rural East Africa

This is a stub, more detail to come.

Philosophy: Implicit Power Dynamics vs. Honest User Feedback

This is a stub, more detail to come.

Monday, August 17, 2020

Philosophy: Simplicity vs. Complexity

This is a stub, more detail to come.

Philosophy: Startups vs. Corporations vs. NGOs

I've worked at all different types of companies throughout my career – I started off interning with Procter and Gamble just out of College, decided that I wanted to pursue freelance work for a while, moved from there to startups and eventually to an NGO. Most of my current role at the NGO involves scaling problems – transitioning from a startup culture to a corporate culture. Each of them has its pros and cons, I still can't say which is best.

Freelance:

Freelance work is fun but unreliable. I've only ever had a few actual freelance jobs (5-10 hours of work or so) – most people don't want to build tiny custom software projects, they want more work commitment. Freelance works seems to be more common for frontend web designers, for backend/embedded programmers like myself there were limited options. This can be a bit stressful when a client stiffs you on a large bill and you need to cover rent for the next month... but the flexibility can be really cool.

My main two lessons from Freelancing: never sign a contract without really thorough review and never do phase-based payment. My first-ever freelance gig, I signed a contract that guaranteed payment after the client reviewed my work, but overlooked the part that would have set a timeline on the lag between my work submissions and their review/approval. My “phase 3” work submission was in 2013, I'm still waiting on that approval payment. Clients always want to do phase-based payment (I pay you $500, you build the thing) but this doesn't take into account agile iteration, startups running out of money, or the unpredictable nature of problem solving. If the client isn't willing to pay by the hour, they're not serious – I no longer consider phase-based work offers and wouldn't recommend them for any other developers.

Startups:

Startups are also fun but unreliable. Most of my freelance work eventually became startup work – the longest contract was with a Biomedical startup developing a new type of blood pressure cuff. In a startup, you'll wear all the hats you can – by the time I left the Biomedical one, I was:

Writing all their software (Embedded PIC C code and a paired Android App)
Doing mathematical modeling on arterial + venous wall pressures (you pick up some random skill sets by working at startups)
Designing experiments to determine if the cuff worked or not
Debugging hardware (including a few hand-soldered fixes to prototype boards)
Analyzing experimental data/running statistics on the result sets
Testing the device in a live operating theater (by far the most stressful moment of my career thus far has been walking into an operating room to test a prototype medical device, dude's on the table about to be cut up, we put the prototype on his arm... and it fails to turn on. ALWAYS. HAVE. A. BACKUP. PLAN. Thankfully we had a second prototype that worked.)

Startups give you a ton of experience, but are really stingy about money (sometimes in needlessly self-destructive ways) and tend to assume that equity means you should work unrealistically long hours. If the startup is offering only equity, they're not serious – if they can't offer hourly pay, they're not worth my time. The pressure in a startup can be overwhelming, there were more fights in the startup environment than I've seen anywhere else.

I'd recommend startup work, but always remember the rule of startups:

Most Startups Fail.

As long as you remember that rule, you should be ok. If you start to forget it (taking equity-only commitments instead of payment, selling your car to fund your startup, etc.) then you need to re-evaluate what you're doing.

Corporations:

My personal take on a corporation vs. startup – in a corporation, the processes that people are executing matter more than the people themselves. This sounds really scary and faceless, but is actually pretty nice when you're a customer of the organization. Can you imagine being unable to use Google Searches because “Bob's the only person who can fix that server and he's out on vacation”? That kind of excuse happens a lot in startups, but is impossible in a corporation – the entire point of working in one is that you're replaceable. I found corporate work to be less exciting than startups and freelancing, but it's very secure. At the end of a 5 year project, you may have found a way to save 0.1% of the production cost of a roll of paper towels and made the shareholders millions of dollars... but if you don't, then the job will usually still exist afterward. I don't think that corporate work is inherently bad, it just wasn't for me the last time I tried it.

Non-Goverment Organizations (NGOs):

Finally there's my current role. Depending on the scale of the NGO, it can feel like a startup or a corporation (One Acre Fund feels like it's in transition from one to the other). Typically these have much lower pay than you'd get for equal corporate work and sparse resources in general. You get to feel like the work you're doing has a positive impact, but I actually wouldn't recommend NGO work for most technical people. My experience has been that NGOs value liberal arts above all else – communication, soft skills and personal interactions matter a lot more than they would at more typical engineering jobs. NGOs seem to have a very short attention span as well, the constant churn of 2-year employees makes it easy to become senior but hard to keep attention on any multi-year project.

So why should you consider working at an NGO when you can get better pay and not have to pretend to be a social butterfly elsewhere? Two words:

Street Cred

Working in Silicon Valley, behind the desk of a major corporation, even at a promising new startup - none of these have street cred. People may not actively dislike you for doing one of these, but you're not exactly cool - look at the media language used for Mark Zuckerberg or Jeff Bezos, whose companies are the best case scenario for a tech startup. Other tech work might make you rich, but it will never make you badass.

Philosophy: Invisible Lines

Here's a thought experiment:

Imagine that you're in America (or the UK, Europe, somewhere in the Global North). You're driving a car and come to a section of the road that has no line painted on it. No cars are oncoming. What do you do?

Most people would say “stay on your side of the road”. You draw an invisible line down the center of the road and stay on your side of it.

One of the patterns that I've consistently seen through many long taxi rides in East Africa – drivers here don't do that. Instead, people drive on whichever side of the road makes the most sense – the one with fewer potholes, less pedestrian traffic, etc.

It's a subtle realization but when you start looking for this mentality, you see it all around you. A lifetime in the West trains you to see invisible lines everywhere – down the center of the road, around other people's houses, between your problems and your neighbor's problems. The lines are drawn differently in rural East Africa (and indeed in any culture other than the one you grew up in). I still don't fully understand it, but in general people in the villages seem to have a very different worldview than my own – one that has much less abstraction and much more concrete reality. A reality where driving on the wrong side of the road or walking through a neighbor's backyard are much more socially acceptable, but interpreting an abstraction like a map doesn't always translate culturally. Of course those are broad generalizations and don't apply to everyone, but they apply to enough people that you need to account for your own invisible lines when operating in rural areas.

This mentality is coded into One Acre Fund's systems in hundreds of subtle ways. There is a single moment that pretty well captures Roster and the OAF business model. It's not the only quirk of our systems, but I find it the most interesting.

During truck deliveries, One Acre Fund sends out excess bags of fertilizer in the likely event of breakage. The interesting question - what should you do if there's an excess bag left at the end of a delivery?

1. Send it back to the warehouse. This is the Western model of business - no client has placed an order, it's easiest to predict inventory and manage sales if the bag is returned.
2. Sell it to a client who wants to buy it. This is the East African model of business - there's a client right there who wants to buy your product, why would you stop them?

In my opinion, One Acre Fund grew because it is built around the second option (not every country does this, we refer to it as "Just In Time" for those that do). Allowing total client flexibility on the day of item delivery is not a standard Western model of business, but we're not in America here. It's not like a post office can just "deliver it tomorrow" or accept item returns after the fact. Trying to force "what works in America" to work here is doomed to failure, no matter how well-funded or internationally applauded.

Generations of foreigners have come to Africa and tried to "Make Sense" of the continent. As far as I can tell, the only foreigners who are capable of "making sense of Africa" are the foreigners who are changed by Africa; those who start to question their own invisible lines rather than projecting them onto others who can't see them.

Technology: Ephemeral Test Environments

This is a stub, more detail to come.

Technology: Forensic Analysis of a Ransomware Attack

On 5/11/2019, I received a fairly routine alert from a developer - "Hey Louis, the deployment server isn't working". At the time, our deployment server hosted Jenkins (a tool used to build code) and Octopus Deploy (a tool used to send the code out into the real world). Both tools are fairly stable, but have some occasional maintenance needs - so I opened up the server through a Remote Desktop Connection to see why they were broken. What I found was a sophisticated Ransomware attack; all of the non-operating system files were encrypted and every directory had a copy of these instructions:

---= GANDCRAB V5.2 =---

***********************UNDER NO CIRCUMSTANCES DO NOT DELETE THIS FILE, UNTIL ALL YOUR DATA IS RECOVERED***********************
*****FAILING TO DO SO, WILL RESULT IN YOUR SYSTEM CORRUPTION, IF THERE ARE DECRYPTION ERRORS*****
Attention!

All your files, documents, photos, databases and other important files are encrypted and have the extension: .DEVPN

The only method of recovering files is to purchase an unique private key. Only we can give you this key and only we can recover your files.
The server with your key is in a closed network TOR. You can get there by the following ways:
----------------------------------------------------------------------------------------
| 0. Download Tor browser - https://www.torproject.org/

| 1. Install Tor browser

| 2. Open Tor Browser

| 3. Open link in TOR browser: http://gandcrabmfe6mnef.onion/8ff5caefabf9673

| 4. Follow the instructions on this page

----------------------------------------------------------------------------------------

Over the next 12 hours, I traced the entry points, pinpointed the security vulnerability that allowed the attack in the first place and worked with our sysadmin to permanently fix the security issue; since May 2019, this attack has not been repeated. I'm pretty proud of this work, forensic analysis isn't my area of expertise so this was outside my normal scope of responsibilities. The following is the writeup I put together at the time of the attack.

Attack Timeline:

DEVPN-MANUAL(the ransom instructions) were added in every directory, the oldest such file is in the top-level C: directory and was created on 5/11/2019 @ 4:19 AM UTC (5:19 AM server time). This appears to be the first file created by the gandcrab ransomware. Circumstantial evidence in the logs points to this time as the likely start of encryption:

5:19:04 (server time) - SQL services killed unexpectedly

5:19:04 (server time) - Logoff from a service (logon type 5 in the event viewer) for the OAFDEPLOY\OneAcreFundAdmin account. Most likely the jenkins server restarting.

Possible Causes:

Dictionary attack brute forced the passwords to the deployment server

All servers in the OAF network are under constant dictionary attacks, but the passwords are long enough that these probably won’t work.
Unlikely that a dictionary attack happened, the password used for the “OneAcreFundAdmin” account that initiated the attack was long and non-standard.

Keylogger captured passwords for Deployment Server

Possible, but strange that only the deployment server was affected - all people who have access to the deployment server also regularly use other servers

Vulnerability in Jenkins software allowed server access

Most likely attack vector, see detail below

Vulnerability in Octopus Deploy allowed server access

Possible Attack Vector, but requires known/guessed credentials

Man-in-the-middle attack captured part of an RDP session.

Possible, but again strange that only the deployment server was affected

Login History:

Windows Event Viewer logs tend to be cluttered with many login/logoff records from system events, not all of which correspond to real logins. Real logins are type 10 (interactive remote session). No logins of type 10 can be found in the deployment server logs from the start of the log (4/28/2019) up until my logins (5/13/2019). No remote desktop sessions where initiated on the date of the encryption attack. This would indicate that it wasn’t a successful dictionary attack, rather a different attack surface was used.

Note: It may be possible to modify the login history, it’s not out of the question that an advanced attack could have covered its own tracks.

Probable Cause:

On May 10, 2019 Trendmicro wrote about a jenkins vulnerability that allowed arbitrary code execution. This is not a new phenomenon, other jenkins exploits are known. Based on the execution logs, just before the attack several requests came in to the endpoint:

https://78.46.206.200:8443/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript/checkScript

From github, this endpoint appears to allow arbitrary code compilation within a sandbox. Searching for that endpoint brought up a sketchy vulnerability marketplace for just this kind of attack.

Note: at the time of the attack our jenkins server was on v.2.118 and the security module was on v1.40. In January the security module v1.50 released an update with a warning about this version allowing code to exit the sandbox and perform arbitrary code execution on the jenkins server.

Evidence Supporting the Jenkins Attack Theory:

No remote desktop sessions logged in on the 11th. This suggests another attack vector was used.

No other servers were encrypted, just the deployment server. This suggests something about the deployment server made it especially vulnerable.

Jenkins error logs were not encrypted by the attack (the ransomware is smart enough to not encrypt system files - they want your money, not a fully bricked system). Error logs from the server show hits on the SecureGroovyScript endpoint just before the attack as well as afterward (see Log notes at end).

Backup files taken in October do not show any SecureGroovyScript endpoint messages in the error logs.

The Secure-Script plugin in question was last updated in January 2018 (v1.40), the backup files and current server setup show that this plugin has not changed. It’s possible that another plugin has changed and started to call the secure-script endpoints for an unknown reason.

The affected endpoint uses a dummy account (admin) that exists on our server but does not have admin privileges. However, the endpoint can be called without logging into the account (allowing anyone to POST arbitrary code without needing a password).

The vulnerability marketplace evidence shows exactly the endpoint that was hit and the version of code running at the time; our server would have had the vulnerability listed as “for sale”.

Lateral Attacks:

After gaining access to the Deployment server, malicious code had access to:

Accounts:

One Acre Fund QA Accounts

SSL Certificates:

Production server certificate (password protected, but with a weak password)

Possible:

Octopus deploy accounts

Jenkins Accounts

MSSQL accounts

Immediate Steps Taken:

Immediately on finding the server intrusion, I locked down all existing test/stage/CI/QA servers to only 1 new RDP account (with the goal of preventing lateral attacks). No changes have been made to the affected deployment server.

I found my most recent server backup from October 2018. It will be a little out of date but not terribly so (possibly 50-100 lines of code were added between October and now).

Next Steps:

Attempt to confirm the jenkins theory. Verify that no other servers were affected by lateral attacks. Verify other servers do not show a history of suspicious logins and/or accurate dictionary attacks using real accounts (1-2 days of forensic analysis). If I’m wrong about the jenkins theory then some other attack vector was used and is probably still open.

Leave the deployment server as-is for now, system changes may fire off additional actions on it. Leave QA/Test servers locked with only 1 RDP account for now.

Setup a new deployment server with jenkins/octopus in docker containers and VPN-only access to prevent jenkins vulnerabilities from spreading outside jenkins. This will probably take 2-4 days and may require some duplicate work recreating the angular deployment configurations that were not present in October.

Move Test/Stage/QA servers inside VPNs and unlock them, changing their passwords/accounts. This will take several days (3-4 days).

Retire the existing deployment server and migrate the DNS records for it.

Change the production server certificates ASAP.

Error Logs/Relevant Data:

Error Log #1: Jenkins Error Log from Encrypted Server (showing CURL attempt on 5/12). Appears to show attempted arbitrary code execution from a POST request. Note that this would have been a day after the Ransomware attack (potentially multiple hijack attempts).

May 12, 2019 4:25:19 AM org.eclipse.jetty.server.handler.ContextHandler$Context log

WARNING: Error while serving https://78.46.206.200:8443/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript/checkScript
java.lang.reflect.InvocationTargetException
at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:347)
at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
...
Caused by: groovy.lang.GroovyRuntimeException: Failed to create Script instance for class: class x. Reason: java.io.IOException: Cannot run program "curl": CreateProcess error=2, The system cannot find the file specified
at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:466)

Error Log #2: Jenkins error log data from just before the attack (on the encrypted server). Seems to show an attempt to run a script without defining a script body. Possibly an automated probe to determine if our server had the affected version of jenkins.

May 11, 2019 5:17:38 AM org.eclipse.jetty.server.handler.ContextHandler$Context log
WARNING: Error while serving http://78.46.206.200:8080/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript/checkScript
java.lang.reflect.InvocationTargetException
at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:347)
...
Caused by: java.lang.IllegalArgumentException: Script text to compile cannot be null! at groovy.lang.GroovyClassLoader.validate(GroovyClassLoader.java:315) at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:275) at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:268) at groovy.lang.GroovyShell.parseClass(GroovyShell.java:688)

Error Log #3: Jenkins error log from October 2018 (last backup before the attack). No instances of the phrase “Script text to compile cannot be null!” or “org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SecureGroovyScript” could be found in the old logs.

Retrospective:

As of August 2020, we have yet to see this attack repeated so the Jenkins theory seems most plausible. Thankfully there was no evidence of lateral attacks and the server in question held no customer data. This is a valuable lesson about updates and open-source software - missing a security update can leave you open to a wide variety of vulnerabilities. In general I like open-source software, but this was a harsh lesson on the dangers of trusting open-source code; we now use VPNs for almost all server security except cases where code is explicitly meant to be publicly accessible.

Technology: AlwaysOn Failover Clustering with Replication

This is a stub, more detail to come.

Technology: Configuring 2500 Tablets in 2 Weeks

In late 2018, I became involved with the bulk configuration process for One Acre Fund's tablets. This was... less than stellar to start. The first year of Mobile Enrollment happened before I joined the organization (late 2016), with a scale of 250 tablets. This was nearly a disaster - over-promised features and late bugs combined with a hard deadline on the field side (the rainy season doesn't wait for code bugs) to create some high-stress days for the developers. In 2017 the code was more stable, but the setup process was still rocky - one of our ITO staff members went through the effort of configuring a different password for every device, but this was too complex for field staff so the field trial team just wrote the device passwords directly on the tablet cases. The core problem of tablet setup was partly due to the relative inexperience in the organization with tablets, partly due to the application code itself.

Setup in 2017 required:

Giving the device a password
Setting up a background image
Pre-loading an API Key onto the device by navigating to a URL (this was really hard to do manually)
Setting up an "App Lock" security software - Field Officers found this laughably easy to bypass
Copying required files onto the device (specific versions of various software)

So, in 2018 I had the bright idea of automating the process. This made use of a lot of my previous android experience - as Android developers will attest, there's a very sophisticated "Android Debug Bridge" maintained by google. I basically wrote a custom C# frontend that could use parallel task threading to run the same Debug commands on a large number of devices all physically connected through USB hubs to a laptop (nicknamed Hydra because I'm a nerd). Not everything can be automated (such as enabling "Debug" mode), but especially pre-loading an API key by navigating to a long URL is much, much easier in code than when done manually. However, in 2018 this process still encountered a number of issues:

Spreadsheet-tracking of tablets failed pretty miserably. Field staff tablet tracking went fairly well, but trying to remember which developer had which tablet was a challenge.
Remembering which tablets were configured and which ones weren't was a huge pain point. Staff members repeatedly factory reset and then reconfigured tablets for reasons I still don't understand, in one case performing seven consecutive configure-reset-configure cycles on the same tablet. This seems to have been due to the office layout - staff members were required to clean up all tablets at the end of the day so other staff could use the office, meaning that they mixed the finished and unfinished tablets in the same box.
Deployment of new applications wasn't possible except by altering the Enrollment progressive web app - what went into the field stayed in the field until the device was returned to HQ.

Despite these issues, the 2018 trial went well - so well in fact that the organization increased the trial scope to the entire country. In 2019, we were looking at scaling from 250 tablets to 2500 - in order to handle that, we made a few improvements to the process:

Switching to Google's Admin tools for Mobile Device Management. Tablets were treated as "Company Owned Devices" with access to a Private Play Store, letting us release future applications easily.
Paper tracking - finished devices had stickers attached to them and explicit manual sign-off sheets at each stage of the process (initial setup, QC, etc.). Tablets were moved forward in boxes, all at the same stage of setup. This paper-and-cardboard-box method worked much better than previous high-tech methods.
Large numbers of casual staff with a "blitz" approach (many staff members working for a short time in a dedicated space).
Group Incentives to hit targets (raffling off one of the decommissioned tablets every day that a "setup target" was hit).

For the 2019 setup, this approach scaled pretty well. The collective group of 30 casual staff was able to setup 240 tablets per day, meeting the target of 2 weeks for complete tablet setup and giving us time to film the process in this video:

Note: this video is unofficial and not released by OAF's communications team, it was created for a Tech Division meetup.

However, by late 2019 we had to yet again change direction. The model of tablet used for field setup had changed; all of the customization work done to setup the first type of tablet would have to be redone to apply to the second tablet model. The ITO department had been able to hire a dedicated tablet lead and code changes within the application had made it possible to configure tablets without manually entering a 20-digit API key; google authentication was now possible because every field staff member had an email address. In something of a zen moment, we decided that the best automation was no automation - ITO was more comfortable with a manual process and the effort to maintain Hydra wasn't worth it. Some things are harder this way (like copying a media folder to each tablet), but the simplicity of the new process makes it possible to do this setup without highly-trained technical staff.

Ultimately there are a lot of lessons to learn here, chief among them:

Physical Tracking (aka Sticker Tracking) beats any high-tech system. It's painful to admit this as a developer but a simple sticker labeled "done" was much more effective than the complex "Tablet tracking spreadsheets uploaded to google drive" method I used in 2018.
The best code is sometimes no code - another harsh lesson, but I'm glad we didn't maintain the Hydra tool past the end of 2019.
Incentives - I was initially concerned in 2019 that Casual staff had a strong incentive to work slowly (slow work = more days worked = more money, regardless of the project delays). I found the group raffle incentive a strong motivator, in general incentives that align the interests of the project with the interests of staff members are a good idea.
Dedicated space - a separate physical space for just tablet configuration proved immensely helpful, I don't think this speed would have been possible in a shared office.

Technology: Building for Offline Environments

This is a stub, more detail to come.

Technology: Deploying to Offline Environments

This is a stub, more detail to come.

Technology: Reinitialization

This is a stub, more detail to come.

Technology: Using SQL Beyond the Limits of SQL

This is a stub, more detail to come.

Technology: Pushing Access Far Past the Limits of Access

This is a stub, more detail to come.

Technology: Bookkeeper Laptop Databases

This is a stub, more detail to come.

Introduction, Part 3: The Technology of One Acre Fund

The Technology of One Acre Fund reflects the philosophy and business practices of the organization, which shouldn't be much of a surprise. Unfortunately, business practices are relatively easy to change and adapt to a new context; Software is much harder to alter and tends to be built in accretion layers. Personally, I believe that the offline system powering One Acre Fund's field operations (Roster) is a key component of the organization's scaling - to really understand it, a story makes more sense than a technical manual.

The Story of Roster

Roster wasn't planned so much as it grew organically. The core business it supports varies widely by geographic area (every OAF country program runs as a startup with near-total control over business practices), but in general consists of:

An intense, short (3-4 weeks) enrollment period to sign up for a season and order products (intent to buy agreements, not legally-enforceable contracts)
A required prepayment amount/percentage by a certain deadline (usually paid in multiple payments over time as clients have cash, not weekly/monthly terms).
A bulk distribution at planting time, utilizing paper management tools, with clients allowed to change their orders on the day of delivery
A bulk distribution at top dress time (second batch of fertilizer used when plants are almost fully grown)
A period to enroll in Top-Up/AddOn orders (additional products)
Multiple additional bulk distributions for Top-Ups/AddOns or Incentives
Flexible Repayment throughout the season (not relying on weekly/monthly payment terms) until the Final Repayment deadline
Analysis of Repayment progress with soft incentives (T-shirts, farm tools, etc.) to encourage a healthy repayment track

Field Tools

For most of its history Roster was basically a way to generate and data-enter those 4 paper forms; over time it's slowly grown to do more things but those are still the core functionality.

Enrollment form (farmers sign up for a "contract" that's not actually a contract, more like an "intent to buy")
Input Distribution Sheet (which client gets which item off a truck)
Truck Management Sheet (which items go onto which truck)
Client Repayment Progress Report (how much of their loan a client has paid back)

Offline Source-of-Truth

The main difference between Roster and almost any other modern software project is that Roster is explicitly designed to work offline; not "partially online (until you logout and it stops working)", it's meant to operate offline for weeks without an internet connection but still sync data when needed. Other tools and apps offer Offline options, but keep the source-of-truth status as the cloud system; Roster's source-of-truth is the Bookkeeper laptop.

Early One Acre Fund (2006-2010)

The earliest prototypes of One Acre Fund software were created by non-technical staff members using spreadsheets and Microsoft Access to provide inputs to dozens of farmers in relatively isolated communities. During the early days of the company, cell networks in East Africa were practically non-existent; if you visit any one of the current company residential compounds, you will almost certainly find a "Programming in Access" textbook mixed in with the various agricultural and leadership reference material. For several years, successive generations of technical people would join the organization, throw out the existing Access work and start over; the business model was changing so quickly that it made more sense to throw out the previous system and start over than to add features. Unfortunately, this approach is difficult to scale; throwing out all previous work is a controversial topic in the tech world.

Early Scaleability (2010 - 2013)

Starting in 2011, developers decided to continue working on the same Access software rather than throw out the existing work. A small team of developers (with some turnover, but never a team larger than 5 people) started iterating heavily on the Access databases built by non-technical field staff. For a period of about 5 years, the development team had the reputation of being able to quickly and easily add new "hack features" to the software to meet the needs of various trials/business processes. This development team made use of a number of tricks and strategies to "supercharge" Access far beyond the intended uses of the system - I'll provide more technical detail in a separate post, but key among these tricks:

Backing MSSQL databases with merge replication - merge replication is a powerful-but-complex offline tool that makes it possible to run separate databases on offline laptops, with only occasional syncing to a cloud-hosted server. Every laptop maintains a separate, isolated copy of just its own data; the cloud system keeps the complete/merged data set from all combined systems. The jargon can be a bit confusing, so it helps to remember:

Replication = Data Copying from place to place

Merge = Separate Subsets that later get combined together

Obfuscation and security within Access - encrypted database connection strings and security modifications that prevented unauthorized users from viewing the Access code.
Offline data centers - using local networks, multiple laptops could be connected as "assistant" machines to one "Bookkeeper" machine. This makes it possible for One Acre Fund to run medium-scale (10s of laptops) data entry centers in areas that don't have internet connections, then physically move the Bookkeeper machine to an area with a connection to sync it every few days.
Heavy reliance on stored procedures and user-defined functions - while SQL code is not as sophisticated as node.js or C#, it provides more functionality than Access VBA script.
Sophisticated SQL methods - complex dynamic SQL was used rather than VBA, including:

Dynamic SQL statements stored inside a SQL table, distributed by replication and run on application startup to create local keys/indexes
Heavy investments in Indexed Views, Query Tuning and Offline Laptop Performance (a problem as many of the laptops were out-of-date and have poor hardware specifications)
SQL queries that dynamically write other SQL queries (Indexed Views) and SQL with embedded email notifications - I plan to write a post on these "SQL beyond what you think SQL can do" methods

However, this decision to keep the existing Access code led to a few drawbacks:

Schema design - the database schema was meant for easy printing of field material, not for performance, storage or limitation of redundant information. A whole set of practices (database normalization) exists to help with designing the best possible schemas for performance and storage; these techniques were not used for the Core Business tables of One Acre Fund.
Audit Trail Weaknesses - running code on a local laptop makes it possible for users to alter the time and date settings, bypassing date checks in the code. Strong authentication of users proved quite difficult to enforce at the field level, leading to shared accounts and authentication done by control of the physical laptop.
Tight coupling of infrastructure and business logic - the original Roster system required physical laptops in each geographic location (district), with the idea that each district would run independently and own its own data. The idea of separate districts is used as the "sharding key" - nearly every database table and software function requires a district identifier. This is causing problems now as countries look at "districtless" logic like brick and mortar stores.

At the time, Cell Networks in East Africa were still quite unreliable; early software updates sometimes required sending a person with a flash drive on a multi-hour bus ride across rural Kenya. The Core Program and business model of One Acre Fund were being rapidly iterated in multiple countries, leading to situations where Off-the-Shelf software would have been too hard to adapt in time to meet the business needs.

Early Cloud Systems (2012-2014)

Expanding cell networks across East Africa made it possible to have centrally hosted servers holding the "merged" main database (instead of laptops in East Africa, use a fully-functional website hosted on a cloud provider called Hetzner). An operations website is built using C#/ASP.NET to handle the aggregated data from all the separate data entry laptops. The operations website is much easier to write code for, deploy and manage. However, it doesn't run offline - so core OAF functionalities are still run on the Bookkeeper laptops.

The early operations website was envisioned as a data warehouse to allow analysis by global/high-level country staff, while day-to-day operations would be handled by individual district machines. Some of the first reports were essentially just database dump operations (like the SeasonClients export, now the most commonly used export with 50,000 downloads in 2019).

Later the ability to add bundle configurations, create inputs and perform other high level configurations were added. The original idea was still to only do central configurations and reporting for districts, but all client-level data entry and data manipulation would be under the control of the district.

A few additional tools and business processes were added to this website between 2015-2017:

Transactional MSSQL Replication - this technology makes it possible to maintain realtime read-only copies of the Roster database on multiple cloud servers. The jargon for this feature really means:

Replication = Data Copying from place to place

Transactions = Database Events that all happen together

Bulk Uploaders - Uploaders to the operations website make it possible to do bulk data changes directly to a cloud server, rather than separately on a dozen laptops. This represents a major shift away from the District as the owner and source of all data. More and more business functions start being performed centrally and uploaders allow taking advantage of mobile money rather than collecting money in the field. Uploads are optimized for efficiency (pretty common for 100,000+ line uploads to happen multiple times in a day) .Different countries have different network capability, so highly complex logic around what could happen online and offline was built into the ever-expanding Roster system.
Operations website additional tools (2015-present): More and more tools are created in the operations website, such as Government location entry, client search functions, specific distribution planning, etc. Tools that have overlap between the operations website and Access frontend are written as SQL stored procedures to avoid duplicate work.
Solar Home Systems (2015-2017): Solar Home System paygo codes (using Angaza and Biolite, two SHS providers) are attached to the operations website. This makes it possible for clients to order solar home systems and pay the balance over time, with a monthly code entered into the device to keep it functional. If no code is entered, the SHS stops working. Included in this logic was the idea of product warranties (even warranties on parts of products) to handle malfunctioning solar lights.
Insurance (2015-2018): Insurance payouts were added to the database as part of the OAF model - especially funeral insurance and crop insurance.
Brick-and-Mortar stores + B2B sales (2015-present): Brick-and-Mortar sales are attempted in multiple countries in order to reach new market segments. Some trials have more success than others, with Kenya's Duka project (OAF-managed brick-and-mortar shops) and Rwanda's P-Shop project (OAF partnerships with existing agrodealers) showing promise.

Separation of Finance from Field (2015-Present)

The philosophical balance that I described previously manifests in our core systems - eventually, the need for standard off-the-shelf Western business software rather than homegrown, customized accounting became obvious. While this is a gross oversimplification, the Western-esque parts of One Acre Fund (Finance, Logistics, Global Purchasing) make use of SAP for off-the-shelf capability of running a global-scale business. The uniquely East African parts of the company that require extreme customization continue to be done using the custom Roster software, with aggregated data from both sets of software combined in a data warehouse. This leads to some bizarrely different levels across different parts of the organization - finance software (SAP and custom-built frontends hosted in Azure) are used almost exclusively by staff based in capital cities with strong internet connections while Access Bookkeeper Laptops are still necessary for use in field locations.

Expanding Cell Networks (2015-2020)

Expanding Cell Networks in East Africa make a number of innovations possible:

Mobile Money (2015-present): Mobile money is trialed + added in Kenya - this makes it possible for farmers to repay their balance through SMS/USSD menus. Direct safaricom integration as well as two aggregators (Lipisha and Beyonic) start to be used. This provides a much clearer audit trail for incoming payments and much faster payment processing; it is currently expanded to almost all OAF countries.
Mobile Enrollment (2016-present): Bookkeeper laptops are designed to facilitate paper contract data entry; at the level of hundreds of thousands of clients, this starts to cause problems (too many paper contracts to enter, even with offline data center tricks). Tablet enrollment is trialed at a scale of ~250 tablets between 2016 and 2018; in 2019 Kenya goes full scale (3000 tablets, one for each FO). Other countries are currently adopting this software; it was originally written as a progressive web app but is now being adapted into a native application.
USSD/SMS Menus (2018-present): Telerivet (a third-party SMS/USSD tool) is used to allow clients to check their balance directly + run trials. Later this functionality is expanded into Mobile Money (a cost-savings measure uses Telerivet + Beyonic in combination). During the 2020 COVID pandemic, rapid reaction through USSD/SMS has allowed the organization to enroll more than 700,000 farmers through a USSD menu.
Field Support (2018-present): Essentially a digitized Client Repayment Progress Report. Field Support was a developer-driven project to rethink the OAF backend using a new offline-capable backend technology (CouchBase). Implemented in a different platform than other Roster tools (Linux/Docker/CBase/node.js) as a caching layer between Roster and tablet users.
Axe Access (2019-present): This is a much-needed push to replace all access tools with operations website tools. In several locations (especially Kenya) offline functionality isn't required; in just 10 years previously-offline locations now have high-speed internet connections and cell networks, so hard-to-maintain offline tools aren't required. Offline capability makes it difficult to release changes and debug issues, so online tools are simpler to iterate and can make use of cutting-edge tools unavailable offline.
Roster Evolution (2019-present): This is a transformative vision to replace/refactor Roster system into something new and better. While the logistics of transforming a $100 Million piece of software can be daunting, the next generation of mobile-optimized tools provide exciting potential to more effectively serve a larger number of clients.

Where We Are Now:

As I write this (August 2020), One Acre Fund is undergoing a transition period. We still have Access databases in active use, but are phasing this out in favor of tablets and USSD/SMS-based tools. We're actively transforming our server hosting processes and transitioning from manually-configured Windows servers to Containerized Linux Microservices. While the Coronavirus pandemic has been an international tragedy, it has also forced the organization to adapt far faster than we thought possible, spurring new innovations in One Acre Fund's technology.

My Part in this Story:

I started at One Acre Fund in January of 2017; at the time we were doing regular Access deployments, expanding Mobile Money, doing early trials of Mobile Enrollment and starting to look at USSD/SMS. My role was formally "Development Operations Engineer", now "Database Administrator" - informally I see my role as "make everyone else's job easier". I basically do support engineering - I don't provide value directly to the business, I provide value to the other software developers so that they can work faster and the team can accomplish more. This is a difficult role for a number of reasons - I need to understand what everyone else is doing in enough detail to fit all the pieces together, fix them when they break, provide production support to deploy them into the real world and help identify/fix scaling issues that occur when you have millions of clients. Throughout my time at One Acre Fund, I've taken on a number of support-type projects:

Reinitialization Cleanup (2018): Reinitialization (the process of refreshing the full database on all offline Bookkeeper laptops) is a cumbersome process involving cross-country coordination; every BK laptop must have an internet connection at the same time. Cleanup of reinitialization code to run automated scripts rather than require manual action used in 2018; this was successful. This was a massive exercise in coordination in 2020 involving 20 cloud servers, 139 offline laptops, 10 countries and two production support personnel (myself and our sysadmin).
Offline Deployments - deploying code updates to an offline system is not an easy task.
Tablet Configuration - I basically ran a factory in Kenya for two weeks to configure 3000 mobile devices
Production-Grade Security (SSL certificates, database obfuscation, VPNs and forensic analysis)
AlwaysOn Clustering - providing immediate redundancy if our servers go offline
Test Servers/Ephemeral Servers - giving other developers a complete, isolated test environment
Bookkeeper Offline Tools - maintenance of the custom frontends that One Acre Fund has built to allow Merge Replicated-laptops to run throughout East Africa

Finally, through a combination of staff turnover and developer specialization, I'm now left as the only developer with production credentials at One Acre Fund. While other developers understand subsets of the software in more detail than I do (especially specifics of the business logic), I'm the only person at One Acre Fund who is able to understand and make changes to the entire live structure (Access, MSSQL, C#, Angular, node.js, docker, couchbase and soon kubernetes + react). That's a situation I'm actively trying to change by training other developers and writing documents like this one. Along the way, I've acquired strong opinions about the things that One Acre Fund's technology has done well and things that we could definitely stand to improve; this blog is meant to capture both my specific technical experiences and the philosophy that I've acquired during my time here.

Introduction, Part 2: The Philosophy of One Acre Fund

Note: these comments are just my opinions and are not official statements of One Acre Fund.

So... what does One Acre Fund actually do?

This is a deceptively difficult question to answer – I would know, I've been asking it for the last year. One Acre Fund is one of the world's largest providers of agricultural microloans with over a million clients and $100 million in annual cash-flow. If you ask leadership what OAF does, you'll get the donor-friendly answer:

“OAF puts farmers first!”

When you dig a bit deeper, you'll get the answer:

“OAF provides agricultural inputs on credit with flexible repayment”

When you dig a bit deeper than that, you'll get wildly different answers depending on the person you're talking to (“we distribute with trucks” or “we use group-based repayments” or “we have committed field teams with high-quality farmer training” or “we use human-centric design innovations”). All of these things are individually true, but they still miss the bigger picture. Every organization has committed staff members; most NGOs use some variant of human-centric design and you'll be hard-pressed to find a company that doesn't claim status as an innovator. So what is the bigger picture?

As part of transitioning into my current role, I had to spend a lot of time learning OAF's current business model. For most of my first 2 years with OAF, I actively avoided it – the terminology is confusing and the business is not easy to understand, so I focused on technical infrastructure. When my former boss was preparing to leave, focusing on just the architecture was no longer an option... so I spent several months studying the core business model of One Acre Fund. Here's what I found out are (in my personal opinion) the reasons why OAF grew while so many past attempts failed:

Starting in the Field

This is one of my favorite parts about OAF – it started in the field. Not “our researchers are in the field, but they'll be back by lunch” - OAF staff members have lived and worked in rural communities since the organization's founding. When I started at the organization in January 2017, the IT Development team worked on an outside patio and routinely had the neighbor's chickens walking through our software planning meetings. Some of the things that seemed like productivity killers at first (deafening rain on a tin roof blocking a call, intermittent cell connections and electricity, unpredictable delays in transportation) are actually beneficial because they make it obvious when you're designing a solution that won't work ("maybe requiring a 10 gigabyte download isn't the best idea here"). While it's true that we would have been able to work faster in an American or European office building, the solutions we designed wouldn't have worked nearly as well in rural East Africa. Working in an office building makes it much easier to answer the question

"What Can We Do?"

but only working in the field makes it possible to answer the question

"What Should We Do?"

Respect Your Customer

During my onboarding week, I casually mentioned “the farmers we're helping”... and got a firm reprimand from a Kenyan colleague who was in my onboarding cohort. “We're not helping them – they're our clients!”. This is a really big mental shift from what I was accustomed to in the West, but a powerful one. When one person is the "Giver of Aid" and the other the "Receiver of Aid", all sorts of dynamics (even subconsciously ones) create an implicit power differential.With clients – even clients who sometimes don't pay back the loan – the power dynamic is reversed and equalized. We are not the bosses of farmers in East Africa - they are our bosses, if we're designing solutions that aren't accepted by farmers then we need to change our strategy. This mentality is another part of the company's DNA that made the company a success.

Work Offline

This is where I come in – One Acre Fund has by far the most sophisticated offline systems I've ever encountered. They're quite difficult to maintain (I'll write a separate post on that), but the systems grew up in East Africa alongside the company – our core data entry system could keep working even if East Africa had a total two-week-long internet shutdown. Now that the cell network architecture in East Africa is rapidly expanding, this offline capability may not be as necessary as it was in the past – but the strength of our offline systems was one of the major differentiators between OAF and all of its failed competitors.

Balance of Philosophies

In my personal opinion, the single largest factor behind One Acre Fund's growth has been its ability to balance between two incompatible worldviews and philosophies. One of the biggest misconceptions Westerners hold about Africa is thinking of it as a monolith; the real Africa is wildly variable, with different languages/crops/customs/taboos in every geographic region (Kenya alone has 70 distinct ethnic groups). African cities can be nearly indistinguishable from Western ones, while rural areas can be so variable that they defy easy classification. The view from the field tends to be highly localized and does not necessarily translate to an area even 50 kilometers away. While this is a broad generalization and individual people vary a lot, my experience has been that people living in East African rural areas have a very different worldview than people from Western countries or African cities - rural residents tend to be much more concerned with concrete, real, tangible objects than with intangibles and abstractions. In order to create real impact for our clients, we have to understand their concrete needs and adapt the business to match - but such a business will only apply to a region smaller than the average county in America.

This is where the Western philosophy comes in - abstractions, interchangeable parts and economies of scale. Western practices make it possible to run a business that works in many different areas, to bulk order vast quantities of farming inputs and distribute them at low cost. Note that "Western" business practices don't necessarily mean Western people - any business course in Nairobi or Kampala will teach essentially this same school of thought.

One Acre Fund thrives by embracing the two incompatible ideologies - bulk ordering millions of kilos of fertilizer and then selling it with business practices that vary by geographic area. Taken too far, either philosophy would destroy the company; a fully Western business would become too rigid to meet the needs of customers (scale without impact), while a fully Field-based business would be unable to work outside of a tiny geographic area (impact without scale). Only by balancing between the two worldviews can the company provide valuable services for over a million farmers in 10 countries.

Specific Business Practices

This combination of offline systems, field-led systems design and farmers-first mentality led to a set of business practices that is... nonstandard. To be blunt, they make very little business sense from a Western perspective. However, in local context these practices actually fit in really well with East African communities.

Clients Can Cancel on Us, We Can't Cancel on Them

All throughout our systems, the term “contract” is repeatedly referenced. This causes a lot of confusion for our software developers, because we don't actually use contracts. Clients sign up for orders that are vague “intent to buy” agreements. We send out a truck with the farm inputs available; if a client doesn't show up that day, then they don't pick up their inputs and we don't charge them. However, if the client shows up and we don't have their items available, we'll send a follow-up truck to get them their stuff. This puts us in a very one-sided relationship with our clients, but for good reasons. Life in the villages is much less predictable than life in the West – transportation is unreliable, agriculture depends on factors outside anyone's control and peoples' complex web of social bonds can cause all sorts of unforeseen delays. Charging clients for items they haven't received is not a good way to ingratiate yourself into a community; neither is failing to deliver on a promise.

Years are Meaningless, Seasons are Life-or-Death

Almost all banking and logistics software originates from the Global North; all the Banking systems in the Global North were originally built around Northern agriculture. In the Global North, winter matters; in the Global South, rains matter. Unfortunately for the banking software, rainy seasons do not follow the same patterns as snowy seasons. The reliance on yearly budgets, yearly payment terms and yearly interest is so hard-wired into Western thought that we can't imagine anything else. However, our clients don't think this way. Our clients grow food during the rainy season, sell food after it's harvested and may have some money to pay back their loans at unpredictable times depending on factors beyond anyone's control. One of the business adaptations of OAF was to allow farmers to pay back their loan for a season, connecting loan payment to agricultural cycles rather than forcing farmers to adapt to Northern banking practices.

Declining-Balance Interest is Deceitful

Standard practice in the microloan field is declining-balance interest (interest is included in the loan, the faster you pay back the less you owe). However, many of OAF's clients have only a basic education; complex formulas and hidden charges are not a good way to build client trust. OAF charges fixed fees for all goods and services, no hidden charges. We do still want to encourage fast repayment though, so we offer a wide variety of incentive items (T-Shirts, Machetes, Boots, fixed-fee rate reductions, etc.) to those clients who pay us back quickly. This combination of fixed fees and free stuff has been very popular and goes a long way towards building client trust.

Standard Economic Models Don't Apply

I have my own theories about a more accurate economic model in rural East Africa than “Every man for himself”. I call my theory “Minimal Liquidity” - I don't think anyone at OAF intended to find this effect, but it's pretty well baked into the OAF model now because it seems to work in the field. The theory is that:

1. Many East African communities have essentially socialist economies (at least within tribes)

2. Economic pressure is intense, so people are constantly helping their neighbors at the expense of being able to hold on to income-generating capital

3. The end result is that the community as a whole survives, but individual members can't make the necessary investments to break the cycle of poverty.

While it's not written into OAF's policies, this “Minimal Liquidity” theory informs a lot of key decisions. The biggest decision is when to ship agricultural items – by the theory, if you ship an item too early then it will likely be sold rather than being used as an agricultural input. If your neighbor's kid is sick and you have a bag of fertilizer sitting in your living room that can't be used for 3 months... what kind of neighbor would you be to keep the fertilizer instead of helping?

Client Simplicity is everything – even if that creates operational complexity

OAF does offer agricultural loans, but not in the normal sense. While this varies country by country, one of the factors that helped OAF scale so large in Kenya was “bundling” - what I call “Easy Agriculture Kits”. Green Revolution techniques for getting high yields out of Maize require two types of fertilizer, applied several months apart. The wildly successful “Core Product Line” of OAF in Kenya is an “Easy Agriculture Kit” comprised of:

The first type of fertilizer (planting fertilizer), delivered just before the rainy season
The second type of fertilizer (top dressing), delivered about two months later
A fertilizer scoop to measure both types of fertilizer
Maize Seeds
Training on how to use the fertilizer properly delivered in person in the local language (Kenya alone has 68 local languages and most people are at least tri-lingual)
Scale the kit to the acre of land the client has, rather than the exact weights/amounts (clients order the “2 Acre Easy Maize Kit”, we handle the exact amounts needed for 2 acres)

Easy enough, right? However, some complications with that kit come up – what if you need 10 kilos of fertilizer for 1 acre of maize and our supplier can only give 6 kilo bags? In the states you would just find a different supplier, but in East Africa the supply chains are much harder to stitch together. Similarly, what if you want to convince clients to take a slightly larger loan by giving price breaks (making it slightly cheaper per unit to buy 3 acres than 2.5 acres)? In an area where most clients have two SIM cards and use one phone number for texting but another for calls to save money, these price breaks can influence a lot of decision making. Nonlinear pricing, nonlinear sizes and generally complex backend processing (what if you only need 8 Kilos per Acre for fertilizer type A, but 10 Kilos for type B?) create frustrating complexity for our software developers... but make total sense at the field level.

Putting it all Together

The technology of One Acre Fund is my main area of expertise - the next post talks about what this mixture of offline capability, farmers-first innovation and operational complexity looks like at the systems level.

Introduction, Part 1: A Day in the Life

I wake up to the wind rustling through the banana trees in my yard. Each morning looks alike here - the seasons in Rwanda seem like minor variations on a theme (sometimes rainy, sometimes not). I brush aside the mosquitoes that have settled inside my net, quaff a liter of coffee while admiring the view of the Congo across Lake Kivu, then return to my home office to plug in for the day. Occasionally I'm interrupted by a fresh avocado falling, an avocado-hungry mongoose scurrying between the slats of my white picket fence (incongruously constructed by a previous resident) or a fist-sized spider scurrying out from beneath a hidden surface in my home office. The days pass in a blur of slack communication, production deployments, occasional outages and other work virtually indistinguishable from what I'd be doing in an American office building. After work I leave the familiarity of the white picket fence to walk through the farming villages that surround us in every direction, practicing my (terrible) Kinyarwanda and making mental notes of what I'm seeing (One Acre Fund-branded T-Shirts, good or bad agriculture practices, the types of cellphones people are using, which houses have electricity and running water).

My name is Louis Racette; I'm 28 years old and for the last 9 months I've been the sole person capable of making changes to a database that coordinates $100 million dollars worth of microloans across 10 countries in East Africa. This blog is part technical (how this system actually works) and part philosophy (what we're doing here).