Buggy Code on Production, Survived

Areca is the name of the billing engine I am working on for Turk Telekom. Funny enough, it is also the name of the flowers we bought to freshen the office. We wanted the office to feel more alive, so we carried them around, watered them, and when the time came to name the project, Areca somehow became the project name too.

The project itself is not as peaceful as the name sounds. Turk Telekom already has a billing system written in Oracle PL/SQL, and it runs once a month to generate bills. What we are building is very different. Areca is a near real time billing engine. If a new promotion changes how a user should be billed, the system should reflect that immediately. This is for around 20 million subscribers, so the scale is serious. There is also a deadline. A new law requires Turk Telekom to inform users when they reach 80% of their bill.

We actually started in a good way. The engine looked elegant, and the design looked clean. The trouble started when we began converting the old PL/SQL logic into the new system. That was the point where many bugs started showing up. Things that looked fine at first became much more complicated during implementation. There were many rules, many exceptions, and not much room for mistakes. Under that pressure, some of those bugs made it all the way to production.

Areca - A billing engine

What We Have Done Right

First of all, we are a small team. There are five of us, including our boss, the founder. We spend a lot of time together, not only in the office but also outside of it. We have lunches together, dinners together, and after some point you really get to know how people think. I think this helps a lot. Communication is easier, asking for help is easier, and for me it is also a very good learning environment.

Another good thing is that we did not start completely blind. We analyzed Amdocs in detail, and we also looked at similar tools. This gave us many good ideas about the domain itself. We started understanding concepts like buckets, promotions, accumulators and similar things much better. In that sense, I think we got the basics right.

I also think we got the engine right. We cache many things, promotions and some customer data for example. The cache works well, but it is very big and JVM does not like that very much. Because of that, we came up with offheap caching. I had never used something like that before, but it worked really well for us.

One more important advantage is that we are not guessing all the time. We have access to Turk Telekom’s internal code, so we can actually see how some things work in the existing system. On top of that, engineers from their side also help us when needed. That makes a big difference, because building a system like this without that support would be much harder.

What We Have Done Wrong

We did many things wrong too. First of all, our Scrum meetings were not ending on time. They usually became longer and turned into design sessions. The bad part was not only the time. The bad part was that these discussions were not always useful. Not everybody joined properly, and sometimes people only cared about their own solution.

Another mistake was testing. We did write tests, but not enough unit tests. We trusted ourselves too much. We thought we knew the code, and we thought we could move faster like that. Later we saw that this was a big mistake. When bugs appeared, finding the real reason was not easy at all.

The PL/SQL conversion also brought many problems. Some parts looked easy at first, but they were not easy in implementation. There were many small details and exceptions. Because of deadline pressure, these bugs reached production before we were really ready.

After that, things became harder and harder. We were tired, stressed and trying to fix many things at the same time. When you work like that, you start making more mistakes. That is exactly what happened to us.

A Reflection

In the end, we survived. I say survived because that is really what it feels like. We found the important bugs just in time and managed to keep going. Next time, maybe we will not be that lucky.

Still, the damage was there. After all those fixes, the code was not in a very good shape. It became harder to read, harder to test and harder to change. Sometimes fixing one problem created another one. That is probably the worst part of buggy code in production. You do not only fix bugs, you also fight with the mess left behind.

But I also learned a lot from this period. I learned that a good looking design is not enough. I learned that tests are not something you write only when you have extra time. I learned that being tired changes the way you think and the way you code. And I learned that if a team does not listen to each other well enough, small problems can become very big ones.

So yes, we survived. That is good. But I think the more important thing is this: now we know much better what can go wrong, and next time we should be more ready.

#What We Have Done Right

#What We Have Done Wrong

#A Reflection

What We Have Done Right

What We Have Done Wrong

A Reflection