comcast5-uglycode

It was years ago, not sure when, at least more than 5 years ago before we adopted databricks. Our team’s operations were mostly on AWS. A person from other team asked us to run his code regularly on AWS. In fact it did not come directly to us, he asked couple of other teams about running his code and other teams were unable to run his code.

Other teams experience on AWS was a bit light and they couldn’t figure out how to run. To save their face, their excuse was the code was not ready to run in prod. Many constants are in the code etc etc …

While we were running EMR as needed basis, launching them through Lambda and terminate them after the job is finished, some other teams were running EMR as if it is OnPrem Hadoop cluster. That is running the cluster months or even more than a year. The original code was written in this environment. The code was written to run everything on running cluster using pyspark. Then, here is the kicker, the code was using xgboost and to run that module, he was installing xgboost on the driver node and ran it.

The other teams weren’t even able to run the code.

In fact, the code was not runnable. It was broken code. It was copy and pasted from the develope’s Jupyter notebook and sometimes functions def and call weren’t matching and problems were quite many.

I looked at the code and fixed obvious bugs and also there was a bug which the developer should have fixed. Because the bug was not syntactic but conditions he caused during the run. I told him that there still is an issue and he was raising the issue to upper level saying I did not tell him where the bug was. If I were him, I would have just try to figure things out by myself.

I refactored a bit, to split EMR portion of the code and xgboost portion. The latter portion, I used EC2. That way the overall cost was reduced a lot. Also no need to install things in EMR. To make good running environment, container is better but.. due to the size of the job, I decided to use EC2. At the end, I put everything in Step and calling EMR then set of EC2s. All worked out beautifully and performance was good.

While the code was not perfect, the idea itself was not bad. So, I think it was the worth the effort. However, the people involved here have no confidence in themselves and just tried to avoid the blames and could not even say, I don’t know how to do it. Sometimes it is good to say that I can’t do it. Be honest to yourself and to others.