comcast2-onefile

Today is that kind of day. I made some progress in creating this web site and starting moving my previous homepage from rackspace to hugo + s3 .

Talking about performance improving at Comcast, there was a task where we need to create one file (csv format) from a table. MSSQL On Prem packcage was able to do it well, but in Hadoop world, now currently on Databricks, it is not that simple.

You can do coalesce, but … the way Databricks / Spark and Hadoop creating coalescing is expensive and sometimes making the one node that collects all the output get out of memory.

Without knowing the underlying technology, the team relied on coalesce and the process is brittle and taking a long time. I searched and there was a very nice way to handle this using IOUtils .

I cut down the whole processing time by more than half using half the computing resource it needed. I could cut down more if I tested more and optimized a bit.

The lessen is that knowing the underlying operation and technology is crucial in improving the performances.