15
ltlian
1y

Years ago, we were setting up an architecture where we fetch certain data as-is and throw it in CosmosDb. Then we run a daily background job to aggregate and store it as structured data.
The problem is the volume. The calculation step is so intense that it will bring down the host machine, and the insert step will bring down the database in a manner where it takes 30 min or more to become accessible again.

Accommodating for this would need a fundamental change in our setup. Maybe rewriting the queries, data structure, containerizing it for auto scaling, whatever. Back then, this wasn't on the table due to time constraints and, nobody wanted to be the person to open that Pandora's box of turning things upside down when it "basically works".

So the hotfix was to do a 1 second threadsleep for every iteration where needed. It makes the job take upwards of 12 hours where - if the system could endure it - it would normally take a couple minutes.

The solution has grown around this behavior ever since, making it even harder to properly fix now. Whenever there is a new team member there is this ritual of explaining this behavior to them, then discussing solutions until they realize how big of a change it would be, and concluding that it needs to be done, but...

not right now.

Comments
Add Comment