Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
Lambda functions shouldn't call each other in a circular fashion.
You can try introducing SQS in between the functions. This will help you decouple the functions, and will open it for scalability -
The thing with ETL is the same in my opinion like with any other "large" workflow.
You will not get a perfect overview in code.
Don't try to solve a problem that is non solvable.
Overview is something graphical, best extremely slimmed down to easily understandable summaries.
I usually write in a first step all process steps down on flash cards, as an example:
"Database - Gather => Summary"
Gather data from XY, summarize by Z, export as JSON - DTO: Summary"
"Transform - Summary => Aggregation"
Parallel aggregation of DTOs Summary by time range."
The important thing is: Note each output and input.
You most likely end up with a lot of cards that you can then nicely order up.
I mostly do this on the carpet or - if it fits - on a free wall.
When you do have an detailed overview of each step, you can easily see the flow of data and you can create a summarized overview - simply by numbering the flash cards and grouping them into logical, isolated process steps.
In my opinion, this is the most efficient way if you don't have an extremely large white board or stuff like that.
I've never found a tool / GUI that did large workflows in a *good* way, mostly due to lack of screen size.
My brain is though a very special and burnt piece of hardware, might be because of that.
When I see the flash cards, I can shuffle them round, take them out, add post it's when my gut feeling tells me this steps gonna melt my face etc.
It's easier than trying to fiddle around with an interface that is very limited.
Working on a feature which heavies relies on a data pipeline. I noticed it is a couple of lambda functions calling each other ( Fuck you to the guy who made it). The best way to get sanity back is build a proper etl pipeline. Any suggestions for building a etl in python with reliability.
Options already considered
1. Celery tasks - Worked well but no overview of the single task progress across celery tasks
2. Airflow - Gives good overview but the docs make less sense than a 10 yr talking. Mostly because they introduced a new syntax and not everything has migrated fully yet. Also no support for reusing dags
question