Data engineers manage the backbone for modern data products. They build the foundations for all data science and analytics efforts. But for the average data engineer, it’s a challenge to make sure jobs are running successfully and data is up to quality standards. For companies whose revenue and operations depend on accurate, on-time data flows, that’s a huge problem.
We built Databand to help data engineers monitor and manage their DAGs. Databand is the only observability solution plugged into the open source ecosystem of solutions that all leading teams are using. Because we are so integrated, we can provide a deeper understanding of how your infra is performing, how much it’s costing you, and how accurate the data is so that you can unlock data engineering productivity.
Databand.ai platform orchestrates ML creation and data processing within organizations and provides visibility to data scientists and engineers involved in the process. The platform streamlines the integration, productization, and testing of ML pipelines, thus enabling the different stakeholders to work together on ML projects in an efficient, frictionless, way.
These are some recent contributions we’ve made to Apache Airflow:
Together with our friends from Polidea we created a new executor useful for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLite and sensors.
Working with Polidea, we’ve made major progress in optimizing Airflow scheduler performance. In total, tests are showing 10x faster query performance with over 2000 fewer queries by count. See the list below for some of the optimizations that have been pushed (and counting):
[AIRFLOW-6856] Bulk fetch paused_dag_ids
[AIRFLOW-6857] Bulk sync DAGs
[AIRFLOW-6862] Do not check the freshness of fresh DAG
[AIRFLOW-6869] Bulk fetch DAGRuns for _process_task_instances
[AIRFLOW-6881] Bulk fetch DAGRun for create_dag_run
[AIRFLOW-6887] Do not check the state of fresh DAGRun