Airflow Interview Questions and Answers on XComs
Airflow XComs should not sound unfamiliar to you if you are appearing for a data engineering job interview/ The below list of Apache Airflow interview questions will give you some good information on why and how Airflow XComs can be used in DAG’s.
43 . What will happen if you set ‘catchup=False’ in the dag and ‘latest_only = True’ for some of the dag tasks?
Since in the dag definition, we have set catchup to False, the dag will only run for the current date, irrespective of whether latest_only is set to True or False in any one or all the tasks of the dag. catchup = False will just ensure you do not need to set latest_only to True for all the tasks.
12 . What is the role of Airflow Operators?
Whenever you define a DAG (directed acyclic graphs), there will be several tasks in it. Now those tasks can be written in different environments altogether, one task can be written in python code and another can be a bash script file. Now since these tasks inherit tasks dependencies within each other, they have to be operated from a single environment (which in our case would be a python file where our DAG is defined). So to solve this, airflow has its operators as python classes, where each operator can act as a wrapper around each unit of work that defines the actions that will be completed and minimizes or effort to write a lot of code.
Now, to execute the python script (Task I), we can call the PythonOperator() Class, and to execute the bash script file (Task II), we can call the BashOperator() Class.
Now if you want the airflow to send an email to you whenever the dag run or the task has been completed with its status, we also have an EmailOperator() as another DAG python class for this. Similarly many more!
29 . What are SLAs?
SLA stands for Service Level Agreement; this is a time by which a task or a DAG should have succeeded. If an SLA is missed, an email alert is sent out as per the system configuration, and a note is made in the log. To view the SLA misses, we can access it in the web UI.
It can be set at a task level using the “timedelta” object as an argument to the Operator, as sla = timedelta(seconds=30).
39 . What is the SparkSQL operator?
It executes Spark SQL queries. This operator runs the SQL query on Spark Hive metastore service. The SQL query can either be templated or used as .sql or .hql files, given that the spark SQL script is in the PATH.
The operator takes “SQL” for templated SQL query or “template_ext” as a list of .sql or .hql to execute along with the spark job name and connection id.
46 . How can you use a set or a subset of parameters in some of the dags tasks without explicitly defining them in each task?
We can use the “params” argument. It is a dictionary of DAG-level parameters that are made accessible in jinja templates. These “params” can be used at the task level. We can pass “params” as a parameter to our dag as a dictionary of parameters such as {“param1”: “value1”, “param2”: “value2”}. And these can be used as “echo {{params.param1}}” in a bash operator.
Most Watched Projects