Thursday 29 December 2022

Data Pipeline for SaaS Application

Introduction





Nowadays all the software applications are moving towards the SaaS(Software as Service) model

where it has a single application that serves multiple tenants with multiple data stores either

separated by schema or database instance. At the same time data separation between tenants

is a overcritical part for all the SaaS platform. I still believe many clients have tough to understand

and believe that their data is completely secure and separated from other tenant data on that SaaS

application and it will never mess with other tenants data. 


Also its great challenge for data engineer and data architect to plan and design Data Pipelines for

SaaS Platform.

How Data pipeline for SaaS is different than normal application pipeline: 



Solution 

Apache Airflow provides a solution with GCP Composer to manage data pipelines for SaaS applications.

Here is detail on GCP documentation Google Cloud Composer Operators — apache-airflow-providers-google Documentation

Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and manage workflows that span across clouds and on-premises data centers.

Cloud Composer is built on the popular Apache Airflow open source project and operates using the Python programming language.

By using Cloud Composer instead of a local instance of Apache Airflow, you can benefit from the best of Airflow with no installation or management overhead. Cloud Composer helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, so you can focus on your workflows and not your infrastructure.

 




I am assuming the reader has an idea or learns about airflow basics DAG creation, Now I will explain

how to make it for SaaS.

  

How to make Airflow for SaaS


  1. Write your most of logic in separate .py file and create as function 

  2. Function should accept tenant as parameter

  3. Create a list of tenants in DAG or you can read the tenant list from either airflow variable or configuration. I recommend reading from airflow variable 

  4. Call that function directly in DAG and pass the tenant parameter 

  5. Import task from airflow.decorators import task

  6. Use @task in DAG just before calling the function.

  7.    Use  this code to create task at runtime for each of the tenant result=data_load.expand(tenant=tenant_list)

    all_sucess(result)


In the next blog I will write all the working code for this solution.

I hope you have enjoyed learning, your feedback or comment will be highly appreciated.





No comments:

Post a Comment

Data Mesh

  Data Mesh is a relatively new approach to managing and organizing data within organizations, especially large enterprises. It advocates fo...