Tuesday 30 April 2024

Data Mesh

 Data Mesh is a relatively new approach to managing and organizing data within organizations, especially large enterprises. It advocates for a decentralized approach to data architecture, where data ownership and management are distributed across different business domains or "domains" rather than centralized within a single data team or department.

I still remember the major challenges when I proposed data centralization architecture and build a data team was “Have excellent database technical knowledge but not domain expertise, have domain expertise but not database technical knowledge” Have both the team, also able to process data but not able to efficiently utilized data.

Key principles and concepts of Data Mesh include:

  1. Domain-oriented decentralized data ownership: Data is owned and managed by the individual business domains or "domains" within an organization. Each domain is responsible for its data, including data governance, quality, and lifecycle management.

  2. Data as a product: Data is treated as a valuable product that is produced, consumed, and reused across the organization. Each domain acts as a data product team, responsible for the end-to-end delivery of data products that meet the needs of their stakeholders.

  3. Self-serve data infrastructure: Domains have autonomy and control over their data infrastructure and tools, allowing them to choose the technologies and solutions that best suit their requirements. This may involve using cloud-based platforms, data lakes, data warehouses, or other data management systems.

  4. Data mesh architecture: Data Mesh advocates for a modular and scalable architecture that enables seamless integration and interoperability of data across domains. This may involve implementing data pipelines, APIs, event-driven architectures, and data mesh platforms to facilitate data sharing and collaboration.

  5. Data governance and standards: While each domain has autonomy over its data, there is a need for common data governance standards, policies, and practices to ensure consistency, compliance, and interoperability across the organization. This may involve establishing data standards, metadata management, and data quality frameworks.

  6. Cross-functional collaboration: Data Mesh encourages collaboration and communication between different business domains, data teams, and stakeholders. This includes fostering a culture of data literacy, collaboration, and knowledge sharing to unlock the value of data across the organization.

Overall, Data Mesh aims to address the challenges of traditional centralized data architectures, such as scalability, agility, and data silos, by promoting a decentralized, domain-oriented approach to data management that empowers individual business domains while ensuring organizational alignment and data governance

Tuesday 16 May 2023

Access Historic data as of time travel in BQ



As a data person it's quite a common requirement to know the as of date data state. Like to see data as of date what has been changed since a particular date. Generally we create a change log or CDC but it's not practically possible to maintain the  change log for all the tables. Nowadays I can’t imagine a data warehouse of a data system with less than 50 tables maintaining a change log for many tables is extra overhead. Data keeps changing over time. Sometimes there are huge changes especially in the data warehouse system. Especially when you have to debug some data issue knowing the as of date data state this feature will be quite helpful.


BigQuery has added this great feature to solve this problem with minimal effort, with almost no additional development. 


Bigquery has FOR SYSTEM_TIME AS OF that help us access data as of a particular date.


Step - 1:

SELECT

  *

FROM

  `poc.bq-api-test`

LIMIT

  100;





Step - 2 : Update records 


UPDATE

  `poc.bq-api-test`

SET

  name ='test2'

WHERE

  id=1;


Finally : 
We can see the data what it was before change.
Its just small example, it has capability to see default 7days pas state of data. We can use this feature to restore the table as of particular past date etc.


SELECT

  *

FROM

  `poc.bq-api-test` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);







Conclusion: Using FOR SYSTEM_TIME AS OF We can travel to any past time and can see the state of data. Obviously it has some limitations; we can see the details on GCP documentation here.


Access historical data using time travel  |  BigQuery  |  Google Cloud


Here is how to restore accidently deleted Bigquery Dataset 
How to restore a deleted BigQuery Dataset (derrickqin.com)



Wednesday 12 April 2023

Using chatGPT with Python

 



Now days ChatGPT is hot cake that everyone want to test it, Its rely a great to ask any type of question. Most of time it gives me better and faster answer than Google.  I tried using UI available in market but its not stable. Finally I have started using my favorite python code to call openai and ask questions.

I am calling chatGPT  in my python code, here is basic code snippet.

Lets create a function first, before that you might need to generate a key openai.api_key. 

# -*- coding: utf-8 -*-

"""

Created on Fri Mar 24 13:17:57 2023


@author: p.vikas

"""


import openai

# import speech_recognition as sr


def call_chatGPT(ask  =" "):

# Define OpenAI API key 

    openai.api_key = "<key>"

    

    # Set up the model and prompt

    model_engine = "text-davinci-003"

    prompt = ask


    completion = openai.Completion.create(

        engine=model_engine,

        prompt=prompt,

        max_tokens=1024,

        n=1,

        stop=None,

        temperature=0.5,

    )

  

    response = completion.choices[0].text

    return response

--------------------------

Now call that function

 # -*- coding: utf-8 -*-

"""

Created on Fri Feb 10 16:55:12 2023

@author: p.vikas

"""


from  functions import call_chatGPT,voice_to_text

import subprocess



voice_file_path="recording.wav" 

# Ask a question 

print("Please ask a question")

# subprocess.call(["C:\\Program Files\\MyApp\\MyApp.exe"])

# question = input("Please ask a question..")

question=f"""how to Develop Your Emotional Intelligence """


# question_text=voice_to_text(voice_file_path)

answer=call_chatGPT(question)

print(answer)

Now call this function in another program and you all set to asking question to chatGPT.

This code is much stable the UI available in the market. Even it give me more flexibility to pass question from various sources and write back answer and handle answer programmatically.

 

Thursday 29 December 2022

Data Pipeline for SaaS Application

Introduction





Nowadays all the software applications are moving towards the SaaS(Software as Service) model

where it has a single application that serves multiple tenants with multiple data stores either

separated by schema or database instance. At the same time data separation between tenants

is a overcritical part for all the SaaS platform. I still believe many clients have tough to understand

and believe that their data is completely secure and separated from other tenant data on that SaaS

application and it will never mess with other tenants data. 


Also its great challenge for data engineer and data architect to plan and design Data Pipelines for

SaaS Platform.

How Data pipeline for SaaS is different than normal application pipeline: 



Solution 

Apache Airflow provides a solution with GCP Composer to manage data pipelines for SaaS applications.

Here is detail on GCP documentation Google Cloud Composer Operators — apache-airflow-providers-google Documentation

Cloud Composer is a fully managed workflow orchestration service, enabling you to create, schedule, monitor, and manage workflows that span across clouds and on-premises data centers.

Cloud Composer is built on the popular Apache Airflow open source project and operates using the Python programming language.

By using Cloud Composer instead of a local instance of Apache Airflow, you can benefit from the best of Airflow with no installation or management overhead. Cloud Composer helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command-line tools, so you can focus on your workflows and not your infrastructure.

 




I am assuming the reader has an idea or learns about airflow basics DAG creation, Now I will explain

how to make it for SaaS.

  

How to make Airflow for SaaS


  1. Write your most of logic in separate .py file and create as function 

  2. Function should accept tenant as parameter

  3. Create a list of tenants in DAG or you can read the tenant list from either airflow variable or configuration. I recommend reading from airflow variable 

  4. Call that function directly in DAG and pass the tenant parameter 

  5. Import task from airflow.decorators import task

  6. Use @task in DAG just before calling the function.

  7.    Use  this code to create task at runtime for each of the tenant result=data_load.expand(tenant=tenant_list)

    all_sucess(result)


In the next blog I will write all the working code for this solution.

I hope you have enjoyed learning, your feedback or comment will be highly appreciated.





Data Mesh

  Data Mesh is a relatively new approach to managing and organizing data within organizations, especially large enterprises. It advocates fo...