menu

Thursday, 19 December 2024

The Growing Importance of Data Professionals


Key Skills and the Path to Becoming a Data Person

Data has become one of the most critical assets for organizations across industries. As the volume of data continues to grow exponentially, the demand for skilled professionals who can manage, analyze, and leverage this data has skyrocketed. These professionals, often referred to as "data persons," play a pivotal role in every organization, bridging the gap between complex data and strategic business decisions.

In recent years, the demand for data professionals has increased dramatically, and this trend will only continue as data grows every second. In this blog, I'll share insights from my experience and provide guidance on how to become a successful data person, along with the essential skills and knowledge that an engineer needs to excel in this field.

The Evolving Role of a Data Professional

Through my career, I’ve noticed that a data person is not just an engineer focused on raw technical skills; they also need to have a solid understanding of the end-to-end business processes, much like a business analyst. In many ways, the data person acts as a bridge between the technical data world and the leadership team, including C-level executives. This makes their role not only technical but also strategic and vital for the organization's success.

The Key Skills Every Data Person Needs

From my experience on interview panels, I’ve observed that many candidates often overlook a critical skill: business understanding. While technical expertise is essential, a data person must understand the business context and objectives to ensure that their work aligns with the company's goals.

Another foundational skill that many candidates miss, even at senior levels, is proficiency in SQL. Whether you're a data engineer, data scientist, or data analyst, SQL remains one of the most important tools in the data person's toolkit. Along with SQL, a basic understanding of programming—particularly Python—has become a must-have skill.

The Rise of the Full-Stack Data Person

The role of a data professional has evolved dramatically in recent years. Industries no longer expect a data person to specialize in just one aspect of data management. Today, the expectation is for a full-stack data professional who can handle a range of tasks, including:

  • Data Engineering: Bringing data from diverse sources, managing data pipelines, and ensuring data integrity.
  • Data Analysis: Analyzing data and ensuring its accuracy to make informed decisions.
  • Machine Learning: Collaborating with data scientists to apply machine learning models and share insights.
  • Data Modeling: Designing data models, at least for staging purposes.
  • Cloud Data Tools: Proficiency in cloud-based data tools (e.g., AWS, Azure, Google Cloud).
  • CI/CD for Data: Understanding data CI/CD tools like Terraform to automate and streamline data workflows.

This broad set of skills makes the role of the data person more dynamic and challenging, but also more rewarding.

How Technology is Easing the Data Person’s Journey

While the expectations have grown, the tools and technologies available to data professionals have also improved significantly. Modern data engineering tools, fully managed data services, and comprehensive tool documentation have streamlined the process, making it easier for data professionals to do their jobs more efficiently. These advancements have simplified tasks such as data integration, query optimization, and troubleshooting, which once took up significant amounts of time and effort.

Conclusion: The Future of Data Professionals

The role of the data person is more crucial than ever. As organizations continue to rely on data to drive business decisions, the demand for skilled data professionals will only grow. To succeed in this field, it’s important not to overlook the foundational skills of business understanding and SQL. Additionally, given the industry's evolving expectations, becoming a full-stack data person is key to staying relevant in the future.

The landscape of data has changed dramatically, and while the challenges have increased, the tools and resources available to data professionals have made their work more efficient and impactful than ever before. As a result, pursuing a career as a data person is not only a great opportunity today but will continue to be one of the most in-demand roles in the future.

About Me


Hi, I’m Vikas Kumar Pathak , a Data Architect and Data Engineer with 14 years of experience crafting innovative, scalable, and secure data solutions across diverse industries including Finance, Retail, Education, and Customer Loyalty.

I specialize in designing and leading the development of cloud-based data platforms, leveraging tools such as Google Cloud Platform (GCP), AWS, BigQuery, and Redshift. My expertise spans the entire data lifecycle—from architecture and engineering to advanced analytics and machine learning. I’m passionate about building automated, high-performance data systems that drive business growth and decision-making.

Throughout my career, I’ve led teams in designing and implementing complex data solutions, managed large-scale data transformations, and contributed to SaaS-based platforms. I thrive in roles that require both technical proficiency and strategic oversight, ensuring data solutions align with business objectives while maintaining security and scalability.

Currently, as a Sr. Cloud Data Architect & Data Engineer at GIIR Germany GmbH, I’m leading the creation of an enterprise-level Customer Data Platform (CDP), focusing on data security, automation, and seamless integration to deliver actionable customer insights.

I’m always exploring new ways to improve data systems and drive insights, and I’m excited about the future of data-driven decision-making.








Tuesday, 17 December 2024

Data Pipeline as a Service (DPaaS)

Understanding Data Pipeline as a Service (DPaaS): 

I define "Data Pipeline as a Service" (DPaaS) as a concept primarily employed when constructing data pipelines for SaaS applications, facilitating the transfer of data from a SaaS-based OLTP system to an OLAP system.

Data Pipeline as a Service (DPaaS) offers a streamlined solution for building and managing data pipelines, particularly in multi-tenant SaaS environments. It enables the reuse of the same pipeline logic for different tenant while ensuring strict separation of data flows to prevent cross-tenant data leakage—a critical challenge in multi-tenant architectures.


The Multi-Tenant Data Challenge

Consider a common scenario: Your SaaS application needs to build a data warehouse. While the data transfer logic (ETL/ELT process) may be consistent across tenant , merging the pipelines can introduce significant risks:

  1. Data Leakage: Without robust isolation, there’s a chance that data from one tenant could inadvertently flow into another tenant's pipeline, creating security and compliance issues.
  2. Pipeline Dependencies: A failure in one tenant’s pipeline could bring down processes for all tenants, causing widespread disruptions.
  3. Maintenance Complexity: Optimizing or updating pipelines can be error-prone, as it requires modifying logic for each tenant. Missing updates for even a single tenant could lead to inconsistent performance or failures.

Most of existing solutions often attempt to address these challenges by tenant code as parameter throughout the ETL process. However, this approach is cumbersome, error-prone, and does not fully mitigate the risk of cross-tenant data leaks.


How DPaaS Solves the Problem

DPaaS introduces a tenant-specific approach to pipeline creation. Instead of managing a single shared pipeline for all tenants, DPaaS allows you to:

  • Isolate Pipelines: Build a shared codebase but instantiate a separate pipeline for each tenant. This ensures complete data isolation and prevents cross-tenant leakage.
  • Improve Reliability: If one tenant’s pipeline encounters an issue, it doesn’t impact others, ensuring seamless operation for unaffected tenant .
  • Simplify Updates: Changes or optimizations are applied at the codebase level and propagated through individual pipeline instantiations. This reduces the risk of missing updates for specific tenants.
  • Backfill and Retry : Backfill and retry easily possible for a specific tenant instead of running pipeline for all tenant.
  • Tenant onboarding:  Onboarding new tenant with almost zero developer effort. Its just matter of adding one more tenant name in tenant list variable.


The Value for SaaS Companies

For SaaS product companies dealing with large-scale data, DPaaS offers a scalable and secure way to manage data pipelines. It eliminates the complexities and risks associated with multi-tenant ETL processes, ensuring compliance, operational reliability, and ease of maintenance—all while enabling consistent performance and scalability.

With DPaaS, you can focus on building efficient pipelines and serving your tenant , confident that their data remains secure and isolated.


"As a Service" Mindset : 

 One of the foundational principles for successfully implementing a DPaaS solution is adopting an "as a service" mindset. This approach ensures that teams think about problems from a service-oriented perspective, focusing on scalability, flexibility, and tenant-agnostic solutions. I have details in my one of post here :

https://sqlvikas.blogspot.com/2024/12/embracing-dpaas-mindset.html






Friday, 13 December 2024

Embracing the "DPaaS" Mindset

 

Embracing the "As a Service" Mindset

One of the foundational principles for successfully implementing a DPaaS solution is adopting an "as a service" mindset. This approach ensures that teams think about problems from a service-oriented perspective, focusing on scalability, flexibility, and tenant-agnostic solutions.

When dealing with tenant-specific change requests, it’s crucial for data engineers to approach the challenge with this service mindset. Instead of immediately resorting to quick fixes or hardcoding solutions—like embedding a tenant code directly into the pipeline—they should explore alternatives that align with the service-first philosophy.

Example Scenario

Imagine a tenant requests a new feature or customization in their pipeline. A traditional approach might involve hardcoding the tenant’s unique identifiers or logic directly into the shared pipeline. While this might work in the short term, it introduces risks like:

  • Complicating the pipeline logic over time.
  • Increasing the chances of cross-tenant data leaks or errors.
  • Making future updates harder to implement across multiple tenants.

Instead, the service mindset encourages engineers to ask:

  • "Can we implement this change in a way that benefits all tenants?"
  • "How can we integrate this feature without tenant-specific hardcoding?"
  • "Does this approach scale as more tenants request similar features?"

By prioritizing reusable, tenant-agnostic solutions, engineers can maintain the integrity and scalability of the DPaaS architecture while still addressing tenant-specific needs.

The Service Mindset in Action

For instance, instead of embedding tenant-specific logic, you could leverage configuration-driven designs. This means creating a flexible pipeline framework where tenant-specific behaviors are defined by configurations stored in a metadata layer. Each pipeline instance pulls its configuration dynamically, ensuring isolation while avoiding custom hardcoding.

This mindset not only simplifies maintenance and updates but also ensures the pipeline remains robust and scalable as the number of tenants grows. By thinking of pipelines as a service rather than a one-off solution, teams can deliver a secure, efficient, and adaptable experience for all clients.

Tuesday, 30 April 2024

Data Mesh

 Data Mesh is a relatively new approach to managing and organizing data within organizations, especially large enterprises. It advocates for a decentralized approach to data architecture, where data ownership and management are distributed across different business domains or "domains" rather than centralized within a single data team or department.

I still remember the major challenges when I proposed data centralization architecture and build a data team was “Have excellent database technical knowledge but not domain expertise, have domain expertise but not database technical knowledge” Have both the team, also able to process data but not able to efficiently utilized data.

Key principles and concepts of Data Mesh include:

  1. Domain-oriented decentralized data ownership: Data is owned and managed by the individual business domains or "domains" within an organization. Each domain is responsible for its data, including data governance, quality, and lifecycle management.

  2. Data as a product: Data is treated as a valuable product that is produced, consumed, and reused across the organization. Each domain acts as a data product team, responsible for the end-to-end delivery of data products that meet the needs of their stakeholders.

  3. Self-serve data infrastructure: Domains have autonomy and control over their data infrastructure and tools, allowing them to choose the technologies and solutions that best suit their requirements. This may involve using cloud-based platforms, data lakes, data warehouses, or other data management systems.

  4. Data mesh architecture: Data Mesh advocates for a modular and scalable architecture that enables seamless integration and interoperability of data across domains. This may involve implementing data pipelines, APIs, event-driven architectures, and data mesh platforms to facilitate data sharing and collaboration.

  5. Data governance and standards: While each domain has autonomy over its data, there is a need for common data governance standards, policies, and practices to ensure consistency, compliance, and interoperability across the organization. This may involve establishing data standards, metadata management, and data quality frameworks.

  6. Cross-functional collaboration: Data Mesh encourages collaboration and communication between different business domains, data teams, and stakeholders. This includes fostering a culture of data literacy, collaboration, and knowledge sharing to unlock the value of data across the organization.

Overall, Data Mesh aims to address the challenges of traditional centralized data architectures, such as scalability, agility, and data silos, by promoting a decentralized, domain-oriented approach to data management that empowers individual business domains while ensuring organizational alignment and data governance

Tuesday, 16 May 2023

Access Historic data as of time travel in BigQuery

As a data professional, it's a common requirement to understand the state of data as of a specific date, such as identifying changes made since a particular point in time. While creating a change log or implementing Change Data Capture (CDC) is a typical approach, it's often impractical to maintain such logs for all tables, especially in systems with a significant number of tables. In modern data warehouses or data systems, which often contain more than 50+ tables, maintaining change logs for many of them can become a substantial overhead. Data evolves constantly, and sometimes these changes are extensive, particularly in data warehouse environments. When debugging data issues, having the ability to view the state of the data as of a specific date can be invaluable.


BigQuery has added this great feature to solve this problem with minimal effort, with almost no additional development. 


Bigquery has FOR SYSTEM_TIME AS OF that help us access data as of a particular date.


Step - 1:

SELECT

  *

FROM

  `poc.bq-api-test`

LIMIT

  100;





Step - 2 : Update records 


UPDATE

  `poc.bq-api-test`

SET

  name ='test2'

WHERE

  id=1;


Finally : 
We can view the data as it existed prior to any changes. This is just a simple example, showcasing the capability to access the state of the data up to 7 days in the past by default. This feature can also be used to restore a table to its state on a specific past date, among other use cases.


SELECT

  *

FROM

  `poc.bq-api-test` FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);







Conclusion: Using FOR SYSTEM_TIME AS OF We can travel to any past time and can see the state of data. Obviously it has some limitations; we can see the details on GCP documentation here.


Access historical data using time travel  |  BigQuery  |  Google Cloud


Here is how to restore accidently deleted Bigquery Dataset 
How to restore a deleted BigQuery Dataset (derrickqin.com)