black screen with code

Architecting a data lineage system for BigQuery

Democratization of data within an organization is essential to help users derive innovative insights for growth. In a big data environment, traceability of where the data in the data warehouse originated and how it flows through a business is critical.

Source: Original Postroducts/data-analytics/architecting-a-data-lineage-system-for-bigquery/">Architecting a data lineage system for BigQuery

This traceability information is called data lineage. Being able to track, manage, and view data lineage helps you to simplify tracking data errors, forensics, and data dependency identification. 

In addition, data lineage has become essential for securing business data. An organization’s data governance practices require tracking all movement of sensitive data, including personally identifiable information (PII). Of key concern is ensuring that metadata stays within the customer’s cloud organization or project.

Data Catalog provides a rich interface to attach business metadata to the swathes of data scattered across Google Cloud in BigQuery, Cloud Storage, Pub/Sub or outside Google Cloud in your on-premises data centers or databases. Data Catalog enables you to organize operational/business metadata for data assets using structured tags. Data Catalog structured tags are user-specified and you can use them to organize complex business and operational metadata, such as entity schema, as well as data lineage.

Common data lineage user journeys

Advertisements

Data lineage can be useful in a variety of user journeys that require a number of related but different capabilities. Different user journeys require lineage information at different granularities like relationships between data assets such as tables or datasets, while other user journeys require data lineage at column level for each table. Another category of user journeys trace data from specific rows in a table and is often referred to as row-level lineage. 

Here, we’ll describe our proposed architecture, which focuses on the most commonly used (column-level) granularity for automated data lineage and can be used for the following user journeys:

Impact/dependency analysis

Schema modification of existing data assets, like deprecation and replacement of old data assets, is commonplace in enterprises. Data lineage helps you flag the breaking changes and identify specific tables or BI dashboards that will be impacted by the planned changes.

Data leakage/exfiltration

In a self-service analytics environment, accidental data exfiltration is high risk and can cause a loss of face for the enterprise. Data lineage helps in identifying unexpected data movement to ensure that data egress is done only to the approved projects/locations where it is accessible only by approved people. 

Debugging data correctness/quality

SaleBestseller No. 1
SAMSUNG Galaxy A54 5G A Series Cell Phone, Unlocked Android Smartphone, 128GB, 6.4” Fluid Display Screen, Pro Grade Camera, Long Battery Life, Refined Design, US Version, 2023, Awesome Black
  • CRISP DETAIL, CLEAR DISPLAY: Enjoy binge-watching...
  • PRO SHOTS WITH EASE: Brilliant sunrises, awesome...
  • CHARGE UP AND CHARGE ON: Always be ready for an...
  • POWERFUL 5G PERFORMANCE: Do what you love most —...
  • NEW LOOK, ADDED DURABILITY: Galaxy A54 5G is...
Bestseller No. 2
OnePlus 12,16GB RAM+512GB,Dual-SIM,Unlocked Android Smartphone,Supports 50W Wireless Charging,Latest Mobile Processor,Advanced Hasselblad Camera,5400 mAh Battery,2024,Flowy Emerald
  • Free 6 months of Google One and 3 months of...
  • Pure Performance: The OnePlus 12 is powered by the...
  • Brilliant Display: The OnePlus 12 has a stunning...
  • Powered by Trinity Engine: The OnePlus 12's...
  • Powerful, Versatile Camera: Explore the new 4th...

Data quality is often compromised by missing or incorrect raw data as well as incorrect data transformations in the data pipelines. Data lineage enables you to traverse the lineage graph back, troubleshoot the data transformations, and trace the data issues all the way to raw data.

Validating data pipelines

Compliance requirements need you to ensure that all approved data assets are sourcing data exclusively from authorized data sources and the data pipelines are not erroneously using, for instance, a table that was created by an analyst for their own use, or a table that still has PII data. Data lineage empowers you to validate and certify data pipelines’ adherence to governance requirements.

Introspection for data scientist

Most data scientists require a close examination of the data lineage graph to really understand the usability of data for their intended purpose. By traversing the data lineage graph and examining the data transformations, you get critical insights into how the data asset was built and how it can be used for building ML models or for generating business insights.

Lineage extraction system

Advertisements

A passive data lineage system is suitable for SQL data warehouses like BigQuery. The lineage extraction process starts with identifying source entities used to generate the target entity through the SQL query. Parsing a query requires the schema information of the source entities of the query from the Schema Provider. The Grammar Provider is then used to identify the relation between output columns to the source columns and the list of functions/transforms applied for each output column. Here’s a look at the procedure to derive lineage:

Click to enlarge
New
Fadnou I23 Ultra Unlocked Cell Phone,Built in Pen,Smartphone Battery 6800mAh 6.8" HD Screen Unlocked Phones,6+256GB Android13 with 128G Memory Card,Face ID/Fingerprint Lock/GPS (Purple)
  • 【Octa-Core CPU + 128GB Expandable TF Card】...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】The machine...
  • 【6800mAh Long lasting battery】With the 6800mAh...
  • 【Business Services】The main additional...
New
Huness I15 Pro MAX Smartphone Unlocked Cell Phone,Battery 6800mAh 6.8 HD Screen Unlocked Phone,6+256GB Android 13 with 128GB Memory Card,Dual SIM/5G/Fingerprint Lock/Face ID (Black, 6+256)
  • 【Dimensity 9000 CPU + 128GB Expandable TF...
  • 【6.8 HD+ Android 13.0】 This is an Android Cell...
  • 【Dual SIM and Global Band 5G Phone】Dual SIM &...
  • 【6800mAh Long lasting battery】The I15 Pro MAX...
  • 【Business Services】The main additional...
New
Jopuzia U24 Ultra Unlocked Cell Phone, 5G Smartphone with S Pen, 8GB+256GB Full Netcom Unlocked Phone, 6800mAh Battery 6.8" FHD+ Display 120Hz 80MP Camera, GPS/Face ID/Dual SIM Phone (Rose Gold)
  • 🥇【6.8" HD Unlocked Android Phones】Please...
  • 💗【Octa-Core CPU+ 256GB Storage】U24 Ultra...
  • 💗【Support Global Band 5G Dual SIM】U24 Ultra...
  • 💗【80MP Professional Photography】The U24...
  • 💗【6800mAh Long Lasting Battery】With the...

A tuple of source, target, and transform information based lineage data model is used to record the extracted lineage.

A cloud-native lineage solution for your BigQuery serverless data warehouse would use the BigQuery audit logs in real time from Pub/Sub. An extraction Dataflow pipeline parses the query’s SQL using the ZetaSQL grammar engine, uses the table schema from BigQuery API and persists the generated lineage in a BigQuery table and as a tag in Data Catalog. The lineage table can then be queried to identify the complete flow of data in the data warehouse. Here’s a look at the architecture:

Click to enlarge

Try data lineage for yourself

Enough talk! Deploy your own BigQuery data lineage system by cloning the bigquery-data-lineage Github repository or take it a step further by trying to dynamically propagate the data access policy to derived tables based on the lineage signals.Related ArticleUnderstanding the fundamentals of tagging in Data CatalogSee how to use tagging and templates inside Data Catalog, Google Cloud’s metadata management service that covers operational and business…Read Article