Building an ESG Data Pipeline: Architecture Guide for Engineering Teams

ESG reporting is fundamentally a data problem. The frameworks — ESRS, GRI, TCFD — specify what to report. The hard engineering work is building reliable pipelines that collect, normalize, calculate, and audit the underlying numbers. Most companies underestimate this until they are six weeks from a filing deadline.

This guide is written for engineering teams tasked with building or evaluating ESG data infrastructure. We cover the full pipeline from ingestion to audit-ready output, with concrete technology choices and the architectural tradeoffs that matter for compliance workloads.

The ESG Data Problem in Concrete Terms

A typical medium-sized tech company needs to track and report:

Scope 1 emissions: Direct combustion (company vehicles, on-premise generators) — usually low data volume, straightforward sourcing
Scope 2 emissions: Purchased electricity and heat — requires utility bill data, location-based vs market-based method calculations, renewable energy certificate (REC) tracking
Scope 3 emissions (15 categories): Business travel, employee commute, purchased goods and services, use of sold products, cloud infrastructure, waste — high volume, heterogeneous sources, highest uncertainty
Social metrics (S1): Headcount by gender/ethnicity/contract type, pay gap analysis, training hours, turnover — sourced from HR systems
Governance metrics (G1): Policy documentation, incident registers, supplier assessments — semi-structured document sources

The challenge is that these data points live in: email inboxes (utility PDFs), ERP systems (Accounts Payable), HR platforms (Workday, BambooHR), travel booking systems (Concur, TravelPerk), cloud provider dashboards (AWS, GCP, Azure), and spreadsheets maintained by finance. None of these systems were designed to talk to each other for ESG purposes.

Pipeline Architecture Overview

A production ESG data pipeline has five logical layers:

[Source Systems] → [Ingestion Layer] → [Normalization Layer] → [Calculation Engine] → [Audit Store] → [Reporting Layer]

Let's walk through each.

Layer 1: Ingestion

The ingestion layer connects to source systems and pulls raw data into a staging area. ESG ingestion is unusual because it spans three fundamentally different data types:

Structured API Sources

Cloud carbon APIs (AWS Customer Carbon Footprint Tool, GCP Carbon Footprint, Azure Emissions Insights) export JSON/CSV data with varying schemas. Each provider uses different system boundary definitions, allocation methodologies, and update frequencies. Ingestion here is a standard ETL problem — poll on a schedule, normalize schema differences, store raw responses with timestamps.

Key considerations:

AWS updates carbon data monthly, with a 3-month lag
GCP provides near-real-time data but uses market-based methodology by default
Azure requires the Emissions Impact Dashboard to be explicitly enabled per subscription

Semi-Structured Document Sources

Utility bills, fuel receipts, travel invoices, and supplier questionnaires arrive as PDFs or images. This requires an OCR pipeline with extraction logic tuned for energy documents.

Architecture pattern:

Document arrives via email (IMAP polling), file upload, or API webhook
Pre-processing: image normalization, multi-page PDF splitting
OCR: Tesseract or a commercial service (Google Document AI, AWS Textract) for base text extraction
Extraction: LLM-based or rule-based parser to identify: vendor name, billing period, consumption figure, unit (kWh, MWh, therms, GJ), location
Validation: cross-check extracted values against expected ranges (flag a 10× spike for human review)
Staging: store extracted values with confidence scores and source document reference

The extraction step is where most custom engineering effort goes. Utility bill formats vary widely by geography and provider. A fine-tuned extraction model on domain-specific bills dramatically outperforms generic OCR post-processing.

Human-Reported Data

Employee commute surveys, supplier questionnaires, and waste manifest data involve structured collection forms feeding into a database. This layer is less technically complex but requires careful schema design — you need to capture both the reported value and the metadata (who reported, when, what methodology they used).

Layer 2: Normalization

Raw data from different sources uses different units, time periods, and reference frames. Normalization standardizes everything before calculation.

Unit Normalization

Energy data arrives in kWh, MWh, therms, GJ, BTU, and tonnes of oil equivalent (toe). All must be converted to a common unit (typically MWh or GJ) using standardized conversion factors. Store the conversion factor applied and its source (IPCC, IEA, DEFRA) alongside every converted figure.

Temporal Alignment

Fiscal year boundaries, billing cycles, and reporting periods rarely align. You need a temporal mapping layer that:

Prorates utility bills that span reporting period boundaries
Handles retroactive data corrections (a utility provider issuing a corrected bill for Q2 in October)
Tracks data vintage (when was this figure collected vs the period it covers)

Organizational Scope Mapping

ESRS and GHG Protocol both require clear organizational boundaries. For a multi-entity company, you need to map each data point to the correct legal entity, location, and operational boundary (equity share vs control approach). This mapping table needs to be version-controlled — organizational structures change, and historical data must remain mapped to the structure that existed at reporting time.

Layer 3: Calculation Engine

The calculation engine applies emission factors to activity data to produce GHG quantities in CO₂e (carbon dioxide equivalent).

Emission Factor Databases

The core reference datasets:

| Source | Coverage | Update Frequency | Use Case | |--------|----------|-----------------|----------| | DEFRA UK Government GHG Conversion Factors | UK + international | Annual (March/April) | Travel, fuel, freight | | IEA Electricity Emission Factors | Country-level grid factors | Annual | Location-based Scope 2 | | IPCC AR6 GWP values | GHG-level global warming potentials | Per assessment cycle (~7 years) | All GHG calculations | | US EPA eGRID | US regional grid factors | Annual | US location-based Scope 2 | | Ecoinvent | Comprehensive LCA database | ~Annual | Scope 3 Category 1 (purchased goods) |

Your calculation engine needs to version-control which emission factor dataset and version was used for each calculation. CSRD and GHG Protocol both require disclosure of the methodologies and sources used — a calculation made with a 2023 DEFRA factor produces a different result than one made with 2025 factors, and both need to be reproducible.

Calculation Logic

For Scope 1 and 2, calculation is deterministic:

emissions_kg_CO2e = activity_data_value × unit_conversion_factor × emission_factor_kg_CO2e_per_unit

For Scope 3, each of the 15 categories has a different calculation approach:

Category 1 (Purchased goods and services): Spend-based or supplier-specific method. Spend-based uses EEIO (environmentally extended input-output) models by industry code — high uncertainty, but tractable without supplier data.
Category 6 (Business travel): Distance-based using booking data (airline class × distance × DEFRA factors for radiative forcing)
Category 11 (Use of sold products): For SaaS, this is typically zero or near-zero (no direct energy use in the product itself). For hardware products, this requires lifetime energy use modeling.
Category 13 (Downstream leased assets): Relevant if you lease data center space or infrastructure

Build the calculation engine as a library of pure functions, one per category, with explicit inputs (activity data, emission factor, methodology choice) and outputs (CO₂e value, uncertainty estimate, methodology documentation string). This architecture makes unit testing trivial and calculation logic auditable.

Layer 4: Audit Store

This is where most ESG data platforms cut corners, and where auditors will focus. The audit store must satisfy three requirements:

Immutability

Once a data point is committed to the audit store, it cannot be modified in place. Corrections must be applied as new records with an amendment reference to the original. Use append-only storage — event-sourced database patterns, immutable S3 objects, or a ledger-style database.

Full Lineage

Every reported figure must trace back to:

The source document or API response (stored in raw form)
The ingestion timestamp and system
Any transformations applied
The emission factor used and its version
The calculation performed
The human reviewer who validated (if applicable)

This is not just good practice — CSRD's limited assurance requirement means your auditor will ask to see this chain for sampled data points. A lineage gap is a finding.

Access Control and Timestamping

All access to audit-store data should be logged. Role-based access control prevents post-hoc modification. Cryptographic timestamping (RFC 3161 trusted timestamping) provides non-repudiation evidence for key report submissions.

Layer 5: Reporting Layer

The reporting layer generates the actual disclosures. At minimum this means:

Quantitative data export: Structured outputs (JSON, CSV) in formats compatible with ESRS data point taxonomy (the XBRL taxonomy for ESRS digital reporting, required for machine-readable submissions)
Narrative template population: Auto-populating disclosure templates with calculated figures, reducing manual copy-paste errors
Year-over-year comparison: Delta calculation with automated flags for material changes requiring explanatory narrative
Assurance package generation: Automated assembly of the evidence package your auditor needs — source documents, calculation records, lineage traces — for sampled data points

Build vs Buy Decision

The market has two types of ESG software:

Full-stack ESG platforms (Persefoni, Watershed, Sweep, Plan A): Handle reporting templates, some data collection, workflow management. Strong on UI and framework alignment. Weak on custom data ingestion — they expect clean structured data and struggle with complex OCR or non-standard source systems.

Data infrastructure tools (dbt, Airbyte, custom pipelines): Give you full control over ingestion and transformation logic but require building the ESG domain logic yourself.

The practical recommendation: use a reporting platform for framework alignment and stakeholder-facing output, but build or customize the data ingestion and normalization layers. The ingestion layer is where your organization's specifics (which ERP, which HR system, which cloud providers) determine what's needed — no off-the-shelf platform handles all combinations well.

Common Engineering Mistakes

1. Storing only aggregated figures. Once you aggregate, you lose the lineage. Store every intermediate step.

2. Manual entry without validation. Human-entered values need range checks and consistency validation at input time, not at report generation time.

3. Ignoring data corrections. Utility companies issue corrected bills. Cloud providers retroactively update carbon data. Your pipeline needs a correction workflow that propagates upstream changes downstream without corrupting the audit history.

4. Single-year data stores. CSRD requires comparative data. Design your schema for multi-year from day one.

5. Conflating methodology versions. If you change from location-based to market-based Scope 2 calculation methodology, the comparison between years is not apples-to-apples. Track methodology versions as first-class entities.

Automation Workflows for ESG Reporting

The most impactful shift companies make when moving from spreadsheet-based ESG reporting to a production pipeline is replacing manual collection steps with automated workflows:

Scheduled API pulls — Cloud carbon APIs (AWS, GCP, Azure) update monthly; automated ingestion means you never fall behind
Email-triggered OCR pipelines — Utility bills arriving in a shared inbox automatically route to an extraction queue
HR system sync — Automated exports from Workday or BambooHR keep workforce metrics current without manual intervention
Audit trail generation — Every pipeline run produces an immutable log, ready for assurance review

If your team is also pursuing SOC2 certification, the access controls, change management, and audit logging you build for ESG data infrastructure directly satisfy SOC2 security controls. Building both together saves significant engineering time — see our Security & SOC2 Compliance offering for how we approach this.

Getting Started in Three Weeks

A credible MVP ESG data pipeline can be delivered in three weeks if you scope it correctly:

Week 1: Ingestion layer for your two highest-volume data sources (typically cloud APIs + utility bill OCR). Raw storage with schema versioning.
Week 2: Normalization and calculation engine for Scope 1 and 2. Emission factor database integration (DEFRA + IEA as a baseline). Audit store with lineage tracking.
Week 3: Reporting layer export in ESRS format. Assurance package generation. Access controls.

Scope 3 is a follow-on phase — the data collection complexity is higher and requires supplier engagement that cannot be compressed into a three-week sprint.

Talk to us about ESG compliance — 100xai.engineering/solutions/esg-compliance

We build production ESG data pipelines for tech companies preparing for CSRD and investor disclosure requirements. Our three-week delivery model gets you from zero to audit-ready infrastructure without the 18-month consultancy engagement.

Emission factor sources cited reflect publicly available datasets as of early 2026. Factor versions and coverage change annually — always verify against the latest published release from DEFRA, IEA, or EPA for production calculations.

Related Resources

More articles:

Our solutions: ESG Compliance Engineering · Security & SOC2 Compliance

Glossary:

Comparisons:

Free Tool: Check your CSRD obligations and get a readiness score with prioritized actions. → CSRD Readiness Calculator

Building an ESG Data Pipeline: Architecture Guide for Engineering Teams

Building an ESG Data Pipeline: Architecture Guide for Engineering Teams

The ESG Data Problem in Concrete Terms

Pipeline Architecture Overview

Layer 1: Ingestion

Structured API Sources

Semi-Structured Document Sources

Human-Reported Data

Layer 2: Normalization

Unit Normalization

Temporal Alignment

Organizational Scope Mapping

Layer 3: Calculation Engine

Emission Factor Databases

Calculation Logic

Layer 4: Audit Store

Immutability

Full Lineage

Access Control and Timestamping

Layer 5: Reporting Layer

Build vs Buy Decision

Common Engineering Mistakes

Automation Workflows for ESG Reporting

Getting Started in Three Weeks

Related Resources

Book a 15-min scope call

Continue Reading

Scope 3 Emissions: The Hidden Challenge Most Startups Ignore

How AI is Transforming ESG Reporting: From Manual Spreadsheets to Real-Time Dashboards

ESG, CSR, and Sustainability: What Investors Actually Mean and Why It Matters for Startups