Background Image
TECH INSIGHTS

Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

July 11, 2025 | 4 Minute Read

Using Cloud Function (GCS trigger), Pub/Sub, Dataflow & BigQuery

At Improving, we architected a cloud-native, real-time ETL pipeline on Google Cloud Platform (GCP) that could:

  • Ingest 10TB of data per day

  • Maintain sub-second end-to-end latency

  • Remain fully serverless, reliable, and production-grade

This post details our architecture, the challenges we encountered, and how we optimized the system to meet both performance and cost-efficiency goals.

Objectives

  • Ingestion volume: 10TB/day (~115MB/sec sustained)

  • Latency target: < 1 second from data upload to visibility in BigQuery

  • Design constraints: Serverless, autoscaling, minimal ops

  • Core services: Cloud Functions, Pub/Sub, Dataflow (Apache Beam), BigQuery

Architecture Design & Overview

Asset - Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

[Data Producers → GCS Upload]

(Object Finalization Event) [Cloud Function Triggered by GCS]

[Pub/Sub Topic (Buffer)]

[Dataflow Streaming Job]

[BigQuery (Analytics Layer)]

Cloud Function Triggered by GCS Upload

Our ingestion pipeline begins when files are uploaded to Google Cloud Storage. A Cloud Function is automatically triggered by GCS “object finalized” events.

This function:

  • Parses GCS metadata (bucket, filename, timestamp, source)

  • Adds custom tags or schema versioning info

  • Publishes a message to a Pub/Sub topic with metadata

Optimizations

  • Allocated enough memory (~512MB) to reduce cold starts

  • Reused network clients to avoid socket overhead

  • Enabled 1000 concurrent executions with no retry deadlocks

Pub/Sub: Decoupling & Buffering

Pub/Sub acted as our message buffer, enabling:

  • Horizontal scaling of downstream consumers

  • Replay capabilities in case of downstream failure

  • Backpressure protection via flow control

Each message carried metadata needed by Dataflow to load, transform, and route the data.

Dataflow: Stream Processing Engine

We used Dataflow (Apache Beam, Java SDK) in streaming mode to:

  • Fetch and parse raw file content from GCS

  • Apply schema-aware transformations

  • Handle deduplication using windowing and watermarking

  • Write enriched data to BigQuery with sub-second delay

Dataflow Optimizations

  • Enabled Streaming Engine + Runner v2

  • Used dynamic work rebalancing and auto-scaling (5–200 workers)

  • Split large files into smaller chunks (ParDo + batching)

BigQuery: Real-Time Analytics Layer

Data was streamed into a landing table in BigQuery via streaming insert. A separate scheduled transformation job partitioned and clustered data into curated tables.

BigQuery Optimizations

  • Partitioned by ingestion date

  • Clustered by source_id, event_type

  • Compacted daily files using scheduled queries

  • Used streaming buffer metrics to detect lag spikes

Bottlenecks & Challenges

Asset - Image 2 - Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

Recovery and Observability

  • Retries: Cloud Functions and Pub/Sub retries + backoff

  • Dead-letter queues: For malformed events or schema mismatches

  • Logging: Centralized with Cloud Logging and custom labels

  • Metrics: Data freshness, latency, throughput, and error counts

  • Reprocessing: Manual or scheduled Dataflow replay from GCS + Pub/Sub

Monitoring, Cost Optimization & Failure Recovery

Building a highly performant ETL pipeline isn’t enough; you also need observability and resilience. Here’s how we ensured visibility, cost control, and robust error handling in our 10TB/day GCP ETL pipeline.

Monitoring the Pipeline

We leveraged Cloud Monitoring and Logging to track every component in real time:

Cloud Function:

  • Monitored invocation count, execution duration, and error rates using Cloud Monitoring dashboards.

  • Configured alerts for failed executions or abnormal spikes in duration.

Pub/Sub:

  • Tracked message backlog and throughput via Pub/Sub metrics.

  • Alerts were set to detect message backlog growth, indicating slow or failing subscribers.

Dataflow:

  • Monitored autoscaling behavior, CPU utilization, memory usage, and system lag.

  • Configured custom lag alerts to catch delays in processing (e.g., high watermarks lagging behind ingestion).

  • Used Dataflow job logs to debug transformation errors, memory issues, or skewed partitions.

BigQuery:

  • Monitored streaming insert errors and throughput using the BigQuery audit logs.

  • Tracked query cost, slot usage, and latency to optimize performance and budget.

Tooling: Cloud monitoring dashboards were customized for our pipeline, with charts for:

  • Throughput (files/min, records/sec)

  • Latency (end-to-end from GCS to BQ)

  • Job failure alerts

  • Cost metrics per service (BQ storage/query, Dataflow VMs, etc.)

Results

  • Scaled to 10TB/day sustained load

  • Achieved end-to-end latency of ~600ms on average

  • Fully serverless and autoscaling with minimal ops overhead

  • Real-time querying and downstream analytics are available within seconds

Lessons Learned

  • Cloud-native doesn’t mean that compromise sub-second latency is achievable with the right patterns.

  • File-based triggers from GCS are reliable if you design for retries and idempotency.

  • The key to performance is pre-scaling, batch tuning, and decoupling stages.

Conclusion

GCP provides a robust foundation for building high-throughput, low-latency ETL pipelines. By combining event-driven triggers, Pub/Sub decoupling, and streaming dataflows, we were able to handle terabytes of data per day with real-time SLAs.

Whether you’re modernizing a batch system or building greenfield streaming architecture, this design can act as a blueprint. Contact us to learn more or check out our career page to find out how you can be a part of the Improving team.

Tech Insights
Asset - Thumbnail - How Generative AI is Revolutionizing Application Security
AI

How Generative AI is Revolutionizing Application Security

AI is transforming AppSec—detecting, explaining, and fixing code flaws in real time with smarter, faster tools.