A dbt Pipeline from DynamoDB to Redshift
By Chris Gilbert
Migrating from Tableau to dbt on Redshift: What Worked, What Didn’t, and What I Wish I’d Known
Getting data out of DynamoDB, through Redshift, and into something your business can actually use — without losing your mind in the process.
I spent the last couple of years building out a data engineering function from scratch at a small SaaS company. We had a Tableau-based reporting pipeline that involved manual runs, fragile prep flows, and a mounting sense of dread every time someone asked for a new report. I replaced it with dbt, Redshift, and Fivetran, with self-hosted Metabase for business-intelligence, and with a detour through several tools that didn’t work out. Here’s what I learned.
The Starting Point: DynamoDB and a Lot of Nested JSON
Our application stores everything in DynamoDB. Projects, audit logs, user records, tenant configuration, and so on. The data model is what can be described as “flexible”. A single project record can have around 10,000 possible a ttributes, stored as deeply nested JSON. Objects inside arrays inside objects. Some fields are consistent, some are tenant-specific, and some appear to have been added by someone in a hurry and never documented.
This is the fundamental challenge. You can’t report on data you can’t understand, and you can’t understand data that doesn’t have a well-defined schema.
Why Not AWS-Native Tooling?
This is the obvious question. AWS has Glue, Step Functions, the COPY command, and the newer Zero-ETL feature. I evaluated all of them. None of them worked for us.
AWS Glue was ruled out based on experience from a previous project where I’d used it for a data export pipeline. The local development and testing experience is genuinely poor. You’re writing PySpark jobs that you can’t meaningfully run locally, deploying them to find out they don’t work, and iterating through a slow feedback loop that makes TDD impossible. For a small team where I’m the only data engineer, that was a painful experience I didn’t want to repeat.
The COPY command into Redshift doesn’t support complex objects in DynamoDB. Our data is heavily nested. There are objects containing arrays of objects, and COPY just can’t handle that. It would have meant flattening the data before loading, which defeats the purpose.
Zero-ETL (the DynamoDB-to-Redshift integration AWS launched) has the same limitation. It doesn’t support the SUPER data type for complex nested objects. For a simpler DynamoDB schema it might work well, but for ours it was immediately disqualified.
Why Fivetran (and Where It Falls Short)
Fivetran handles the DynamoDB-to-Redshift sync reliably and in a GDPR-compliant way, which matters when your data includes customer PII and your infrastructure needs to stay in the EU. The setup is straightforward: point it at a table, it syncs incrementally, and you get on with your life.
But Fivetran has a significant quirk with Redshift: it doesn’t use the SUPER data type correctly. Instead of preserving the native JSON structure, it converts everything to VARCHAR. This means you lose the ability to query nested JSON natively in Redshift until you re-parse it. Every model in our ingest layer starts with json_parse() calls to convert Fivetran’s varchar columns back into queryable SUPER types.
There’s also a size limit. Some of our configuration objects (tenant settings with deeply nested workflow definitions) exceed the Redshift varchar limit (65535 characters). Fivetran simply can’t load these. For those tables, I wrote Python scripts using dlt (data load tool) that scan DynamoDB directly and load into Redshift via a separate ingest_dlt schema. It’s not elegant having two ingestion paths, but it works.
What About the Competition?
I tried several alternatives before settling on Fivetran. Estuary was the most promising, but it did a worse job with our complex nested data, either by creating a sprawling mess with everything unnested into dozens of tables, or failing to convert the JSON reliably at all. The fundamental problem is that most ELT tools are designed for relational sources. When your source is a document database with a schema that was never formally defined, you’re outside the usual happy path.
Processing JSON in Redshift: The Pain Points
Redshift’s SUPER data type is powerful but has sharp edges. The main issues we hit:
Nested extraction is verbose and brittle. To get a value like data.address.postCode you need to chain path expressions, handle nulls at every level, and deal with the fact that some records store the same data in different paths (flat data.postCode vs nested data.address.postCode). I built a macro system that handles this with seed-driven configuration. You define the attribute path, any aliases, and the expected data type in a CSV file, and the pipeline handles coalescing and type casting automatically.
Dynamic pivoting is expensive. Our project attributes are stored in an EAV (Entity-Attribute-Value) pattern — each attribute is a row with a name and value. To make this usable for reporting, you need to pivot it into a wide table where each attribute becomes a column. With ~170 attributes, this means generating a CASE WHEN statement for each one. I wrote a dynamic_pivot macro that generates this SQL from seed files, with a column cache to avoid the compilation overhead of querying the database for every pivot operation.
Boolean handling is surprisingly tricky. Source data represents booleans as "true", "Yes", "1", true, and various other formats. The pivot macro needs to normalise all of these into actual Redshift booleans using bool_or() rather than max().
The dbt Pipeline: Layers and Why Star Schemas Didn’t Work
We settled on a four-layer approach: ingest → isolation → galaxy → reports.
The idea was that the galaxy layer would contain star schemas with clean fact and dimension tables that reporting tools could query directly. In practice, this didn’t survive contact with the data. When a single entity (a project) can have thousands of possible attributes, varies by tenant, and changes structure over time, trying to force it into a star schema creates more problems than it solves. You end up with dozens of dimension tables that are really just different projections of the same underlying JSON.
What actually works is the isolation layer. This is where the real value lives. It takes the raw parsed JSON from the ingest layer and presents it as typed, tested, documented views. It formalises the data schema after the fact. It corrects data inconsistencies (country names normalised, US state codes standardised, duplicate attributes coalesced). And it has test gates everywhere: uniqueness, referential integrity, data type validation, and completeness checks.
The isolation layer is essentially the data contract that the application should have had.
The Tableau Migration: Death by a Thousand Prep Flows
The original reporting setup was Tableau Prep flows. These are GUI-based data transformation pipelines that Tableau offers as an alternative to writing SQL. In theory, they’re accessible to non-technical users. In practice, they become unmaintainable black boxes that only one person understands.
Our billing process is setup to run through Tableau Prep. It originally took a combined day of work each month across two people, had a history of problems, and even our experts were afraid to touch it. The prep flows also break when Tableau had reliability issues on their end (which happened more than you’d expect from an enterprise product), and debugging means clicking through dozens of visual steps trying to find where the data went wrong. Rerunning failed steps is also difficult and tricky to automate.
Migrating to dbt meant replacing these with version-controlled SQL, tested in CI, with clear documentation about what each transformation does and why.
We’re down to two customers still using Tableau-based reporting. Everything else runs through dbt, with customer-facing data exported via views that are documented with auto-generated data catalogs.
The pipeline also powers customer-facing analytics directly. We use Metabase with static embedding to surface reporting dashboards inside our web application. Because the dbt models are tested and documented, the Metabase questions sit on top of a reliable foundation rather than querying raw tables and hoping for the best. This is the kind of thing that only becomes possible once you’ve invested in the isolation layer - you can confidently expose data to customers because you know it’s been validated, deduplicated, and correctly typed before it reaches them.
What I’d Do Differently
Look at Airbyte first. I didn’t know about it or test it at the time. It’s open-source, self-hostable (which solves the GDPR problem), and has better community support for edge cases than Fivetran. Whether it handles our specific DynamoDB complexity better, I can’t say, but I’d want to find out before committing to a commercial tool.
Invest in data modelling earlier. A well-defined data model makes everything easier. Reporting, feature development, AI analysis and onboarding new engineers are all simpler. Without it, your business rules are implicit, scattered across application code and team knowledge, and nothing gets easier as the system grows. The work I did formalising the schema in dbt seed files and the isolation layer should ideally be done in the application. Our flexible workflows are a double edged sword that also make reporting much harder, and data integrity much more difficult to enforce.
Build ReSQL sooner. After wrestling with nested JSON-to-SQL conversion enough times, I built a GUI tool that does the hard work of converting a heavily nested JSON document into a generated relational schema. It’s presently free and available online at resql.gilbert.cloud. It’s also avaiable as a downloadable app at software.gilbert.cloud/resqlIf you’re staring at a JSON export wondering how to make sense of it, it might save you some time!
The Takeaway
The migration from Tableau to dbt wasn’t really about swapping one tool for another. It was about bringing engineering discipline to a part of the business that had been running on manual processes and good intentions. Version control, testing, documentation and CI/CD are no longer luxuries for data pipelines. They’re the minimum bar for processing business critical data.
If you’re in a similar position, with a SaaS product, DynamoDB backend, reports held together with string and optimism, then the tooling exists to fix it. The hard part isn’t the technology. It’s getting the organisation to invest in data quality before the problems become visible to customers.
By that point, it’s often too late to do it cheaply or quickly.
Chris Gilbert is a Staff Engineer based near Nottingham, working across DevOps and Data Engineering. He writes about green technology and the occasional technical deep-dive at blog.gilbert.cloud.