Advice for integrating SQL with Targets/Drake.

Does a SQL workflow fit within the drake / targets framework, or does that go counter to its design?

There are a few challenges I’m running into, and I’m curious how those with more experience with drake/targets think about them:

  • In Drake, if I change an input file for a data-frame, it would be reconstructed from that file. In a SQL workflow, I would be more interested in persisting the diffs. Moreover changes in the database table aren’t tied within Drake, but it makes perfect sense for the top-down process to be async updated. That is, the data frame that the table is loaded from won’t update if the database table changes, and would need refreshed.

  • In my workflow for populating the database, target granularity is harder when thinking about SQL. I have a table that I want to update multiple times, as the process is populating these fields based on the prior steps. This multiple update method seems to work against Drake/Targets, though I suppose I could just return the new column with the keys each time? Still, it seems like what I want to do is refresh my local copy of the table after each persistent change.

Are there any complex examples of using Drake/Targets for SQL projects that both (a) populate a database from scratch, and (b) then update it based on input files? And do you think this fits, or is this more of an event driven problem and needs a different framework?

1 Like

In my workflow for populating the database, target granularity is harder when thinking about SQL. I have a table that I want to update multiple times, as the process is populating these fields based on the prior steps. This multiple update method seems to work against Drake/Targets, though I suppose I could just return the new column with the keys each time? Still, it seems like what I want to do is refresh my local copy of the table after each persistent change.

As you say, targets assumes your outputs get written only once and do not change in subsequent steps (immutability). You can pull input from one table, do some processing, and then write once per table to tables you do not take as input. Even if this does not seem possible at first, you can always combine any two targets into one if it better suits the problem, especially with good functions to back it up.

In Drake, if I change an input file for a data-frame, it would be reconstructed from that file. In a SQL workflow, I would be more interested in persisting the diffs.

I do not understand what you mean. Would you elaborate?

Moreover changes in the database table aren’t tied within Drake, but it makes perfect sense for the top-down process to be async updated. That is, the data frame that the table is loaded from won’t update if the database table changes, and would need refreshed.

Cues, as well as functions like tar_change() and tar_force(), let you define custom rules to invalidate a target. For example, if you write a custom function to get the time stamp or hash of a database table, you could it into tar_change(). Behind the scenes tar_change() would define an upstream target that always runs and computes the time stamp (e.g. list(tar_target(x_change, get_timestamp(), cue = tar_cue(mode = "always")), tar_target(x, {x_change; get_table()}))). And then there’s tar_target(..., format = "url") if your database has a URL with a last-modified time or ETag.

3 Likes