Hosting biggish data, serving biggish queries

Hi all, I’m interested in solutions people know of to this data-hosting problem:

  • One hosts data that can be represented in an RDMS. It’s in the multi-GB+ scale, may be hosted on one server or across servers.
  • One expects to serve data from queries to this data set via interactive or REST API
  • Results of queries are MB to GB scale

I’ve seen this become a performance problem a number of times. It’s often when the hosting setup didn’t anticipate the load of large-scale queries and processing the queries becomes very slow, usually filling up RAM.

Some databases are very good at processing queries out of RAM and can stream the output of the query to a file (I like MonetDB a lot for this, and I think Apache Drill can do it even if the data are flat files). So loading the results into memory may not be necessary, but I imagine that other parts of the tooling, (like constructing JSON payloads), end up pulling the data into memory or are otherwise very resource intensive.

A frequent pattern I see for a lot of academic or government services is to have an website or REST API for small queries, and direct users to an FTP site, S3, or other static hosting for flat file dumps of the data. We’re big fans of flat file dumps at rOpenSci, but this anticipates that you can break up the data into the right type of outputs, and/or the user will process down to the query results they want.

I’ve also seen versions where a user can make a query with an online tool and this enters a queue, and one receives an email when the result is ready to download (Example here). Clearly this limits the burden on the hosting service.

What other solutions are there to this? I realize the cost of moving the data is in the reason for analytics-in-the-DB, but assuming one is OK with the cost of data transfer, can one query and stream without a huge load on servers? It seems doable but non-trivial, like an API request to execute the query and then another request to stream the output file. Maybe I and these government services don’t know about more modern paradigms/databases? Is this something that turnkey services like Google BigQuery or Amazon Relational Database Service offer?

2 Likes

What is the budget like? I imagine that will narrow down the options. For example, I imagine paying for BIgQuery/etc. would be most expensive, but just guessing.

Is the number of unique queries very large? or smallish? if smallish, then maybe the first result is cached long term (until data is updated) and subsequent calls are faster.

GBIF uses that model you described of

  1. request made that gets put into a queue
  2. make another request to diff. route to get status of job
  3. download output when job is done

I’m not sure what their tech stack is for that. I’d probably put together Sidekiq for scheduling, plus whatever database.

Cacheing and queueing are good strategies, it just seems to me that queries run into a resource bottleneck due to inefficient data handling. In cases where performance suffers, I often see a loooong wait for a response before download, when the server is using up lots of memory. It seems to me that one should be able to initiate a streaming download considerably faster and not be memory-bound.

I might try to test this with a MonetDB/plumber set up that:

  • Takes a request with parameters for a data query
  • Asyncronously executes a COPY INTO query in MonetDB, which streams the query to disk. It could be CSV or NDJSON output.
  • Returns the URL path of the streaming output path

I don’t know a lot about how serving static files works so I don’t know how it will work to try do download a file that is actively being written to.

R/plumber can’t be the fastest option though, yeah?

Definitely not, more a proof of concept. Though in theory the idea is that the heavy part of query and serialization is offloaded to the DB and the file system.

I just published a blog post on a pretty similar analysis:

Hope it helps!

3 Likes

I think I messed up the link…(even though that image is from the blog post):

https://waterdata.usgs.gov/blog/formats/

2 Likes

I got some excellent responses to this on the rOpenSci Slack, too, which I’m summarizing here:

  • It’s often the serialization, not the queries that are the performance issue.
  • Services like AWS Aurora or Athena largely allow you to abstract away this scaling issue. Putting them behind serverless functions like AWS lambda is a good way to go. Aurora is a hosted database that can spin up or down base on queries, Athena is service using Presto (a tool similar to Apache Drill), to query flat files on S3.
  • A RDBMS has some inherent performance requirements because of ACID requirements. These come along even though you might not need ACID for read-only, filter-only type outputs
  • When dealing with large outputs suddenly you’re dealing with network reliability as well as speed, turnkey services are helpful for this.
  • Async queueing and pagination is a helpful way to protect your resources.
1 Like
1 Like

Logged in to write the above :slight_smile: And even if its in addition to what you finally decide on, you could qualify for the public datasets programme of BigQuery https://cloud.google.com/bigquery/public-data/

You get 1TB a month free of queries, and these days don’t need a credit card to access.

1 Like

Hey there…

Athena has been much cheaper for us in-production that bigquery (i have no love for or stake in either large money-grubbing service so this isn’t me just hating on google). That’s a combination of at-rest costs for compressed parquet data that has very intelligent partitioning and for well-crafted queries (which are often not dbplyr-auto-generated SQL unfortunately).

Amazon updated the Athena back-end to support result set streaming which provided much better performance than the super slow 1000-record-fetch-at-a-time it had before but I still use my home-grown async query results grabber (https://git.rud.is/hrbrmstr/awsathena/src/branch/master/R/collect-async.R) since S3 copies are still faster.

The http (et al) scan/study data sets (ref: https://opendata.rapid7.com) are often yuge and we use mostly date partitioning with those.

However, we have 150-200 honeypots running 24/7 which all dump various small files to central S3 buckets hourly. they get turned into ndjson (some are already in ndjson) and then they get rolled up into optimized parquet files with various partition schemes.

The partitioning is key (and can be done with plain ol’ ndjson or even CSVs, too) since it speeds up queries (if the queries are written properly) and reduces costs (which are based on how much data had to be reviewed by Amazon’s customized Presto backend).

Drill works the same way and we/I also use that in distributed mode for things that need to be queried more often (where there’s a cost savings of using or spinning up dedicated resources vs pay-per-query in Athena).

We’re also in the process of building a caching layer since the old data doesn’t change. That way we can provide an API or UX for folks we don’t want hitting Athena directly (b/c they know not of query optimization).

We also collect the SQL from the queries to see what we should consider moving to a more optimized setup.

Not sure if ^^ helps but I can clarify if anything is wonkily explained.

@hrbrmstr Thanks! Very helpful! By “intelligent paritioning” do you mean that your parquet files are split along the variables one is most likely to filter by? Do you split the same data multiple ways?

It seems that Athena and BigQuery both dump results to hosted storage and this has better performance and fewer limits than direct return of query results, and I imagine that something like this is how you would want to run any such service.

aye. all by some form of date. some by date and query fields we see folks filtering on often.

the rly nice thing abt both bigquery and athena is you only pay for storage and queries. so if work is infrequent it’s cheap and you can cost each query and athena just setup “workgroups” where you can set a limit (to avoid going over budget)