Hi all, I’m interested in solutions people know of to this data-hosting problem:
- One hosts data that can be represented in an RDMS. It’s in the multi-GB+ scale, may be hosted on one server or across servers.
- One expects to serve data from queries to this data set via interactive or REST API
- Results of queries are MB to GB scale
I’ve seen this become a performance problem a number of times. It’s often when the hosting setup didn’t anticipate the load of large-scale queries and processing the queries becomes very slow, usually filling up RAM.
Some databases are very good at processing queries out of RAM and can stream the output of the query to a file (I like MonetDB a lot for this, and I think Apache Drill can do it even if the data are flat files). So loading the results into memory may not be necessary, but I imagine that other parts of the tooling, (like constructing JSON payloads), end up pulling the data into memory or are otherwise very resource intensive.
A frequent pattern I see for a lot of academic or government services is to have an website or REST API for small queries, and direct users to an FTP site, S3, or other static hosting for flat file dumps of the data. We’re big fans of flat file dumps at rOpenSci, but this anticipates that you can break up the data into the right type of outputs, and/or the user will process down to the query results they want.
I’ve also seen versions where a user can make a query with an online tool and this enters a queue, and one receives an email when the result is ready to download (Example here). Clearly this limits the burden on the hosting service.
What other solutions are there to this? I realize the cost of moving the data is in the reason for analytics-in-the-DB, but assuming one is OK with the cost of data transfer, can one query and stream without a huge load on servers? It seems doable but non-trivial, like an API request to execute the query and then another request to stream the output file. Maybe I and these government services don’t know about more modern paradigms/databases? Is this something that turnkey services like Google BigQuery or Amazon Relational Database Service offer?