Hi! I'm Andrew.

Software & big data engineer / photographer

Streaming Reads with Python and Google Cloud Storage

By Andrew Fisher |  Nov 24, 2023  | python, googlecloudplatform, gcp, gcs, dataengineering, featured
In data processing, efficiency and reliability are paramount. As a data engineer, you’ll often need to read files in resource constrained environments. One common approach to reading a file is to stream the file and process it in smaller chunks. I recently came across a way to accomplish this using Google Cloud Storage (GCS), Python, and a CRC32C checksum (to verify the file’s integrity). Some reasons why this approach could be useful and why this post exists:
Continue Reading...

Keyset pagination in PostgreSQL like a pro

By Andrew Fisher |  Nov 6, 2023  | sql, postgres, dataengineering
Why keyset pagination With infinite scrolling tables on websites, keyset pagination is a technique to provide approximate constant time access to subsequent pages as a user scrolls. This approach can be implemented with data stored in a relational database like PostgreSQL. It is more complex to implement than a simple approach like LIMIT + OFFSET pagination but minimizes slower query times as you scroll many pages into a result set. The database doesn’t have to load the entire result set, sort it, and then return the specified limit from the given offset.
Continue Reading...

BigQuery: An interactive analytics benchmark

By Andrew Fisher |  Jul 28, 2021  | bigquery, dataengineering
If you have operational data sitting in BigQuery that powers dashboards through tools like Tableau, Looker, or Apache Superset, putting an exploratory analytics tool on top of your BigQuery datasets can enable business and technical users to interact with the data in an interactive, exploratory fashion, and performance is surprisingly good. Using a standard dataset of varying sizes, an automated test suite ran over the data simulating “slice and dice” with concurrent users and performance of BigQuery was measured.
Continue Reading...

Snowflake supports interactive analytics at scale

By Andrew Fisher |  Feb 25, 2021  | snowflake, dataengineering
With a proliferation of massively parallel processing (MPP) database technologies, like Apache Pinot, Apache Druid, and ClickHouse, there are no shortage of blog posts on the Internet explaining how these technologies are the only ones capable of supporting interactive analytics on large data volumes. That is not the case. Benchmark tests on Snowflake’s platform with wide, denormalized datasets and concurrent query access patterns show that Snowflake offers reasonably fast query performance on large datasets when queried in an iterative, ad-hoc fashion.
Continue Reading...
Score: 
0
×