How to preview a Parquet file on S3 without downloading it
You don't have to pull the whole 1.4GB Parquet from S3 to see its schema. The Parquet footer is at the end of the file; HTTP range requests fetch it in milliseconds. Here's the technique, two ways to use it (Python, Sery Link), and the failure modes nobody tells you about.
The thing nobody tells you about Parquet
Parquet files end with a footer. Not metaphorically — literally. The last 8 bytes of every Parquet file are PAR1 magic bytes plus a 4-byte little-endian integer that tells you how many bytes back to seek to find the file's metadata: column names, data types, row count, row-group boundaries, per-column min/max statistics, and compression info.
That footer is small. Tens of kilobytes, even for files that store hundreds of millions of rows in tens of gigabytes. So if all you want is "what columns does this file have, and how many rows" — you don't need to download the file at all. You just need to fetch the footer.
S3 and any HTTP server that supports the Range header (which is essentially all of them) can serve you bytes-N-through-M of an object. Two range requests gets you the whole footer:
- First range request: last 8 bytes. Read the footer length.
- Second range request: the footer length, ending at the magic bytes. Parse it.
That's it. Roughly two HTTP round-trips. Total bytes transferred: under 100 KB for almost every Parquet file in the wild, even multi-GB ones.
The 1.4 GB stays in S3.
Two ways to actually do this
1. Python (the "script it in a notebook" path)
pyarrow reads Parquet footers natively over s3fs / fsspec filesystems — no download, just the range-request dance over the wire.
import pyarrow.parquet as pq
import s3fs
fs = s3fs.S3FileSystem(key="AKIA…", secret="…", region_name="us-east-1")
pf = pq.ParquetFile("my-bucket/data/sales.parquet", filesystem=fs)
print(pf.schema_arrow) # columns + types
print(pf.metadata.num_rows) # row count
print(pf.metadata.num_row_groups)
Best for: a notebook workflow where the user is already in Python. Worst for: anything resembling a daily browsing experience — typing this is more friction than just clicking a file.
2. Sery Link (the "click the file" path)
We built Sery Link specifically because the "click a file, see what's inside" UX shouldn't require typing SQL. Connect an S3 bucket once (credentials live in your OS keychain, never on our servers), browse the bucket like a folder, and click any Parquet — schema, sample rows, and column profile (null %, unique values, min/max/avg) render inline in under two seconds. Same range-request technique under the hood; we're just doing the typing for you.
It also works on CSVs (streams enough to infer schema), Excel files, and over more than just S3 — across all 9 supported protocols (Local, HTTPS, S3, Drive, SFTP, WebDAV, Dropbox, Azure, OneDrive) plus the 4 S3-compatible presets (B2, Wasabi, R2, GCS).
Free, open source under AGPL-3.0: download. No account required for the local browsing part — the cloud workspace ($19/mo) is the upgrade for AI chat across all your sources and multi-machine search.
The failure modes nobody tells you about
The technique is real, but every layer of it has a sharp edge that you only discover the hard way. Things we've hit building this:
Wrong region returns empty, not an error
S3 has a long-standing rough edge: requesting a bucket in the wrong region sometimes returns an empty result set rather than a clear region-mismatch error. Range-request-based tools inherit this. If you point a tool at us-east-1 for a bucket that actually lives in eu-west-1, you may get back "0 rows" instead of a region error. Always test with a known-non-empty bucket first.
Glob brace expansion is not what you think
Most query engines support glob syntax with *, **, and brace expansion {a,b,c}. Against local files it works as documented. Against S3 listings via HTTP, we've observed brace expansion silently returning empty even when matching keys are present. Use plain **/* globs and filter extensions yourself if you can.
One-level-deep prefixes miss partitioned data
s3://bucket/data/*.parquet only matches Parquet files directly under data/. Most real-world data lakes are partitioned — year=2024/month=01/data.parquet lives one or more directories below. Use s3://bucket/data/**/*.parquet for recursive matching. (Sery Link defaults to recursive + extension-filter-after, so this case "just works" — that was a v0.6.1 fix after a real user hit exactly this.)
Footer-only is a Parquet feature, not a CSV feature
Everything in this post applies to Parquet. CSV doesn't have a footer; you can't skip to schema without reading at least the first few rows. The good news: "the first few rows" for a CSV is also a small range request, so schema + sample preview is still bounded — but it's a different code path under the hood.
Cost: range requests are LIST requests + GET requests
Each Parquet preview costs you two GET requests (footer length + footer body) plus whatever LIST requests you needed to find the file in the first place. AWS S3 charges roughly $0.0004 per 1,000 GET requests and $0.005 per 1,000 LIST requests. Even at 100 previews a day, this rounds to zero. Don't worry about it.
When to actually download
Footer-only previewing covers schema, row count, sample rows, and column-level statistics. If you need:
- Custom aggregation across the whole file (sum, average over millions of rows): you'll pay for the column data, but only for the columns the query touches — Parquet's columnar layout means SELECT col_a doesn't pull col_b through col_z.
- Repeated heavy queries against the same file: consider pulling once, converting to a local Parquet, and querying locally. Sery Link has a one-click Convert button for this — drops the cost of every subsequent query to zero network traffic.
- Joins across multiple files on different cloud-storage backends: each file's data has to land somewhere addressable by the query engine. Sery Link handles the join definition as one SQL statement, not a multi-step download dance.
The take-away
You almost never need to download a Parquet just to see what's inside. The footer-only technique has been available via pyarrow for years; it's just been buried under the assumption that cloud-storage browsers exist to show you file lists, and actual data inspection is a separate tool. We don't think it should be.
That's the whole pitch behind Sery Link: every cloud storage you have, browseable in one app, with preview-without-downloading for tabular files baked in. Try it for free at /download — open source under AGPL-3.0, runs on macOS, Windows, and Linux, no account required for the local browsing part.
Further reading: the official Parquet format spec describes the footer layout in detail. The S3 documentation covers HTTP range requests against objects.