Technology
Apache Parquet
Apache Parquet is an open-source, column-oriented data file format optimized for efficient data storage and high-performance analytical query processing.
Parquet is the definitive columnar storage format for big data analytics: it drastically improves performance and reduces storage costs. Unlike row-based formats (CSV), Parquet organizes data by column, enabling systems to read only the necessary fields (predicate pushdown) and compress similar data types together (e.g., Snappy, Gzip). This design delivers massive efficiency gains; for example, one study showed Parquet queries running 34x faster with 99.7% cost savings compared to CSV on a 1TB dataset. The format is language-agnostic and features a self-describing schema via a file footer (metadata), making it the standard interchange format across major ecosystems like Apache Spark, Hive, Presto, and cloud services (AWS Athena, Google BigQuery).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1