Glossary / Apache Arrow

Apache Arrow

Apache Arrow was founded in 2016 by developers of numerous open source data projects to bring together the database and data science communities to collaborate on a shared computational technology. It includes a language-agnostic software framework for developing data analytics applications that process columnar data. Its standardized column-oriented memory format is able to represent flat and hierarchical data for efficient analytic operations and reduced costs and is a more efficient approach when working with large sets of data. Columnar data representation can yield better compression and can also speed up certain queries because the compiler and CPU can do more parallel computing. It’s common for analytics systems to use Apache Arrow to process data stored in Apache Parquet files.

The Arrow project is split into 2 parts:

A set of specifications for memory format
Standard libraries for key programming languages

Apache Arrow works with Apache Parquet, Apache Flight SQL, Apache Spark, NumPy, PySpark, pandas, and other data processing libraries and includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

How Apache Arrow deframents Data Access

How Apache Arrow defragments Data Access

Advantages with Arrow

All systems utilize the same memory format
No overhead for cross-system communication
Interoperable (data exchange)
Embeddable (in execution engines, storage layers, etc.)

Apache Arrow

Related resources

Jared Lander [Lander Analytics] | Modeling Time Series in R

InfluxDB IOx Tech Talks - Query Processing in InfluxDB IOx

InfluxDB IOx: Query Engine Design and the Rust-Based DataFusion in Apache Arrow