%timeit pd.read_parquet(path='file.parquet. %timeit pd.read_parquet(path='', engine='pyarrow') %timeit df.to_parquet(path='', compression='gzip', engine='pyarrow', index=True) %timeit df.to_parquet(path='', compression='snappy', engine='pyarrow', index=True) Results (small file, 4 KB, Iris dataset): +-+-+-+ Let's test speed and size with large and small parquet files in Python. The tradeoff depends on the retention period of the data. For nested data types (array, map and struct), vectorized reader is disabled by default. However, cloud compute is a one-time cost whereas cloud storage is a recurring cost. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when is set to native and is set to true. It's important to keep in mind that speed is essentially compute cost. See extensive research and benchmark code and results in this article ( Performance of various general compression algorithms – some of them are unbelievably fast!).īased on the data below, I'd say gzip wins outside of scenarios like streaming, where write-time latency would be important. LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU.įor longer term/static storage, the GZip compression is still better. GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data. If you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not. CREATE TABLE encrypted ( ssn STRING, email STRING, name STRING ) USING ORC OPTIONS ( .path 'kms://httplocalhost:9600/kms', 'hadoop', orc.encrypt 'pii:ssn,email', orc. It is worth running tests to see if you detect a significant difference. Snappy or LZO are a better choice for hot data, which is accessed frequently. GZip is often a good choice for cold data, which is accessed infrequently. In case when the property was ABCDE the query didn't failed, but table wasn't been created.GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. ![]() create table NEW_TABLE stored as parquet tblproperties ('pression'='ABCDE') as select * from OLD_TABLE Note: I tried to run the same query directly from Hive and in case when the property was equals to SNAPPY table was created successfully with proper compression (i.e. This make me think that TBLPROPERTIES are just ignored by Spark SQL. General Usage : GZip is often a good choice for cold data, which is accessed infrequently. ABCDE the code still works fine with exception that compression is still GZIP: hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('pression'='ABCDE') as select * from OLD_TABLE")Īnd Hue "Metastore Tables" -> TABLE -> "Properties" shows: | Parameter | Value | 5 Answers Sorted by: 48 Compression Ratio : GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. If I change SNAPPY to any other string e.g. The following code creates table in PARQUET format, but with GZIP compression: hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('pression'='SNAPPY') as select * from OLD_TABLE")īut in the Hue "Metastore Tables" -> TABLE -> "Properties" it still shows: | Parameter | Value | If None is set, it uses the value specified in .codec. This will override orc.compress and .codec. I need to create a Hive table from Spark SQL which will be in the PARQUET format and SNAPPY compression. This can be one of the known case-insensitive shorten names (none, snappy, zlib, and lzo).
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |