![]() ![]() For text-based files, use the keywords STORED as TEXTFILE. Supported for text, RC, Sequence and Avro files in Impala 2.0 and higher. If you create a Hive table over an existing data set in HDFS, you need to tell Hive about the format of the files as they are on the filesystem ('schema on read'). Gzip Recommended when achieving the highest level of compression (and therefore greatest disk-space savings) is desired. Supported for text, RC, Sequence, and Avro files in Impala 2.0 and higher. ![]() The version of Spark I'm using is Spark version 2.2. Snappy compression is very fast, but gzip provides greater space savings. Neither way causes any errors so I was wondering which would be the better way and if there are any other more reliable ways of setting the compression when writing files? Also just to say this is after I've imported "._" so I'm not having any trouble using Avro files.Īnother way I've seen is to use "tConf" and these would be the commands I'd use in this instance: Snappy is provided in the Hadoop package along with the other native libraries (such as native gzip compression). ![]() I'm just starting off and was wondering if there's any concrete way of setting the compression when writing to a file in Spark?Į("compression", "snappy").avro("output path")īut when I go to check where the Avro files are saved I can't tell from the name of the files whether they've been compressed or not. Snappy aims for very high speeds and reasonable compression instead of maximum compression or compatibility with other compression libraries. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |