One main advantage of the Apache Spark is, it splits data into multiple partitions and executes operations on all partitions of data in parallel which allows us to complete the job faster.While working with partition data we often need to increase or decrease the partitions based on data distribution. Methods repartition and coalesce helps us to repartition.
Like this:
Like Loading...
Continue Reading
Generate Sequential and Unique IDs in a Spark Dataframe
Apache Spark is an open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Hence, adding sequential and unique IDs to a Spark Dataframe is not very straight forward, because of distributed nature of it.
Share this:
Like this:
Continue Reading
Spark Partitions with Coalesce and Repartition (hash, range, round robin)
One main advantage of the Apache Spark is, it splits data into multiple partitions and executes operations on all partitions of data in parallel which allows us to complete the job faster.While working with partition data we often need to increase or decrease the partitions based on data distribution. Methods repartition and coalesce helps us to repartition.
Share this:
Like this:
Continue Reading
Scala String Interpolation
Introduction String Interpolation refers to substitution of defined variables or expressions in a given String with respected values. String Interpolation allows users to embed variable references directly in processed string literals.. read more…
Share this:
Like this:
Continue Reading
Multiline Strings in Scala
You want to create multiline strings within your Scala source code, like you can with the heredoc syntax of other languages and help in escaping quotes and other symbols. A heredoc is a way. read more…
Share this:
Like this:
Continue Reading