Improve developing efficiency in pySpark? [Discussion]

Davidat0r · 1 year ago

So, if I take a sample and save it on disk with spark.write.parquet(…. It will become a separate entity from the original table right?

Sorry you must find these questions so trivial but for a newbie like me your answers are super helpful

Davidat0r · 1 year ago

I’m sorry…I still don’t understand. I thought it I sampled it would be faster? Isn’t that what people do with large datasets? And if it’s like you say, what’s the option during the development phase? I can’t really wait 15 minutes between instructions (if I want to keep my job haha)

Davidat0r · 1 year ago

Nope. I sampled the dataset so that it’d be around 1000 rows. I did it with pyspark’s sample ().

Then a display operation of that tiny dataset took around 8 minutes.

So I’m thinking that maybe spark’s lazy evaluation had something to do with this? The original DF is that brutally huge so maybe it plays a role?

I tried creating a dummy df from scratch with 10k rows and displaying it. And as expected it goes pretty fast. So I really think it must be somehow linked to the size of the original df.

Davidat0r · 1 year ago

Improve developing efficiency in pySpark? [Discussion]