• 1 Post
  • 3 Comments
Joined 10 months ago
cake
Cake day: November 28th, 2023

help-circle


  • Nope. I sampled the dataset so that it’d be around 1000 rows. I did it with pyspark’s sample ().

    Then a display operation of that tiny dataset took around 8 minutes.

    So I’m thinking that maybe spark’s lazy evaluation had something to do with this? The original DF is that brutally huge so maybe it plays a role?

    I tried creating a dummy df from scratch with 10k rows and displaying it. And as expected it goes pretty fast. So I really think it must be somehow linked to the size of the original df.