Compiling Spark from Source

Figured I’d stop lurking, and first say thank you to the Clear Linux team and all that post here.

Cut to the chase: is it a terrible move to have compiled Spark from source and unpacked/installed from user home?

I have a home lab setup, which is meant to run a Spark cluster. Cluster will be a one-trick pony of sorts – want to use RStudio/ R API to Spark for ML/modeling. Since my own limitations (skills, patience) meant I could only keep up Docker Swarm, never K8s, I switched back to doing Spark on bare metal. So, here’s my “dumb noob” move – I compiled from source, after using the jdk13 bundle install via swupd. I set the JAVA_HOME per other posts about it, but unpacked and installed from home. All worked surprisingly well – loads less heartache than having Swarm go down / trying to find the right setup for using the stacks/images.

Should I stop and re-do everything?

I understand on a basic level about the stateless approach of CL etc., but this is just a home lab and it’s DIY learn by doing. I’m not ever going to production or developing something, etc. I simply want to apply existing R skills/knowledge and play with ML through Spark and Synthea data (https://synthetichealth.github.io/synthea/). It’s also pretty crazy how much faster my ten Braswell NUC’s are on CL versus Ubuntu, maybe just placebo?

Originally, I picked Clear because of the stacks, and was stuck inside like all, in need of an inside only weekend hobby. By the time I had enough repurposed weekend/commute time in getting feet wet with Docker, etc., I realized there would be no more direct support for Spark through swupd. I get that move, too, and bet it may have some small part to do with licensing (https://www.scotusblog.com/case-files/cases/google-llc-v-oracle-america-inc/)?

I get confused here. Isn’t Spark available in CL?

I am a little confused also. If …

Then why …

I might be missing something (sorry it is kind of a late night here) here but if everything works ''surprisingly well", then what is the matter?

And don’t evert sell yourself short or undervalue a good homelab. Sound like you’re getting into some interesting stuff!

Oh and no, the faster compile times are not just a placebo. Enjoy! That’s a big part of why many of us are here. Welcome! The learning curve is half the value in a home lab so just stick to what interests you and I’m sure you’ll figure it out.

Thanks for the fast replies and encouragement.

To clarify, I think that big-data-basic bundle no longer includes Spark.

  • When following this tutorial a prerequisite step includes install of big-data-basic. However, the current documentation for all available bundles does not list Spark anywhere but buildreq-spark.

  • Even with fresh install, then following above tutorial, it appears an clr-update or install of big-data-basic will not populate the examples for spark in the /usr/share/defaults/spark pathway. This thread discusses suggested practices for keeping order/organization, partly responsive to any general question about documentation – and spark is one example.

doct0rHu requests clarification about the above on this thread.

With the announcements above, I made working assumptions that

  • a) Spark is no longer included or supported through swupd and
  • b) I could either try using the stacks from Docker or
  • c) compile from source, then move certain things over manually, where these “used to be” when swupd would install Spark.

b) above proved a bit rough because Docker Swarm did not play nice and easy with the image/stack provided by Clear on DockerHub. I did learn a lot, and even tried Kubernetes. I had fits of success, but ultimately since my use case is as described in the initial post on this thread, I moved back to “just Spark” on bare metal.

c) above works now, with the following hiccups

  1. reboot of Spark Master means I have to run again the scripts for .start master and .start [script name omitted for sensitivity to pejorative different shade of meaning for same word] to connect workers
  2. Rstudio (desktop) throws an error and won’t connect to Spark; so I spun up rstudio-server is up and I didn’t get to troubleshoot the error here because of 1.

Also, any tips for tracking above and keeping tidy record of it. I wasn’t overstating how much I appreciate the community and existing documentation. It’s a ton of thinking and expertise provided at no cost. To supplement, I have these desk references as follows with title (author) my rating/utility:

  1. Spark: The Definitive Guide (Bill Chambers) 8 of 10
  2. Hadoop: The Definitive Guide (Tom White) 8 of 10
  3. The Kubernetes Book (Nigel Poulton) 7 of 10
  4. Mastering Spark with R (Javier Luraschi) 4 of 10
  5. Docker Deep Diver (Nigel Poulton) 7 of 10
  6. Ubuntu Unleashed (Matthew Helmke) 10 of 10
  7. Kubernetes: A Step-by-Step Guide for Beginners (Sheldon Miles) 3 of 10

So, I look up whatever I can. But, I finally got stuck enough to ask. My docker/kubernetes journey pointed me back to “just do Spark on bare metal”.

Giving back/engaging more on my end:

  • Gently, I’d rate the Clear Linux tutorials as 5 of 10, the community posts 10 of 10, and documentation overall 2 of 10.
  • I’m happy to help share/describe more lessons learned, but don’t want to add noise or distraction.
  • None of this is a complaint, I only endeavor to figure out my problem(s) while helping others either avoid or leverage my hard headed DIY mistakes.
  • I used to use these quite a bit, but wanted Spark and not to pay out the nose for AWS. So, the Rstudio issues I think I can handle, once I get back there. Would be great to offer something to community like this, and maybe I could help?
  • Creating mixes and helping incubate a community-based 3rd party repo for bundles is something with chatter across the forum, and I’d be excited to be a part of that effort. I just don’t have real-world experience. But, I’d eagerly do what I could, and think I could help with maintaining, or just writing documentation.
2 Likes

yes unfortunately the docs are misleading as they haven’t been updated since the big java purge. I would recommend going with apache’s build instructions. If you really just need the old defaults to follow the tutorial you could comb through the old releases. looks like the last one was 33440. sorry on my phone or I would provide more info.

1 Like

I think you’re right. Last time I used Spark was early 2019. It’s possible Spark is no longer provided.

1 Like

Thanks again, excuse the word vomit in my reply. I hadn’t thought to check the old release and will comb through it now. I found another example possibly helpful for the reboot issue here https://gist.github.com/shahmimajid/8d9a3e0b64e2b33e9e8bb59e9fabf980 but will check the old releases first.