Compiling Spark from Source

Thanks for the fast replies and encouragement.

To clarify, I think that big-data-basic bundle no longer includes Spark.

  • When following this tutorial a prerequisite step includes install of big-data-basic. However, the current documentation for all available bundles does not list Spark anywhere but buildreq-spark.

  • Even with fresh install, then following above tutorial, it appears an clr-update or install of big-data-basic will not populate the examples for spark in the /usr/share/defaults/spark pathway. This thread discusses suggested practices for keeping order/organization, partly responsive to any general question about documentation – and spark is one example.

doct0rHu requests clarification about the above on this thread.

With the announcements above, I made working assumptions that

  • a) Spark is no longer included or supported through swupd and
  • b) I could either try using the stacks from Docker or
  • c) compile from source, then move certain things over manually, where these “used to be” when swupd would install Spark.

b) above proved a bit rough because Docker Swarm did not play nice and easy with the image/stack provided by Clear on DockerHub. I did learn a lot, and even tried Kubernetes. I had fits of success, but ultimately since my use case is as described in the initial post on this thread, I moved back to “just Spark” on bare metal.

c) above works now, with the following hiccups

  1. reboot of Spark Master means I have to run again the scripts for .start master and .start [script name omitted for sensitivity to pejorative different shade of meaning for same word] to connect workers
  2. Rstudio (desktop) throws an error and won’t connect to Spark; so I spun up rstudio-server is up and I didn’t get to troubleshoot the error here because of 1.

Also, any tips for tracking above and keeping tidy record of it. I wasn’t overstating how much I appreciate the community and existing documentation. It’s a ton of thinking and expertise provided at no cost. To supplement, I have these desk references as follows with title (author) my rating/utility:

  1. Spark: The Definitive Guide (Bill Chambers) 8 of 10
  2. Hadoop: The Definitive Guide (Tom White) 8 of 10
  3. The Kubernetes Book (Nigel Poulton) 7 of 10
  4. Mastering Spark with R (Javier Luraschi) 4 of 10
  5. Docker Deep Diver (Nigel Poulton) 7 of 10
  6. Ubuntu Unleashed (Matthew Helmke) 10 of 10
  7. Kubernetes: A Step-by-Step Guide for Beginners (Sheldon Miles) 3 of 10

So, I look up whatever I can. But, I finally got stuck enough to ask. My docker/kubernetes journey pointed me back to “just do Spark on bare metal”.

Giving back/engaging more on my end:

  • Gently, I’d rate the Clear Linux tutorials as 5 of 10, the community posts 10 of 10, and documentation overall 2 of 10.
  • I’m happy to help share/describe more lessons learned, but don’t want to add noise or distraction.
  • None of this is a complaint, I only endeavor to figure out my problem(s) while helping others either avoid or leverage my hard headed DIY mistakes.
  • I used to use these quite a bit, but wanted Spark and not to pay out the nose for AWS. So, the Rstudio issues I think I can handle, once I get back there. Would be great to offer something to community like this, and maybe I could help?
  • Creating mixes and helping incubate a community-based 3rd party repo for bundles is something with chatter across the forum, and I’d be excited to be a part of that effort. I just don’t have real-world experience. But, I’d eagerly do what I could, and think I could help with maintaining, or just writing documentation.
2 Likes