Skip to content

PySpark Project Creation

Awantik Das edited this page Nov 21, 2017 · 21 revisions
  1. Create Project directory

  2. Copy launch_spark_submit script here ( Required if notebook also running on same spark )

    #!/bin/bash unset PYSPARK_DRIVER_PYTHON spark-submit $* export PYSPARK_DRIVER_PYTHON=jupyter

  3. Now create entry program entry.py with 'main'

  4. create another dir 'additionalCode'

  5. cd additionalCode

  6. Create setup.py from setuptools import setup

    setup( name='PySparkUtilities', version='0.1dev', packages=['utilities'], license=''' Creative Commons Attribution-Noncommercial-Share Alike license''', long_description=''' An example of how to package code for PySpark''' )

  7. mkdir utilities

  8. Copy modules inside it

  9. In additionalCode execute - python setup.py bdist_egg

  10. This will create dist dir.

  11. dist will contain egg file

  12. To run ./launch_spark_submit.sh --master local[4] --py-files additionalCode/dist/PySparkUtilities-0.2.dev0-py2.7.egg entry.py

Clone this wiki locally