An example of AWS Glue Jobs and workflow deployment with terraform in monorepo style.
To learn more about decisions behind this structure chek out the supporting articles: https://dev.to/1oglop1/aws-glue-first-experience-part-1-how-to-run-your-code-3pe3
(for simplicity this solution uses just 1 bucket and does not deploy database)
Requirements:
- AWS Account
- S3 bucket to store terraform state.
- Rename
.evn.exampleto.envand set the values - export environment variables from
.envusing command:set -o allexport; source .env; set +o allexport docker-compose up -ddocker exec -it glue /bin/bash
Now we are going to work inside the docker container
make tf-initmake tf-planmake tf-applymake jobs-deploy
That's it! If everything went well you can now go to AWS Glue Console and explore jobs and workflows.
Or start workflow from CLI aws glue start-workflow-run --name etl-workflow--simple
Once you are finished with observations remove everything with make tf-destroy.
With the release of Glue 2.0 AWS released official Glue Docker Image you can use it for local development of glue jobs.
example:
docker exec -it glue /bin/bashto connect into our containercd /project/glue/data_sources/ds1/raw_to_refinedpip install -r requirements.txt- Run the fist job
python raw_to_refined.py --APP_SETTINGS_ENVIRONMENT=dev --LOG_LEVEL=DEBUG --S3_BUCKET=${TF_VAR_glue_bucket_name} cd /project/glue/data_sources/ds1/refined_to_curated- Next step requires results from previous stage
raw_to_refined - Run the second job
python refined_to_curated.py --APP_SETTINGS_ENVIRONMENT=dev --LOG_LEVEL=DEBUG --S3_BUCKET=${TF_VAR_glue_bucket_name}
If everything went well you should see output like this:
2020-12-23 14:28:43,278 DEBUG glue_shared.spark_helpers - DF: +--------------------+-----------+-----------+--------+-------+---+------+-----+-----+------+------+--------+-----+------+-----+---------+
| name| mfr| type|calories|protein|fat|sodium|fiber|carbo|sugars|potass|vitamins|shelf|weight| cups| rating|
+--------------------+-----------+-----------+--------+-------+---+------+-----+-----+------+------+--------+-----+------+-----+---------+
| String|Categorical|Categorical| Int| Int|Int| Int|Float|Float| Int| Int| Int| Int| Float|Float| Float|
| 100% Bran| N| C| 70| 4| 1| 130| 10| 5| 6| 280| 25| 3| 1| 0.33|68.402973|
| 100% Natural Bran| Q| C| 120| 3| 5| 15| 2| 8| 8| 135| 0| 3| 1| 1|33.983679|
| All-Bran| K| C| 70| 4| 1| 260| 9| 7| 5| 320| 25| 3| 1| 0.33|59.425505|
|All-Bran with Ext...| K| C| 50| 4| 0| 140| 14| 8| 0| 330| 25| 3| 1| 0.5|93.704912|
| Almond Delight| R| C| 110| 2| 2| 200| 1| 14| 8| -1| 25| 3| 1| 0.75|34.384843|
|Apple Cinnamon Ch...| G| C| 110| 2| 2| 180| 1.5| 10.5| 10| 70| 25| 1| 1| 0.75|29.509541|
| Apple Jacks| K| C| 110| 2| 0| 125| 1| 11| 14| 30| 25| 2| 1| 1|33.174094|
| Basic 4| G| C| 130| 3| 2| 210| 2| 18| 8| 100| 25| 3| 1.33| 0.75|37.038562|
| Bran Chex| R| C| 90| 2| 1| 200| 4| 15| 6| 125| 25| 1| 1| 0.67|49.120253|
+--------------------+-----------+-----------+--------+-------+---+------+-----+-----+------+------+--------+-----+------+-----+---------+
only showing top 10 rows
Commands above start PySpark inside the container and look for files stored in S3 <bucket>/ds1/refined
PS. You should avoid running local PySpark on large datasets!
Please keep in mind that IAM roles used in this example are very broad and should not be used as is.