Home Orchestrating ETL workflow using AWS Step Function
Post
Cancel

Orchestrating ETL workflow using AWS Step Function


AWS Step Functions is a serverless orchestration service that allows users to create workflows as a series of event-driven steps. Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.

Before running this example, we have to create a glue job based on the following script and following the instructions here. For the worker type choose ‘Standard’ and we will use 4 workers.

You will also need to create an IAM role and attach the following policies:

  • arn:aws:iam::aws:policy/AmazonS3FullAccess
  • arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
  • arn:aws:iam::aws:policy/CloudWatchLogsFullAccess

This example uses the definition to start the glue job, wait for job to complete, then query athena table and finally publish number of rows to SNS

Once we have created the glue job, we can create the state machine based on this definition by running commmand listed here and supplying definition value as json

Get the step function arn, by running the following command to list the step functions arn.

Then execute statemachine using the following command and pass arn retrieved earlier for state machine created.

1
2
3
4
5
6
$ aws stepfunctions start-execution --state-machine-arn <arn>

{
    "executionArn": "arn:aws:states:us-east-1:376337229415:execution:ETLDemo:905b2d8e-e659-4e18-ba1f-714882100324",
    "startDate": "2022-04-21T02:37:21.064000+01:00"
}

We can check status of the execution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ aws stepfunctions describe-execution --execution-arn "arn:aws:states:us-east-1:376337229415:execution:ETLDemo:905b2d8e-e659-4e18-ba1f-714882100324"


{
    "executionArn": "<arn>",
    "stateMachineArn": "<arn>",
    "name": "905b2d8e-e659-4e18-ba1f-714882100324",
    "status": "FAILED",
    "startDate": "2022-04-21T02:37:21.064000+01:00",
    "stopDate": "2022-04-21T02:38:18.965000+01:00",
    "input": "{}",
    "inputDetails": {
        "included": true
    },
    "traceHeader": "Root=1-6260b551-db5653e799449c7169fc982b;Sampled=1"
}

If failed, we can retrieve execution history. Command below does this in reverse order and only prints out two items (so we get the latest event that failed) and the cause for failure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ aws stepfunctions get-execution-history  --execution-arn <enter-arn> --no-include-execution-data --reverse-order --max-items 2

{
    "events": [
        {
            "timestamp": "2022-04-21T02:38:18.965000+01:00",
            "type": "ExecutionFailed",
            "id": 9,
            "previousEventId": 0,
            "executionFailedEventDetails": {
                "error": "States.Runtime",
                "cause": "An error occurred while executing the state 'Glue StartJobRun' (entered at the event id #8). 
                The JSONPath '$.JobName' specified for the field 'JobName.$' could not be found in the input '{}'"
            }
        },
        {
            "timestamp": "2022-04-21T02:38:18.965000+01:00",
            "type": "TaskStateEntered",
            "id": 8,
            "previousEventId": 7,
            "stateEnteredEventDetails": {
                "name": "Glue StartJobRun"
            }
        }
    ],
    "NextToken": "eyJuZXh0VG9rZW4iOiBudWxsLCAiYm90b190cnVuY2F0ZV9hbW91bnQiOiAyfQ=="
}

This error is because the input json {"JobName": "flights_s3_to_s3"} was not passed via --input argument. The GlueStartJob task requires JSONPath ‘$.JobName’ from the input as defined in the state-machine definition.

This post is licensed under CC BY 4.0 by the author.

Running a simple ETL job using AWS Glue

Running a Spark notebook in AWS EMR

Comments powered by Disqus.