Writing portable AWS Glue Jobs


engelsAWS Glue is a somewhat magical service. When it works, it makes ETL downright simple. With Glue Crawlers you catalog your data (be it a database or json files), and with Glue Jobs you use the same catalog to transform that data and load it into another store using distributed Spark jobs. And all that while being fully managed and serverless. Indeed, magical.

But that magic breaks down at times. Should you follow the advice of AWS and write your jobs using the Glue library, you will find that your pipelines have a number of inconvenient limitations. First of all, your job scripts cannot be executed locally on your computer, forcing you to develop your code using development endpoints that you have to provision and pay for. This also means that unit tests cannot be written, making it tough to run any production workload with confidence. Second, your jobs are not portable, locking you in to AWS. Even migrating to an alternative service on AWS such AWS EMR becomes impossible. These two points alone make it a scary exercise to commit your ETL to AWS Glue.

Yet we do not have to accept these limitations. If you forego some convenience that Glue brings, it is possible to create portable and unit-testable pipelines.

The problem

The problem at its heart lies in the Glue library, which adds a number of convenient methods but does not run anywhere other than the AWS Glue environment. Using it, a job looks something like the following code snippet in Python:
from awsglue.context import GlueContext
# removed other imports
args = getResolvedOptions(sys.argv, [‘JOB_NAME’])

# job bootstrapping
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)

# transformations using the Glue library
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = “source”, table_name = “table”, transformation_ctx = “datasource0”)
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(“value”, “string”, “value”, “string”)], transformation_ctx = “applymapping1”)
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = “make_struct”, transformation_ctx = “resolvechoice2”)
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = “dropnullfields3”)
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = “s3”, connection_options = {“path”: “s3://target/table”}, format = “parquet”, transformation_ctx = “datasink4”)

# commit the job for bookmarking the progress
job.commit()

 

Even though Glue jobs run on Spark, the code is markedly different from what a PySpark script would look like otherwise. The Glue library introduces the concept of Jobs, Dynamic Dataframes and Transformations. When used together, Glue keeps track of what files have been processed before (called job bookmarking) and handles multiple data types per field gracefully. But writing your ETL like this leaves you with code that cannot be unit-tested or migrated.

The solution

The solution lies in eliminating the Glue library and using pure PySpark. This possibility is not marketed, nor is it extensively documented, but as Glue Jobs run on Spark it is quickly done. An example would look like the following snippet:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession.builder \
  .appName(“job_name”) \
  .getOrCreate()

df = spark_session.sql(“SELECT value FROM table”)

df.write.mode(“overwrite”).format(“parquet”).save(“s3://target/table/”)

 

Gone are the specific Glue library methods. What you now have is a standard PySpark job that makes use of the Glue Data catalog to read data. This job is portable, can be developed without a development endpoint and put into a test-harness. The world is right again. But for what we have gained, we have sacrificed something else.

Limitations & conclusion

Removing the Glue library inevitably means that we forego the convenience it brings. Dynamic dataframes are downright useful, and the Relationalize method that relationalizes nested dataframes has no alternatives in pure PySpark that are as simple to use. Most important, job bookmarking is only possible when you load and transform data using the Glue library. In the past, if you were to use pure PySpark you would also be locked out from using the Glue Data Catalog in your queries, but this has recently been remedied by the AWS Glue team.

Many developers working on pipelines in production environments might have been scared away from AWS Glue due to the jobs seemingly not being portable or unit-testable. And while that did certainly put us off in the past, it is not a limitation of the service itself. If you are willing to sacrifice some convenience an alternative exists, and it brings out the best in the service: fully managed, serverless, unit-tested and portable pipelines. No magic lost there.