Writing portable AWS Glue Jobs
AWS Glue is a somewhat magical service. When it works, it makes ETL downright simple. With Glue Crawlers you catalog your data (be it a database or json files), and with Glue Jobs you use the same catalog to transform that data and load it into another store using distributed Spark jobs. And all that while being fully managed and serverless. Indeed, magical.
But that magic breaks down at times. Should you follow the advice of AWS and write your jobs using the Glue library, you will find that your pipelines have a number of inconvenient limitations. First of all, your job scripts cannot be executed locally on your computer, forcing you to develop your code using development endpoints that you have to provision and pay for. This also means that unit tests cannot be written, making it tough to run any production workload with confidence. Second, your jobs are not portable, locking you in to AWS. Even migrating to an alternative service on AWS such AWS EMR becomes impossible. These two points alone make it a scary exercise to commit your ETL to AWS Glue.
Yet we do not have to accept these limitations. If you forego some convenience that Glue brings, it is possible to create portable and unit-testable pipelines.
The problem
from awsglue.context import GlueContext # removed other imports args = getResolvedOptions(sys.argv, [‘JOB_NAME’]) # job bootstrapping sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args[‘JOB_NAME’], args) # transformations using the Glue library datasource0 = glueContext.create_dynamic_frame.from_catalog(database = “source”, table_name = “table”, transformation_ctx = “datasource0”) applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(“value”, “string”, “value”, “string”)], transformation_ctx = “applymapping1”) resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = “make_struct”, transformation_ctx = “resolvechoice2”) dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = “dropnullfields3”) datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = “s3”, connection_options = {“path”: “s3://target/table”}, format = “parquet”, transformation_ctx = “datasink4”) # commit the job for bookmarking the progress job.commit()
The solution
from awsglue.context import GlueContext from pyspark.context import SparkContext from pyspark.sql import SparkSession spark_session = SparkSession.builder \ .appName(“job_name”) \ .getOrCreate() df = spark_session.sql(“SELECT value FROM table”) df.write.mode(“overwrite”).format(“parquet”).save(“s3://target/table/”)
Limitations & conclusion
Removing the Glue library inevitably means that we forego the convenience it brings. Dynamic dataframes are downright useful, and the Relationalize method that relationalizes nested dataframes has no alternatives in pure PySpark that are as simple to use. Most important, job bookmarking is only possible when you load and transform data using the Glue library. In the past, if you were to use pure PySpark you would also be locked out from using the Glue Data Catalog in your queries, but this has recently been remedied by the AWS Glue team.
Many developers working on pipelines in production environments might have been scared away from AWS Glue due to the jobs seemingly not being portable or unit-testable. And while that did certainly put us off in the past, it is not a limitation of the service itself. If you are willing to sacrifice some convenience an alternative exists, and it brings out the best in the service: fully managed, serverless, unit-tested and portable pipelines. No magic lost there.