Troubleshooting

  • Job is slow or OOMs (throws an OutOfMemoryError) while using an aggregate like collect_list or sample_call_summary_stats

    • Try disabling the ObjectHashAggregate by setting spark.sql.execution.useObjectHashAggregateExec to false

  • Job is slow or OOMs while writing to partitioned table

    • This error can occur when reading from highly compressed files. Try decreasing spark.files.maxPartitionBytes to a smaller value like 33554432 (32MB)

  • My VCF looks weird after merging VCFs and saving with bigvcf

    • When saving to a VCF, the samples in the genotypes array must be in the same order for each row. This ordering is not guaranteed when using collect_list to join multiple VCFs. Try sorting the array using sort_array.

  • Glow’s behavior changed after a release

  • com.databricks.sql.io.FileReadException: Error while reading file

    • When Glow is registered to access transform functions this also overrides the Spark Context. This can interfere with the checkpointing functionality in Delta Lake in a Databricks environment. To resolve please reset the runtime configurations via spark.sql("RESET") after running Glow transform functions and before checkpointing to Delta Lake, then try again.