Troubleshooting

  • Job is slow or OOMs (throws an OutOfMemoryError) while using an aggregate like collect_list or sample_call_summary_stats

    • Try disabling the ObjectHashAggregate by setting spark.sql.execution.useObjectHashAggregateExec to false

  • Job is slow or OOMs while writing to partitioned table

    • This error can occur when reading from highly compressed files. Try decreasing spark.files.maxPartitionBytes to a smaller value like 33554432 (32MB)

  • My VCF looks weird after merging VCFs and saving with bigvcf

    • When saving to a VCF, the samples in the genotypes array must be in the same order for each row. This ordering is not guaranteed when using collect_list to join multiple VCFs. Try sorting the array using sort_array.

  • Glow’s behavior changed after a release