Converting GFF3 to refFlat

  • By: Faith Okamoto
  • Original development: April 2023
  • Writeup: July 2024

Full copy-paste-able code is right before References.

Introduction

Palmer Lab performs genome-wide association studies (GWAS) to determine genetic drivers of complex traits such as addiction. GWAS identify regions of the genome associated with the given phenotype. Association significance is assessed via p-values. However, due to local linkage disequilibrium (LD), i.e. the tendency for physically close variants to be inherited as a unit without recombination, many variants have similar p-values.

To identify the causal signal in a quantitative trait locus (QTL), we visualize GWAS results in a regional association plot (e.g. Gunturkun et al. 2022). These display the association signal in the context of local genes. Genes within or nearby the block of associated variants are more likely to be driving the association signal. These genes can then be prioritized for follow-up analyses.

Figure 1. Example plots from Gunturkun et al. 2022A. Porcupine plot with genotype-phenotype associations for all phenotypes in the study. Variant position (separated by chromosome) on X-axis, significance level on Y-axis. Taken from Figure 3. B. Regional association plot for "latency of entering social zone in SIT" QTL on Chromosome 17. Note the included gene annotations. Taken from Figure 6.

Regional association plots require the positions of genes to plot. We use the refFlat format, as required by LocusZoom. However, the Rat Genome Database (RGDreleases genome annotations in GFF3 format. Thus, we need to convert from one format to the other.

File formats

Note that for annotations on the - strand, start and end are flipped. For example, txStart is the end of transcription. This ensures that end > start.

GFF3

GFF3 is a tab-separated nine-column format. It starts with #-marked comments, which provide metadata such as format version. Column headers are not provided.

Col #Used by scriptColumn nameExampleDescription
1YesseqidChr1chromosome/contig name
2NosourceRGDannotation source
3Yestypeexontype of annotation
4Yesstart533961-indexed start position
5Yesend537011-indexed end position
6Noscore.floating point value
7Yesstrand-annotation strand
8Nophase00-indexed codon start position
9YesattributesParent=mRNARGD11369141;ID=trFeature11369144tag-value pairs
Table 1. Description of each column in a GFF3.

Only the Name (genes only), Parent (non-genes only), and ID attributes are used. Each gene has its own line, and then each of its mRNAs, exons and coding sequences (CDS) has its own line as well.

These annotations are taken from RGD

RefFlat

RefFlat is a tab-separated 11-column format. It starts with a header row. The following table is modified from the UCSC Genome Browser.

Col #FieldExampleDescription
1geneNamePSRC1Common name of gene
2nameNM_001283145Gene identifier
3chromchr1chromosome/contig name
4strand+annotation strand
5txStart1174731170-indexed transcription start position
6txEnd1174763790-indexed transcription end position
7cdsStart1174735210-indexed coding region start position
8cdsEnd1174760750-indexed coding region end position
9exonCount8Number of exons
10exonStarts117473117,117473483,1174736...0-indexed exon start positions
11exonEnds117473147,117473540,1174737...0-indexed exon end positions
Table 2. Description of each column in a refFlat.

Each gene and all of its exons are on the same line.

Conversion method

  1. Read raw file into gff3: the GFF3 is read (ignoring header comments). The two core attributes (ID and Parent) are extracted from the ATTR column and put into their own columns.
  2. Clean up gff3: minor tweaks to make the GFF3 closer to refFlat data.
    • In a GFF3, exons/CDS which are part of an mRNA have the mRNA as their Parent. In a refFlat, exons are directly connected to their parent gene. Thus, the Parent of each mRNA is used to overwrite its children's.
    • Many of the lines in a GFF3 are not needed in a refFlat. These include three_prime_UTRfive_prime_UTRmRNA, and any sort of gene with a Parent (since refFlat needs genes to be the top-level unit).
    • Some genes aren't given a separate exon/CDS row in the GFF3. Since refFlat needs exon annotations, these are given a dummy whole-gene exon.
  3. Reformat into refFlat: for each gene in gff3:
    • Copy data over to form the first six refFlat columns.
    • Take all exons, sort by start coordinate, merge overlapping exons (i.e. use maximum edges for alternative splicing), and distill into exonCountexonStarts, and exonEnds columns.
    • Take most extreme CDS positions for the cdsStart and cdsEnd columns.
  4. Save refFlat: combine Step 3 by-gene information and save to file.

Code

This code requires the package data.table. It was tested for data.table version 1.14.8 and R version 4.2.3.

library(data.table)

# ---- read gff3 basics ----

# https://useast.ensembl.org/info/website/upload/gff3.html
gff3 <- fread(<infile>, header = FALSE,
              col.names = c("SEQID", "SOURCE", "TYPE", "START", "END", 
                            "SCORE", "STRAND", "PHASE", "ATTR"))

# extract <val> from [anything];<name>=<val>;[anything]
extract_attr <- function(name, x) {
  ifelse(grepl(name, x), gsub(paste0(".*;", name, "=*(.*?);.*"), "\\1", x), NA)
}

# the extract_attr() function expects everything to be bordered by semicolons
gff3[, ATTR := paste0(";", ATTR, ";")]
# basic attributes which are needed for multiple purposes; extracting them once
gff3[, ID := extract_attr("ID", ATTR)][, PARENT := extract_attr("Parent", ATTR)]

# ---- clean up gff3 ----

# for chains of feature -> mRNA -> gene, create a pointer from feature -> gene
mRNAs <- gff3[TYPE == "mRNA"]
mRNAs[, PARENT_PARENT := PARENT][, PARENT := NULL]
gff3 <- merge.data.table(gff3, mRNAs[, c("ID", "PARENT_PARENT")], 
                         by.x = "PARENT", by.y = "ID", all.x = TRUE)
gff3[!is.na(PARENT_PARENT), PARENT := PARENT_PARENT][, PARENT_PARENT := NULL]

# we don't care about these separately
gff3 <- gff3[!(TYPE %in% c("three_prime_UTR", "five_prime_UTR", "mRNA"))]
# these are used as filler single-exon things when no exon is known
types_of_genes <- c("pseudogene", "protein_coding_gene", 
                    "ncRNA_gene", "tRNA_gene", "gene")
gff3 <- gff3[!(TYPE %in% types_of_genes & !is.na(PARENT))]

# some genes lack exons or CDS; pretend these are exons and CDS unto themselves
childless_genes <- gff3[TYPE == "gene" & !(ID %in% PARENT)]
gff3 <- rbindlist(list(gff3, 
                       copy(childless_genes)[, TYPE := "exon"][, PARENT := ID], 
                       copy(childless_genes)[, TYPE := "CDS"][, PARENT := ID]))

# ---- extract information for refFlat file ----

# get basic information about each gene
refFlat <- gff3[TYPE == "gene", 
                .(geneName = extract_attr("Name", ATTR), name = ID,
                  chrom = paste0("c", substring(SEQID, 2)), strand = STRAND, 
                  txStart = START, txEnd = END)]

# helper function to determine unique exons given starts and ends
get_exons <- function(starts, ends) {
  exons <- data.table(START = starts, END = ends)
  exons <- exons[!duplicated(exons)][order(START)]
  # merge overlapping exons
  exons <- exons[, .(START = min(START), END = max(END)),
                 by = .(group = cumsum(c(1, tail(START, -1) > head(END, -1))))]
  # extract number and ends of exons
  return(list(exonCount = nrow(exons), 
              exonStarts = paste(exons$START, collapse = ","),
              exonEnds = paste(exons$END, collapse = ",")))
}

# find exons & coding-sequence boundaries by gene
exons <- gff3[TYPE == "exon", get_exons(START, END), by = PARENT]
cds <- gff3[TYPE == "CDS", .(cdsStart = min(START), cdsEnd = max(END)), 
            by = PARENT]

# --- combine into final refFlat ----

# add these columns to the refFlat
refFlat <- merge.data.table(refFlat, merge.data.table(exons, cds), 
                            by.x = "name", by.y = "PARENT")

# https://genome.sph.umich.edu/wiki/LocusZoom_Standalone#Inserting_refFlat
setcolorder(refFlat, c("geneName", "name", "chrom", "strand", "txStart", 
                       "txEnd", "cdsStart", "cdsEnd", "exonCount", 
                       "exonStarts", "exonEnds"))
write.table(refFlat, <outfile>,
            quote = FALSE, sep = "\t", row.names = FALSE)

References

Gunturkun MH, Wang T, Chitre AS, Garcia Martinez A, Holl K, St. Pierre C, Bimschleger H, Gao J, Cheng R, Polesskaya O, et al. 2022. Genome-Wide Association Study on Three Behaviors Tested in an Open Field in Heterogeneous Stock Rats Identifies Multiple Loci Implicated in Psychiatric Disorders. Front Psychiatry. 13. doi:10.3389/fpsyt.2022.790566.