How to: assign ChIP-seq peaks to genes

This tutorial demonstrates one way to assign CTCF ChIP-seq peaks to the nearest genes using bioframe.

import matplotlib.pyplot as plt
import numpy as np

import bioframe

base_dir = "/tmp/bioframe_tutorial_data/"
assembly = "hg38"

Load chromosome sizes

chromsizes = bioframe.fetch_chromsizes(assembly)
chromsizes.tail()

name
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

chromosomes = bioframe.make_viewframe(chromsizes)

Load CTCF ChIP-seq peaks for HFF from ENCODE

This approach makes use of the narrowPeak schema for bioframe.read_table .

ctcf_peaks = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
ctcf_peaks.head()

	chrom	start	end	name	score	strand	fc	-log10p	-log10q	relSummit
0	chr19	48309541	48309911	.	1000	.	5.04924	-1.0	0.00438	185
1	chr4	130563716	130564086	.	993	.	5.05052	-1.0	0.00432	185
2	chr1	200622507	200622877	.	591	.	5.05489	-1.0	0.00400	185
3	chr5	112848447	112848817	.	869	.	5.05841	-1.0	0.00441	185
4	chr1	145960616	145960986	.	575	.	5.05955	-1.0	0.00439	185

# Filter for selected chromosomes:
ctcf_peaks = bioframe.overlap(ctcf_peaks, chromosomes).dropna(subset=["name_"])[
    ctcf_peaks.columns
]

Get list of genes from UCSC

UCSC genes are stored in .gtf format.

genes_url = (
    "https://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/genes/hg38.ensGene.gtf.gz"
)
genes = bioframe.read_table(genes_url, schema="gtf").query('feature=="CDS"')

genes.head()

## Note this functions to parse the attributes of the genes:
# import bioframe.sandbox.gtf_io
# genes_attr = bioframe.sandbox.gtf_io.parse_gtf_attributes(genes['attributes'])

	chrom	source	feature	start	end	score	strand	frame	attributes
47	chr1	ensGene	CDS	69091	70005	.	+	0	gene_id "ENSG00000186092"; transcript_id "ENST...
112	chr1	ensGene	CDS	182709	182746	.	+	0	gene_id "ENSG00000279928"; transcript_id "ENST...
114	chr1	ensGene	CDS	183114	183240	.	+	1	gene_id "ENSG00000279928"; transcript_id "ENST...
116	chr1	ensGene	CDS	183922	184155	.	+	0	gene_id "ENSG00000279928"; transcript_id "ENST...
122	chr1	ensGene	CDS	185220	185350	.	-	2	gene_id "ENSG00000279457"; transcript_id "ENST...

# Filter for selected chromosomes:
genes = bioframe.overlap(genes, chromosomes).dropna(subset=["name_"])[genes.columns]

Assign each peak to the gene

Setup

Default closests

Here, we want to assign each peak (feature) to a gene (input table).

peaks_closest = bioframe.closest(genes, ctcf_peaks)

# Plot the distribution of distances from peaks to genes:
plt.hist(peaks_closest["distance"], np.arange(0, 1e3, 10))
plt.xlim([0, 1e3])

(0.0, 1000.0)

../_images/4e70fc7e652c1f2ffb8b578424fdc9715befd66656d63813a12f162c32d6d3b4.png

Ignore upstream/downstream peaks from genes (strand-indifferent version)

Sometimes you may want to ignore all the CTCFs upstream from the genes.

By default, bioframe.overlap does not know the orintation of the genes, and thus assumes that the upstream/downstream is defined by the genomic coordinate (upstream is the direction towards the smaller coordinate):

Closests with ignoring

peaks_closest_upstream_nodir = bioframe.closest(
    genes,
    ctcf_peaks,
    ignore_overlaps=False,
    ignore_upstream=False,
    ignore_downstream=True,
    direction_col=None,
)

peaks_closest_downstream_nodir = bioframe.closest(
    genes,
    ctcf_peaks,
    ignore_overlaps=False,
    ignore_upstream=True,
    ignore_downstream=False,
    direction_col=None,
)

Note that distribution did not change much, and upstream and downstream distances are very similar:

plt.hist(
    peaks_closest_upstream_nodir["distance"],
    np.arange(0, 1e4, 100),
    alpha=0.5,
    label="upstream",
)
plt.hist(
    peaks_closest_downstream_nodir["distance"],
    np.arange(0, 1e4, 100),
    alpha=0.5,
    label="downstream",
)
plt.xlim([0, 1e4])
plt.legend()

<matplotlib.legend.Legend at 0x7f33a44537f0>

../_images/8048722794eae30e224b9ed9d9f88a2c801860f9f54a1abe5f4eb506c4555774.png

Ignore upstream/downstream peaks from genes (strand-aware version)

More biologically relevant approach will be to define upstream/downstream by strand of the gene. CTCF upstream of transcription start site might play different role than CTCF after transcription end site.

bioframe.closest has the parameter direction_col to control for that:

Closests with smart ignoring

# Note that "strand" here is the column name in genes table:
peaks_closest_upstream_dir = bioframe.closest(
    genes,
    ctcf_peaks,
    ignore_overlaps=False,
    ignore_upstream=False,
    ignore_downstream=True,
    direction_col="strand",
)

peaks_closest_downstream_dir = bioframe.closest(
    genes,
    ctcf_peaks,
    ignore_overlaps=False,
    ignore_upstream=True,
    ignore_downstream=False,
    direction_col="strand",
)

plt.hist(
    peaks_closest_upstream_dir["distance"],
    np.arange(0, 1e4, 100),
    alpha=0.5,
    label="upstream",
)
plt.hist(
    peaks_closest_downstream_dir["distance"],
    np.arange(0, 1e4, 100),
    alpha=0.5,
    label="downstream",
)
plt.xlim([0, 1e4])
plt.legend()

<matplotlib.legend.Legend at 0x7f338e3e5660>

../_images/56641e08142ee6135a3bc2f1cfd638df61c89a5990e0c6da1abe4aba0116e7e3.png

CTCF peaks upstream of the genes are more enriched at short distances to TSS, if we take the strand into account.