How to: assign TF Motifs to ChIP-seq peaks

This tutorial demonstrates one way to assign CTCF motifs to CTCF ChIP-seq peaks using bioframe.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import bioframe

base_dir = "/tmp/bioframe_tutorial_data/"
assembly = "GRCh38"

Load CTCF ChIP-seq peaks for HFF from ENCODE

This approach makes use of the narrowPeak schema for bioframe.read_table .

ctcf_peaks = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
ctcf_peaks[0:5]

	chrom	start	end	name	score	strand	fc	-log10p	-log10q	relSummit
0	chr19	48309541	48309911	.	1000	.	5.04924	-1.0	0.00438	185
1	chr4	130563716	130564086	.	993	.	5.05052	-1.0	0.00432	185
2	chr1	200622507	200622877	.	591	.	5.05489	-1.0	0.00400	185
3	chr5	112848447	112848817	.	869	.	5.05841	-1.0	0.00441	185
4	chr1	145960616	145960986	.	575	.	5.05955	-1.0	0.00439	185

Get CTCF motifs from JASPAR

### CTCF motif: http://jaspar.genereg.net/matrix/MA0139.1/
jaspar_url = "http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/"
jaspar_motif_file = "MA0139.1.tsv.gz"
ctcf_motifs = bioframe.read_table(
    jaspar_url + jaspar_motif_file, schema="jaspar", skiprows=1
)
ctcf_motifs[0:4]

	chrom	start	end	name	score	pval	strand
0	chr1	11163	11182	CTCF	811	406	-
1	chr1	11222	11241	CTCF	959	804	-
2	chr1	11280	11299	CTCF	939	728	-
3	chr1	11339	11358	CTCF	837	455	-

Overlap peaks & motifs

df_peaks_motifs = bioframe.overlap(
    ctcf_peaks, ctcf_motifs, suffixes=("_1", "_2"), return_index=True
)

There are often multiple motifs overlapping one ChIP-seq peak, and a substantial number of peaks without motifs:

# note that counting motifs per peak can also be handled directly with
# bioframe.count_overlaps but since we re-use df_peaks_motifs below we
# instead use the pandas operations directly
motifs_per_peak = df_peaks_motifs.groupby(["index_1"])["index_2"].count().values

plt.hist(motifs_per_peak, np.arange(0, np.max(motifs_per_peak)))
plt.xlabel("number of overlapping motifs per peak")
plt.ylabel("number of peaks")
plt.semilogy()

print(
    f"fraction of peaks without motifs "
    f"{np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak), 2)}"
)

fraction of peaks without motifs 0.14

../_images/538d13eb9549896a6ccfbc7c0580a85be279970281b8a1ba1a4395e168de18ba.png

Assign the strongest motif to each peak

# since idxmax does not currently take NA, fill with -1
df_peaks_motifs["pval_2"] = df_peaks_motifs["pval_2"].fillna(-1)
idxmax_peaks_motifs = (
    df_peaks_motifs.groupby(["chrom_1", "start_1", "end_1"])["pval_2"].idxmax().values
)
df_peaks_maxmotif = df_peaks_motifs.loc[idxmax_peaks_motifs]
df_peaks_maxmotif["pval_2"].replace(-1, np.nan, inplace=True)

/tmp/ipykernel_750/2373793568.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

  df_peaks_maxmotif["pval_2"].replace(-1, np.nan, inplace=True)

stronger peaks tend to have stronger motifs:

plt.rcParams["font.size"] = 12
df_peaks_maxmotif["fc_1"] = df_peaks_maxmotif["fc_1"].values.astype("float")
plt.scatter(
    df_peaks_maxmotif["fc_1"].values,
    df_peaks_maxmotif["pval_2"].values,
    5,
    alpha=0.5,
    lw=0,
)
plt.xlabel("ENCODE CTCF peak strength, fc")
plt.ylabel("JASPAR CTCF motif strength \n (-log10 pval *100)")
plt.title(
    "corr: "
    + str(np.round(df_peaks_maxmotif["fc_1"].corr(df_peaks_maxmotif["pval_2"]), 2))
);

../_images/d90ae8cb95f0cf88a6b8e05ffa7a3327f279b87ac5f30125d0640582af2ed976.png

We can also ask the reverse question: how many motifs overlap a ChIP-seq peak?

df_motifs_peaks = bioframe.overlap(
    ctcf_motifs, ctcf_peaks, how="left", suffixes=("_1", "_2")
)

m = df_motifs_peaks.sort_values("pval_1")
plt.plot(
    m["pval_1"].values[::-1],
    np.cumsum(pd.isnull(m["chrom_2"].values[::-1]) == 0) / np.arange(1, len(m) + 1),
)
plt.xlabel("pval")
plt.ylabel("probability motif overlaps a peak");

../_images/81ae2c05f8e25cca1dd4a32a9dacfe0a76054fbf43b1f51bb006460299c2f771.png

filter peaks overlapping blacklisted regions

do any of our peaks overlap blacklisted genomic regions?

blacklist = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz",
    schema="bed3",
)
blacklist[0:3]

	chrom	start	end
0	chr1	628903	635104
1	chr1	5850087	5850571
2	chr1	8909610	8910014

there appears to be a small spike in the number of peaks close to blacklist regions

closest_to_blacklist = bioframe.closest(ctcf_peaks, blacklist)
plt.hist(
    closest_to_blacklist["distance"].astype("Float64").astype("float"),
    np.arange(0, 1e4, 100),
);

../_images/51a3dd9638838daed27a28d89d4bd66c6e902e20482a480262864c4853c6449b.png

to be safe, let’s remove anything +/- 1kb from a blacklisted region

# first let's select the columns we want for our final dataframe of peaks
# with motifs
df_peaks_maxmotif = df_peaks_maxmotif[
    [
        "chrom_1",
        "start_1",
        "end_1",
        "fc_1",
        "chrom_2",
        "start_2",
        "end_2",
        "pval_2",
        "strand_2",
    ]
]
#  then rename columns for convenience when subtracting
for i in df_peaks_maxmotif.keys():
    if "_1" in i:
        df_peaks_maxmotif.rename(columns={i: i.split("_")[0]}, inplace=True)

# now subtract, expanding the blacklist by 1kb
df_peaks_maxmotif_clean = bioframe.subtract(
    df_peaks_maxmotif, bioframe.expand(blacklist, 1000)
)

there it is! we now have a dataframe containing positions of CTCF ChIP peaks, including the strongest motif underlying that peak, and after conservative filtering for proximity to blacklisted regions

df_peaks_maxmotif_clean.iloc[7:15]

	chrom	start	end	fc	chrom_2	start_2	end_2	pval_2	strand_2
7	chr9	124777413	124777783	5.06479	chr9	124777400	124777419	450.0	+
8	chr1	67701045	67701415	5.06708	None	<NA>	<NA>	NaN	None
9	chr10	119859586	119859956	5.08015	chr10	119859591	119859610	611.0	-
10	chr3	66816327	66816697	5.08233	chr3	66816332	66816351	741.0	-
11	chr16	50248791	50249161	5.08249	None	<NA>	<NA>	NaN	None
12	chr19	41431677	41432047	5.11060	chr19	41431802	41431821	477.0	+
13	chr4	131644839	131645209	5.11204	None	<NA>	<NA>	NaN	None
14	chr2	203239519	203239889	5.11817	None	<NA>	<NA>	NaN	None