How to: assign TF Motifs to ChIP-seq peaks

This tutorial demonstrates one way to assign CTCF motifs to CTCF ChIP-seq peaks using bioframe.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import bioframe
base_dir = "/tmp/bioframe_tutorial_data/"
assembly = "GRCh38"

Load CTCF ChIP-seq peaks for HFF from ENCODE

This approach makes use of the narrowPeak schema for bioframe.read_table .

ctcf_peaks = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
ctcf_peaks[0:5]
chrom start end name score strand fc -log10p -log10q relSummit
0 chr19 48309541 48309911 . 1000 . 5.04924 -1.0 0.00438 185
1 chr4 130563716 130564086 . 993 . 5.05052 -1.0 0.00432 185
2 chr1 200622507 200622877 . 591 . 5.05489 -1.0 0.00400 185
3 chr5 112848447 112848817 . 869 . 5.05841 -1.0 0.00441 185
4 chr1 145960616 145960986 . 575 . 5.05955 -1.0 0.00439 185

Get CTCF motifs from JASPAR

### CTCF motif: http://jaspar.genereg.net/matrix/MA0139.1/
jaspar_url = "http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/"
jaspar_motif_file = "MA0139.1.tsv.gz"
ctcf_motifs = bioframe.read_table(
    jaspar_url + jaspar_motif_file, schema="jaspar", skiprows=1
)
ctcf_motifs[0:4]
chrom start end name score pval strand
0 chr1 11163 11182 CTCF 811 406 -
1 chr1 11222 11241 CTCF 959 804 -
2 chr1 11280 11299 CTCF 939 728 -
3 chr1 11339 11358 CTCF 837 455 -

Overlap peaks & motifs

df_peaks_motifs = bioframe.overlap(
    ctcf_peaks, ctcf_motifs, suffixes=("_1", "_2"), return_index=True
)

There are often multiple motifs overlapping one ChIP-seq peak, and a substantial number of peaks without motifs:

# note that counting motifs per peak can also be handled directly with
# bioframe.count_overlaps but since we re-use df_peaks_motifs below we
# instead use the pandas operations directly
motifs_per_peak = df_peaks_motifs.groupby(["index_1"])["index_2"].count().values

plt.hist(motifs_per_peak, np.arange(0, np.max(motifs_per_peak)))
plt.xlabel("number of overlapping motifs per peak")
plt.ylabel("number of peaks")
plt.semilogy()

print(
    f"fraction of peaks without motifs "
    f"{np.round(np.sum(motifs_per_peak==0)/len(motifs_per_peak), 2)}"
)
fraction of peaks without motifs 0.14
../_images/538d13eb9549896a6ccfbc7c0580a85be279970281b8a1ba1a4395e168de18ba.png

Assign the strongest motif to each peak

# since idxmax does not currently take NA, fill with -1
df_peaks_motifs["pval_2"] = df_peaks_motifs["pval_2"].fillna(-1)
idxmax_peaks_motifs = (
    df_peaks_motifs.groupby(["chrom_1", "start_1", "end_1"])["pval_2"].idxmax().values
)
df_peaks_maxmotif = df_peaks_motifs.loc[idxmax_peaks_motifs]
df_peaks_maxmotif["pval_2"].replace(-1, np.nan, inplace=True)
/tmp/ipykernel_750/2373793568.py:7: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_peaks_maxmotif["pval_2"].replace(-1, np.nan, inplace=True)

stronger peaks tend to have stronger motifs:

plt.rcParams["font.size"] = 12
df_peaks_maxmotif["fc_1"] = df_peaks_maxmotif["fc_1"].values.astype("float")
plt.scatter(
    df_peaks_maxmotif["fc_1"].values,
    df_peaks_maxmotif["pval_2"].values,
    5,
    alpha=0.5,
    lw=0,
)
plt.xlabel("ENCODE CTCF peak strength, fc")
plt.ylabel("JASPAR CTCF motif strength \n (-log10 pval *100)")
plt.title(
    "corr: "
    + str(np.round(df_peaks_maxmotif["fc_1"].corr(df_peaks_maxmotif["pval_2"]), 2))
);
../_images/d90ae8cb95f0cf88a6b8e05ffa7a3327f279b87ac5f30125d0640582af2ed976.png

We can also ask the reverse question: how many motifs overlap a ChIP-seq peak?

df_motifs_peaks = bioframe.overlap(
    ctcf_motifs, ctcf_peaks, how="left", suffixes=("_1", "_2")
)
m = df_motifs_peaks.sort_values("pval_1")
plt.plot(
    m["pval_1"].values[::-1],
    np.cumsum(pd.isnull(m["chrom_2"].values[::-1]) == 0) / np.arange(1, len(m) + 1),
)
plt.xlabel("pval")
plt.ylabel("probability motif overlaps a peak");
../_images/81ae2c05f8e25cca1dd4a32a9dacfe0a76054fbf43b1f51bb006460299c2f771.png

filter peaks overlapping blacklisted regions

do any of our peaks overlap blacklisted genomic regions?

blacklist = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz",
    schema="bed3",
)
blacklist[0:3]
chrom start end
0 chr1 628903 635104
1 chr1 5850087 5850571
2 chr1 8909610 8910014

there appears to be a small spike in the number of peaks close to blacklist regions

closest_to_blacklist = bioframe.closest(ctcf_peaks, blacklist)
plt.hist(
    closest_to_blacklist["distance"].astype("Float64").astype("float"),
    np.arange(0, 1e4, 100),
);
../_images/51a3dd9638838daed27a28d89d4bd66c6e902e20482a480262864c4853c6449b.png

to be safe, let’s remove anything +/- 1kb from a blacklisted region

# first let's select the columns we want for our final dataframe of peaks
# with motifs
df_peaks_maxmotif = df_peaks_maxmotif[
    [
        "chrom_1",
        "start_1",
        "end_1",
        "fc_1",
        "chrom_2",
        "start_2",
        "end_2",
        "pval_2",
        "strand_2",
    ]
]
#  then rename columns for convenience when subtracting
for i in df_peaks_maxmotif.keys():
    if "_1" in i:
        df_peaks_maxmotif.rename(columns={i: i.split("_")[0]}, inplace=True)

# now subtract, expanding the blacklist by 1kb
df_peaks_maxmotif_clean = bioframe.subtract(
    df_peaks_maxmotif, bioframe.expand(blacklist, 1000)
)

there it is! we now have a dataframe containing positions of CTCF ChIP peaks, including the strongest motif underlying that peak, and after conservative filtering for proximity to blacklisted regions

df_peaks_maxmotif_clean.iloc[7:15]
chrom start end fc chrom_2 start_2 end_2 pval_2 strand_2
7 chr9 124777413 124777783 5.06479 chr9 124777400 124777419 450.0 +
8 chr1 67701045 67701415 5.06708 None <NA> <NA> NaN None
9 chr10 119859586 119859956 5.08015 chr10 119859591 119859610 611.0 -
10 chr3 66816327 66816697 5.08233 chr3 66816332 66816351 741.0 -
11 chr16 50248791 50249161 5.08249 None <NA> <NA> NaN None
12 chr19 41431677 41432047 5.11060 chr19 41431802 41431821 477.0 +
13 chr4 131644839 131645209 5.11204 None <NA> <NA> NaN None
14 chr2 203239519 203239889 5.11817 None <NA> <NA> NaN None