Interval operations

Functions:

`assign_view`(df, view_df[, drop_unassigned, ...])	Associates genomic intervals in bedframe `df` with regions in viewframe `view_df`, based on their largest overlap.
`closest`(df1[, df2, k, ignore_overlaps, ...])	For every interval in dataframe df1 find k closest genomic intervals in dataframe df2.
`cluster`(df[, min_dist, cols, on, ...])	Cluster overlapping intervals into groups.
`complement`(df[, view_df, view_name_col, ...])	Find genomic regions in a viewFrame 'view_df' that are not covered by any interval in the dataFrame 'df'.
`count_overlaps`(df1, df2[, suffixes, ...])	Count number of overlapping genomic intervals.
`coverage`(df1, df2[, suffixes, return_input, ...])	Quantify the coverage of intervals from 'df1' by intervals from 'df2'.
`expand`(df[, pad, scale, side, cols])	Expand each interval by an amount specified with pad.
`merge`(df[, min_dist, cols, on])	Merge overlapping intervals.
`overlap`(df1, df2[, how, return_input, ...])	Find pairs of overlapping genomic intervals.
`select`(df, region[, cols])	Return all genomic intervals in a dataframe that overlap a genomic region.
`select_indices`(df, region[, cols])	Return integer indices of all genomic intervals that overlap a query range.
`select_labels`(df, region[, cols])	Return pandas Index labels of all genomic intervals that overlap a query range.
`select_mask`(df, region[, cols])	Return boolean mask for all genomic intervals that overlap a query range.
`setdiff`(df1, df2[, cols1, cols2, on])	Generate a new dataframe of genomic intervals by removing any interval from the first dataframe that overlaps an interval from the second dataframe.
`sort_bedframe`(df[, view_df, reset_index, ...])	Sorts a bedframe 'df'.
`subtract`(df1, df2[, return_index, suffixes, ...])	Generate a new set of genomic intervals by subtracting the second set of genomic intervals from the first.
`trim`(df[, view_df, df_view_col, ...])	Trim each interval to fall within regions specified in the viewframe 'view_df'.

assign_view(df, view_df, drop_unassigned=False, df_view_col='view_region', view_name_col='name', cols=None, cols_view=None)[source]

Associates genomic intervals in bedframe df with regions in viewframe view_df, based on their largest overlap.

Parameters:

df (pandas.DataFrame)
view_df (pandas.DataFrame) – ViewFrame specifying region start and ends for assignment. Attempts to convert dictionary and pd.Series formats to viewFrames.
drop_unassigned (bool) – If True, drop intervals in df that do not overlap a region in the view. Default False.
df_view_col (str) – The column of df used to specify view regions. The associated region in view_df is then used for trimming. If no view_df is provided, uses the chrom column, df[cols[0]]. Default “view_region”.
view_name_col (str) – Column of view_df with region names. Default ‘name’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals. The default values are ‘chrom’, ‘start’, ‘end’.
cols_view ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals in the view. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

out_df (dataframe with an associated view region for each interval in)
out_df[view_name_col].

Notes

Resets index.

closest(df1, df2=None, k=1, ignore_overlaps=False, ignore_upstream=False, ignore_downstream=False, direction_col=None, tie_breaking_col=None, return_input=True, return_index=False, return_distance=True, return_overlap=False, suffixes=('', '_'), cols1=None, cols2=None)[source]

For every interval in dataframe df1 find k closest genomic intervals in dataframe df2.

Currently, we are not taking the feature strands into account for filtering. However, the strand can be used for definition of upstream/downstream of the feature (direction).

Note that, unless specified otherwise, overlapping intervals are considered as closest. When multiple intervals are located at the same distance, the ones with the lowest index in df2 are returned.

Parameters:

df1 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame. If df2 is None, find closest non-identical intervals within the same set.
df2 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame. If df2 is None, find closest non-identical intervals within the same set.
k (int) – The number of the closest intervals to report.
ignore_overlaps (bool) – If True, ignore overlapping intervals and return the closest non-overlapping interval.
ignore_upstream (bool) – If True, ignore intervals in df2 that are upstream of intervals in df1, relative to the reference strand or the strand specified by direction_col.
ignore_downstream (bool) – If True, ignore intervals in df2 that are downstream of intervals in df1, relative to the reference strand or the strand specified by direction_col.
direction_col (str) – Name of direction column that will set upstream/downstream orientation for each feature. The column should contain bioframe-compliant strand values (“+”, “-”, “.”).
tie_breaking_col (str) – A column in df2 to use for breaking ties when multiple intervals are located at the same distance. Intervals with lower values will be selected.
return_input (bool) – If True, return input
return_index (bool) – If True, return indices
return_distance (bool) – If True, return distances. Returns zero for overlaps.
return_overlap (bool) – If True, return columns: ‘have_overlap’, ‘overlap_start’, and ‘overlap_end’. Fills df_closest[‘overlap_start’] and df[‘overlap_end’] with None if non-overlapping. Default False.
suffixes ((str, str)) – The suffixes for the columns of the two sets.
cols1 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
cols2 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df_closest – If no intervals found, returns none.

Return type:

pandas.DataFrame

Notes

By default, direction is defined by the reference genome: everything with smaller coordinate is considered upstream, everything with larger coordinate is considered downstream.

If direction_col is provided, upstream/downstream are relative to the direction column in df1, i.e. features marked “+” and “.” strand will define upstream and downstream as above, while features marked “-” have upstream and downstream reversed: smaller coordinates are downstream and larger coordinates are upstream.

cluster(df, min_dist=0, cols=None, on=None, return_input=True, return_cluster_ids=True, return_cluster_intervals=True)[source]

Cluster overlapping intervals into groups.

Can return numeric ids for these groups (with return_cluster_ids`=True) and/or their genomic coordinates (with `return_cluster_intervals`=True). Also see :func:`merge(), which discards original intervals and returns a new set.

Parameters:

df (pandas.DataFrame)
min_dist (float or None) – If provided, cluster intervals separated by this distance or less. If None, do not cluster non-overlapping intervals. Since bioframe uses semi-open intervals, interval pairs [0,1) and [1,2) do not overlap, but are separated by a distance of 0. Such adjacent intervals are not clustered when min_dist=None, but are clustered when min_dist=0.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals. The default values are ‘chrom’, ‘start’, ‘end’.
on (None or list) – List of column names to perform clustering on independently, passed as an argument to df.groupby before clustering. Default is None. An example useage would be to pass on=['strand'].
return_input (bool) – If True, return input
return_cluster_ids (bool) – If True, return ids for clusters
return_cluster_invervals (bool) – If True, return clustered interval the original interval belongs to

Returns:

df_clustered

Return type:

pd.DataFrame

complement(df, view_df=None, view_name_col='name', cols=None, cols_view=None)[source]

Find genomic regions in a viewFrame ‘view_df’ that are not covered by any interval in the dataFrame ‘df’.

First assigns intervals in ‘df’ to region in ‘view_df’, splitting intervals in ‘df’ as necessary.

Parameters:

df (pandas.DataFrame)
view_df (pandas.Dataframe) – If none, attempts to infer the view from chroms (i.e. df[cols[0]]).
view_name_col (str) – Name of column in view_df with unique reigon names. Default ‘name’.
cols ((str, str, str)) – The names of columns containing the chromosome, start and end of the genomic intervals. The default values are ‘chrom’, ‘start’, ‘end’.
cols_view ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals in the view. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df_complement

Return type:

pandas.DataFrame

Notes

Discards null intervals in input, and df_complement has regular int dtype.

count_overlaps(df1, df2, suffixes=('', '_'), return_input=True, cols1=None, cols2=None, on=None)[source]

Count number of overlapping genomic intervals.

Parameters:

df1 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
df2 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
suffixes ((str, str)) – The suffixes for the columns of the two overlapped sets.
return_input (bool) – If True, return columns from input dfs. Default True.
cols1 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
cols2 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
on (list) – List of additional shared columns to consider as separate groups when considering overlaps. A common use would be passing on=[‘strand’]. Default is None.

Returns:

df_counts

Return type:

pandas.DataFrame

Notes

Resets index.

coverage(df1, df2, suffixes=('', '_'), return_input=True, cols1=None, cols2=None)[source]

Quantify the coverage of intervals from ‘df1’ by intervals from ‘df2’.

For every interval in ‘df1’ find the number of base pairs covered by intervals in ‘df2’. Note this only quantifies whether a basepair in ‘df1’ was covered, as ‘df2’ is merged before calculating coverage.

Parameters:

df1 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
df2 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
suffixes ((str, str)) – The suffixes for the columns of the two overlapped sets.
return_input (bool) – If True, return input as well as computed coverage
cols1 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
cols2 ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df_coverage

Return type:

pandas.DataFrame

Notes

Resets index.

expand(df, pad=None, scale=None, side='both', cols=None)[source]

Expand each interval by an amount specified with pad.

Negative values for pad shrink the interval, up to the midpoint. Multiplicative rescaling of intervals enabled with scale. Only one of pad or scale can be provided. Often followed by trim().

Parameters:

df (pandas.DataFrame)
pad (int, optional) – The amount by which the intervals are additively expanded on each side. Negative values for pad shrink intervals, but not beyond the interval midpoint. Either pad or scale must be supplied.
scale (float, optional) – The factor by which to scale intervals multiplicatively on each side, e.g scale=2 doubles each interval, scale=0 returns midpoints, and scale=1 returns original intervals. Default False. Either pad or scale must be supplied.
side (str, optional) – Which side to expand, possible values are ‘left’, ‘right’ and ‘both’. Default ‘both’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals. Default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df_expanded

Return type:

pandas.DataFrame

Notes

See bioframe.trim() for trimming interals after expansion.

merge(df, min_dist=0, cols=None, on=None)[source]

Merge overlapping intervals.

This returns a new dataframe of genomic intervals, which have the genomic coordinates of the interval cluster groups from the input dataframe. Also cluster(), which returns the assignment of intervals to clusters prior to merging.

Parameters:

df (pandas.DataFrame)
min_dist (float or None) – If provided, merge intervals separated by this distance or less. If None, do not merge non-overlapping intervals. Using min_dist=0 and min_dist=None will bring different results. bioframe uses semi-open intervals, so interval pairs [0,1) and [1,2) do not overlap, but are separated by a distance of 0. Adjacent intervals are not merged when min_dist=None, but are merged when min_dist=0.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals. The default values are ‘chrom’, ‘start’, ‘end’.
on (None or list) – List of column names to perform clustering on independently, passed as an argument to df.groupby before clustering. Default is None. An example useage would be to pass on=['strand'].

Returns:

df_merged – A pandas dataframe with coordinates of merged clusters.

Return type:

pandas.DataFrame

Notes

Resets index.

overlap(df1, df2, how='left', return_input=True, return_index=False, return_overlap=False, suffixes=('', '_'), keep_order=None, cols1=None, cols2=None, on=None, ensure_int=True)[source]

Find pairs of overlapping genomic intervals.

Parameters:

df1 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
df2 (pandas.DataFrame) – Two sets of genomic intervals stored as a DataFrame.
how ({'left', 'right', 'outer', 'inner'}, default 'left') – How to handle the overlaps on the two dataframes. left: use the set of intervals in df1 right: use the set of intervals in df2 outer: use the union of the set of intervals from df1 and df2 inner: use intersection of the set of intervals from df1 and df2
return_input (bool, optional) – If True, return columns from input dfs. Default True.
return_index (bool, optional) – If True, return indicies of overlapping pairs as two new columns (‘index’+suffixes[0] and ‘index’+suffixes[1]). Default False.
return_overlap (bool, optional) – If True, return overlapping intervals for the overlapping pairs as two additional columns (overlap_start, overlap_end). When cols1 is modified, start and end are replaced accordingly. When return_overlap is a string, its value is used for naming the overlap columns: return_overlap + “_start”, return_overlap + “_end”. Default False.
suffixes ((str, str), optional) – The suffixes for the columns of the two overlapped sets.
keep_order (bool, optional) – If True and how=’left’, sort the output dataframe to preserve the order of the intervals in df1. Cannot be used with how=’right’/’outer’/’inner’. Default True for how=’left’, and None otherwise. Note that it relies on sorting of index in the original dataframes, and will reorder the output by index.
cols1 ((str, str, str) or None, optional) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
cols2 ((str, str, str) or None, optional) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
on (list or None, optional) – List of additional shared columns to consider as separate groups when considering overlaps. A common use would be passing on=[‘strand’]. Default is None.
ensure_int (bool, optional [default: True]) – If True, ensures that the output dataframe uses integer dtypes for start and end coordinates. This may involve converting coordinate columns to nullable types in outer joins. Default True.

Returns:

df_overlap

Return type:

pandas.DataFrame

Notes

If ensure_int is False, inner joins will preserve coordinate dtypes from the input dataframes, but outer joins will be subject to native type casting rules if missing data is introduced. For example, if df1 uses a NumPy integer dtype for start and/or end, the output dataframe will use the same dtype after an inner join, but, due to casting rules, may produce float64 after a left/right/outer join with missing data stored as NaN. On the other hand, if df1 uses Pandas nullable dtypes, the corresponding coordinate columns will preserve the same dtype in the output, with missing data stored as NA.

select(df, region, cols=None)[source]

Return all genomic intervals in a dataframe that overlap a genomic region.

Parameters:

df (pandas.DataFrame)
region (str or tuple) – The genomic region to select from the dataframe in UCSC-style genomic region string, or triple (chrom, start, end).
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df

Return type:

pandas.DataFrame

Notes

See core.stringops.parse_region() for more information on region formatting.