Array operations

Low level operations that are used to implement the genomic interval operations.

Functions:

arange_multi(starts[, stops, lengths])

Create concatenated ranges of integers for multiple start/length.

closest_intervals(starts1, ends1[, starts2, ...])

For every interval in set 1, return the indices of k closest intervals from set 2.

interweave(a, b)

Interweave two arrays.

merge_intervals(starts, ends[, min_dist])

Merge overlapping intervals.

overlap_intervals(starts1, ends1, starts2, ends2)

Take two sets of intervals and return the indices of pairs of overlapping intervals.

overlap_intervals_outer(starts1, ends1, ...)

Take two sets of intervals and return the indices of pairs of overlapping intervals, as well as the indices of the intervals that do not overlap any other interval.

sum_slices(arr, starts, ends)

Calculate sums of slices of an array.

arange_multi(starts, stops=None, lengths=None)[source]

Create concatenated ranges of integers for multiple start/length.

Parameters:
  • starts (numpy.ndarray) – Starts for each range

  • stops (numpy.ndarray) – Stops for each range

  • lengths (numpy.ndarray) – Lengths for each range. Either stops or lengths must be provided.

Returns:

concat_ranges – Concatenated ranges.

Return type:

numpy.ndarray

Notes

See the following illustrative example:

starts = np.array([1, 3, 4, 6]) stops = np.array([1, 5, 7, 6])

print arange_multi(starts, lengths) >>> [3 4 4 5 6]

From: https://codereview.stackexchange.com/questions/83018/vectorized-numpy-version-of-arange-with-multiple-start-stop

closest_intervals(starts1, ends1, starts2=None, ends2=None, k=1, tie_arr=None, ignore_overlaps=False, ignore_upstream=False, ignore_downstream=False, direction=None)[source]

For every interval in set 1, return the indices of k closest intervals from set 2.

Parameters:
  • starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored. If start2 and ends2 are None, find closest intervals within the same set.

  • ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored. If start2 and ends2 are None, find closest intervals within the same set.

  • starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored. If start2 and ends2 are None, find closest intervals within the same set.

  • ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored. If start2 and ends2 are None, find closest intervals within the same set.

  • k (int) – The number of neighbors to report.

  • tie_arr (numpy.ndarray or None) – Extra data describing intervals in set 2 to break ties when multiple intervals are located at the same distance. Intervals with lower tie_arr values will be given priority.

  • ignore_overlaps (bool) – If True, ignore set 2 intervals that overlap with set 1 intervals.

  • ignore_upstream (bool) – If True, ignore set 2 intervals upstream/downstream of set 1 intervals.

  • ignore_downstream (bool) – If True, ignore set 2 intervals upstream/downstream of set 1 intervals.

  • direction (numpy.ndarray with dtype bool or None) – Strand vector to define the upstream/downstream orientation of the intervals.

Returns:

closest_ids – An Nx2 array containing the indices of pairs of closest intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

Return type:

numpy.ndarray

interweave(a, b)[source]

Interweave two arrays.

Parameters:
  • a (numpy.ndarray) – Arrays to interweave, must have the same length/

  • b (numpy.ndarray) – Arrays to interweave, must have the same length/

Returns:

out – Array of interweaved values from a and b.

Return type:

numpy.ndarray

Notes

From https://stackoverflow.com/questions/5347065/interweaving-two-numpy-arrays

merge_intervals(starts, ends, min_dist=0)[source]

Merge overlapping intervals.

Parameters:
  • starts (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • min_dist (float or None) – If provided, merge intervals separated by this distance or less. If None, do not merge non-overlapping intervals. Using min_dist=0 and min_dist=None will bring different results. bioframe uses semi-open intervals, so interval pairs [0,1) and [1,2) do not overlap, but are separated by a distance of 0. Such intervals are not merged when min_dist=None, but are merged when min_dist=0.

Returns:

  • cluster_ids (numpy.ndarray) – The indices of interval clusters that each interval belongs to.

  • cluster_starts (numpy.ndarray)

  • cluster_ends (numpy.ndarray) – The spans of the merged intervals.

Notes

From https://stackoverflow.com/questions/43600878/merging-overlapping-intervals/58976449#58976449

overlap_intervals(starts1, ends1, starts2, ends2, closed=False, sort=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals.

Parameters:
  • starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

overlap_ids – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

Return type:

numpy.ndarray

overlap_intervals_outer(starts1, ends1, starts2, ends2, closed=False)[source]

Take two sets of intervals and return the indices of pairs of overlapping intervals, as well as the indices of the intervals that do not overlap any other interval.

Parameters:
  • starts1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends1 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • starts2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • ends2 (numpy.ndarray) – Interval coordinates. Warning: if provided as pandas.Series, indices will be ignored.

  • closed (bool) – If True, then treat intervals as closed and report single-point overlaps.

Returns:

  • overlap_ids (numpy.ndarray) – An Nx2 array containing the indices of pairs of overlapping intervals. The 1st column contains ids from the 1st set, the 2nd column has ids from the 2nd set.

  • no_overlap_ids1, no_overlap_ids2 (numpy.ndarray) – Two 1D arrays containing the indices of intervals in sets 1 and 2 respectively that do not overlap with any interval in the other set.

sum_slices(arr, starts, ends)[source]

Calculate sums of slices of an array.

Parameters:
  • arr (numpy.ndarray)

  • starts (numpy.ndarray) – Starts for each slice

  • ends (numpy.ndarray) – Stops for each slice

Returns:

sums – Sums of the slices.

Return type:

numpy.ndarray