Additional tools

Functions:

binnify(chromsizes, binsize[, rel_ids])

Divide a genome into evenly sized bins.

digest(fasta_records, enzyme)

Divide a genome into restriction fragments.

frac_gc(df, fasta_records[, mapped_only, ...])

Calculate the fraction of GC basepairs for each interval in a dataframe.

frac_gene_coverage(df, ucsc_mrna)

Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe.

frac_mapped(df, fasta_records[, return_input])

Calculate the fraction of mapped base-pairs for each interval in a dataframe.

make_chromarms(chromsizes, midpoints[, ...])

Split chromosomes into chromosome arms.

mark_runs(df, col, *[, allow_overlaps, ...])

Mark runs of spatially consecutive intervals sharing the same value of col.

merge_runs(df, col, *[, allow_overlaps, ...])

Merge runs of spatially consecutive intervals sharing the same value of col.

pair_by_distance(df, min_sep, max_sep[, ...])

From a dataframe of genomic intervals, find all unique pairs of intervals that are between min_sep and max_sep bp separated from each other.

seq_gc(seq[, mapped_only])

Calculate the fraction of GC basepairs for a string of nucleotides.

binnify(chromsizes, binsize, rel_ids=False)[source]

Divide a genome into evenly sized bins.

Parameters:
  • chromsizes (Series) – pandas Series indexed by chromosome name with chromosome lengths in bp.

  • binsize (int) – size of bins in bp

Returns:

bintable

Return type:

pandas.DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

digest(fasta_records, enzyme)[source]

Divide a genome into restriction fragments.

Parameters:
  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • enzyme (str) – Name of restriction enzyme.

Returns:

Dataframe with columns

Return type:

‘chrom’, ‘start’, ‘end’.

frac_gc(df, fasta_records, mapped_only=True, return_input=True)[source]

Calculate the fraction of GC basepairs for each interval in a dataframe.

Parameters:
  • df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.

  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • mapped_only (bool) – if True, ignore ‘N’ in the fasta_records for calculation. if True and there are no mapped base-pairs in an interval, return np.nan.

  • return_input (bool) – if False, only return Series named frac_mapped.

Returns:

df_mapped – Original dataframe with new column ‘GC’ appended.

Return type:

pd.DataFrame

frac_gene_coverage(df, ucsc_mrna)[source]

Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe.

Parameters:
  • df (pd.DataFrame) – Set of genomic intervals stored as a dataframe.

  • ucsc_mrna (str or DataFrame) – Name of UCSC genome or all_mrna.txt dataframe from UCSC or similar.

Returns:

df_gene_coverage

Return type:

pd.DataFrame

frac_mapped(df, fasta_records, return_input=True)[source]

Calculate the fraction of mapped base-pairs for each interval in a dataframe.

Parameters:
  • df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.

  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • return_input (bool) – if False, only return Series named frac_mapped.

Returns:

df_mapped – Original dataframe with new column ‘frac_mapped’ appended.

Return type:

pd.DataFrame

make_chromarms(chromsizes, midpoints, cols_chroms=('chrom', 'length'), cols_mids=('chrom', 'mid'), suffixes=('_p', '_q'))[source]

Split chromosomes into chromosome arms.

Parameters:
  • chromsizes (pandas.Dataframe or dict-like) – If dict or pandas.Series, a map from chromosomes to lengths in bp. If pandas.Dataframe, a dataframe with columns defined by cols_chroms. If cols_chroms is a triplet (e.g. ‘chrom’,’start’,’end’), then values in chromsizes[cols_chroms[1]].values must all be zero.

  • midpoints (pandas.Dataframe or dict-like) – Mapping of chromosomes to midpoint (aka centromere) locations. If dict or pandas.Series, a map from chromosomes to midpoints in bp. If pandas.Dataframe, a dataframe with columns defined by cols_mids.

  • cols_chroms ((str, str) or (str, str, str)) – Two columns

  • suffixes (tuple, optional) – Suffixes to name chromosome arms. Defaults to p and q.

Returns:

4-column BED-like DataFrame (chrom, start, end, name). Arm names are chromosome names + suffix. Any chromosome not included in mids will be not be split.

Return type:

df_chromarms

mark_runs(df, col, *, allow_overlaps=False, reset_counter=True, run_col='run', cols=None)[source]

Mark runs of spatially consecutive intervals sharing the same value of col.

Parameters:
  • df (DataFrame) – A bioframe dataframe.

  • col (str) – The column to mark runs of values for.

  • allow_overlaps (bool, optional [default: False]) – If True, allow intervals in df to overlap. This may cause unexpected results.

  • reset_counter (bool, optional [default: True]) – If True, reset the run counter for each chromosome.

  • run_col (str, optional [default: 'run']) – The name of the column to store the run numbers in.

Returns:

A reordered copy the input dataframe with an additional column ‘run’ marking runs of values in the input column.

Return type:

pandas.DataFrame

Notes

This is similar to cluster(), but only clusters intervals sharing the same value of col.

Examples

>>> df = pd.DataFrame({
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'start': [0, 100, 200, 300, 400, 500],
...     'end': [100, 200, 300, 400, 500, 600],
...     'value': [1, 1, 1, 2, 2, 2],
... })
>>> mark_runs(df, 'value')
    chrom  start  end  value  run
0   chr1      0  100      1    0
1   chr1    100  200      1    0
2   chr1    200  300      1    0
3   chr1    300  400      2    1
4   chr1    400  500      2    1
5   chr1    500  600      2    1

See also

merge_runs, cluster, merge

merge_runs(df, col, *, allow_overlaps=False, agg=None, cols=None)[source]

Merge runs of spatially consecutive intervals sharing the same value of col.

Parameters:
  • df (DataFrame) – A bioframe dataframe.

  • col (str) – The column to compress runs of values for.

  • allow_overlaps (bool, optional [default: False]) – If True, allow intervals in df to overlap. This may cause unexpected results.

  • agg (dict, optional [default: None]) –

    A dictionary of additional column names and aggregation functions to apply to each run. Takes the format:

    {‘agg_name’: (‘column_name’, ‘agg_func’)}

Returns:

Dataframe with consecutive intervals in the same run merged.

Return type:

pandas.DataFrame

Notes

This is similar to merge(), but only merges intervals sharing the same value of col.

Examples

>>> df = pd.DataFrame({
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'start': [0, 100, 200, 300, 400, 500],
...     'end': [100, 200, 300, 400, 500, 600],
...     'value': [1, 1, 1, 2, 2, 2],
... })
>>> merge_runs(df, 'value')
    chrom  start  end  value
0   chr1      0  300      1
1   chr1    300  600      2
>>> merge_runs(df, 'value', agg={'sum': ('value', 'sum')})
    chrom  start  end  value  sum
0   chr1      0  300      1    3
1   chr1    300  600      2    6

See also

mark_runs, cluster, merge

pair_by_distance(df, min_sep, max_sep, min_intervening=None, max_intervening=None, relative_to='midpoints', cols=None, return_index=False, keep_order=False, suffixes=('_1', '_2'))[source]

From a dataframe of genomic intervals, find all unique pairs of intervals that are between min_sep and max_sep bp separated from each other.

Parameters:
  • df (pandas.DataFrame) – A BED-like dataframe.

  • min_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.

  • max_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.

  • min_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.

  • max_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.

  • relative_to (str,) – Whether to calculate distances between interval “midpoints” or “endpoints”. Default “midpoints”.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

  • return_index (bool) – If True, return indicies of pairs as two new columns (‘index’+suffixes[0] and ‘index’+suffixes[1]). Default False.

  • keep_order (bool, optional) – If True, sort the output dataframe to preserve the order of the intervals in df1. Default False. Note that it relies on sorting of index in the original dataframes, and will reorder the output by index.

  • suffixes ((str, str), optional) – The column name suffixes for the two interval sets in the output. The first interval of each output pair is always upstream of the second.

Returns:

A BEDPE-like dataframe of paired intervals from df.

Return type:

pandas.DataFrame

seq_gc(seq, mapped_only=True)[source]

Calculate the fraction of GC basepairs for a string of nucleotides.

Parameters:
  • seq (str) – Basepair input

  • mapped_only (bool) – if True, ignore ‘N’ in the sequence for calculation. if True and there are no mapped base-pairs, return np.nan.

Returns:

gc – calculated gc content.

Return type:

float