Additional tools
Functions:
|
Divide a genome into evenly sized bins. |
|
Divide a genome into restriction fragments. |
|
Calculate the fraction of GC basepairs for each interval in a dataframe. |
|
Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe. |
|
Calculate the fraction of mapped base-pairs for each interval in a dataframe. |
|
Split chromosomes into chromosome arms. |
|
Mark runs of spatially consecutive intervals sharing the same value of |
|
Merge runs of spatially consecutive intervals sharing the same value of |
|
From a dataframe of genomic intervals, find all unique pairs of intervals that are between |
|
Calculate the fraction of GC basepairs for a string of nucleotides. |
- binnify(chromsizes, binsize, rel_ids=False)[source]
Divide a genome into evenly sized bins.
- Parameters:
chromsizes (Series) – pandas Series indexed by chromosome name with chromosome lengths in bp.
binsize (int) – size of bins in bp
- Returns:
bintable
- Return type:
pandas.DataFrame with columns: ‘chrom’, ‘start’, ‘end’.
- digest(fasta_records, enzyme)[source]
Divide a genome into restriction fragments.
- Parameters:
fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)
enzyme (str) – Name of restriction enzyme.
- Returns:
Dataframe with columns
- Return type:
‘chrom’, ‘start’, ‘end’.
- frac_gc(df, fasta_records, mapped_only=True, return_input=True)[source]
Calculate the fraction of GC basepairs for each interval in a dataframe.
- Parameters:
df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.
fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)
mapped_only (bool) – if True, ignore ‘N’ in the fasta_records for calculation. if True and there are no mapped base-pairs in an interval, return np.nan.
return_input (bool) – if False, only return Series named frac_mapped.
- Returns:
df_mapped – Original dataframe with new column ‘GC’ appended.
- Return type:
pd.DataFrame
- frac_gene_coverage(df, ucsc_mrna)[source]
Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe.
- Parameters:
df (pd.DataFrame) – Set of genomic intervals stored as a dataframe.
ucsc_mrna (str or DataFrame) – Name of UCSC genome or all_mrna.txt dataframe from UCSC or similar.
- Returns:
df_gene_coverage
- Return type:
pd.DataFrame
- frac_mapped(df, fasta_records, return_input=True)[source]
Calculate the fraction of mapped base-pairs for each interval in a dataframe.
- Parameters:
df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.
fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)
return_input (bool) – if False, only return Series named frac_mapped.
- Returns:
df_mapped – Original dataframe with new column ‘frac_mapped’ appended.
- Return type:
pd.DataFrame
- make_chromarms(chromsizes, midpoints, cols_chroms=('chrom', 'length'), cols_mids=('chrom', 'mid'), suffixes=('_p', '_q'))[source]
Split chromosomes into chromosome arms.
- Parameters:
chromsizes (pandas.Dataframe or dict-like) – If dict or pandas.Series, a map from chromosomes to lengths in bp. If pandas.Dataframe, a dataframe with columns defined by cols_chroms. If cols_chroms is a triplet (e.g. ‘chrom’,’start’,’end’), then values in chromsizes[cols_chroms[1]].values must all be zero.
midpoints (pandas.Dataframe or dict-like) – Mapping of chromosomes to midpoint (aka centromere) locations. If dict or pandas.Series, a map from chromosomes to midpoints in bp. If pandas.Dataframe, a dataframe with columns defined by cols_mids.
suffixes (tuple, optional) – Suffixes to name chromosome arms. Defaults to p and q.
- Returns:
4-column BED-like DataFrame (chrom, start, end, name). Arm names are chromosome names + suffix. Any chromosome not included in
midswill be not be split.- Return type:
df_chromarms
- mark_runs(df, col, *, allow_overlaps=False, reset_counter=True, run_col='run', cols=None)[source]
Mark runs of spatially consecutive intervals sharing the same value of
col.- Parameters:
df (DataFrame) – A bioframe dataframe.
col (str) – The column to mark runs of values for.
allow_overlaps (bool, optional [default: False]) – If True, allow intervals in
dfto overlap. This may cause unexpected results.reset_counter (bool, optional [default: True]) – If True, reset the run counter for each chromosome.
run_col (str, optional [default: 'run']) – The name of the column to store the run numbers in.
- Returns:
A reordered copy the input dataframe with an additional column ‘run’ marking runs of values in the input column.
- Return type:
pandas.DataFrame
Notes
This is similar to
cluster(), but only clusters intervals sharing the same value ofcol.Examples
>>> df = pd.DataFrame({ ... 'chrom': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1'], ... 'start': [0, 100, 200, 300, 400, 500], ... 'end': [100, 200, 300, 400, 500, 600], ... 'value': [1, 1, 1, 2, 2, 2], ... })
>>> mark_runs(df, 'value') chrom start end value run 0 chr1 0 100 1 0 1 chr1 100 200 1 0 2 chr1 200 300 1 0 3 chr1 300 400 2 1 4 chr1 400 500 2 1 5 chr1 500 600 2 1
See also
merge_runs,cluster,merge
- merge_runs(df, col, *, allow_overlaps=False, agg=None, cols=None)[source]
Merge runs of spatially consecutive intervals sharing the same value of
col.- Parameters:
df (DataFrame) – A bioframe dataframe.
col (str) – The column to compress runs of values for.
allow_overlaps (bool, optional [default: False]) – If True, allow intervals in
dfto overlap. This may cause unexpected results.agg (dict, optional [default: None]) –
A dictionary of additional column names and aggregation functions to apply to each run. Takes the format:
{‘agg_name’: (‘column_name’, ‘agg_func’)}
- Returns:
Dataframe with consecutive intervals in the same run merged.
- Return type:
pandas.DataFrame
Notes
This is similar to
merge(), but only merges intervals sharing the same value ofcol.Examples
>>> df = pd.DataFrame({ ... 'chrom': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1', 'chr1'], ... 'start': [0, 100, 200, 300, 400, 500], ... 'end': [100, 200, 300, 400, 500, 600], ... 'value': [1, 1, 1, 2, 2, 2], ... })
>>> merge_runs(df, 'value') chrom start end value 0 chr1 0 300 1 1 chr1 300 600 2
>>> merge_runs(df, 'value', agg={'sum': ('value', 'sum')}) chrom start end value sum 0 chr1 0 300 1 3 1 chr1 300 600 2 6
See also
mark_runs,cluster,merge
- pair_by_distance(df, min_sep, max_sep, min_intervening=None, max_intervening=None, relative_to='midpoints', cols=None, return_index=False, keep_order=False, suffixes=('_1', '_2'))[source]
From a dataframe of genomic intervals, find all unique pairs of intervals that are between
min_sepandmax_sepbp separated from each other.- Parameters:
df (pandas.DataFrame) – A BED-like dataframe.
min_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.
max_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.
min_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.
max_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.
relative_to (str,) – Whether to calculate distances between interval “midpoints” or “endpoints”. Default “midpoints”.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
return_index (bool) – If True, return indicies of pairs as two new columns (‘index’+suffixes[0] and ‘index’+suffixes[1]). Default False.
keep_order (bool, optional) – If True, sort the output dataframe to preserve the order of the intervals in df1. Default False. Note that it relies on sorting of index in the original dataframes, and will reorder the output by index.
suffixes ((str, str), optional) – The column name suffixes for the two interval sets in the output. The first interval of each output pair is always upstream of the second.
- Returns:
A BEDPE-like dataframe of paired intervals from
df.- Return type:
pandas.DataFrame