Additional tools

Functions:

binnify(chromsizes, binsize[, rel_ids])

Divide a genome into evenly sized bins.

digest(fasta_records, enzyme)

Divide a genome into restriction fragments.

frac_gc(df, fasta_records[, mapped_only, ...])

Calculate the fraction of GC basepairs for each interval in a dataframe.

frac_gene_coverage(df, ucsc_mrna)

Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe.

frac_mapped(df, fasta_records[, return_input])

Calculate the fraction of mapped base-pairs for each interval in a dataframe.

make_chromarms(chromsizes, midpoints[, ...])

Split chromosomes into chromosome arms.

pair_by_distance(df, min_sep, max_sep[, ...])

From a dataframe of genomic intervals, find all unique pairs of intervals that are between min_sep and max_sep bp separated from each other.

seq_gc(seq[, mapped_only])

Calculate the fraction of GC basepairs for a string of nucleotides.

binnify(chromsizes, binsize, rel_ids=False)[source]

Divide a genome into evenly sized bins.

Parameters:
  • chromsizes (Series) – pandas Series indexed by chromosome name with chromosome lengths in bp.

  • binsize (int) – size of bins in bp

Returns:

bintable

Return type:

pandas.DataFrame with columns: ‘chrom’, ‘start’, ‘end’.

digest(fasta_records, enzyme)[source]

Divide a genome into restriction fragments.

Parameters:
  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • enzyme (str) – Name of restriction enzyme.

Returns:

Dataframe with columns

Return type:

‘chrom’, ‘start’, ‘end’.

frac_gc(df, fasta_records, mapped_only=True, return_input=True)[source]

Calculate the fraction of GC basepairs for each interval in a dataframe.

Parameters:
  • df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.

  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • mapped_only (bool) – if True, ignore ‘N’ in the fasta_records for calculation. if True and there are no mapped base-pairs in an interval, return np.nan.

  • return_input (bool) – if False, only return Series named frac_mapped.

Returns:

df_mapped – Original dataframe with new column ‘GC’ appended.

Return type:

pd.DataFrame

frac_gene_coverage(df, ucsc_mrna)[source]

Calculate number and fraction of overlaps by predicted and verified RNA isoforms for a set of intervals stored in a dataframe.

Parameters:
  • df (pd.DataFrame) – Set of genomic intervals stored as a dataframe.

  • ucsc_mrna (str or DataFrame) – Name of UCSC genome or all_mrna.txt dataframe from UCSC or similar.

Returns:

df_gene_coverage

Return type:

pd.DataFrame

frac_mapped(df, fasta_records, return_input=True)[source]

Calculate the fraction of mapped base-pairs for each interval in a dataframe.

Parameters:
  • df (pandas.DataFrame) – A sets of genomic intervals stored as a DataFrame.

  • fasta_records (OrderedDict) – Dictionary of chromosome names to sequence records. Created by: bioframe.load_fasta(‘/path/to/fasta.fa’)

  • return_input (bool) – if False, only return Series named frac_mapped.

Returns:

df_mapped – Original dataframe with new column ‘frac_mapped’ appended.

Return type:

pd.DataFrame

make_chromarms(chromsizes, midpoints, cols_chroms=('chrom', 'length'), cols_mids=('chrom', 'mid'), suffixes=('_p', '_q'))[source]

Split chromosomes into chromosome arms.

Parameters:
  • chromsizes (pandas.Dataframe or dict-like) – If dict or pandas.Series, a map from chromosomes to lengths in bp. If pandas.Dataframe, a dataframe with columns defined by cols_chroms. If cols_chroms is a triplet (e.g. ‘chrom’,’start’,’end’), then values in chromsizes[cols_chroms[1]].values must all be zero.

  • midpoints (pandas.Dataframe or dict-like) – Mapping of chromosomes to midpoint (aka centromere) locations. If dict or pandas.Series, a map from chromosomes to midpoints in bp. If pandas.Dataframe, a dataframe with columns defined by cols_mids.

  • cols_chroms ((str, str) or (str, str, str)) – Two columns

  • suffixes (tuple, optional) – Suffixes to name chromosome arms. Defaults to p and q.

Returns:

4-column BED-like DataFrame (chrom, start, end, name). Arm names are chromosome names + suffix. Any chromosome not included in mids will be not be split.

Return type:

df_chromarms

pair_by_distance(df, min_sep, max_sep, min_intervening=None, max_intervening=None, relative_to='midpoints', cols=None, return_index=False, keep_order=False, suffixes=('_1', '_2'))[source]

From a dataframe of genomic intervals, find all unique pairs of intervals that are between min_sep and max_sep bp separated from each other.

Parameters:
  • df (pandas.DataFrame) – A BED-like dataframe.

  • min_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.

  • max_sep (int) – Minimum and maximum separation between intervals in bp. Min > 0 and Max >= Min.

  • min_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.

  • max_intervening (int) – Minimum and maximum number of intervening intervals separating pairs. Min > 0 and Max >= Min.

  • relative_to (str,) – Whether to calculate distances between interval “midpoints” or “endpoints”. Default “midpoints”.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

  • return_index (bool) – If True, return indicies of pairs as two new columns (‘index’+suffixes[0] and ‘index’+suffixes[1]). Default False.

  • keep_order (bool, optional) – If True, sort the output dataframe to preserve the order of the intervals in df1. Default False. Note that it relies on sorting of index in the original dataframes, and will reorder the output by index.

  • suffixes ((str, str), optional) – The column name suffixes for the two interval sets in the output. The first interval of each output pair is always upstream of the second.

Returns:

A BEDPE-like dataframe of paired intervals from df.

Return type:

pandas.DataFrame

seq_gc(seq, mapped_only=True)[source]

Calculate the fraction of GC basepairs for a string of nucleotides.

Parameters:
  • seq (str) – Basepair input

  • mapped_only (bool) – if True, ignore ‘N’ in the sequence for calculation. if True and there are no mapped base-pairs, return np.nan.

Returns:

gc – calculated gc content.

Return type:

float