Construction

Functions:

from_any(regions[, fill_null, name_col, cols])

Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types.

from_dict(regions[, cols])

Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names.

make_viewframe(regions[, check_bounds, ...])

Makes and validates a dataframe view_df out of regions.

sanitize_bedframe(df1[, recast_dtypes, ...])

Attempts to clean a genomic interval dataframe to be a valid bedframe.

from_any(regions, fill_null=False, name_col='name', cols=None)[source]

Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types.

Parameters:
  • regions (supported input) –

    Currently supported inputs:

    • dataframe

    • series of UCSC strings

    • dictionary of {str:int} key value pairs

    • pandas series where the index is interpreted as chromosomes and values are interpreted as end

    • list of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]

    • tuple of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]

  • fill_null (False or dictionary) – Accepts a dictionary of {str:int} pairs, interpreted as chromosome sizes. Kept or backwards compatibility. Default False.

  • name_col (str) – Column name. Only used if 4 column list is provided. Default “name”.

  • cols ((str,str,str)) – Names for dataframe columns. Default None sets them with get_default_colnames().

Returns:

out_df

Return type:

dataframe

from_dict(regions, cols=None)[source]

Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names.

Note that {str,(int,int)} dictionaries of tuples are no longer supported!

Parameters:
  • regions (dict)

  • name_col (str) – Default ‘name’.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df

Return type:

pandas.DataFrame

make_viewframe(regions, check_bounds=None, name_style=None, view_name_col='name', cols=None)[source]

Makes and validates a dataframe view_df out of regions.

Parameters:
  • regions (supported input type) –

    Currently supported input types:

    • a dictionary where keys are strings and values are integers {str:int}, specifying regions (chrom, 0, end, chrom)

    • a pandas series of chromosomes lengths with index specifying region names

    • a list of tuples [(chrom,start,end), …] or [(chrom,start,end,name), …]

    • a pandas DataFrame, skips to validation step

  • name_style (None or "ucsc") – If None and no column view_name_col, propagate values from cols[0] If “ucsc” and no column view_name_col, create UCSC style names

  • check_bounds (None, or chromosome sizes provided as any of valid formats above) – Optional, if provided checks if regions in the view are contained by regions supplied in check_bounds, typically provided as a series of chromosome sizes. Default None.

  • view_name_col (str) – Specifies column name of the view regions. Default ‘name’.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

view_df

Return type:

dataframe satisfying properties of a view

sanitize_bedframe(df1, recast_dtypes=True, drop_null=False, start_exceed_end_action=None, cols=None)[source]

Attempts to clean a genomic interval dataframe to be a valid bedframe.

Parameters:
  • df1 (pandas.DataFrame)

  • recast_dtypes (bool) – Whether to attempt to recast column dtypes to pandas nullable dtypes.

  • drop_null (bool) – Drops rows with pd.NA. Default False.

  • start_exceed_end_action (str or None) –

    Options: ‘flip’ or ‘drop’ or None. Default None.

    • If ‘flip’, attempts to sanitize by flipping intervals with start>end.

    • If ‘drop’ attempts to sanitize dropping intervals with start>end.

    • If None, does not alter these intervals if present.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

out_df – Sanitized dataframe satisfying the properties of a bedframe.

Return type:

pandas.DataFrame

Notes

The option start_exceed_end_action='flip' may be useful for gff files with strand information but starts > ends.