Construction

Functions:

`from_any`(regions[, fill_null, name_col, cols])	Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types.
`from_dict`(regions[, cols])	Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names.
`make_viewframe`(regions[, check_bounds, ...])	Makes and validates a dataframe view_df out of regions.
`sanitize_bedframe`(df1[, recast_dtypes, ...])	Attempts to clean a genomic interval dataframe to be a valid bedframe.

from_any(regions, fill_null=False, name_col='name', cols=None)[source]

Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types.

Parameters:

regions (supported input) –
Currently supported inputs:
- dataframe
- series of UCSC strings
- dictionary of {str:int} key value pairs
- pandas series where the index is interpreted as chromosomes and values are interpreted as end
- list of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]
- tuple of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]
fill_null (False or dictionary) – Accepts a dictionary of {str:int} pairs, interpreted as chromosome sizes. Kept or backwards compatibility. Default False.
name_col (str) – Column name. Only used if 4 column list is provided. Default “name”.
cols ((str,str,str)) – Names for dataframe columns. Default None sets them with get_default_colnames().

Returns:

out_df

Return type:

dataframe

from_dict(regions, cols=None)[source]

Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names.

Note that {str,(int,int)} dictionaries of tuples are no longer supported!

Parameters:

regions (dict)
name_col (str) – Default ‘name’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

df

Return type:

pandas.DataFrame

make_viewframe(regions, check_bounds=None, name_style=None, view_name_col='name', cols=None)[source]

Makes and validates a dataframe view_df out of regions.

Parameters:

regions (supported input type) –
Currently supported input types:
- a dictionary where keys are strings and values are integers {str:int}, specifying regions (chrom, 0, end, chrom)
- a pandas series of chromosomes lengths with index specifying region names
- a list of tuples [(chrom,start,end), …] or [(chrom,start,end,name), …]
- a pandas DataFrame, skips to validation step
name_style (None or "ucsc") – If None and no column view_name_col, propagate values from cols[0] If “ucsc” and no column view_name_col, create UCSC style names
check_bounds (None, or chromosome sizes provided as any of valid formats above) – Optional, if provided checks if regions in the view are contained by regions supplied in check_bounds, typically provided as a series of chromosome sizes. Default None.
view_name_col (str) – Specifies column name of the view regions. Default ‘name’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

view_df

Return type:

dataframe satisfying properties of a view

sanitize_bedframe(df1, recast_dtypes=True, drop_null=False, start_exceed_end_action=None, cols=None)[source]

Attempts to clean a genomic interval dataframe to be a valid bedframe.

Parameters:

df1 (pandas.DataFrame)
recast_dtypes (bool) – Whether to attempt to recast column dtypes to pandas nullable dtypes.
drop_null (bool) – Drops rows with pd.NA. Default False.
start_exceed_end_action (str or None) –
Options: ‘flip’ or ‘drop’ or None. Default None.
- If ‘flip’, attempts to sanitize by flipping intervals with start>end.
- If ‘drop’ attempts to sanitize dropping intervals with start>end.
- If None, does not alter these intervals if present.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

out_df – Sanitized dataframe satisfying the properties of a bedframe.

Return type:

pandas.DataFrame

Notes

The option start_exceed_end_action='flip' may be useful for gff files with strand information but starts > ends.