Construction
Functions:
|
Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types. |
|
Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names. |
|
Makes and validates a dataframe view_df out of regions. |
|
Attempts to clean a genomic interval dataframe to be a valid bedframe. |
- from_any(regions, fill_null=False, name_col='name', cols=None)[source]
Attempts to make a genomic interval dataframe with columns [chr, start, end, name_col] from a variety of input types.
- Parameters:
regions (supported input) –
Currently supported inputs:
dataframe
series of UCSC strings
dictionary of {str:int} key value pairs
pandas series where the index is interpreted as chromosomes and values are interpreted as end
list of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]
tuple of tuples or lists, either [(chrom,start,end)] or [(chrom,start,end,name)]
fill_null (False or dictionary) – Accepts a dictionary of {str:int} pairs, interpreted as chromosome sizes. Kept or backwards compatibility. Default False.
name_col (str) – Column name. Only used if 4 column list is provided. Default “name”.
cols ((str,str,str)) – Names for dataframe columns. Default None sets them with get_default_colnames().
- Returns:
out_df
- Return type:
dataframe
- from_dict(regions, cols=None)[source]
Makes a dataframe from a dictionary of {str,int} pairs, interpreted as chromosome names.
Note that {str,(int,int)} dictionaries of tuples are no longer supported!
- Parameters:
regions (dict)
name_col (str) – Default ‘name’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
- Returns:
df
- Return type:
pandas.DataFrame
- make_viewframe(regions, check_bounds=None, name_style=None, view_name_col='name', cols=None)[source]
Makes and validates a dataframe view_df out of regions.
- Parameters:
regions (supported input type) –
Currently supported input types:
a dictionary where keys are strings and values are integers {str:int}, specifying regions (chrom, 0, end, chrom)
a pandas series of chromosomes lengths with index specifying region names
a list of tuples [(chrom,start,end), …] or [(chrom,start,end,name), …]
a pandas DataFrame, skips to validation step
name_style (None or "ucsc") – If None and no column view_name_col, propagate values from cols[0] If “ucsc” and no column view_name_col, create UCSC style names
check_bounds (None, or chromosome sizes provided as any of valid formats above) – Optional, if provided checks if regions in the view are contained by regions supplied in check_bounds, typically provided as a series of chromosome sizes. Default None.
view_name_col (str) – Specifies column name of the view regions. Default ‘name’.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
- Returns:
view_df
- Return type:
dataframe satisfying properties of a view
- sanitize_bedframe(df1, recast_dtypes=True, drop_null=False, start_exceed_end_action=None, cols=None)[source]
Attempts to clean a genomic interval dataframe to be a valid bedframe.
- Parameters:
df1 (pandas.DataFrame)
recast_dtypes (bool) – Whether to attempt to recast column dtypes to pandas nullable dtypes.
drop_null (bool) – Drops rows with pd.NA. Default False.
start_exceed_end_action (str or None) –
Options: ‘flip’ or ‘drop’ or None. Default None.
If ‘flip’, attempts to sanitize by flipping intervals with start>end.
If ‘drop’ attempts to sanitize dropping intervals with start>end.
If None, does not alter these intervals if present.
cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.
- Returns:
out_df – Sanitized dataframe satisfying the properties of a bedframe.
- Return type:
pandas.DataFrame
Notes
The option
start_exceed_end_action='flip'
may be useful for gff files with strand information but starts > ends.