Validation

Functions:

is_bedframe(df[, raise_errors, cols])

Checks that required bedframe properties are satisfied for dataframe df.

is_cataloged(df, view_df[, raise_errors, ...])

Tests if all region names in df[df_view_col] are present in view_df[view_name_col].

is_contained(df, view_df[, raise_errors, ...])

Tests if all genomic intervals in a bioframe df are cataloged and do not extend beyond their associated region in the view view_df.

is_covering(df, view_df[, view_name_col, ...])

Tests if a view view_df is covered by the set of genomic intervals in the bedframe df.

is_overlapping(df[, cols])

Tests if any genomic intervals in a bioframe df overlap.

is_sorted(df[, view_df, reset_index, ...])

Tests if a bedframe is changed by sorting.

is_tiling(df, view_df[, raise_errors, ...])

Tests if a view view_df is tiled by the set of genomic intervals in the bedframe df.

is_viewframe(region_df[, raise_errors, ...])

Checks that region_df is a valid viewFrame.

is_bedframe(df, raise_errors=False, cols=None)[source]

Checks that required bedframe properties are satisfied for dataframe df.

This includes:

  • chrom, start, end columns

  • columns have valid dtypes

  • for each interval, if any of chrom, start, end are null, then all are

    null

  • all starts < ends.

Parameters:
  • df (pandas.DataFrame)

  • raise_errors (bool, optional [default: False]) – If True, raises errors instead of returning a boolean False for invalid properties.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_bedframe

Return type:

bool

Notes

Valid dtypes for chrom are object, string, or categorical. Valid dtypes for start and end are int/Int64Dtype.

is_cataloged(df, view_df, raise_errors=False, df_view_col='view_region', view_name_col='name')[source]

Tests if all region names in df[df_view_col] are present in view_df[view_name_col].

Parameters:
  • df (pandas.DataFrame)

  • view_df (pandas.DataFrame)

  • raise_errors (bool) – If True, raises errors instead of returning a boolean False for invalid properties. Default False.

  • df_view_col (str) – Name of column from df that indicates region in view.

  • view_name_col (str) – Name of column from view that specifies region name.

Returns:

is_cataloged

Return type:

bool

Notes

Does not check if names in view_df[view_name_col] are unique.

is_contained(df, view_df, raise_errors=False, df_view_col=None, view_name_col='name', cols=None, cols_view=None)[source]

Tests if all genomic intervals in a bioframe df are cataloged and do not extend beyond their associated region in the view view_df.

Parameters:
  • df (pandas.DataFrame)

  • view_df (pandas.DataFrame) – Valid viewframe.

  • raise_errors (bool) – If True, raises errors instead of returning a boolean False for invalid properties. Default False.

  • df_view_col – Column from df used to associate interviews with view regions. Default view_region.

  • view_name_col – Column from view_df with view region names. Default name.

  • cols ((str, str, str)) – Column names for chrom, start, end in df.

  • cols_view ((str, str, str)) – Column names for chrom, start, end in view_df.

Returns:

is_contained

Return type:

bool

is_covering(df, view_df, view_name_col='name', cols=None, cols_view=None)[source]

Tests if a view view_df is covered by the set of genomic intervals in the bedframe df.

This test is true if complement(df,view_df) is empty. Also note this test ignores regions assigned to intervals in df since regions are re-assigned in bioframe.ops.complement().

Parameters:
  • df (pandas.DataFrame)

  • view_df (pandas.DataFrame) – Valid viewFrame.

  • view_name_col – Column from view_df with view region names. Default name.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

  • cols_view ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals in view_df, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_covering

Return type:

bool

is_overlapping(df, cols=None)[source]

Tests if any genomic intervals in a bioframe df overlap.

Also see bioframe.ops.merge().

Parameters:
  • df (pandas.DataFrame)

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_overlapping

Return type:

bool

is_sorted(df, view_df=None, reset_index=True, df_view_col=None, view_name_col='name', cols=None, cols_view=None)[source]

Tests if a bedframe is changed by sorting.

Also see bioframe.ops.sort_bedframe().

Parameters:
  • df (pandas.DataFrame)

  • view_df (pandas.DataFrame | dict-like) – Optional view to pass to sort_bedframe. When it is dict-like :func:’bioframe.make_viewframe’ will be used to convert to viewframe. If view_df is not provided df is assumed to be sorted by chrom and start.

  • reset_index (bool) – Optional argument to pass to sort_bedframe.

  • df_view_col (None | str) – Name of column from df that indicates region in view. If None, :func:’bioframe.assign_view’ will be used to assign view regions. Default None.

  • view_name_col (str) – Name of column from view that specifies unique region name.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

  • cols_view ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals in view_df, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_sorted

Return type:

bool

is_tiling(df, view_df, raise_errors=False, df_view_col='view_region', view_name_col='name', cols=None, cols_view=None)[source]

Tests if a view view_df is tiled by the set of genomic intervals in the bedframe df.

This is true if:

  • df is not overlapping

  • df is covering view_df

  • df is contained in view_df

Parameters:
  • df (pandas.DataFrame)

  • view_df (pandas.DataFrame) – valid viewFrame

  • raise_errors (bool) – If True, raises errors instead of returning a boolean False for invalid properties. Default False.

  • df_view_col (str) – Name of column from df that indicates region in view.

  • view_name_col (str) – Name of column from view that specifies unique region name.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

  • cols_view ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals in view_df, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_tiling

Return type:

bool

is_viewframe(region_df, raise_errors=False, view_name_col='name', cols=None)[source]

Checks that region_df is a valid viewFrame.

This includes:

  • it satisfies requirements for a bedframe, including columns for (‘chrom’, ‘start’, ‘end’)

  • it has an additional column, view_name_col, with default ‘name’

  • it does not contain null values

  • entries in the view_name_col are unique.

  • intervals are non-overlapping

Parameters:
  • region_df (pandas.DataFrame) – Dataframe of genomic intervals to be tested.

  • raise_errors (bool) – If True, raises errors instead of returning a boolean False for invalid properties. Default False.

  • view_name_col (str) – Specifies column name of the view regions. Default ‘name’.

  • cols ((str, str, str) or None) – The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are ‘chrom’, ‘start’, ‘end’.

Returns:

is_viewframe

Return type:

bool