String operations

Functions:

is_complete_ucsc_string(s)

Returns True if a string can be parsed into a completely informative (chrom, start, end) format.

parse_region(grange[, chromsizes, check_bounds])

Coerce a genomic range string or sequence type into a triple.

parse_region_string(s)

Parse a UCSC-style genomic range string into a triple.

to_ucsc_string(grange)

Convert a grange to a UCSC string.

is_complete_ucsc_string(s: str) bool[source]

Returns True if a string can be parsed into a completely informative (chrom, start, end) format.

Parameters:

s (str)

Returns:

True if able to be parsed and end is known.

Return type:

bool

parse_region(grange: str | tuple, chromsizes: dict | Series | None = None, *, check_bounds: bool = True) Tuple[str, int, int][source]

Coerce a genomic range string or sequence type into a triple.

Parameters:
  • grange (str or tuple) –

    • A UCSC-style genomic range string, e.g. “chr5:10,100,000-30,000,000”.

    • A triple (chrom, start, end), where start or end may be None.

    • A quadruple or higher-order tuple, e.g. (chrom, start, end, name). name and other fields will be ignored.

  • chromsizes (dict or Series, optional) – Lookup table of sequence lengths for bounds checking and for filling in a missing end coordinate.

  • check_bounds (bool, optional [default: True]) – If True, check that the genomic range is within the bounds of the sequence.

Returns:

A well-formed genomic range triple (str, int, int).

Return type:

tuple

Notes

Genomic ranges are interpreted as half-open intervals (0-based starts, 1-based ends) along the length coordinate of a sequence.

Sequence names may contain any character except for whitespace and colon.

The start coordinate should be 0 or greater and the end coordinate should be less than or equal to the length of the sequence, if the latter is known. These are enforced when check_bounds is True.

If the start coordinate is missing, it is assumed to be 0. If the end coordinate is missing and chromsizes are provided, it is replaced with the length of the sequence.

The end coordinate must be greater than or equal to the start.

The start and end coordinates may be suffixed with k(b), M(b), or G(b) multipliers, case-insentive. e.g. “chr1:1K-2M” is equivalent to “chr1:1000-2000000”.

parse_region_string(s: str) Tuple[str, int, int][source]

Parse a UCSC-style genomic range string into a triple.

Parameters:

s (str) – UCSC-style genomic range string, e.g. “chr5:10,100,000-30,000,000”.

Returns:

(str, int or None, int or None)

Return type:

tuple

See also

parse_region

to_ucsc_string(grange: Tuple[str, int, int]) str[source]

Convert a grange to a UCSC string.

Parameters:

grange (tuple or other iterable) – chrom, start, end

Returns:

UCSC-style genomic range string, ‘{chrom}:{start}-{end}’

Return type:

str