Resources

Genome assembly metadata

Bioframe provides a collection of genome assembly metadata for commonly used genomes. These are accessible through a convenient dataclass interface via bioframe.assembly_info().

The assemblies are listed in a manifest YAML file, and each assembly has a mandatory companion file called seqinfo that contains the sequence names, lengths, and other information. The records in the manifest file contain the following fields:

  • organism: the organism name

  • provider: the genome assembly provider (e.g, ucsc, ncbi)

  • provider_build: the genome assembly build name (e.g., hg19, GRCh37)

  • release_year: the year of the assembly release

  • seqinfo: path to the seqinfo file

  • cytobands: path to the cytoband file, if available

  • default_roles: default molecular roles to include from the seqinfo file

  • default_units: default assembly units to include from the seqinfo file

  • url: URL to where the corresponding sequence files can be downloaded

The seqinfo file is a TSV file with the following columns (with header):

  • name: canonical sequence name

  • length: sequence length

  • role: role of the sequence or scaffold (e.g., “assembled”, “unlocalized”, “unplaced”)

  • molecule: name of the molecule that the sequence belongs to, if placed

  • unit: assembly unit of the chromosome (e.g., “primary”, “non-nuclear”, “decoy”)

  • aliases: comma-separated list of aliases for the sequence name

We currently do not include sequences with “alt” or “patch” roles in seqinfo files, but we do support the inclusion of additional decoy sequences (as used by so-called NGS analysis sets for human genome assemblies) by marking them as members of a “decoy” assembly unit.

The cytoband file is an optional TSV file with the following columns (with header):

  • chrom: chromosome name

  • start: start position

  • end: end position

  • band: cytogenetic coordinate (name of the band)

  • stain: Giesma stain result

The order of the sequences in the seqinfo file is treated as canonical. The ordering of the chromosomes in the cytobands file should match the order of the chromosomes in the seqinfo file.

The manifest and companion files are stored in the bioframe/io/data directory. New assemblies can be requested by opening an issue on GitHub or by submitting a pull request.

Functions:

assemblies_available()

Get a list of available genome assembly metadata in local storage.

assembly_info(name[, roles, units])

Get information about a genome assembly.

assemblies_available() DataFrame[source]

Get a list of available genome assembly metadata in local storage.

Returns:

A dataframe with metadata fields for available assemblies, including ‘provider’, ‘provider_build’, ‘default_roles’, ‘default_units’, and names of seqinfo and cytoband files.

Return type:

pandas.DataFrame

assembly_info(name: str, roles: List | Tuple | Literal['all'] | None = None, units: List | Tuple | Literal['all'] | None = None) GenomeAssembly[source]

Get information about a genome assembly.

Parameters:
  • name (str) – Name of the assembly. If the name contains a dot, it is interpreted as a provider name and a build, e.g. “hg38”. Otherwise, the provider is inferred if the build name is unique.

  • roles (list or tuple or "all", optional) – Sequence roles to include in the assembly info. If not specified, only sequences with the default sequence roles for the assembly are shown. e.g. “assembled”, “unlocalized”, “unplaced”

  • units (list or tuple or "all", optional) – Assembly units to include in the assembly info. If not specified, only sequences from the default units for the assembly are shown. e.g. “primary”, “non-nuclear”, “decoy”

Returns:

A dataclass containing information about the assembly.

Return type:

GenomeAssembly

Raises:

ValueError – If the assembly name is not found or is not unique.

Examples

>>> hg38 = assembly_info("hg38")
>>> hg38.chromsizes
name
chr1    248956422
chr2    242193529
chr3    198295559
...     ...
>>> assembly_info("hg38", roles=("assembled", "non-nuclear"))
>>> assembly_info("ucsc.hg38", units=("unplaced",))
class GenomeAssembly(organism: str, provider: str, provider_build: str, release_year: str, seqinfo: DataFrame, cytobands: DataFrame | None = None, url: str | None = None, alias_dict: Dict[str, str] | None = None)[source]

A dataclass containing information about sequences in a genome assembly.

alias_dict: Dict[str, str] = None
property chromnames: List[str]
property chromsizes: Series
cytobands: DataFrame = None
organism: str
provider: str
provider_build: str
release_year: str
seqinfo: DataFrame
url: str = None
property viewframe: DataFrame

Remote resources

These functions now default to using the local data store, but can be used to obtain chromsizes or centromere positions from UCSC by setting provider="ucsc".

Functions:

fetch_centromeres(db[, provider])

Extract centromere locations for a given assembly 'db' from a variety of file formats in UCSC (cytoband, centromeres) depending on availability, returning a DataFrame.

fetch_chromsizes(db, *[, provider, as_bed, ...])

Fetch chromsizes from local storage or the UCSC database.

fetch_centromeres(db: str, provider: str = 'local') DataFrame[source]

Extract centromere locations for a given assembly ‘db’ from a variety of file formats in UCSC (cytoband, centromeres) depending on availability, returning a DataFrame.

Parameters:
  • db (str) – Assembly name.

  • provider (str, optional [default: "local"]) – The provider of centromere data. Either “local” for local storage or “ucsc”.

Return type:

DataFrame with centromere ‘chrom’, ‘start’, ‘end’, ‘mid’.

Notes

When provider=”local”, centromeres are derived from cytoband tables in local storage.

Whe provider=”ucsc”, the fallback priority goes as follows: - UCSC cytoBand - UCSC cytoBandIdeo - UCSC centromeres.txt

Note that UCSC “gap” files no longer provide centromere information.

Currently only works for human assemblies.

See also

bioframe.assembly_info, bioframe.UCSCClient

fetch_chromsizes(db: str, *, provider: str = 'local', as_bed: bool = False, filter_chroms: bool = True, chrom_patterns: tuple = ('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort: bool = True, **kwargs) Series | DataFrame[source]

Fetch chromsizes from local storage or the UCSC database.

Parameters:
  • db (str) – Assembly name.

  • provider (str, optional [default: "local"]) – The provider of chromsizes. Either “local” for local storage or “ucsc”.

  • as_bed (bool, optional) – If True, return chromsizes as an interval DataFrame (chrom, start, end) instead of a Series.

  • provider="ucsc". (The remaining options only apply to)

  • filter_chroms (bool, optional) – Filter for chromosome names given in chrom_patterns.

  • chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.

  • natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.

  • **kwargs – Passed to pandas.read_csv()

Return type:

Series of integer bp lengths indexed by sequence name or BED3 DataFrame.

Notes

For more fine-grained control over the chromsizes from local storage, use bioframe.assembly_info().

Examples

>>> fetch_chromsizes("hg38")
name
chr1     248956422
chr2     242193529
chr3     198295559
...      ...
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64
>>> fetch_chromsizes("hg38", as_bed=True)
        chrom      start        end
0        chr1          0  248956422
1        chr2          0  242193529
2        chr3          0  198295559
...      ...
21       chrX          0  156040895
22       chrY          0   57227415
23       chrM          0      16569

See also

bioframe.assembly_info, bioframe.UCSCClient