Resources

Genome assembly metadata

Bioframe provides a collection of genome assembly metadata for commonly used genomes. These are accessible through a convenient dataclass interface via bioframe.assembly_info().

The assemblies are listed in a manifest YAML file, and each assembly has a mandatory companion file called _seqinfo_ that contains the sequence names, lengths, and other information. The records in the manifest file contain the following fields:

  • organism: the organism name

  • provider: the genome assembly provider (e.g, ucsc, ncbi)

  • provider_build: the genome assembly build name (e.g., hg19, GRCh37)

  • release_year: the year of the assembly release

  • seqinfo: path to the seqinfo file

  • cytobands: path to the cytoband file, if available

  • default_roles: default molecular roles to include from the seqinfo file

  • default_units: default assembly units to include from the seqinfo file

  • url: URL to where the corresponding sequence files can be downloaded

The _seqinfo_ file is a TSV file with the following columns (with header):

  • name: canonical sequence name

  • length: sequence length

  • role: role of the sequence or scaffold (e.g., “assembled”, “unlocalized”, “unplaced”)

  • molecule: name of the molecule that the sequence belongs to, if placed

  • unit: assembly unit of the chromosome (e.g., “primary”, “non-nuclear”, “decoy”)

  • aliases: comma-separated list of aliases for the sequence name

We currently do not include sequences with “alt” or “patch” roles in _seqinfo_ files, but we do support the inclusion of additional decoy sequences (as used by so-called NGS analysis sets for human genome assemblies) by marking them as members of a “decoy” assembly unit.

The _cytoband_ file is an optional TSV file with the following columns (with header): - chrom: chromosome name - start: start position - end: end position - band: cytogenetic coordinate (name of the band) - stain: Giesma stain result

The order of the sequences in the _seqinfo_ file is treated as canonical. The ordering of the chromosomes in the _cytobands_ file should match the order of the chromosomes in the _seqinfo_ file.

The manifest and companion files are stored in the bioframe/io/data directory. New assemblies can be requested by opening an issue on GitHub or by submitting a pull request.

Functions:

assemblies_available()

Get a list of available genome assembly metadata in local storage.

assembly_info(name[, roles, units])

Get information about a genome assembly.

assemblies_available()[source]

Get a list of available genome assembly metadata in local storage.

Returns:

A dataframe with metadata fields for available assemblies, including ‘provider’, ‘provider_build’, ‘default_roles’, ‘default_units’, and names of seqinfo and cytoband files.

Return type:

pandas.DataFrame

assembly_info(name, roles=None, units=None)[source]

Get information about a genome assembly.

Parameters:
  • name (str) – Name of the assembly. If the name contains a dot, it is interpreted as a provider name and a build, e.g. “hg38”. Otherwise, the provider is inferred if the build name is unique.

  • roles (list or tuple or "all", optional) – Sequence roles to include in the assembly info. If not specified, only sequences with the default sequence roles for the assembly are shown. e.g. “assembled”, “unlocalized”, “unplaced”

  • units (list or tuple or "all", optional) – Assembly units to include in the assembly info. If not specified, only sequences from the default units for the assembly are shown. e.g. “primary”, “non-nuclear”, “decoy”

Returns:

A dataclass containing information about the assembly.

Return type:

GenomeAssembly

Raises:

ValueError – If the assembly name is not found or is not unique.

Examples

>>> hg38 = assembly_info("hg38")
>>> hg38.chromsizes
name
chr1    248956422
chr2    242193529
chr3    198295559
...     ...
>>> assembly_info("hg38", roles=("assembled", "non-nuclear"))
>>> assembly_info("ucsc.hg38", units=("unplaced",))
class GenomeAssembly(organism, provider, provider_build, release_year, seqinfo, cytobands=None, url=None, alias_dict=None)[source]

A dataclass containing information about sequences in a genome assembly.

alias_dict: Dict[str, str] = None
property chromnames: List[str]
property chromsizes: Series
cytobands: DataFrame = None
organism: str
provider: str
provider_build: str
release_year: str
seqinfo: DataFrame
url: str = None
property viewframe: DataFrame

Remote resources

These functions now default to using the local data store, but can be used to obtain chromsizes or centromere positions from UCSC by setting provider="ucsc".

Functions:

fetch_centromeres(db[, provider])

Extract centromere locations for a given assembly 'db' from a variety of file formats in UCSC (cytoband, centromeres) depending on availability, returning a DataFrame.

fetch_chromsizes(db, *[, provider, as_bed, ...])

Fetch chromsizes from local storage or the UCSC database.

fetch_centromeres(db, provider='local')[source]

Extract centromere locations for a given assembly ‘db’ from a variety of file formats in UCSC (cytoband, centromeres) depending on availability, returning a DataFrame.

Parameters:
  • db (str) – Assembly name.

  • provider (str, optional [default: "local"]) – The provider of centromere data. Either “local” for local storage or “ucsc”.

Return type:

DataFrame with centromere ‘chrom’, ‘start’, ‘end’, ‘mid’.

Notes

When provider=”local”, centromeres are derived from cytoband tables in local storage.

Whe provider=”ucsc”, the fallback priority goes as follows: - UCSC cytoBand - UCSC cytoBandIdeo - UCSC centromeres.txt

Note that UCSC “gap” files no longer provide centromere information.

Currently only works for human assemblies.

See also

bioframe.assembly_info, bioframe.UCSCClient

fetch_chromsizes(db, *, provider='local', as_bed=False, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, **kwargs)[source]

Fetch chromsizes from local storage or the UCSC database.

Parameters:
  • db (str) – Assembly name.

  • provider (str, optional [default: "local"]) – The provider of chromsizes. Either “local” for local storage or “ucsc”.

  • as_bed (bool, optional) – If True, return chromsizes as an interval DataFrame (chrom, start, end) instead of a Series.

  • provider="ucsc". (The remaining options only apply to)

  • filter_chroms (bool, optional) – Filter for chromosome names given in chrom_patterns.

  • chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.

  • natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.

  • **kwargs – Passed to pandas.read_csv()

Return type:

Series of integer bp lengths indexed by sequence name or BED3 DataFrame.

Notes

For more fine-grained control over the chromsizes from local storage, use bioframe.assembly_info().

Examples

>>> fetch_chromsizes("hg38")
name
chr1     248956422
chr2     242193529
chr3     198295559
...      ...
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64
>>> fetch_chromsizes("hg38", as_bed=True)
        chrom      start        end
0        chr1          0  248956422
1        chr2          0  242193529
2        chr3          0  198295559
...      ...
21       chrX          0  156040895
22       chrY          0   57227415
23       chrM          0      16569

See also

bioframe.assembly_info, bioframe.UCSCClient