File I/O

Functions:

load_fasta(filepath_or[, engine])

Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.

read_alignments(fp[, chrom, start, end])

Read alignment records into a DataFrame.

read_bigbed(path, chrom[, start, end, engine])

Read intervals from a bigBed file.

read_bigwig(path, chrom[, start, end, engine])

Read intervals from a bigWig file.

read_chromsizes(filepath_or[, ...])

Read a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name, as a pandas.Series.

read_pairix(fp, region1[, region2, ...])

Read a pairix-indexed file into DataFrame.

read_tabix(fp[, chrom, start, end])

Read a tabix-indexed file into dataFrame.

read_table(filepath_or[, schema, ...])

Read a tab-delimited file into a data frame.

to_bigbed(df, chromsizes, outpath[, schema, ...])

Save a BED-like dataframe as a binary BigBed file.

to_bigwig(df, chromsizes, outpath[, ...])

Save a bedGraph-like dataframe as a binary BigWig file.

load_fasta(filepath_or, engine='pysam', **kwargs)[source]

Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.

Parameters:
  • filepath_or (str or iterable) – If a string, a filepath to a single .fa or .fa.gz file. Assumed to be accompanied by a .fai index file. Depending on the engine, the index may be created on the fly, and some compression formats may not be supported. If not a string, an iterable of fasta file paths each assumed to contain a single sequence.

  • engine ({'pysam', 'pyfaidx'}, optional) – Module to use for loading sequences.

  • kwargs (optional) – Options to pass to pysam.FastaFile or pyfaidx.Fasta.

Return type:

OrderedDict of (lazy) fasta records.

Notes

  • pysam/samtools can read .fai and .gzi indexed files, I think.

  • pyfaidx can handle uncompressed and bgzf compressed files.

read_alignments(fp, chrom=None, start=None, end=None)[source]

Read alignment records into a DataFrame.

read_bigbed(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigBed file.

Parameters:
  • path (str) – Path or URL to a bigBed file

  • chrom (str)

  • start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigBed file.

Return type:

DataFrame

read_bigwig(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigWig file.

Parameters:
  • path (str) – Path or URL to a bigWig file

  • chrom (str)

  • start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigWig file.

Return type:

DataFrame

read_chromsizes(filepath_or, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, as_bed=False, **kwargs)[source]

Read a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name, as a pandas.Series.

Parameters:
  • filepath_or (str or file-like) – Path or url to text file, or buffer.

  • filter_chroms (bool, optional) – Filter for chromosome names given in chrom_patterns.

  • chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.

  • natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.

  • as_bed (bool, optional) – If True, return chromsizes as an interval dataframe (chrom, start, end).

  • **kwargs – Passed to pandas.read_csv()

Return type:

Series of integer bp lengths indexed by sequence name or an interval dataframe.

Notes

Mention name patterns

See also

read_pairix(fp, region1, region2=None, chromsizes=None, columns=None, usecols=None, dtypes=None, **kwargs)[source]

Read a pairix-indexed file into DataFrame.

read_tabix(fp, chrom=None, start=None, end=None)[source]

Read a tabix-indexed file into dataFrame.

read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs)[source]

Read a tab-delimited file into a data frame.

Equivalent to pandas.read_table() but supports an additional schema argument to populate column names for common genomic formats.

Parameters:
  • filepath_or (str, path object or file-like object) – Any valid string path is acceptable. The string could be a URL

  • schema (str) – Schema to use for table column names.

  • schema_is_strict (bool) – Whether to check if columns are filled with NAs.

Returns:

df

Return type:

pandas.DataFrame of intervals

to_bigbed(df, chromsizes, outpath, schema='infer', engine='ucsc', path_to_binary=None)[source]

Save a BED-like dataframe as a binary BigBed file.

Parameters:
  • df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns

  • chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp

  • outpath (str) – The output BigWig file path

  • value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.

  • path_to_binary (str, optional) – Provide system path to the bedToBigBed binary.

to_bigwig(df, chromsizes, outpath, value_field=None, engine='ucsc', path_to_binary=None)[source]

Save a bedGraph-like dataframe as a binary BigWig file.

Parameters:
  • df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns

  • chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp

  • outpath (str) – The output BigWig file path

  • value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.

  • path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.

  • engine ({'ucsc', 'bigtools'}, optional) – Engine to use for creating the BigWig file.

to_bed(df, path=None, *, schema='infer', validate_fields=True, require_sorted=False, chromsizes=None, strict_score=False, replace_na=True, na_rep='nan')[source]

Write a DataFrame to a BED file.

Parameters:
  • df (pd.DataFrame) – DataFrame to write.

  • path (str or Path, optional) – Path to write the BED file to. If None, the serialized BED file is returned as a string.

  • schema (str, optional [default: "infer"]) – BED schema to use. If "infer", the schema is inferred from the DataFrame’s columns.

  • validate_fields (bool, optional [default: True]) – Whether to validate the fields of the BED file.

  • require_sorted (bool, optional [default: False]) – Whether to require the BED file to be sorted.

  • chromsizes (dict or pd.Series, optional) – Chromosome sizes to validate against.

  • strict_score (bool, optional [default: False]) – Whether to strictly enforce validation of the score field (0-1000).

  • replace_na (bool, optional [default: True]) – Whether to replace null values of standard BED fields with compliant uninformative values.

  • na_rep (str, optional [default: "nan"]) – String representation of null values if written.

Returns:

The serialized BED file as a string if path is None, otherwise None.

Return type:

str or None