File I/O

Functions:

`load_fasta`(filepath_or[, engine])	Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.
`read_bam`(fp[, chrom, start, end])	Read bam records into a DataFrame.
`read_bigbed`(path, chrom[, start, end, engine])	Read intervals from a bigBed file.
`read_bigwig`(path, chrom[, start, end, engine])	Read intervals from a bigWig file.
`read_chromsizes`(filepath_or[, ...])	Read a `<db>.chrom.sizes` or `<db>.chromInfo.txt` file from the UCSC database, where `db` is a genome assembly name, as a pandas.Series.
`read_pairix`(fp, region1[, region2, ...])	Read a pairix-indexed file into DataFrame.
`read_tabix`(fp[, chrom, start, end])	Read a tabix-indexed file into dataFrame.
`read_table`(filepath_or[, schema, ...])	Read a tab-delimited file into a data frame.
`to_bigbed`(df, chromsizes, outpath[, schema, ...])	Save a bedGraph-like dataframe as a binary BigWig track.
`to_bigwig`(df, chromsizes, outpath[, ...])	Save a bedGraph-like dataframe as a binary BigWig track.

load_fasta(filepath_or, engine='pysam', **kwargs)[source]

Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.

Parameters:

filepath_or (str or iterable) – If a string, a filepath to a single .fa or .fa.gz file. Assumed to be accompanied by a .fai index file. Depending on the engine, the index may be created on the fly, and some compression formats may not be supported. If not a string, an iterable of fasta file paths each assumed to contain a single sequence.
engine ({'pysam', 'pyfaidx'}, optional) – Module to use for loading sequences.
kwargs (optional) – Options to pass to pysam.FastaFile or pyfaidx.Fasta.

Return type:

OrderedDict of (lazy) fasta records.

Notes

pysam/samtools can read .fai and .gzi indexed files, I think.
pyfaidx can handle uncompressed and bgzf compressed files.

read_bam(fp, chrom=None, start=None, end=None)[source]: Read bam records into a DataFrame.

read_bigbed(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigBed file.

Parameters:

path (str) – Path or URL to a bigBed file
chrom (str) –
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigBed file.

Return type:

DataFrame

read_bigwig(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigWig file.

Parameters:

path (str) – Path or URL to a bigWig file
chrom (str) –
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigWig file.

Return type:

DataFrame

read_chromsizes(filepath_or, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, as_bed=False, **kwargs)[source]

Read a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name, as a pandas.Series.

Parameters:

filepath_or (str or file-like) – Path or url to text file, or buffer.
filter_chroms (bool, optional) – Filter for chromosome names given in chrom_patterns.
chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.
natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.
as_bed (bool, optional) – If True, return chromsizes as an interval dataframe (chrom, start, end).
**kwargs – Passed to pandas.read_csv()

Return type:

Series of integer bp lengths indexed by sequence name or an interval dataframe.

Notes

Mention name patterns