File I/O

Functions:

load_fasta(filepath_or[, engine])

Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.

read_bam(fp[, chrom, start, end])

Read bam records into a DataFrame.

read_bigbed(path, chrom[, start, end, engine])

Read intervals from a bigBed file.

read_bigwig(path, chrom[, start, end, engine])

Read intervals from a bigWig file.

read_chromsizes(filepath_or[, ...])

Read a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name, as a pandas.Series.

read_pairix(fp, region1[, region2, ...])

Read a pairix-indexed file into DataFrame.

read_tabix(fp[, chrom, start, end])

Read a tabix-indexed file into dataFrame.

read_table(filepath_or[, schema, ...])

Read a tab-delimited file into a data frame.

to_bigbed(df, chromsizes, outpath[, schema, ...])

Save a bedGraph-like dataframe as a binary BigWig track.

to_bigwig(df, chromsizes, outpath[, ...])

Save a bedGraph-like dataframe as a binary BigWig track.

load_fasta(filepath_or, engine='pysam', **kwargs)[source]

Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.

Parameters:
  • filepath_or (str or iterable) – If a string, a filepath to a single .fa or .fa.gz file. Assumed to be accompanied by a .fai index file. Depending on the engine, the index may be created on the fly, and some compression formats may not be supported. If not a string, an iterable of fasta file paths each assumed to contain a single sequence.

  • engine ({'pysam', 'pyfaidx'}, optional) – Module to use for loading sequences.

  • kwargs (optional) – Options to pass to pysam.FastaFile or pyfaidx.Fasta.

Return type:

OrderedDict of (lazy) fasta records.

Notes

  • pysam/samtools can read .fai and .gzi indexed files, I think.

  • pyfaidx can handle uncompressed and bgzf compressed files.

read_bam(fp, chrom=None, start=None, end=None)[source]

Read bam records into a DataFrame.

read_bigbed(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigBed file.

Parameters:
  • path (str) – Path or URL to a bigBed file

  • chrom (str) –

  • start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigBed file.

Return type:

DataFrame

read_bigwig(path, chrom, start=None, end=None, engine='auto')[source]

Read intervals from a bigWig file.

Parameters:
  • path (str) – Path or URL to a bigWig file

  • chrom (str) –

  • start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.

  • engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigWig file.

Return type:

DataFrame

read_chromsizes(filepath_or, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, as_bed=False, **kwargs)[source]

Read a <db>.chrom.sizes or <db>.chromInfo.txt file from the UCSC database, where db is a genome assembly name, as a pandas.Series.

Parameters:
  • filepath_or (str or file-like) – Path or url to text file, or buffer.

  • filter_chroms (bool, optional) – Filter for chromosome names given in chrom_patterns.

  • chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.

  • natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.

  • as_bed (bool, optional) – If True, return chromsizes as an interval dataframe (chrom, start, end).

  • **kwargs – Passed to pandas.read_csv()

Return type:

Series of integer bp lengths indexed by sequence name or an interval dataframe.

Notes

Mention name patterns

See also

read_pairix(fp, region1, region2=None, chromsizes=None, columns=None, usecols=None, dtypes=None, **kwargs)[source]

Read a pairix-indexed file into DataFrame.

read_tabix(fp, chrom=None, start=None, end=None)[source]

Read a tabix-indexed file into dataFrame.

read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs)[source]

Read a tab-delimited file into a data frame.

Equivalent to pandas.read_table() but supports an additional schema argument to populate column names for common genomic formats.

Parameters:
  • filepath_or (str, path object or file-like object) – Any valid string path is acceptable. The string could be a URL

  • schema (str) – Schema to use for table column names.

  • schema_is_strict (bool) – Whether to check if columns are filled with NAs.

Returns:

df

Return type:

pandas.DataFrame of intervals

to_bigbed(df, chromsizes, outpath, schema='bed6', path_to_binary=None)[source]

Save a bedGraph-like dataframe as a binary BigWig track.

Parameters:
  • df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns

  • chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp

  • outpath (str) – The output BigWig file path

  • value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.

  • path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.

to_bigwig(df, chromsizes, outpath, value_field=None, path_to_binary=None)[source]

Save a bedGraph-like dataframe as a binary BigWig track.

Parameters:
  • df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns

  • chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp

  • outpath (str) – The output BigWig file path

  • value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.

  • path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.