File I/O
Functions:
|
Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files. |
|
Read bam records into a DataFrame. |
|
Read intervals from a bigBed file. |
|
Read intervals from a bigWig file. |
|
Read a |
|
Read a pairix-indexed file into DataFrame. |
|
Read a tabix-indexed file into dataFrame. |
|
Read a tab-delimited file into a data frame. |
|
Save a bedGraph-like dataframe as a binary BigWig track. |
|
Save a bedGraph-like dataframe as a binary BigWig track. |
- load_fasta(filepath_or, engine='pysam', **kwargs)[source]
Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.
- Parameters:
filepath_or (str or iterable) – If a string, a filepath to a single .fa or .fa.gz file. Assumed to be accompanied by a .fai index file. Depending on the engine, the index may be created on the fly, and some compression formats may not be supported. If not a string, an iterable of fasta file paths each assumed to contain a single sequence.
engine ({'pysam', 'pyfaidx'}, optional) – Module to use for loading sequences.
kwargs (optional) – Options to pass to
pysam.FastaFile
orpyfaidx.Fasta
.
- Return type:
OrderedDict of (lazy) fasta records.
Notes
pysam/samtools can read .fai and .gzi indexed files, I think.
pyfaidx can handle uncompressed and bgzf compressed files.
- read_bigbed(path, chrom, start=None, end=None, engine='auto')[source]
Read intervals from a bigBed file.
- Parameters:
path (str) – Path or URL to a bigBed file
chrom (str) –
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigBed file.
- Return type:
DataFrame
- read_bigwig(path, chrom, start=None, end=None, engine='auto')[source]
Read intervals from a bigWig file.
- Parameters:
path (str) – Path or URL to a bigWig file
chrom (str) –
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigWig file.
- Return type:
DataFrame
- read_chromsizes(filepath_or, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, as_bed=False, **kwargs)[source]
Read a
<db>.chrom.sizes
or<db>.chromInfo.txt
file from the UCSC database, wheredb
is a genome assembly name, as a pandas.Series.- Parameters:
filepath_or (str or file-like) – Path or url to text file, or buffer.
filter_chroms (bool, optional) – Filter for chromosome names given in
chrom_patterns
.chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.
natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.
as_bed (bool, optional) – If True, return chromsizes as an interval dataframe (chrom, start, end).
**kwargs – Passed to
pandas.read_csv()
- Return type:
Series of integer bp lengths indexed by sequence name or an interval dataframe.
Notes
Mention name patterns
See also
UCSC assembly terminology: <http://genome.ucsc.edu/FAQ/FAQdownloads.html#download9>
NCBI assembly terminology: <https://www.ncbi.nlm.nih.gov/grc/help/definitions>
- read_pairix(fp, region1, region2=None, chromsizes=None, columns=None, usecols=None, dtypes=None, **kwargs)[source]
Read a pairix-indexed file into DataFrame.
- read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs)[source]
Read a tab-delimited file into a data frame.
Equivalent to
pandas.read_table()
but supports an additional schema argument to populate column names for common genomic formats.- Parameters:
filepath_or (str, path object or file-like object) – Any valid string path is acceptable. The string could be a URL
schema (str) – Schema to use for table column names.
schema_is_strict (bool) – Whether to check if columns are filled with NAs.
- Returns:
df
- Return type:
pandas.DataFrame of intervals
- to_bigbed(df, chromsizes, outpath, schema='bed6', path_to_binary=None)[source]
Save a bedGraph-like dataframe as a binary BigWig track.
- Parameters:
df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns
chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp
outpath (str) – The output BigWig file path
value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.
path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.
- to_bigwig(df, chromsizes, outpath, value_field=None, path_to_binary=None)[source]
Save a bedGraph-like dataframe as a binary BigWig track.
- Parameters:
df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns
chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp
outpath (str) – The output BigWig file path
value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.
path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.