File I/O
Functions:
|
Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files. |
|
Read alignment records into a DataFrame. |
|
Read intervals from a bigBed file. |
|
Read intervals from a bigWig file. |
|
Read a |
|
Read a pairix-indexed file into DataFrame. |
|
Read a tabix-indexed file into dataFrame. |
|
Read a tab-delimited file into a data frame. |
|
Save a BED-like dataframe as a binary BigBed file. |
|
Save a bedGraph-like dataframe as a binary BigWig file. |
- load_fasta(filepath_or, engine='pysam', **kwargs)[source]
Load lazy fasta sequences from an indexed fasta file (optionally compressed) or from a collection of uncompressed fasta files.
- Parameters:
filepath_or (str or iterable) – If a string, a filepath to a single .fa or .fa.gz file. Assumed to be accompanied by a .fai index file. Depending on the engine, the index may be created on the fly, and some compression formats may not be supported. If not a string, an iterable of fasta file paths each assumed to contain a single sequence.
engine ({'pysam', 'pyfaidx'}, optional) – Module to use for loading sequences.
kwargs (optional) – Options to pass to
pysam.FastaFileorpyfaidx.Fasta.
- Return type:
OrderedDict of (lazy) fasta records.
Notes
pysam/samtools can read .fai and .gzi indexed files, I think.
pyfaidx can handle uncompressed and bgzf compressed files.
- read_alignments(fp, chrom=None, start=None, end=None)[source]
Read alignment records into a DataFrame.
- read_bigbed(path, chrom, start=None, end=None, engine='auto')[source]
Read intervals from a bigBed file.
- Parameters:
path (str) – Path or URL to a bigBed file
chrom (str)
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigBed file.
- Return type:
DataFrame
- read_bigwig(path, chrom, start=None, end=None, engine='auto')[source]
Read intervals from a bigWig file.
- Parameters:
path (str) – Path or URL to a bigWig file
chrom (str)
start (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
end (int, optional) – Start and end coordinates. Defaults to 0 and chromosome length.
engine ({"auto", "pybbi", "pybigwig"}) – Library to use for querying the bigWig file.
- Return type:
DataFrame
- read_chromsizes(filepath_or, filter_chroms=True, chrom_patterns=('^chr[0-9]+$', '^chr[XY]$', '^chrM$'), natsort=True, as_bed=False, **kwargs)[source]
Read a
<db>.chrom.sizesor<db>.chromInfo.txtfile from the UCSC database, wheredbis a genome assembly name, as a pandas.Series.- Parameters:
filepath_or (str or file-like) – Path or url to text file, or buffer.
filter_chroms (bool, optional) – Filter for chromosome names given in
chrom_patterns.chrom_patterns (sequence, optional) – Sequence of regular expressions to capture desired sequence names.
natsort (bool, optional) – Sort each captured group of names in natural order. Default is True.
as_bed (bool, optional) – If True, return chromsizes as an interval dataframe (chrom, start, end).
**kwargs – Passed to
pandas.read_csv()
- Return type:
Series of integer bp lengths indexed by sequence name or an interval dataframe.
Notes
Mention name patterns
See also
UCSC assembly terminology: <http://genome.ucsc.edu/FAQ/FAQdownloads.html#download9>
NCBI assembly terminology: <https://www.ncbi.nlm.nih.gov/grc/help/definitions>
- read_pairix(fp, region1, region2=None, chromsizes=None, columns=None, usecols=None, dtypes=None, **kwargs)[source]
Read a pairix-indexed file into DataFrame.
- read_table(filepath_or, schema=None, schema_is_strict=False, **kwargs)[source]
Read a tab-delimited file into a data frame.
Equivalent to
pandas.read_table()but supports an additional schema argument to populate column names for common genomic formats.- Parameters:
- Returns:
df
- Return type:
pandas.DataFrame of intervals
- to_bigbed(df, chromsizes, outpath, schema='infer', engine='ucsc', path_to_binary=None)[source]
Save a BED-like dataframe as a binary BigBed file.
- Parameters:
df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns
chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp
outpath (str) – The output BigWig file path
value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.
path_to_binary (str, optional) – Provide system path to the bedToBigBed binary.
- to_bigwig(df, chromsizes, outpath, value_field=None, engine='ucsc', path_to_binary=None)[source]
Save a bedGraph-like dataframe as a binary BigWig file.
- Parameters:
df (pandas.DataFrame) – Data frame with columns ‘chrom’, ‘start’, ‘end’ and one or more value columns
chromsizes (pandas.Series) – Series indexed by chromosome name mapping to their lengths in bp
outpath (str) – The output BigWig file path
value_field (str, optional) – Select the column label of the data frame to generate the track. Default is to use the fourth column.
path_to_binary (str, optional) – Provide system path to the bedGraphToBigWig binary.
engine ({'ucsc', 'bigtools'}, optional) – Engine to use for creating the BigWig file.
- to_bed(df, path=None, *, schema='infer', validate_fields=True, require_sorted=False, chromsizes=None, strict_score=False, replace_na=True, na_rep='nan')[source]
Write a DataFrame to a BED file.
- Parameters:
df (pd.DataFrame) – DataFrame to write.
path (str or Path, optional) – Path to write the BED file to. If
None, the serialized BED file is returned as a string.schema (str, optional [default: "infer"]) – BED schema to use. If
"infer", the schema is inferred from the DataFrame’s columns.validate_fields (bool, optional [default: True]) – Whether to validate the fields of the BED file.
require_sorted (bool, optional [default: False]) – Whether to require the BED file to be sorted.
chromsizes (dict or pd.Series, optional) – Chromosome sizes to validate against.
strict_score (bool, optional [default: False]) – Whether to strictly enforce validation of the score field (0-1000).
replace_na (bool, optional [default: True]) – Whether to replace null values of standard BED fields with compliant uninformative values.
na_rep (str, optional [default: "nan"]) – String representation of null values if written.
- Returns:
The serialized BED file as a string if
pathisNone, otherwiseNone.- Return type:
str or None