Reading genomic dataframes
import bioframe
Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in bioframe.io
.
Reading tabular text data
The most common need is to read tablular data, which can be accomplished with bioframe.read_table
. This function wraps pandas pandas.read_csv
/pandas.read_table
(tab-delimited by default), but allows the user to easily pass a schema (i.e. list of pre-defined column names) for common genomic interval-based file formats.
For example,
df = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz",
schema="bed9",
)
display(df[0:3])
chrom | start | end | name | score | strand | thickStart | thickEnd | itemRgb | |
---|---|---|---|---|---|---|---|---|---|
0 | chr1 | 193500 | 194500 | . | 400 | + | . | . | 179,45,0 |
1 | chr1 | 618500 | 619500 | . | 700 | + | . | . | 179,45,0 |
2 | chr1 | 974500 | 975500 | . | 1000 | + | . | . | 179,45,0 |
df = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
schema="narrowPeak",
)
display(df[0:3])
chrom | start | end | name | score | strand | fc | -log10p | -log10q | relSummit | |
---|---|---|---|---|---|---|---|---|---|---|
0 | chr19 | 48309541 | 48309911 | . | 1000 | . | 5.04924 | -1.0 | 0.00438 | 185 |
1 | chr4 | 130563716 | 130564086 | . | 993 | . | 5.05052 | -1.0 | 0.00432 | 185 |
2 | chr1 | 200622507 | 200622877 | . | 591 | . | 5.05489 | -1.0 | 0.00400 | 185 |
df = bioframe.read_table(
"https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz",
schema="bed12",
)
display(df[0:3])
chrom | start | end | name | score | strand | thickStart | thickEnd | itemRgb | blockCount | blockSizes | blockStarts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr19 | 54331773 | 54620705 | 5C_304_ENm007_FOR_1.5C_304_ENm007_REV_40 | 1000 | . | 54331773 | 54620705 | 0 | 2 | 14528,19855, | 0,269077, |
1 | chr19 | 54461360 | 54620705 | 5C_304_ENm007_FOR_26.5C_304_ENm007_REV_40 | 1000 | . | 54461360 | 54620705 | 0 | 2 | 800,19855, | 0,139490, |
2 | chr5 | 131346229 | 132145236 | 5C_299_ENm002_FOR_241.5C_299_ENm002_REV_33 | 1000 | . | 131346229 | 132145236 | 0 | 2 | 2609,2105, | 0,796902, |
The schema
argument looks up file type from a registry of schemas stored in the bioframe.SCHEMAS
dictionary:
bioframe.SCHEMAS["bed6"]
['chrom', 'start', 'end', 'name', 'score', 'strand']
UCSC Big Binary Indexed files (BigWig, BigBed)
Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames.
bw_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw"
df = bioframe.read_bigwig(bw_url, "chr21", start=10_000_000, end=10_010_000)
df.head(5)
chrom | start | end | value | |
---|---|---|---|---|
0 | chr21 | 10000000 | 10000005 | 40.0 |
1 | chr21 | 10000005 | 10000010 | 40.0 |
2 | chr21 | 10000010 | 10000015 | 60.0 |
3 | chr21 | 10000015 | 10000020 | 80.0 |
4 | chr21 | 10000020 | 10000025 | 40.0 |
df["value"] *= 100
df.head(5)
chrom | start | end | value | |
---|---|---|---|---|
0 | chr21 | 10000000 | 10000005 | 4000.0 |
1 | chr21 | 10000005 | 10000010 | 4000.0 |
2 | chr21 | 10000010 | 10000015 | 6000.0 |
3 | chr21 | 10000015 | 10000020 | 8000.0 |
4 | chr21 | 10000020 | 10000025 | 4000.0 |
chromsizes = bioframe.fetch_chromsizes("hg19")
# bioframe.to_bigwig(df, chromsizes, 'times100.bw')
# note: requires UCSC bedGraphToBigWig binary, which can be installed as
# !conda install -y -c bioconda ucsc-bedgraphtobigwig
bb_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb"
bioframe.read_bigbed(bb_url, "chr21", start=48000000).head()
chrom | start | end | |
---|---|---|---|
0 | chr21 | 48003453 | 48003785 |
1 | chr21 | 48003545 | 48003672 |
2 | chr21 | 48018114 | 48019432 |
3 | chr21 | 48018244 | 48018550 |
4 | chr21 | 48018843 | 48019099 |
Reading genome assembly information
The most fundamental information about a genome assembly is its set of chromosome sizes.
Bioframe provides functions to read chromosome sizes file as pandas.Series
, with some useful filtering and sorting options:
bioframe.read_chromsizes(
"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes"
)
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chr8 145138636
chr9 138394717
chr10 133797422
chr11 135086622
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr19 58617616
chr20 64444167
chr21 46709983
chr22 50818468
chrX 156040895
chrY 57227415
chrM 16569
Name: length, dtype: int64
bioframe.read_chromsizes(
"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
filter_chroms=False,
)
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
...
chrUn_KI270539v1 993
chrUn_KI270385v1 990
chrUn_KI270423v1 981
chrUn_KI270392v1 971
chrUn_KI270394v1 970
Name: length, Length: 455, dtype: int64
dm6_url = "https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz"
bioframe.read_chromsizes(
dm6_url,
filter_chroms=True,
chrom_patterns=("^chr2L$", "^chr2R$", "^chr3L$", "^chr3R$", "^chr4$", "^chrX$"),
)
chr2L 23513712
chr2R 25286936
chr3L 28110227
chr3R 32079331
chr4 1348131
chrX 23542271
Name: length, dtype: int64
bioframe.read_chromsizes(
dm6_url, chrom_patterns=[r"^chr\d+L$", r"^chr\d+R$", "^chr4$", "^chrX$", "^chrM$"]
)
chr2L 23513712
chr3L 28110227
chr2R 25286936
chr3R 32079331
chr4 1348131
chrX 23542271
chrM 19524
Name: length, dtype: int64
Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:
chromsizes = bioframe.fetch_chromsizes("hg38")
chromsizes[-5:]
name
chr21 46709983
chr22 50818468
chrX 156040895
chrY 57227415
chrM 16569
Name: length, dtype: int64
Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:
display(bioframe.fetch_centromeres("hg38")[:3])
chrom | start | end | mid | |
---|---|---|---|---|
0 | chr1 | 121700000 | 125100000 | 123400000 |
1 | chr2 | 91800000 | 96000000 | 93900000 |
2 | chr3 | 87800000 | 94000000 | 90900000 |
These functions are just wrappers for a UCSC client. Users can also use UCSCClient
directly:
client = bioframe.UCSCClient("hg38")
client.fetch_cytoband()
chrom | start | end | name | gieStain | |
---|---|---|---|---|---|
0 | chr1 | 0 | 2300000 | p36.33 | gneg |
1 | chr1 | 2300000 | 5300000 | p36.32 | gpos25 |
2 | chr1 | 5300000 | 7100000 | p36.31 | gneg |
3 | chr1 | 7100000 | 9100000 | p36.23 | gpos25 |
4 | chr1 | 9100000 | 12500000 | p36.22 | gneg |
... | ... | ... | ... | ... | ... |
1544 | chr19_MU273387v1_alt | 0 | 89211 | NaN | gneg |
1545 | chr16_MU273376v1_fix | 0 | 87715 | NaN | gneg |
1546 | chrX_MU273393v1_fix | 0 | 68810 | NaN | gneg |
1547 | chr8_MU273360v1_fix | 0 | 39290 | NaN | gneg |
1548 | chr5_MU273352v1_fix | 0 | 34400 | NaN | gneg |
1549 rows × 5 columns
Curated genome assembly build information
New in v0.5.0
Bioframe also has locally stored information for common genome assembly builds.
For a given provider and assembly build, this API provides additional sequence metadata:
A canonical name for every sequence, usually opting for UCSC-style naming.
A canonical ordering of the sequences.
Each sequence’s length.
An alias dictionary mapping alternative names or aliases to the canonical sequence name.
Each sequence is assigned to an assembly unit: e.g., primary, non-nuclear, decoy.
Each sequence is assigned a role: e.g., assembled molecule, unlocalized, unplaced.
bioframe.assemblies_available()
organism | provider | provider_build | release_year | seqinfo | cytobands | default_roles | default_units | url | |
---|---|---|---|---|---|---|---|---|---|
0 | homo sapiens | ncbi | GRCh37 | 2009 | hg19.seqinfo.tsv | hg19.cytoband.tsv | [assembled] | [primary, non-nuclear-revised] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0... |
1 | homo sapiens | ucsc | hg19 | 2009 | hg19.seqinfo.tsv | hg19.cytoband.tsv | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/hg1... |
2 | homo sapiens | ncbi | GRCh38 | 2013 | hg38.seqinfo.tsv | hg38.cytoband.tsv | [assembled] | [primary, non-nuclear] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0... |
3 | homo sapiens | ucsc | hg38 | 2013 | hg38.seqinfo.tsv | hg38.cytoband.tsv | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/hg3... |
4 | homo sapiens | ncbi | T2T-CHM13v2.0 | 2022 | hs1.seqinfo.tsv | hs1.cytoband.tsv | [assembled] | [primary, non-nuclear] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/0... |
5 | homo sapiens | ucsc | hs1 | 2022 | hs1.seqinfo.tsv | hs1.cytoband.tsv | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/hs1... |
6 | mus musculus | ncbi | MGSCv37 | 2010 | mm9.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0... |
7 | mus musculus | ucsc | mm9 | 2007 | mm9.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/mm9... |
8 | mus musculus | ncbi | GRCm38 | 2011 | mm10.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0... |
9 | mus musculus | ucsc | mm10 | 2011 | mm10.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/mm1... |
10 | mus musculus | ncbi | GRCm39 | 2020 | mm39.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0... |
11 | mus musculus | ucsc | mm39 | 2020 | mm39.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/mm3... |
12 | drosophila melanogaster | ucsc | dm3 | 2006 | dm3.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/dm3... |
13 | drosophila melanogaster | ucsc | dm6 | 2014 | dm6.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/dm6... |
14 | caenorhabditis elegans | ucsc | ce10 | 2010 | ce10.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/ce1... |
15 | caenorhabditis elegans | ucsc | ce11 | 2013 | ce11.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/ce1... |
16 | danio rerio | ucsc | danRer10 | 2014 | danRer10.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/dan... |
17 | danio rerio | ucsc | danRer11 | 2017 | danRer10.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/dan... |
18 | saccharomyces cerevisiae | ucsc | sacCer3 | 2011 | sacCer3.seqinfo.tsv | NaN | [assembled] | [primary, non-nuclear] | https://hgdownload.soe.ucsc.edu/goldenPath/sac... |
hg38 = bioframe.assembly_info("hg38")
print(hg38.provider, hg38.provider_build)
hg38.seqinfo
ucsc hg38
name | length | role | molecule | unit | aliases | |
---|---|---|---|---|---|---|
0 | chr1 | 248956422 | assembled | chr1 | primary | 1,CM000663.2,NC_000001.11 |
1 | chr2 | 242193529 | assembled | chr2 | primary | 2,CM000664.2,NC_000002.12 |
2 | chr3 | 198295559 | assembled | chr3 | primary | 3,CM000665.2,NC_000003.12 |
3 | chr4 | 190214555 | assembled | chr4 | primary | 4,CM000666.2,NC_000004.12 |
4 | chr5 | 181538259 | assembled | chr5 | primary | 5,CM000667.2,NC_000005.10 |
5 | chr6 | 170805979 | assembled | chr6 | primary | 6,CM000668.2,NC_000006.12 |
6 | chr7 | 159345973 | assembled | chr7 | primary | 7,CM000669.2,NC_000007.14 |
7 | chr8 | 145138636 | assembled | chr8 | primary | 8,CM000670.2,NC_000008.11 |
8 | chr9 | 138394717 | assembled | chr9 | primary | 9,CM000671.2,NC_000009.12 |
9 | chr10 | 133797422 | assembled | chr10 | primary | 10,CM000672.2,NC_000010.11 |
10 | chr11 | 135086622 | assembled | chr11 | primary | 11,CM000673.2,NC_000011.10 |
11 | chr12 | 133275309 | assembled | chr12 | primary | 12,CM000674.2,NC_000012.12 |
12 | chr13 | 114364328 | assembled | chr13 | primary | 13,CM000675.2,NC_000013.11 |
13 | chr14 | 107043718 | assembled | chr14 | primary | 14,CM000676.2,NC_000014.9 |
14 | chr15 | 101991189 | assembled | chr15 | primary | 15,CM000677.2,NC_000015.10 |
15 | chr16 | 90338345 | assembled | chr16 | primary | 16,CM000678.2,NC_000016.10 |
16 | chr17 | 83257441 | assembled | chr17 | primary | 17,CM000679.2,NC_000017.11 |
17 | chr18 | 80373285 | assembled | chr18 | primary | 18,CM000680.2,NC_000018.10 |
18 | chr19 | 58617616 | assembled | chr19 | primary | 19,CM000681.2,NC_000019.10 |
19 | chr20 | 64444167 | assembled | chr20 | primary | 20,CM000682.2,NC_000020.11 |
20 | chr21 | 46709983 | assembled | chr21 | primary | 21,CM000683.2,NC_000021.9 |
21 | chr22 | 50818468 | assembled | chr22 | primary | 22,CM000684.2,NC_000022.11 |
22 | chrX | 156040895 | assembled | chrX | primary | X,CM000685.2,NC_000023.11 |
23 | chrY | 57227415 | assembled | chrY | primary | Y,CM000686.2,NC_000024.10 |
24 | chrM | 16569 | assembled | chrM | non-nuclear | MT,J01415.2,NC_012920.1 |
hg38.chromsizes
name
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chr8 145138636
chr9 138394717
chr10 133797422
chr11 135086622
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr19 58617616
chr20 64444167
chr21 46709983
chr22 50818468
chrX 156040895
chrY 57227415
chrM 16569
Name: length, dtype: int64
hg38.alias_dict["MT"]
'chrM'
bioframe.assembly_info("hg38", roles="all").seqinfo
name | length | role | molecule | unit | aliases | |
---|---|---|---|---|---|---|
0 | chr1 | 248956422 | assembled | chr1 | primary | 1,CM000663.2,NC_000001.11 |
1 | chr2 | 242193529 | assembled | chr2 | primary | 2,CM000664.2,NC_000002.12 |
2 | chr3 | 198295559 | assembled | chr3 | primary | 3,CM000665.2,NC_000003.12 |
3 | chr4 | 190214555 | assembled | chr4 | primary | 4,CM000666.2,NC_000004.12 |
4 | chr5 | 181538259 | assembled | chr5 | primary | 5,CM000667.2,NC_000005.10 |
... | ... | ... | ... | ... | ... | ... |
189 | chrUn_KI270753v1 | 62944 | unplaced | NaN | primary | HSCHRUN_RANDOM_CTG30,KI270753.1,NT_187508.1 |
190 | chrUn_KI270754v1 | 40191 | unplaced | NaN | primary | HSCHRUN_RANDOM_CTG33,KI270754.1,NT_187509.1 |
191 | chrUn_KI270755v1 | 36723 | unplaced | NaN | primary | HSCHRUN_RANDOM_CTG34,KI270755.1,NT_187510.1 |
192 | chrUn_KI270756v1 | 79590 | unplaced | NaN | primary | HSCHRUN_RANDOM_CTG35,KI270756.1,NT_187511.1 |
193 | chrUn_KI270757v1 | 71251 | unplaced | NaN | primary | HSCHRUN_RANDOM_CTG36,KI270757.1,NT_187512.1 |
194 rows × 6 columns
Contributing metadata for a new assembly build
To contribute a new assembly build to bioframe’s internal metadata registry, make a pull request with the following items:
Add a record to the assembly manifest file located at
bioframe/io/data/_assemblies.yml
. Required fields are as shown in the example below.Create a
seqinfo.tsv
file for the new assembly build and place it inbioframe/io/data
. Reference the exact file name in the manifest record’sseqinfo
field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.Optionally, a
cytoband.tsv
file adapted from acytoBand.txt
file from UCSC.
Note that we currently do not include sequences with alt or patch roles in seqinfo files.
Example
Metadata for the mouse mm9 assembly build as provided by UCSC.
_assemblies.yml
... - organism: mus musculus provider: ucsc provider_build: mm9 release_year: 2007 seqinfo: mm9.seqinfo.tsv default_roles: [assembled] default_units: [primary, non-nuclear] url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/ ...
mm9.seqinfo.tsv
name length role molecule unit aliases chr1 197195432 assembled chr1 primary 1,CM000994.1,NC_000067.5 chr2 181748087 assembled chr2 primary 2,CM000995.1,NC_000068.6 chr3 159599783 assembled chr3 primary 3,CM000996.1,NC_000069.5 chr4 155630120 assembled chr4 primary 4,CM000997.1,NC_000070.5 chr5 152537259 assembled chr5 primary 5,CM000998.1,NC_000071.5 chr6 149517037 assembled chr6 primary 6,CM000999.1,NC_000072.5 chr7 152524553 assembled chr7 primary 7,CM001000.1,NC_000073.5 chr8 131738871 assembled chr8 primary 8,CM001001.1,NC_000074.5 chr9 124076172 assembled chr9 primary 9,CM001002.1,NC_000075.5 chr10 129993255 assembled chr10 primary 10,CM001003.1,NC_000076.5 chr11 121843856 assembled chr11 primary 11,CM001004.1,NC_000077.5 chr12 121257530 assembled chr12 primary 12,CM001005.1,NC_000078.5 chr13 120284312 assembled chr13 primary 13,CM001006.1,NC_000079.5 chr14 125194864 assembled chr14 primary 14,CM001007.1,NC_000080.5 chr15 103494974 assembled chr15 primary 15,CM001008.1,NC_000081.5 chr16 98319150 assembled chr16 primary 16,CM001009.1,NC_000082.5 chr17 95272651 assembled chr17 primary 17,CM001010.1,NC_000083.5 chr18 90772031 assembled chr18 primary 18,CM001011.1,NC_000084.5 chr19 61342430 assembled chr19 primary 19,CM001012.1,NC_000085.5 chrX 166650296 assembled chrX primary X,CM001013.1,NC_000086.6 chrY 15902555 assembled chrY primary Y,CM001014.1,NC_000087.6 chrM 16299 assembled chrM non-nuclear MT,AY172335.1,NC_005089.1 chr1_random 1231697 unlocalized chr1 primary chr3_random 41899 unlocalized chr3 primary chr4_random 160594 unlocalized chr4 primary chr5_random 357350 unlocalized chr5 primary chr7_random 362490 unlocalized chr7 primary chr8_random 849593 unlocalized chr8 primary chr9_random 449403 unlocalized chr9 primary chr13_random 400311 unlocalized chr13 primary chr16_random 3994 unlocalized chr16 primary chr17_random 628739 unlocalized chr17 primary chrX_random 1785075 unlocalized chrX primary chrY_random 58682461 unlocalized chrY primary chrUn_random 5900358 unplaced primary
Reading other genomic formats
See the docs for File I/O for other supported file formats.