Reading genomic dataframes

import bioframe

Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in bioframe.io.

Reading tabular text data

The most common need is to read tablular data, which can be accomplished with bioframe.read_table. This function wraps pandas pandas.read_csv/pandas.read_table (tab-delimited by default), but allows the user to easily pass a schema (i.e. list of pre-defined column names) for common genomic interval-based file formats.

For example,

df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz",
    schema="bed9",
)
display(df[0:3])
chrom start end name score strand thickStart thickEnd itemRgb
0 chr1 193500 194500 . 400 + . . 179,45,0
1 chr1 618500 619500 . 700 + . . 179,45,0
2 chr1 974500 975500 . 1000 + . . 179,45,0
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
display(df[0:3])
chrom start end name score strand fc -log10p -log10q relSummit
0 chr19 48309541 48309911 . 1000 . 5.04924 -1.0 0.00438 185
1 chr4 130563716 130564086 . 993 . 5.05052 -1.0 0.00432 185
2 chr1 200622507 200622877 . 591 . 5.05489 -1.0 0.00400 185
df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz",
    schema="bed12",
)
display(df[0:3])
chrom start end name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts
0 chr19 54331773 54620705 5C_304_ENm007_FOR_1.5C_304_ENm007_REV_40 1000 . 54331773 54620705 0 2 14528,19855, 0,269077,
1 chr19 54461360 54620705 5C_304_ENm007_FOR_26.5C_304_ENm007_REV_40 1000 . 54461360 54620705 0 2 800,19855, 0,139490,
2 chr5 131346229 132145236 5C_299_ENm002_FOR_241.5C_299_ENm002_REV_33 1000 . 131346229 132145236 0 2 2609,2105, 0,796902,

The schema argument looks up file type from a registry of schemas stored in the bioframe.SCHEMAS dictionary:

bioframe.SCHEMAS["bed6"]
['chrom', 'start', 'end', 'name', 'score', 'strand']

UCSC Big Binary Indexed files (BigWig, BigBed)

Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames.

bw_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw"
df = bioframe.read_bigwig(bw_url, "chr21", start=10_000_000, end=10_010_000)
df.head(5)
chrom start end value
0 chr21 10000000 10000005 40.0
1 chr21 10000005 10000010 40.0
2 chr21 10000010 10000015 60.0
3 chr21 10000015 10000020 80.0
4 chr21 10000020 10000025 40.0
df["value"] *= 100
df.head(5)
chrom start end value
0 chr21 10000000 10000005 4000.0
1 chr21 10000005 10000010 4000.0
2 chr21 10000010 10000015 6000.0
3 chr21 10000015 10000020 8000.0
4 chr21 10000020 10000025 4000.0
chromsizes = bioframe.fetch_chromsizes("hg19")
# bioframe.to_bigwig(df, chromsizes, 'times100.bw')

# note: requires UCSC bedGraphToBigWig binary, which can be installed as
# !conda install -y -c bioconda ucsc-bedgraphtobigwig
bb_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb"
bioframe.read_bigbed(bb_url, "chr21", start=48000000).head()
chrom start end
0 chr21 48003453 48003785
1 chr21 48003545 48003672
2 chr21 48018114 48019432
3 chr21 48018244 48018550
4 chr21 48018843 48019099

Reading genome assembly information

The most fundamental information about a genome assembly is its set of chromosome sizes.

Bioframe provides functions to read chromosome sizes file as pandas.Series, with some useful filtering and sorting options:

bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes"
)
chr1     248956422
chr2     242193529
chr3     198295559
chr4     190214555
chr5     181538259
chr6     170805979
chr7     159345973
chr8     145138636
chr9     138394717
chr10    133797422
chr11    135086622
chr12    133275309
chr13    114364328
chr14    107043718
chr15    101991189
chr16     90338345
chr17     83257441
chr18     80373285
chr19     58617616
chr20     64444167
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64
bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    filter_chroms=False,
)
chr1                248956422
chr2                242193529
chr3                198295559
chr4                190214555
chr5                181538259
                      ...    
chrUn_KI270539v1          993
chrUn_KI270385v1          990
chrUn_KI270423v1          981
chrUn_KI270392v1          971
chrUn_KI270394v1          970
Name: length, Length: 455, dtype: int64
dm6_url = "https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz"
bioframe.read_chromsizes(
    dm6_url,
    filter_chroms=True,
    chrom_patterns=("^chr2L$", "^chr2R$", "^chr3L$", "^chr3R$", "^chr4$", "^chrX$"),
)
chr2L    23513712
chr2R    25286936
chr3L    28110227
chr3R    32079331
chr4      1348131
chrX     23542271
Name: length, dtype: int64
bioframe.read_chromsizes(
    dm6_url, chrom_patterns=[r"^chr\d+L$", r"^chr\d+R$", "^chr4$", "^chrX$", "^chrM$"]
)
chr2L    23513712
chr3L    28110227
chr2R    25286936
chr3R    32079331
chr4      1348131
chrX     23542271
chrM        19524
Name: length, dtype: int64

Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:

chromsizes = bioframe.fetch_chromsizes("hg38")
chromsizes[-5:]
name
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:

display(bioframe.fetch_centromeres("hg38")[:3])
chrom start end mid
0 chr1 121700000 125100000 123400000
1 chr2 91800000 96000000 93900000
2 chr3 87800000 94000000 90900000

These functions are just wrappers for a UCSC client. Users can also use UCSCClient directly:

client = bioframe.UCSCClient("hg38")
client.fetch_cytoband()
chrom start end name gieStain
0 chr1 0 2300000 p36.33 gneg
1 chr1 2300000 5300000 p36.32 gpos25
2 chr1 5300000 7100000 p36.31 gneg
3 chr1 7100000 9100000 p36.23 gpos25
4 chr1 9100000 12500000 p36.22 gneg
... ... ... ... ... ...
1544 chr19_MU273387v1_alt 0 89211 NaN gneg
1545 chr16_MU273376v1_fix 0 87715 NaN gneg
1546 chrX_MU273393v1_fix 0 68810 NaN gneg
1547 chr8_MU273360v1_fix 0 39290 NaN gneg
1548 chr5_MU273352v1_fix 0 34400 NaN gneg

1549 rows × 5 columns

Curated genome assembly build information

New in v0.5.0

Bioframe also has locally stored information for common genome assembly builds.

For a given provider and assembly build, this API provides additional sequence metadata:

  • A canonical name for every sequence, usually opting for UCSC-style naming.

  • A canonical ordering of the sequences.

  • Each sequence’s length.

  • An alias dictionary mapping alternative names or aliases to the canonical sequence name.

  • Each sequence is assigned to an assembly unit: e.g., primary, non-nuclear, decoy.

  • Each sequence is assigned a role: e.g., assembled molecule, unlocalized, unplaced.

bioframe.assemblies_available()
organism provider provider_build release_year seqinfo cytobands default_roles default_units url
0 homo sapiens ncbi GRCh37 2009 hg19.seqinfo.tsv hg19.cytoband.tsv [assembled] [primary, non-nuclear-revised] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
1 homo sapiens ucsc hg19 2009 hg19.seqinfo.tsv hg19.cytoband.tsv [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/hg1...
2 homo sapiens ncbi GRCh38 2013 hg38.seqinfo.tsv hg38.cytoband.tsv [assembled] [primary, non-nuclear] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
3 homo sapiens ucsc hg38 2013 hg38.seqinfo.tsv hg38.cytoband.tsv [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/hg3...
4 homo sapiens ncbi T2T-CHM13v2.0 2022 hs1.seqinfo.tsv hs1.cytoband.tsv [assembled] [primary, non-nuclear] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/0...
5 homo sapiens ucsc hs1 2022 hs1.seqinfo.tsv hs1.cytoband.tsv [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/hs1...
6 mus musculus ncbi MGSCv37 2010 mm9.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
7 mus musculus ucsc mm9 2007 mm9.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/mm9...
8 mus musculus ncbi GRCm38 2011 mm10.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
9 mus musculus ucsc mm10 2011 mm10.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/mm1...
10 mus musculus ncbi GRCm39 2020 mm39.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
11 mus musculus ucsc mm39 2020 mm39.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/mm3...
12 drosophila melanogaster ucsc dm3 2006 dm3.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/dm3...
13 drosophila melanogaster ucsc dm6 2014 dm6.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/dm6...
14 caenorhabditis elegans ucsc ce10 2010 ce10.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/ce1...
15 caenorhabditis elegans ucsc ce11 2013 ce11.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/ce1...
16 danio rerio ucsc danRer10 2014 danRer10.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/dan...
17 danio rerio ucsc danRer11 2017 danRer10.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/dan...
18 saccharomyces cerevisiae ucsc sacCer3 2011 sacCer3.seqinfo.tsv NaN [assembled] [primary, non-nuclear] https://hgdownload.soe.ucsc.edu/goldenPath/sac...
hg38 = bioframe.assembly_info("hg38")
print(hg38.provider, hg38.provider_build)
hg38.seqinfo
ucsc hg38
name length role molecule unit aliases
0 chr1 248956422 assembled chr1 primary 1,CM000663.2,NC_000001.11
1 chr2 242193529 assembled chr2 primary 2,CM000664.2,NC_000002.12
2 chr3 198295559 assembled chr3 primary 3,CM000665.2,NC_000003.12
3 chr4 190214555 assembled chr4 primary 4,CM000666.2,NC_000004.12
4 chr5 181538259 assembled chr5 primary 5,CM000667.2,NC_000005.10
5 chr6 170805979 assembled chr6 primary 6,CM000668.2,NC_000006.12
6 chr7 159345973 assembled chr7 primary 7,CM000669.2,NC_000007.14
7 chr8 145138636 assembled chr8 primary 8,CM000670.2,NC_000008.11
8 chr9 138394717 assembled chr9 primary 9,CM000671.2,NC_000009.12
9 chr10 133797422 assembled chr10 primary 10,CM000672.2,NC_000010.11
10 chr11 135086622 assembled chr11 primary 11,CM000673.2,NC_000011.10
11 chr12 133275309 assembled chr12 primary 12,CM000674.2,NC_000012.12
12 chr13 114364328 assembled chr13 primary 13,CM000675.2,NC_000013.11
13 chr14 107043718 assembled chr14 primary 14,CM000676.2,NC_000014.9
14 chr15 101991189 assembled chr15 primary 15,CM000677.2,NC_000015.10
15 chr16 90338345 assembled chr16 primary 16,CM000678.2,NC_000016.10
16 chr17 83257441 assembled chr17 primary 17,CM000679.2,NC_000017.11
17 chr18 80373285 assembled chr18 primary 18,CM000680.2,NC_000018.10
18 chr19 58617616 assembled chr19 primary 19,CM000681.2,NC_000019.10
19 chr20 64444167 assembled chr20 primary 20,CM000682.2,NC_000020.11
20 chr21 46709983 assembled chr21 primary 21,CM000683.2,NC_000021.9
21 chr22 50818468 assembled chr22 primary 22,CM000684.2,NC_000022.11
22 chrX 156040895 assembled chrX primary X,CM000685.2,NC_000023.11
23 chrY 57227415 assembled chrY primary Y,CM000686.2,NC_000024.10
24 chrM 16569 assembled chrM non-nuclear MT,J01415.2,NC_012920.1
hg38.chromsizes
name
chr1     248956422
chr2     242193529
chr3     198295559
chr4     190214555
chr5     181538259
chr6     170805979
chr7     159345973
chr8     145138636
chr9     138394717
chr10    133797422
chr11    135086622
chr12    133275309
chr13    114364328
chr14    107043718
chr15    101991189
chr16     90338345
chr17     83257441
chr18     80373285
chr19     58617616
chr20     64444167
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64
hg38.alias_dict["MT"]
'chrM'
bioframe.assembly_info("hg38", roles="all").seqinfo
name length role molecule unit aliases
0 chr1 248956422 assembled chr1 primary 1,CM000663.2,NC_000001.11
1 chr2 242193529 assembled chr2 primary 2,CM000664.2,NC_000002.12
2 chr3 198295559 assembled chr3 primary 3,CM000665.2,NC_000003.12
3 chr4 190214555 assembled chr4 primary 4,CM000666.2,NC_000004.12
4 chr5 181538259 assembled chr5 primary 5,CM000667.2,NC_000005.10
... ... ... ... ... ... ...
189 chrUn_KI270753v1 62944 unplaced NaN primary HSCHRUN_RANDOM_CTG30,KI270753.1,NT_187508.1
190 chrUn_KI270754v1 40191 unplaced NaN primary HSCHRUN_RANDOM_CTG33,KI270754.1,NT_187509.1
191 chrUn_KI270755v1 36723 unplaced NaN primary HSCHRUN_RANDOM_CTG34,KI270755.1,NT_187510.1
192 chrUn_KI270756v1 79590 unplaced NaN primary HSCHRUN_RANDOM_CTG35,KI270756.1,NT_187511.1
193 chrUn_KI270757v1 71251 unplaced NaN primary HSCHRUN_RANDOM_CTG36,KI270757.1,NT_187512.1

194 rows × 6 columns

Contributing metadata for a new assembly build

To contribute a new assembly build to bioframe’s internal metadata registry, make a pull request with the following items:

  1. Add a record to the assembly manifest file located at bioframe/io/data/_assemblies.yml. Required fields are as shown in the example below.

  2. Create a seqinfo.tsv file for the new assembly build and place it in bioframe/io/data. Reference the exact file name in the manifest record’s seqinfo field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.

  3. Optionally, a cytoband.tsv file adapted from a cytoBand.txt file from UCSC.

Note that we currently do not include sequences with alt or patch roles in seqinfo files.

Example

Metadata for the mouse mm9 assembly build as provided by UCSC.

_assemblies.yml

...
- organism: mus musculus
  provider: ucsc
  provider_build: mm9
  release_year: 2007
  seqinfo: mm9.seqinfo.tsv
  default_roles: [assembled]
  default_units: [primary, non-nuclear]
  url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/
...

mm9.seqinfo.tsv

name	length	role	molecule	unit	aliases
chr1	197195432	assembled	chr1	primary	1,CM000994.1,NC_000067.5
chr2	181748087	assembled	chr2	primary	2,CM000995.1,NC_000068.6
chr3	159599783	assembled	chr3	primary	3,CM000996.1,NC_000069.5
chr4	155630120	assembled	chr4	primary	4,CM000997.1,NC_000070.5
chr5	152537259	assembled	chr5	primary	5,CM000998.1,NC_000071.5
chr6	149517037	assembled	chr6	primary	6,CM000999.1,NC_000072.5
chr7	152524553	assembled	chr7	primary	7,CM001000.1,NC_000073.5
chr8	131738871	assembled	chr8	primary	8,CM001001.1,NC_000074.5
chr9	124076172	assembled	chr9	primary	9,CM001002.1,NC_000075.5
chr10	129993255	assembled	chr10	primary	10,CM001003.1,NC_000076.5
chr11	121843856	assembled	chr11	primary	11,CM001004.1,NC_000077.5
chr12	121257530	assembled	chr12	primary	12,CM001005.1,NC_000078.5
chr13	120284312	assembled	chr13	primary	13,CM001006.1,NC_000079.5
chr14	125194864	assembled	chr14	primary	14,CM001007.1,NC_000080.5
chr15	103494974	assembled	chr15	primary	15,CM001008.1,NC_000081.5
chr16	98319150	assembled	chr16	primary	16,CM001009.1,NC_000082.5
chr17	95272651	assembled	chr17	primary	17,CM001010.1,NC_000083.5
chr18	90772031	assembled	chr18	primary	18,CM001011.1,NC_000084.5
chr19	61342430	assembled	chr19	primary	19,CM001012.1,NC_000085.5
chrX	166650296	assembled	chrX	primary	X,CM001013.1,NC_000086.6
chrY	15902555	assembled	chrY	primary	Y,CM001014.1,NC_000087.6
chrM	16299	assembled	chrM	non-nuclear	MT,AY172335.1,NC_005089.1
chr1_random	1231697	unlocalized	chr1	primary	
chr3_random	41899	unlocalized	chr3	primary	
chr4_random	160594	unlocalized	chr4	primary	
chr5_random	357350	unlocalized	chr5	primary	
chr7_random	362490	unlocalized	chr7	primary	
chr8_random	849593	unlocalized	chr8	primary	
chr9_random	449403	unlocalized	chr9	primary	
chr13_random	400311	unlocalized	chr13	primary	
chr16_random	3994	unlocalized	chr16	primary	
chr17_random	628739	unlocalized	chr17	primary	
chrX_random	1785075	unlocalized	chrX	primary	
chrY_random	58682461	unlocalized	chrY	primary	
chrUn_random	5900358	unplaced		primary	

Reading other genomic formats

See the docs for File I/O for other supported file formats.