Reading genomic dataframes

import bioframe

Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in bioframe.io.

Reading tabular text data

The most common need is to read tablular data, which can be accomplished with bioframe.read_table. This function wraps pandas pandas.read_csv/pandas.read_table (tab-delimited by default), but allows the user to easily pass a schema (i.e. list of pre-defined column names) for common genomic interval-based file formats.

For example,

df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz",
    schema="bed9",
)
display(df[0:3])

	chrom	start	end	name	score	strand	thickStart	thickEnd	itemRgb
0	chr1	193500	194500	.	400	+	.	.	179,45,0
1	chr1	618500	619500	.	700	+	.	.	179,45,0
2	chr1	974500	975500	.	1000	+	.	.	179,45,0

df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz",
    schema="narrowPeak",
)
display(df[0:3])

	chrom	start	end	name	score	strand	fc	-log10p	-log10q	relSummit
0	chr19	48309541	48309911	.	1000	.	5.04924	-1.0	0.00438	185
1	chr4	130563716	130564086	.	993	.	5.05052	-1.0	0.00432	185
2	chr1	200622507	200622877	.	591	.	5.05489	-1.0	0.00400	185

df = bioframe.read_table(
    "https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz",
    schema="bed12",
)
display(df[0:3])

	chrom	start	end	name	score	strand	thickStart	thickEnd	blockCount	blockSizes	blockStarts
0	chr19	54331773	54620705	5C_304_ENm007_FOR_1.5C_304_ENm007_REV_40	1000	.	54331773	54620705	2	14528,19855,	0,269077,
1	chr19	54461360	54620705	5C_304_ENm007_FOR_26.5C_304_ENm007_REV_40	1000	.	54461360	54620705	2	800,19855,	0,139490,
2	chr5	131346229	132145236	5C_299_ENm002_FOR_241.5C_299_ENm002_REV_33	1000	.	131346229	132145236	2	2609,2105,	0,796902,

The schema argument looks up file type from a registry of schemas stored in the bioframe.SCHEMAS dictionary:

bioframe.SCHEMAS["bed6"]

['chrom', 'start', 'end', 'name', 'score', 'strand']

UCSC Big Binary Indexed files (BigWig, BigBed)

Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames.

bw_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw"
df = bioframe.read_bigwig(bw_url, "chr21", start=10_000_000, end=10_010_000)
df.head(5)

	chrom	start	end	value
0	chr21	10000000	10000005	40.0
1	chr21	10000005	10000010	40.0
2	chr21	10000010	10000015	60.0
3	chr21	10000015	10000020	80.0
4	chr21	10000020	10000025	40.0

df["value"] *= 100
df.head(5)

	chrom	start	end	value
0	chr21	10000000	10000005	4000.0
1	chr21	10000005	10000010	4000.0
2	chr21	10000010	10000015	6000.0
3	chr21	10000015	10000020	8000.0
4	chr21	10000020	10000025	4000.0

chromsizes = bioframe.fetch_chromsizes("hg19")
# bioframe.to_bigwig(df, chromsizes, 'times100.bw')

# note: requires UCSC bedGraphToBigWig binary, which can be installed as
# !conda install -y -c bioconda ucsc-bedgraphtobigwig

bb_url = "http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb"
bioframe.read_bigbed(bb_url, "chr21", start=48000000).head()

	chrom	start	end
0	chr21	48003453	48003785
1	chr21	48003545	48003672
2	chr21	48018114	48019432
3	chr21	48018244	48018550
4	chr21	48018843	48019099

Reading genome assembly information

The most fundamental information about a genome assembly is its set of chromosome sizes.

Bioframe provides functions to read chromosome sizes file as pandas.Series, with some useful filtering and sorting options:

bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes"
)

chr1     248956422
chr2     242193529
chr3     198295559
chr4     190214555
chr5     181538259
chr6     170805979
chr7     159345973
chr8     145138636
chr9     138394717
chr10    133797422
chr11    135086622
chr12    133275309
chr13    114364328
chr14    107043718
chr15    101991189
chr16     90338345
chr17     83257441
chr18     80373285
chr19     58617616
chr20     64444167
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

bioframe.read_chromsizes(
    "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes",
    filter_chroms=False,
)

chr1                248956422
chr2                242193529
chr3                198295559
chr4                190214555
chr5                181538259
                      ...    
chrUn_KI270539v1          993
chrUn_KI270385v1          990
chrUn_KI270423v1          981
chrUn_KI270392v1          971
chrUn_KI270394v1          970
Name: length, Length: 455, dtype: int64

dm6_url = "https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz"

bioframe.read_chromsizes(
    dm6_url,
    filter_chroms=True,
    chrom_patterns=("^chr2L$", "^chr2R$", "^chr3L$", "^chr3R$", "^chr4$", "^chrX$"),
)

chr2L    23513712
chr2R    25286936
chr3L    28110227
chr3R    32079331
chr4      1348131
chrX     23542271
Name: length, dtype: int64

bioframe.read_chromsizes(
    dm6_url, chrom_patterns=[r"^chr\d+L$", r"^chr\d+R$", "^chr4$", "^chrX$", "^chrM$"]
)

chr2L    23513712
chr3L    28110227
chr2R    25286936
chr3R    32079331
chr4      1348131
chrX     23542271
chrM        19524
Name: length, dtype: int64

Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:

chromsizes = bioframe.fetch_chromsizes("hg38")
chromsizes[-5:]

name
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:

display(bioframe.fetch_centromeres("hg38")[:3])

	chrom	start	end	mid
0	chr1	121700000	125100000	123400000
1	chr2	91800000	96000000	93900000
2	chr3	87800000	94000000	90900000

These functions are just wrappers for a UCSC client. Users can also use UCSCClient directly:

client = bioframe.UCSCClient("hg38")
client.fetch_cytoband()

	chrom	start	end	name	gieStain
0	chr1	0	2300000	p36.33	gneg
1	chr1	2300000	5300000	p36.32	gpos25
2	chr1	5300000	7100000	p36.31	gneg
3	chr1	7100000	9100000	p36.23	gpos25
4	chr1	9100000	12500000	p36.22	gneg
...	...	...	...	...	...
1544	chr19_MU273387v1_alt	0	89211	NaN	gneg
1545	chr16_MU273376v1_fix	0	87715	NaN	gneg
1546	chrX_MU273393v1_fix	0	68810	NaN	gneg
1547	chr8_MU273360v1_fix	0	39290	NaN	gneg
1548	chr5_MU273352v1_fix	0	34400	NaN	gneg

1549 rows × 5 columns

Curated genome assembly build information

New in v0.5.0

Bioframe also has locally stored information for common genome assembly builds.

For a given provider and assembly build, this API provides additional sequence metadata:

A canonical name for every sequence, usually opting for UCSC-style naming.
A canonical ordering of the sequences.
Each sequence’s length.
An alias dictionary mapping alternative names or aliases to the canonical sequence name.
Each sequence is assigned to an assembly unit: e.g., primary, non-nuclear, decoy.
Each sequence is assigned a role: e.g., assembled molecule, unlocalized, unplaced.

bioframe.assemblies_available()

	organism	provider	provider_build	release_year	seqinfo	cytobands	default_roles	default_units	url
0	homo sapiens	ncbi	GRCh37	2009	hg19.seqinfo.tsv	hg19.cytoband.tsv	[assembled]	[primary, non-nuclear-revised]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
1	homo sapiens	ucsc	hg19	2009	hg19.seqinfo.tsv	hg19.cytoband.tsv	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/hg1...
2	homo sapiens	ncbi	GRCh38	2013	hg38.seqinfo.tsv	hg38.cytoband.tsv	[assembled]	[primary, non-nuclear]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
3	homo sapiens	ucsc	hg38	2013	hg38.seqinfo.tsv	hg38.cytoband.tsv	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/hg3...
4	homo sapiens	ncbi	T2T-CHM13v2.0	2022	hs1.seqinfo.tsv	hs1.cytoband.tsv	[assembled]	[primary, non-nuclear]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/0...
5	homo sapiens	ucsc	hs1	2022	hs1.seqinfo.tsv	hs1.cytoband.tsv	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/hs1...
6	mus musculus	ncbi	MGSCv37	2010	mm9.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
7	mus musculus	ucsc	mm9	2007	mm9.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/mm9...
8	mus musculus	ncbi	GRCm38	2011	mm10.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
9	mus musculus	ucsc	mm10	2011	mm10.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/mm1...
10	mus musculus	ncbi	GRCm39	2020	mm39.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
11	mus musculus	ucsc	mm39	2020	mm39.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/mm3...
12	drosophila melanogaster	ucsc	dm3	2006	dm3.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/dm3...
13	drosophila melanogaster	ucsc	dm6	2014	dm6.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/dm6...
14	caenorhabditis elegans	ucsc	ce10	2010	ce10.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/ce1...
15	caenorhabditis elegans	ucsc	ce11	2013	ce11.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/ce1...
16	danio rerio	ucsc	danRer10	2014	danRer10.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/dan...
17	danio rerio	ucsc	danRer11	2017	danRer10.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/dan...
18	saccharomyces cerevisiae	ucsc	sacCer3	2011	sacCer3.seqinfo.tsv	NaN	[assembled]	[primary, non-nuclear]	https://hgdownload.soe.ucsc.edu/goldenPath/sac...

hg38 = bioframe.assembly_info("hg38")
print(hg38.provider, hg38.provider_build)
hg38.seqinfo

ucsc hg38

	name	length	role	molecule	unit	aliases
0	chr1	248956422	assembled	chr1	primary	1,CM000663.2,NC_000001.11
1	chr2	242193529	assembled	chr2	primary	2,CM000664.2,NC_000002.12
2	chr3	198295559	assembled	chr3	primary	3,CM000665.2,NC_000003.12
3	chr4	190214555	assembled	chr4	primary	4,CM000666.2,NC_000004.12
4	chr5	181538259	assembled	chr5	primary	5,CM000667.2,NC_000005.10
5	chr6	170805979	assembled	chr6	primary	6,CM000668.2,NC_000006.12
6	chr7	159345973	assembled	chr7	primary	7,CM000669.2,NC_000007.14
7	chr8	145138636	assembled	chr8	primary	8,CM000670.2,NC_000008.11
8	chr9	138394717	assembled	chr9	primary	9,CM000671.2,NC_000009.12
9	chr10	133797422	assembled	chr10	primary	10,CM000672.2,NC_000010.11
10	chr11	135086622	assembled	chr11	primary	11,CM000673.2,NC_000011.10
11	chr12	133275309	assembled	chr12	primary	12,CM000674.2,NC_000012.12
12	chr13	114364328	assembled	chr13	primary	13,CM000675.2,NC_000013.11
13	chr14	107043718	assembled	chr14	primary	14,CM000676.2,NC_000014.9
14	chr15	101991189	assembled	chr15	primary	15,CM000677.2,NC_000015.10
15	chr16	90338345	assembled	chr16	primary	16,CM000678.2,NC_000016.10
16	chr17	83257441	assembled	chr17	primary	17,CM000679.2,NC_000017.11
17	chr18	80373285	assembled	chr18	primary	18,CM000680.2,NC_000018.10
18	chr19	58617616	assembled	chr19	primary	19,CM000681.2,NC_000019.10
19	chr20	64444167	assembled	chr20	primary	20,CM000682.2,NC_000020.11
20	chr21	46709983	assembled	chr21	primary	21,CM000683.2,NC_000021.9
21	chr22	50818468	assembled	chr22	primary	22,CM000684.2,NC_000022.11
22	chrX	156040895	assembled	chrX	primary	X,CM000685.2,NC_000023.11
23	chrY	57227415	assembled	chrY	primary	Y,CM000686.2,NC_000024.10
24	chrM	16569	assembled	chrM	non-nuclear	MT,J01415.2,NC_012920.1

hg38.chromsizes

name
chr1     248956422
chr2     242193529
chr3     198295559
chr4     190214555
chr5     181538259
chr6     170805979
chr7     159345973
chr8     145138636
chr9     138394717
chr10    133797422
chr11    135086622
chr12    133275309
chr13    114364328
chr14    107043718
chr15    101991189
chr16     90338345
chr17     83257441
chr18     80373285
chr19     58617616
chr20     64444167
chr21     46709983
chr22     50818468
chrX     156040895
chrY      57227415
chrM         16569
Name: length, dtype: int64

hg38.alias_dict["MT"]

'chrM'

bioframe.assembly_info("hg38", roles="all").seqinfo

	name	length	role	molecule	unit	aliases
0	chr1	248956422	assembled	chr1	primary	1,CM000663.2,NC_000001.11
1	chr2	242193529	assembled	chr2	primary	2,CM000664.2,NC_000002.12
2	chr3	198295559	assembled	chr3	primary	3,CM000665.2,NC_000003.12
3	chr4	190214555	assembled	chr4	primary	4,CM000666.2,NC_000004.12
4	chr5	181538259	assembled	chr5	primary	5,CM000667.2,NC_000005.10
...	...	...	...	...	...	...
189	chrUn_KI270753v1	62944	unplaced	NaN	primary	HSCHRUN_RANDOM_CTG30,KI270753.1,NT_187508.1
190	chrUn_KI270754v1	40191	unplaced	NaN	primary	HSCHRUN_RANDOM_CTG33,KI270754.1,NT_187509.1
191	chrUn_KI270755v1	36723	unplaced	NaN	primary	HSCHRUN_RANDOM_CTG34,KI270755.1,NT_187510.1
192	chrUn_KI270756v1	79590	unplaced	NaN	primary	HSCHRUN_RANDOM_CTG35,KI270756.1,NT_187511.1
193	chrUn_KI270757v1	71251	unplaced	NaN	primary	HSCHRUN_RANDOM_CTG36,KI270757.1,NT_187512.1

194 rows × 6 columns

Contributing metadata for a new assembly build

To contribute a new assembly build to bioframe’s internal metadata registry, make a pull request with the following items:

Add a record to the assembly manifest file located at bioframe/io/data/_assemblies.yml. Required fields are as shown in the example below.
Create a seqinfo.tsv file for the new assembly build and place it in bioframe/io/data. Reference the exact file name in the manifest record’s seqinfo field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.
Optionally, a cytoband.tsv file adapted from a cytoBand.txt file from UCSC.

Note that we currently do not include sequences with alt or patch roles in seqinfo files.

Example

Metadata for the mouse mm9 assembly build as provided by UCSC.

_assemblies.yml

...
- organism: mus musculus
  provider: ucsc
  provider_build: mm9
  release_year: 2007
  seqinfo: mm9.seqinfo.tsv
  default_roles: [assembled]
  default_units: [primary, non-nuclear]
  url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/
...

mm9.seqinfo.tsv

name	length	role	molecule	unit	aliases
chr1	197195432	assembled	chr1	primary	1,CM000994.1,NC_000067.5
chr2	181748087	assembled	chr2	primary	2,CM000995.1,NC_000068.6
chr3	159599783	assembled	chr3	primary	3,CM000996.1,NC_000069.5
chr4	155630120	assembled	chr4	primary	4,CM000997.1,NC_000070.5
chr5	152537259	assembled	chr5	primary	5,CM000998.1,NC_000071.5
chr6	149517037	assembled	chr6	primary	6,CM000999.1,NC_000072.5
chr7	152524553	assembled	chr7	primary	7,CM001000.1,NC_000073.5
chr8	131738871	assembled	chr8	primary	8,CM001001.1,NC_000074.5
chr9	124076172	assembled	chr9	primary	9,CM001002.1,NC_000075.5
chr10	129993255	assembled	chr10	primary	10,CM001003.1,NC_000076.5
chr11	121843856	assembled	chr11	primary	11,CM001004.1,NC_000077.5
chr12	121257530	assembled	chr12	primary	12,CM001005.1,NC_000078.5
chr13	120284312	assembled	chr13	primary	13,CM001006.1,NC_000079.5
chr14	125194864	assembled	chr14	primary	14,CM001007.1,NC_000080.5
chr15	103494974	assembled	chr15	primary	15,CM001008.1,NC_000081.5
chr16	98319150	assembled	chr16	primary	16,CM001009.1,NC_000082.5
chr17	95272651	assembled	chr17	primary	17,CM001010.1,NC_000083.5
chr18	90772031	assembled	chr18	primary	18,CM001011.1,NC_000084.5
chr19	61342430	assembled	chr19	primary	19,CM001012.1,NC_000085.5
chrX	166650296	assembled	chrX	primary	X,CM001013.1,NC_000086.6
chrY	15902555	assembled	chrY	primary	Y,CM001014.1,NC_000087.6
chrM	16299	assembled	chrM	non-nuclear	MT,AY172335.1,NC_005089.1
chr1_random	1231697	unlocalized	chr1	primary	
chr3_random	41899	unlocalized	chr3	primary	
chr4_random	160594	unlocalized	chr4	primary	
chr5_random	357350	unlocalized	chr5	primary	
chr7_random	362490	unlocalized	chr7	primary	
chr8_random	849593	unlocalized	chr8	primary	
chr9_random	449403	unlocalized	chr9	primary	
chr13_random	400311	unlocalized	chr13	primary	
chr16_random	3994	unlocalized	chr16	primary	
chr17_random	628739	unlocalized	chr17	primary	
chrX_random	1785075	unlocalized	chrX	primary	
chrY_random	58682461	unlocalized	chrY	primary	
chrUn_random	5900358	unplaced		primary	

Reading other genomic formats

See the docs for File I/O for other supported file formats.