{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Reading genomic dataframes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import bioframe" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in `bioframe.io`.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Reading tabular text data" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The most common need is to read tablular data, which can be accomplished with `bioframe.read_table`. This function wraps pandas `pandas.read_csv`/`pandas.read_table` (tab-delimited by default), but allows the user to easily pass a **schema** (i.e. list of pre-defined column names) for common genomic interval-based file formats. \n", "\n", "For example, " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bioframe.read_table(\n", " \"https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz\",\n", " schema=\"bed9\",\n", ")\n", "display(df[0:3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bioframe.read_table(\n", " \"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz\",\n", " schema=\"narrowPeak\",\n", ")\n", "display(df[0:3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = bioframe.read_table(\n", " \"https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz\",\n", " schema=\"bed12\",\n", ")\n", "display(df[0:3])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The `schema` argument looks up file type from a registry of schemas stored in the `bioframe.SCHEMAS` dictionary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.SCHEMAS[\"bed6\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### UCSC Big Binary Indexed files (BigWig, BigBed)\n", "\n", "Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bw_url = \"http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw\"\n", "df = bioframe.read_bigwig(bw_url, \"chr21\", start=10_000_000, end=10_010_000)\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"value\"] *= 100\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chromsizes = bioframe.fetch_chromsizes(\"hg19\")\n", "# bioframe.to_bigwig(df, chromsizes, 'times100.bw')\n", "\n", "# note: requires UCSC bedGraphToBigWig binary, which can be installed as\n", "# !conda install -y -c bioconda ucsc-bedgraphtobigwig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bb_url = \"http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb\"\n", "bioframe.read_bigbed(bb_url, \"chr21\", start=48000000).head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Reading genome assembly information" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The most fundamental information about a genome assembly is its set of chromosome sizes.\n", "\n", " " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Bioframe provides functions to read chromosome sizes file as `pandas.Series`, with some useful filtering and sorting options:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.read_chromsizes(\n", " \"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.read_chromsizes(\n", " \"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes\",\n", " filter_chroms=False,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dm6_url = \"https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.read_chromsizes(\n", " dm6_url,\n", " filter_chroms=True,\n", " chrom_patterns=(\"^chr2L$\", \"^chr2R$\", \"^chr3L$\", \"^chr3R$\", \"^chr4$\", \"^chrX$\"),\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.read_chromsizes(\n", " dm6_url, chrom_patterns=[r\"^chr\\d+L$\", r\"^chr\\d+R$\", \"^chr4$\", \"^chrX$\", \"^chrM$\"]\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chromsizes = bioframe.fetch_chromsizes(\"hg38\")\n", "chromsizes[-5:]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "display(bioframe.fetch_centromeres(\"hg38\")[:3])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "These functions are just wrappers for a UCSC client. Users can also use `UCSCClient` directly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = bioframe.UCSCClient(\"hg38\")\n", "client.fetch_cytoband()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Curated genome assembly build information\n", "\n", "_New in v0.5.0_\n", "\n", "Bioframe also has locally stored information for common genome assembly builds. \n", "\n", "For a given provider and assembly build, this API provides additional sequence metadata:\n", "\n", "* A canonical **name** for every sequence, usually opting for UCSC-style naming.\n", "* A canonical **ordering** of the sequences.\n", "* Each sequence's **length**.\n", "* An **alias dictionary** mapping alternative names or aliases to the canonical sequence name.\n", "* Each sequence is assigned to an assembly **unit**: e.g., primary, non-nuclear, decoy.\n", "* Each sequence is assigned a **role**: e.g., assembled molecule, unlocalized, unplaced." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.assemblies_available()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hg38 = bioframe.assembly_info(\"hg38\")\n", "print(hg38.provider, hg38.provider_build)\n", "hg38.seqinfo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hg38.chromsizes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hg38.alias_dict[\"MT\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bioframe.assembly_info(\"hg38\", roles=\"all\").seqinfo" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Contributing metadata for a new assembly build\n", "\n", "To contribute a new assembly build to bioframe's internal metadata registry, make a pull request with the following items:\n", "\n", "1. Add a record to the assembly manifest file located at `bioframe/io/data/_assemblies.yml`. Required fields are as shown in the example below.\n", "2. Create a `seqinfo.tsv` file for the new assembly build and place it in `bioframe/io/data`. Reference the exact file name in the manifest record's `seqinfo` field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.\n", "3. Optionally, a `cytoband.tsv` file adapted from a `cytoBand.txt` file from UCSC.\n", "\n", "Note that we currently do not include sequences with alt or patch roles in seqinfo files.\n", "\n", "#### Example\n", "\n", "Metadata for the mouse mm9 assembly build as provided by UCSC.\n", "\n", "`_assemblies.yml`\n", "\n", "> ```\n", "> ...\n", "> - organism: mus musculus\n", "> provider: ucsc\n", "> provider_build: mm9\n", "> release_year: 2007\n", "> seqinfo: mm9.seqinfo.tsv\n", "> default_roles: [assembled]\n", "> default_units: [primary, non-nuclear]\n", "> url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/\n", "> ...\n", "> ```\n", "\n", "`mm9.seqinfo.tsv`\n", "\n", "> ```\n", "> name\tlength\trole\tmolecule\tunit\taliases\n", "> chr1\t197195432\tassembled\tchr1\tprimary\t1,CM000994.1,NC_000067.5\n", "> chr2\t181748087\tassembled\tchr2\tprimary\t2,CM000995.1,NC_000068.6\n", "> chr3\t159599783\tassembled\tchr3\tprimary\t3,CM000996.1,NC_000069.5\n", "> chr4\t155630120\tassembled\tchr4\tprimary\t4,CM000997.1,NC_000070.5\n", "> chr5\t152537259\tassembled\tchr5\tprimary\t5,CM000998.1,NC_000071.5\n", "> chr6\t149517037\tassembled\tchr6\tprimary\t6,CM000999.1,NC_000072.5\n", "> chr7\t152524553\tassembled\tchr7\tprimary\t7,CM001000.1,NC_000073.5\n", "> chr8\t131738871\tassembled\tchr8\tprimary\t8,CM001001.1,NC_000074.5\n", "> chr9\t124076172\tassembled\tchr9\tprimary\t9,CM001002.1,NC_000075.5\n", "> chr10\t129993255\tassembled\tchr10\tprimary\t10,CM001003.1,NC_000076.5\n", "> chr11\t121843856\tassembled\tchr11\tprimary\t11,CM001004.1,NC_000077.5\n", "> chr12\t121257530\tassembled\tchr12\tprimary\t12,CM001005.1,NC_000078.5\n", "> chr13\t120284312\tassembled\tchr13\tprimary\t13,CM001006.1,NC_000079.5\n", "> chr14\t125194864\tassembled\tchr14\tprimary\t14,CM001007.1,NC_000080.5\n", "> chr15\t103494974\tassembled\tchr15\tprimary\t15,CM001008.1,NC_000081.5\n", "> chr16\t98319150\tassembled\tchr16\tprimary\t16,CM001009.1,NC_000082.5\n", "> chr17\t95272651\tassembled\tchr17\tprimary\t17,CM001010.1,NC_000083.5\n", "> chr18\t90772031\tassembled\tchr18\tprimary\t18,CM001011.1,NC_000084.5\n", "> chr19\t61342430\tassembled\tchr19\tprimary\t19,CM001012.1,NC_000085.5\n", "> chrX\t166650296\tassembled\tchrX\tprimary\tX,CM001013.1,NC_000086.6\n", "> chrY\t15902555\tassembled\tchrY\tprimary\tY,CM001014.1,NC_000087.6\n", "> chrM\t16299\tassembled\tchrM\tnon-nuclear\tMT,AY172335.1,NC_005089.1\n", "> chr1_random\t1231697\tunlocalized\tchr1\tprimary\t\n", "> chr3_random\t41899\tunlocalized\tchr3\tprimary\t\n", "> chr4_random\t160594\tunlocalized\tchr4\tprimary\t\n", "> chr5_random\t357350\tunlocalized\tchr5\tprimary\t\n", "> chr7_random\t362490\tunlocalized\tchr7\tprimary\t\n", "> chr8_random\t849593\tunlocalized\tchr8\tprimary\t\n", "> chr9_random\t449403\tunlocalized\tchr9\tprimary\t\n", "> chr13_random\t400311\tunlocalized\tchr13\tprimary\t\n", "> chr16_random\t3994\tunlocalized\tchr16\tprimary\t\n", "> chr17_random\t628739\tunlocalized\tchr17\tprimary\t\n", "> chrX_random\t1785075\tunlocalized\tchrX\tprimary\t\n", "> chrY_random\t58682461\tunlocalized\tchrY\tprimary\t\n", "> chrUn_random\t5900358\tunplaced\t\tprimary\t\n", "> ```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Reading other genomic formats" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "See the [docs for File I/O](https://bioframe.readthedocs.io/en/latest/api-fileops.html) for other supported file formats." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 4 }