{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading genomic dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import bioframe"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bioframe provides multiple methods to convert data stored in common genomic file formats to pandas dataFrames in `bioframe.io`.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading tabular text data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common need is to read tablular data, which can be accomplished with `bioframe.read_table`. This function wraps pandas `pandas.read_csv`/`pandas.read_table` (tab-delimited by default), but allows the user to easily pass a **schema** (i.e. list of pre-defined column names) for common genomic interval-based file formats. \n",
    "\n",
    "For example, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = bioframe.read_table(\n",
    "    \"https://www.encodeproject.org/files/ENCFF001XKR/@@download/ENCFF001XKR.bed.gz\",\n",
    "    schema=\"bed9\",\n",
    ")\n",
    "display(df[0:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = bioframe.read_table(\n",
    "    \"https://www.encodeproject.org/files/ENCFF401MQL/@@download/ENCFF401MQL.bed.gz\",\n",
    "    schema=\"narrowPeak\",\n",
    ")\n",
    "display(df[0:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = bioframe.read_table(\n",
    "    \"https://www.encodeproject.org/files/ENCFF001VRS/@@download/ENCFF001VRS.bed.gz\",\n",
    "    schema=\"bed12\",\n",
    ")\n",
    "display(df[0:3])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `schema` argument looks up file type from a registry of schemas stored in the `bioframe.SCHEMAS` dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.SCHEMAS[\"bed6\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UCSC Big Binary Indexed files (BigWig, BigBed)\n",
    "\n",
    "Bioframe also has convenience functions for reading and writing bigWig and bigBed binary files to and from pandas DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bw_url = \"http://genome.ucsc.edu/goldenPath/help/examples/bigWigExample.bw\"\n",
    "df = bioframe.read_bigwig(bw_url, \"chr21\", start=10_000_000, end=10_010_000)\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"value\"] *= 100\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chromsizes = bioframe.fetch_chromsizes(\"hg19\")\n",
    "# bioframe.to_bigwig(df, chromsizes, 'times100.bw')\n",
    "\n",
    "# note: requires UCSC bedGraphToBigWig binary, which can be installed as\n",
    "# !conda install -y -c bioconda ucsc-bedgraphtobigwig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bb_url = \"http://genome.ucsc.edu/goldenPath/help/examples/bigBedExample.bb\"\n",
    "bioframe.read_bigbed(bb_url, \"chr21\", start=48000000).head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading genome assembly information"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most fundamental information about a genome assembly is its set of chromosome sizes.\n",
    "\n",
    "    "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bioframe provides functions to read chromosome sizes file as `pandas.Series`, with some useful filtering and sorting options:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.read_chromsizes(\n",
    "    \"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.read_chromsizes(\n",
    "    \"https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes\",\n",
    "    filter_chroms=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dm6_url = \"https://hgdownload.soe.ucsc.edu/goldenPath/dm6/database/chromInfo.txt.gz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.read_chromsizes(\n",
    "    dm6_url,\n",
    "    filter_chroms=True,\n",
    "    chrom_patterns=(\"^chr2L$\", \"^chr2R$\", \"^chr3L$\", \"^chr3R$\", \"^chr4$\", \"^chrX$\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.read_chromsizes(\n",
    "    dm6_url, chrom_patterns=[r\"^chr\\d+L$\", r\"^chr\\d+R$\", \"^chr4$\", \"^chrX$\", \"^chrM$\"]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bioframe provides a convenience function to fetch chromosome sizes from UCSC given an assembly name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chromsizes = bioframe.fetch_chromsizes(\"hg38\")\n",
    "chromsizes[-5:]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bioframe can also generate a list of centromere positions using information from some UCSC assemblies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(bioframe.fetch_centromeres(\"hg38\")[:3])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These functions are just wrappers for a UCSC client. Users can also use `UCSCClient` directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = bioframe.UCSCClient(\"hg38\")\n",
    "client.fetch_cytoband()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Curated genome assembly build information\n",
    "\n",
    "_New in v0.5.0_\n",
    "\n",
    "Bioframe also has locally stored information for common genome assembly builds. \n",
    "\n",
    "For a given provider and assembly build, this API provides additional sequence metadata:\n",
    "\n",
    "* A canonical **name** for every sequence, usually opting for UCSC-style naming.\n",
    "* A canonical **ordering** of the sequences.\n",
    "* Each sequence's **length**.\n",
    "* An **alias dictionary** mapping alternative names or aliases to the canonical sequence name.\n",
    "* Each sequence is assigned to an assembly **unit**: e.g., primary, non-nuclear, decoy.\n",
    "* Each sequence is assigned a **role**: e.g., assembled molecule, unlocalized, unplaced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.assemblies_available()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hg38 = bioframe.assembly_info(\"hg38\")\n",
    "print(hg38.provider, hg38.provider_build)\n",
    "hg38.seqinfo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hg38.chromsizes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hg38.alias_dict[\"MT\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bioframe.assembly_info(\"hg38\", roles=\"all\").seqinfo"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Contributing metadata for a new assembly build\n",
    "\n",
    "To contribute a new assembly build to bioframe's internal metadata registry, make a pull request with the following items:\n",
    "\n",
    "1. Add a record to the assembly manifest file located at `bioframe/io/data/_assemblies.yml`. Required fields are as shown in the example below.\n",
    "2. Create a `seqinfo.tsv` file for the new assembly build and place it in `bioframe/io/data`. Reference the exact file name in the manifest record's `seqinfo` field. The seqinfo is a tab-delimited file with a required header line as shown in the example below.\n",
    "3. Optionally, a `cytoband.tsv` file adapted from a `cytoBand.txt` file from UCSC.\n",
    "\n",
    "Note that we currently do not include sequences with alt or patch roles in seqinfo files.\n",
    "\n",
    "#### Example\n",
    "\n",
    "Metadata for the mouse mm9 assembly build as provided by UCSC.\n",
    "\n",
    "`_assemblies.yml`\n",
    "\n",
    "> ```\n",
    "> ...\n",
    "> - organism: mus musculus\n",
    ">   provider: ucsc\n",
    ">   provider_build: mm9\n",
    ">   release_year: 2007\n",
    ">   seqinfo: mm9.seqinfo.tsv\n",
    ">   default_roles: [assembled]\n",
    ">   default_units: [primary, non-nuclear]\n",
    ">   url: https://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/\n",
    "> ...\n",
    "> ```\n",
    "\n",
    "`mm9.seqinfo.tsv`\n",
    "\n",
    "> ```\n",
    "> name\tlength\trole\tmolecule\tunit\taliases\n",
    "> chr1\t197195432\tassembled\tchr1\tprimary\t1,CM000994.1,NC_000067.5\n",
    "> chr2\t181748087\tassembled\tchr2\tprimary\t2,CM000995.1,NC_000068.6\n",
    "> chr3\t159599783\tassembled\tchr3\tprimary\t3,CM000996.1,NC_000069.5\n",
    "> chr4\t155630120\tassembled\tchr4\tprimary\t4,CM000997.1,NC_000070.5\n",
    "> chr5\t152537259\tassembled\tchr5\tprimary\t5,CM000998.1,NC_000071.5\n",
    "> chr6\t149517037\tassembled\tchr6\tprimary\t6,CM000999.1,NC_000072.5\n",
    "> chr7\t152524553\tassembled\tchr7\tprimary\t7,CM001000.1,NC_000073.5\n",
    "> chr8\t131738871\tassembled\tchr8\tprimary\t8,CM001001.1,NC_000074.5\n",
    "> chr9\t124076172\tassembled\tchr9\tprimary\t9,CM001002.1,NC_000075.5\n",
    "> chr10\t129993255\tassembled\tchr10\tprimary\t10,CM001003.1,NC_000076.5\n",
    "> chr11\t121843856\tassembled\tchr11\tprimary\t11,CM001004.1,NC_000077.5\n",
    "> chr12\t121257530\tassembled\tchr12\tprimary\t12,CM001005.1,NC_000078.5\n",
    "> chr13\t120284312\tassembled\tchr13\tprimary\t13,CM001006.1,NC_000079.5\n",
    "> chr14\t125194864\tassembled\tchr14\tprimary\t14,CM001007.1,NC_000080.5\n",
    "> chr15\t103494974\tassembled\tchr15\tprimary\t15,CM001008.1,NC_000081.5\n",
    "> chr16\t98319150\tassembled\tchr16\tprimary\t16,CM001009.1,NC_000082.5\n",
    "> chr17\t95272651\tassembled\tchr17\tprimary\t17,CM001010.1,NC_000083.5\n",
    "> chr18\t90772031\tassembled\tchr18\tprimary\t18,CM001011.1,NC_000084.5\n",
    "> chr19\t61342430\tassembled\tchr19\tprimary\t19,CM001012.1,NC_000085.5\n",
    "> chrX\t166650296\tassembled\tchrX\tprimary\tX,CM001013.1,NC_000086.6\n",
    "> chrY\t15902555\tassembled\tchrY\tprimary\tY,CM001014.1,NC_000087.6\n",
    "> chrM\t16299\tassembled\tchrM\tnon-nuclear\tMT,AY172335.1,NC_005089.1\n",
    "> chr1_random\t1231697\tunlocalized\tchr1\tprimary\t\n",
    "> chr3_random\t41899\tunlocalized\tchr3\tprimary\t\n",
    "> chr4_random\t160594\tunlocalized\tchr4\tprimary\t\n",
    "> chr5_random\t357350\tunlocalized\tchr5\tprimary\t\n",
    "> chr7_random\t362490\tunlocalized\tchr7\tprimary\t\n",
    "> chr8_random\t849593\tunlocalized\tchr8\tprimary\t\n",
    "> chr9_random\t449403\tunlocalized\tchr9\tprimary\t\n",
    "> chr13_random\t400311\tunlocalized\tchr13\tprimary\t\n",
    "> chr16_random\t3994\tunlocalized\tchr16\tprimary\t\n",
    "> chr17_random\t628739\tunlocalized\tchr17\tprimary\t\n",
    "> chrX_random\t1785075\tunlocalized\tchrX\tprimary\t\n",
    "> chrY_random\t58682461\tunlocalized\tchrY\tprimary\t\n",
    "> chrUn_random\t5900358\tunplaced\t\tprimary\t\n",
    "> ```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading other genomic formats"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See the [docs for File I/O](https://bioframe.readthedocs.io/en/latest/api-fileops.html) for other supported file formats."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}