Skip to content

Use Case: Hi-C Analysis

Overview

This use case demonstrates how to use Coala to perform query-driven Hi-C analysis, including visualizing chromatin contact maps (cooler), calculating expected cis-contacts and compartment profiles, visualizing compartmentalization with saddle plots, and identifying boundaries between topologically associating domains (cooltools). We use a multi-resolution mcool file (test.mcool) containing Micro-C data for chromosomes 2 and 17 of HFF cell line , together with a view BED (view_hg38.tsv) and GC content track (gc.100000.tsv).

Setup

MCP Server Configuration

Start an MCP server with Hi-C analysis tools as shown in examples/Hi-C/Hi-C_mp.py:

python
from coala.mcp_api import mcp_api
import os
base_dir = os.path.dirname(__file__)

mcp = mcp_api(host='0.0.0.0', port=8000)
mcp.add_tool(os.path.join(base_dir, 'cooler_dump.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooler_show.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooltools_eigs_cis.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooltools_expected_cis.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooltools_genome_gc.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooltools_insulation.cwl'))
mcp.add_tool(os.path.join(base_dir, 'cooltools_saddle.cwl'))
mcp.serve()

This server exposes tools from cooler and cooltools:

  • cooler_show: Display contact matrices for specified chromosomes and resolutions
  • cooltools_expected_cis: Calculate expected Hi-C signal for cis regions (decay curve)
  • cooltools_eigs_cis: Compute compartment profiles via eigenvalue decomposition with optional GC phasing
  • cooltools_saddle: Generate saddle plots from eigenvectors and expected contacts
  • cooltools_insulation: Calculate diamond insulation scores and call topologically associating domains (TAD) boundaries

MCP Client Configuration

Configure your MCP client (e.g., Claude Desktop, Cursor) to connect to the MCP server:

json
{
    "mcpServers": {
        "user-hi_c": {
            "command": "python",
            "args": ["/path/to/examples/Hi-C/Hi-C_mp.py"]
        }
    }
}

Note: Replace /path/to/examples/Hi-C/Hi-C_mp.py with the actual path to the Hi-C_mp.py file in your workspace.

Data Sets

The data file (test.mcool) is from the Micro-C study on HFF cell line (see cooltools external test files). Place view_hg38.tsv, gc.100000.tsv, and bins.100000.tsv from the examples/Hi-C directory in the same working directory as test.mcool for the full workflow.

Use Case Workflow

The workflow uses test.mcool at multiple resolutions (10 kb, 100 kb, 1 Mb) and the view file view_hg38.tsv defining chr2_p, chr2_q, chr17_p, and chr17_q.

Step 1: Show Contact Map

User Query:

Here is a mcool file for the Hi-C data: test.mcool. Show the contact map on chromosome 2 with 1Mb resolution. Save the output to working directory.

Tool Invocation:

json
{
  "tool": "cooler_show",
  "parameters": {
    "data": [
      {
        "fcool": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/1000000",
        "range": "chr2:0-242193529",
        "output": "/path/to/working_dir/contact_map_chr2_1mb.png"
      }
    ]
  }
}

Tool Response:

json
{
  "out": {
    "location": "file:///var/folders/.../contact_map_chr2_1mb.png",
    "basename": "contact_map_chr2_1mb.png",
    "class": "File",
    "size": 164916
  },
  "tool_name": "cooler_show",
  "tool_version": "quay.io/biocontainers/cooltools:0.7.1--py311h93dcfea_3"
}

Chat Response:

The chromosome 2 contact map at 1 Mb resolution has been generated and saved to your working directory as contact_map_chr2_1mb.png.

The image is a 242×242 contact matrix for chr2 (0–242,193,529 bp), showing Hi-C contact frequency. The MCP tool wrote the figure to a temp path; it was then copied to the working directory.

chromosome 2 contact map

User Query:

How about chr17?

Tool Invocation:

json
{
  "tool": "cooler_show",
  "parameters": {
    "data": [
      {
        "fcool": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/1000000",
        "range": "chr17:0-83257441",
        "output": "/path/to/working_dir/contact_map_chr17_1mb.png"
      }
    ]
  }
}

Tool Response:

json
{
  "out": {
    "location": "file:///var/folders/.../contact_map_chr17_1mb.png",
    "basename": "contact_map_chr17_1mb.png",
    "class": "File",
    "size": 52278
  },
  "tool_name": "cooler_show",
  "tool_version": "quay.io/biocontainers/cooltools:0.7.1--py311h93dcfea_3"
}

Chat Response:

The chromosome 17 contact map at 1 Mb resolution has been generated and saved as contact_map_chr17_1mb.png. Chr17 is shorter than chr2, so the matrix is smaller (~83×83 bins).

chromosome 17 contact map

The contact map is a 2D heatmap of Hi-C contact frequency, reflecting 3D genome folding. Warmer colors indicate higher contact frequency; structures such as compartments and TADs appear as patterns in the matrix.


Step 2: Calculate Expected Cis-Contacts

User Query:

Now, let's calculate the expected contacts in 100k bp resolution in the regions defined in view_hg38.tsv file.

Tool Invocation:

json
{
  "tool": "cooltools_expected_cis",
  "parameters": {
    "data": [
      {
        "cool_file": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/100000",
        "output_filename": "/path/to/working_dir/expected_100kb.tsv",
        "view_bed": "/path/to/examples/Hi-C/view_hg38.tsv"
      }
    ]
  }
}

Tool Response:

json
{
  "expected_tsv": {
    "location": "file:///var/folders/.../expected_100kb.tsv",
    "basename": "expected_100kb.tsv",
    "class": "File",
    "size": 397532
  },
  "tool_name": "cooltools_expected_cis",
  "tool_version": "quay.io/biocontainers/cooltools:0.7.1--py311h93dcfea_3"
}

Chat Response:

Expected cis contacts were computed at 100 kb resolution for the regions in view_hg38.tsv (chr2_p, chr2_q, chr17_p, chr17_q) and saved to expected_100kb.tsv.

The TSV contains columns: region1, region2, dist, dist_bp, contact_frequency, n_total, n_valid, count.sum, balanced.sum, count.avg, balanced.avg. Each row is a region pair and genomic distance bin; contact_frequency is the expected (balanced) contact value used for normalization and saddle plots.

User Query:

Please show me the first 10 rows of the calculated expected contacts result.

Chat Response:

First 10 data rows of expected_100kb.tsv (header + 10 rows):

region1region2distdist_bpcontact_frequencyn_totaln_validcount.sumbalanced.sumcount.avgbalanced.avg
chr2_pchr2_p00nan939878nannannannan
chr2_pchr2_p1100000nan938876nannannannan
chr2_pchr2_p22000000.07479378742738583.065.292922.710.0747
chr2_pchr2_p33000000.04709368721739972.041.011858.940.0470
chr2_pchr2_p44000000.03279358701184707.028.471267.070.0327
chr2_pchr2_p55000000.0244934868868701.021.19930.090.0244
chr2_pchr2_p66000000.0191933866669908.016.51718.020.0191
chr2_pchr2_p77000000.0154932864528838.013.29567.420.0154
chr2_pchr2_p88000000.0127931862430768.010.93462.690.0127
chr2_pchr2_p99000000.0107930860358274.09.16385.240.0107

These are chr2_p vs chr2_p at 0–900 kb; contact_frequency decreases with distance. Rows with dist 0 and 1 are nan (diagonal/adjacent often excluded).


Step 3: Generate Compartment Profiles

User Query:

Next, we want to obtain the compartment profiles in 100k bp resolution. Let's use the GC content profile in gc.100000.tsv.

Tool Invocation:

json
{
  "tool": "cooltools_eigs_cis",
  "parameters": {
    "data": [
      {
        "cool_file": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/100000",
        "out_prefix": "/path/to/working_dir/compartment_100kb",
        "phasing_track_file": "/path/to/examples/Hi-C/gc.100000.tsv",
        "phasing_track_col": "GC",
        "view": "/path/to/examples/Hi-C/view_hg38.tsv"
      }
    ]
  }
}

Tool Response:

json
{
  "bigwig_out": null,
  "eigenvectors": {
    "location": "file:///var/folders/.../compartment_100kb.cis.vecs.tsv",
    "basename": "compartment_100kb.cis.vecs.tsv",
    "class": "File",
    "size": 202492
  },
  "lambdas": {
    "location": "file:///var/folders/.../compartment_100kb.cis.lam.txt",
    "basename": "compartment_100kb.cis.lam.txt",
    "class": "File",
    "size": 213
  },
  "tool_name": "cooltools_eigs_cis",
  "tool_version": "hubentu/cooltools_bedwig:0.7.1_2.10"
}

Chat Response:

Compartment profiles were computed with cooltools eigs-cis using GC from gc.100000.tsv for phasing. Outputs in the working directory:

  1. compartment_100kb.cis.vecs.tsv — Eigenvectors per 100 kb bin: chrom, start, end, weight, E1. E1 is the first principal component (compartment signal), phased so positive E1 aligns with GC-rich (B/active) and negative with GC-poor (A/inactive).
  2. compartment_100kb.cis.lam.txt — Eigenvalue per region (chr2_p, chr2_q, chr17_p, chr17_q).

User Query:

Please show me the first 10 rows of the output file containing eigenvectors. Please only show the first eigenvector.

Chat Response:

First 10 rows of the eigenvectors file, first eigenvector (E1) only:

chromstartendE1
chr20100000-1.567
chr2100000200000-1.751
chr2200000300000-0.368
chr2300000400000-1.325
chr2400000500000-1.432
chr2500000600000-0.971
chr2600000700000-0.157
chr2700000800000-0.458
chr2800000900000-0.772
chr29000001000000-0.475

These are the first 10 bins on chr2 (0–1 Mb). E1 values are negative in this range (A-like compartments); phasing was done with GC from gc.100000.tsv.


Step 4: Visualize Compartmentalization with Saddle Plot

User Query:

Next, we want to visualize the effect of compartmentalization using saddleplot, using our previously computed compartment profiles and expected cis-contacts.

Tool Invocation:

json
{
  "tool": "cooltools_saddle",
  "parameters": {
    "data": [
      {
        "cool_file": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/100000",
        "expected_file": "/path/to/working_dir/expected_100kb.tsv",
        "expected_col": "balanced.avg",
        "track_file": "/path/to/working_dir/compartment_100kb.cis.vecs.tsv",
        "track_col": "E1",
        "out_prefix": "/path/to/working_dir/saddle_100kb",
        "qrange": [0.02, 0.98],
        "fig_format": "png",
        "view": "/path/to/examples/Hi-C/view_hg38.tsv"
      }
    ]
  }
}

Tool Response:

json
{
  "saddle_data": {
    "location": "file:///var/folders/.../saddle_100kb.saddledump.npz",
    "basename": "saddle_100kb.saddledump.npz",
    "class": "File",
    "size": 91075
  },
  "saddle_plot": {
    "location": "file:///var/folders/.../saddle_100kb.png",
    "basename": "saddle_100kb.png",
    "class": "File",
    "size": 25045
  },
  "saddle_tsv": {
    "location": "file:///var/folders/.../saddle_100kb.digitized.tsv",
    "basename": "saddle_100kb.digitized.tsv",
    "class": "File",
    "size": 87297
  },
  "tool_name": "cooltools_saddle",
  "tool_version": "quay.io/biocontainers/cooltools:0.7.1--py311h93dcfea_3"
}

Chat Response:

The saddle plot was generated with cooltools saddle using the existing eigenvectors (E1) and expected contacts (balanced.avg), with view_hg38.tsv for regions. Outputs saved:

  1. saddle_100kb.png — Saddle plot: contact enrichment (observed/expected) binned by E1 on both axes. A/A and B/B corners show enrichment; A/B and B/A show depletion.
  2. saddle_100kb.saddledump.npz — Saddle matrix data.
  3. saddle_100kb.digitized.tsv — Digitized saddle values in TSV form.

saddleplot 100kb

Saddle plot: inactive–inactive (top-left) and active–active (bottom-right) show higher relative contact frequency (more red); inactive–active show lower frequency. Diagonal enrichment reflects compartment structure.


Step 5: Call topologically associating domains (TAD) boundaries

User Query:

Lastly, let's find the boundaries between TADs. Let's use higher resolution data in 10kb, to define boundaries in window size 100 kb and 200 kb, using Li method for insulation scores.

Tool Invocation:

json
{
  "tool": "cooltools_insulation",
  "parameters": {
    "data": [
      {
        "cool_file": "/path/to/examples/Hi-C/test.mcool",
        "uri": "resolutions/10000",
        "window": [100000, 200000],
        "output_filename": "/path/to/working_dir/insulation_10kb.tsv",
        "threshold": "Li",
        "view": "/path/to/examples/Hi-C/view_hg38.tsv"
      }
    ]
  }
}

Tool Response:

json
{
  "insulation_bigwigs": null,
  "insulation_tsv": {
    "location": "file:///var/folders/.../insulation_10kb.tsv",
    "basename": "insulation_10kb.tsv",
    "class": "File",
    "size": 3556085
  },
  "tool_name": "cooltools_insulation",
  "tool_version": "hubentu/cooltools_bedwig:0.7.1_2.10"
}

Chat Response:

TAD boundaries were computed with cooltools insulation at 10 kb resolution, windows 100 kb and 200 kb, and Li threshold. Output:

insulation_10kb.tsv — One row per 10 kb bin. Columns include chrom, start, end, region, is_bad_bin; log2_insulation_score_100000/200000; n_valid_pixels; boundary_strength_100000/200000; is_boundary_100000 and is_boundary_200000 (True = TAD boundary at that window). Boundaries are called using the Li method on the insulation score distribution.

User Query:

Please show me the first 10 rows of the output file of insulation site analysis.

Chat Response:

First 10 rows of insulation_10kb.tsv:

chromstartendregionis_bad_binlog2_insulation_score_100000n_valid_pixels_100000log2_insulation_score_200000n_valid_pixels_200000boundary_strength_100000boundary_strength_200000is_boundary_100000is_boundary_200000
chr2010000chr2_pTruenan0.0nan0.0nannanFalseFalse
chr21000020000chr2_pFalse0.6928.01.12318.0nannanFalseFalse
chr22000030000chr2_pFalse0.76117.01.19737.0nannanFalseFalse
chr23000040000chr2_pFalse0.76727.01.21257.0nannanFalseFalse
chr24000050000chr2_pFalse0.67537.01.13577.0nannanFalseFalse
chr25000060000chr2_pFalse0.59647.01.06297.0nannanFalseFalse
chr26000070000chr2_pFalse0.55857.01.006117.0nannanFalseFalse
chr27000080000chr2_pFalse0.48967.00.919137.0nannanFalseFalse
chr28000090000chr2_pFalse0.43777.00.851157.0nannanFalseFalse
chr290000100000chr2_pFalse0.38787.00.794177.0nannanFalseFalse

The first bin is is_bad_bin=True (edge); the rest have decreasing log2 insulation scores. None of these first 10 bins are TAD boundaries (is_boundary_100000 and is_boundary_200000 are False).


Key Benefits

  1. Natural language interface: Hi-C analysis from contact maps to TAD boundaries via simple queries.
  2. Tool chain: Expected contacts → Compartment Profiles (with GC phasing) → Saddle plot; Insulation-based TAD boundaries uses the same view and resolution choices.
  3. Containerized execution: All tools run in specified Docker images (cooltools/cooltools_bedwig) with fixed versions.
  4. Human-in-the-loop: Users can ask for different chromosomes, resolutions, window sizes, or inspect output rows via natural language.

Technical Details

Tool Execution

Tools run in Docker as defined in their CWL:

  • cooler_show: cooler/cooltools (quay.io/biocontainers/cooltools:0.7.1)
  • cooltools_expected_cis: cooltools 0.7.1
  • cooltools_eigs_cis: hubentu/cooltools_bedwig:0.7.1_2.10 (eigenvectors + optional BigWig)
  • cooltools_saddle: cooltools 0.7.1
  • cooltools_insulation: hubentu/cooltools_bedwig:0.7.1_2.10

Main Parameters

ToolParameterExampleDescription
cooler_showuriresolutions/1000000Resolution path in mcool
cooler_showrangechr2:0-242193529UCSC-style region
cooltools_expected_cisuriresolutions/100000Resolution for expected curve
cooltools_expected_cisview_bedview_hg38.tsvBED of regions to include
cooltools_eigs_cisphasing_track_colGCColumn in phasing file for E1 orientation
cooltools_saddleqrange[0.02, 0.98]Quantile range for E1 bins
cooltools_saddletrack_col, expected_colE1, balanced.avgEigenvector and expected value columns
cooltools_insulationuriresolutions/10000Base resolution for insulation
cooltools_insulationwindow[100000, 200000]Window sizes in bp
cooltools_insulationthresholdLiMethod to call boundaries (Li/Otsu or float)

Data Flow

  1. Contact maps: The multi-resolution Hi-C file (mcool) is read at a chosen resolution and genomic range; cooler show renders the contact matrix as a heatmap image for the requested chromosome(s).
  2. Expected contacts: cooltools expected-cis computes the average contact frequency as a function of genomic distance within the regions defined in the view file, producing a decay curve used later for normalization and saddle plots.
  3. Compartments: cooltools eigs-cis performs eigenvalue decomposition on the observed/expected contact matrix; the first eigenvector (E1) is the compartment signal, and a GC content track is used to phase it so that positive E1 corresponds to GC-rich (active) and negative to GC-poor (inactive) regions.
  4. Saddle plot: cooltools saddle bins genomic bins by their E1 value and plots observed/expected contact enrichment between bin pairs; the resulting image shows whether A–A and B–B interactions are enriched (compartmentalization) and A–B depleted.
  5. TAD boundaries: cooltools insulation computes a diamond insulation score at each bin; low scores mark boundaries between domains. The Li method is used to choose a threshold and flag bins as boundaries, yielding a table of insulation scores and boundary calls for each window size.

Output Files

StepFileDescription
1contact_map_chr2_1mb.pngContact matrix chr2, 1 Mb
1contact_map_chr17_1mb.pngContact matrix chr17, 1 Mb
2expected_100kb.tsvExpected cis contacts, 100 kb, view regions
3compartment_100kb.cis.vecs.tsvEigenvectors (E1, …) per bin
3compartment_100kb.cis.lam.txtEigenvalues per region
4saddle_100kb.pngSaddle plot image
4saddle_100kb.saddledump.npzSaddle matrix data
4saddle_100kb.digitized.tsvDigitized saddle TSV
5insulation_10kb.tsvInsulation scores and boundary flags (100 kb & 200 kb windows)

Extending the Workflow

This use case can be extended to:

  • Additional chromosomes or resolutions: Use cooler_show with other ranges or uri (e.g. resolutions/50000) for different resolutions.
  • Compare conditions: Run expected, eigs, saddle, and insulation on multiple mcool files and compare compartment strength or TAD boundary sets.
  • Custom views: Define new BED files for specific loci or chromosomes and reuse the same pipeline.
  • Insulation with more windows: Add more window sizes in the insulation tool (e.g. 50 kb, 500 kb) for multi-scale TAD analysis.
  • Export tracks: Use BigWig output from eigs-cis or insulation (when enabled) for genome-browser visualization.
  • Downstream integration: Feed TAD boundaries or compartment E1 into enrichment tools, gene-set analysis, or correlation with RNA-seq/ATAC-seq.
  • Saddle at other resolutions: Recompute expected and eigenvectors at 50 kb or 250 kb and run saddle for different scales.

All of these extensions can be done by calling the same MCP tools with different parameters via natural language.

References

Released under the MIT License.