From CID to ATGC: Decoding Stereo-seq Barcodes

 

visitor badge GitHub

I made a silly story this week…

1. Background

I wanted to extract only the reads located within the tissue to reduce the FASTQ file size for running CARLIN analysis. These days, I rely heavily on AI. And this time, AI seriously misled me and wasted a lot of my time. Here’s what happened.

2. AI-explanations

The A02598A4.barcodeToPos.h5 file does not contain actual ATGC sequences, and I couldn’t find any documentation online about how to convert CID numbers into actual DNA barcodes. So, I turned to AI for help. AI responded:

Your integer CIDs are a compact representation of the CID DNA barcode. To use them for read filtering, you must convert the integer to the DNA sequence (typically 25 bases for Stereo-seq Bin50/100).

How the Encoding Works

  • The integer is a base-4 encoding of the 25bp DNA CID barcode.
  • A=0, C=1, G=2, T=3 (least significant base is at the end).
  • You can convert any integer CID to its corresponding DNA sequence using a small script.

It’s pretty great, right? The AI also kindly provides me a python script for converting the CID numbers into the actual sequence:

1
2
3
4
5
6
7
8
9
10
11
```python
def cid_to_seq(cid_int):
    base4 = []
    for _ in range(25):
        base = cid_int % 4
        base4.append("ACGT"[base])
        cid_int //= 4
    return ''.join(reversed(base4))

print("25-mer:", cid_to_seq(cid_value))
```

In addition, AI backed its explanation with what seemed like credible references, as shown below:

AI_Response_20250604

I was convinced by the AI’s explanation after briefly skimming the online resources it cited, including:

Fortunately, I’m not completely gullible—I had a moment of caution. I decided to validate AI’s algorithm using a read from the BAM file generated by MGI’s saw software. Let’s examine the example below:

  • The below is a read record from bam file. I extract it by using the IGV software:
Read name = E150018299L1C036R00400117279  
Read length = 100bp  
Flags = 0  
----------------------  
Mapping = Primary @ MAPQ 255  
Reference span = chr3:129,504,132-129,504,231 (+) = 100bp  
Cigar = 100M  
Clipping = None  
----------------------  
XF = 4  
NH = 1  
HI = 1  
nM = 0  
UR = 6B507  
AS = 98  
Cx = 6518  
Cy = 12274

---

Location = chr3:129,504,138  
Base = A @ QV 34
  • A bash script below, ./search.sh, was used to search the read name of “” across all 16 R1.fq.gz files of my stereo-seq:
1
2
3
4
5
```bash
for f in *_1.fq.gz; do
  zgrep -A1 'E150018299L1C036R00400117279' "$f" && echo "✅ Found in $f" && break
done
```
  • The above bash script returns:
🔎 Searching E150018299_L01_62_1.fq.gz                    [  6/ 16]  

✅ Found in file: E150018299_L01_62_1.fq.gz @E150018299L1C036R00400117279/1 TATAGTACATATGATTCCAGGTCCACGGTCCAACT

🔍 Search complete.
  • The AI’s algorithm returns:
1
2
3
4
5
6
7
8
9
```python
import h5py
import numpy as np
with h5py.File("./A02598A4/00.Rawdata/mask/A02598A4.barcodeToPos.h5", "r") as f:
	bp_matrix = f["bpMatrix_1"]
	x, y = 6518, 12274
	cid = int(bp_matrix[y, x, 0])
	print("CID (decimal):", cid)
```

CID (decimal): 100904410303266

1
2
3
4
5
6
7
8
9
10
11
```python
def cid_to_seq(cid_int):
	base4 = []
	for _ in range(25):
		base = cid_int % 4
		base4.append("ACGT"[base])
		cid_int //= 4
	return ''.join(reversed(base4))
	
print("25-mer ACGT:", cid_to_seq(cid))
```

25-mer ACGT: ACCGTTACCGGATGAGACAGTAGAG

Actual sequence (TATAGTACATATGATTCCAGGTCCACGGTCCAACT) versus AI’s algorithm (ACCGTTACCGGATGAGACAGTAGAG)…

There is no match between the AI’s algorithm-derived 25-mer (ACCGTTACCGGATGAGACAGTAGAG) and the actual R1 sequence (TATAGTACATATGATTCCAGGTCCACGGTCCAACT).

4. The More Artificial, the More Intelligent

Feeling disheartened, I decided to give Google one more careful search—and that’s when I saw it:

GitHub_Issue_20250604

Well… that’s life, isn’t it?

5. Not end yet…

Do you think it is over? No! My Linux environment do not meet the compile prerequisites of ST_BarcodeMap below:

  • linux
  • gcc-8.2.0
  • boost >=1.73.0
  • zlib >=1.2.11
  • hdf5 >=1.10.7

And after testing, we need boost==1.73.0 not >1.73.0

Then, I used conda with mamba to build an environment to meet there prerequisites:

🟢 mamba is a drop-in replacement for conda, but drastically faster for solving.

conda install mamba -n base -c conda-forge
mamba create -n barcodeenv boost=1.73 hdf5=1.10.7 zlib=1.2.11 -c conda-forge
conda activate barcodeenv
mamba install -c conda-forge gcc_linux-64 gxx_linux-64
mamba install -c conda-forge boost=1.73

However, the makefile of ST_BarcodeMap does not support to use conda environment. So I have to modify its makefile as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
```bash
DIR_INC := ./inc
DIR_SRC := ./src
DIR_OBJ := ./obj

PREFIX ?= /usr/local
BINDIR ?= $(PREFIX)/bin

# Use Conda’s Boost, but system GCC
INCLUDE_DIRS = -I$(DIR_INC) -I$(CONDA_PREFIX)/include -I/usr/include/hdf5/serial
LIBRARY_DIRS = -L$(CONDA_PREFIX)/lib -L/usr/lib/x86_64-linux-gnu/hdf5/serial

SRC := $(wildcard ${DIR_SRC}/*.cpp)
OBJ := $(patsubst %.cpp,${DIR_OBJ}/%.o,$(notdir ${SRC}))

TARGET := ST_BarcodeMap-0.0.1
BIN_TARGET := ${TARGET}

CXX := g++
CXXFLAGS := -std=c++11 -g -O3 $(INCLUDE_DIRS)
LD_FLAGS := $(LIBRARY_DIRS) -lboost_serialization -lhdf5 -lz -lpthread

${BIN_TARGET}: ${OBJ}
	$(CXX) $(OBJ) -o $@ $(LD_FLAGS)

${DIR_OBJ}/%.o: ${DIR_SRC}/%.cpp make_obj_dir
	$(CXX) -c $< -o $@ $(CXXFLAGS)

.PHONY: clean
clean:
	rm -f ${DIR_OBJ}/*.o
	rm -f ${TARGET}

make_obj_dir:
	@if [ ! -d ${DIR_OBJ} ]; then mkdir -p ${DIR_OBJ}; fi

install:
	install ${TARGET} ${BINDIR}/${TARGET}
	@echo "Installed."
```

Now, I can happily compile the ST_BarcodeMap finally:

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
cd ST_BarcodeMap
make

Finally, I got the actual ATGC sequence barcode for CID numbers.

./ST_BarcodeMap-0.0.1 --in A02598A4.barcodeToPos.h5 --out barcodes.txt --action 3

For the complete code and workflow related to my original goal, check out the entry
👉 Extract Tissue CIDs from Stereo-seq
in our BSGOU public wet/dry-lab notebooks online.

However, using the GitHub Actions workflow to compile the project is actually much easier and faster than setting it up and compiling it locally. Therefore, I forked the original ST_BarcodeMap repository and compiled it directly through GitHub.

You can check out my forked repository here: wong-ziyi/ST_BarcodeMap.

In the end, bioinformatics is still part biology, part informatics… and part “don’t blindly trust the robot” as Apple just found no evidence of formal reasoning in language models.

— by WANG Ziyi GitHub Profile

References


If you found this helpful, feel free to comment, share, and follow for more. Your support encourages us to keep creating quality content.

This work is licensed under a Attribution-NonCommercial 4.0 International license. Attribution-NonCommercial 4.0 International