Extract Tissue CIDs from Stereo-seqdigital knowledge repository

As always, you can directly jump for checking TL;DR first.

1. What’s inside the `*.tissue.gef`?

Let’s first explore the resulted <SN>.tissue.gef in Python:

import h5py
import numpy as np
 
def explore_gef(h5file):
	def visitor(name, obj):
		if isinstance(obj, h5py.Dataset):
		print(f"{name}: shape={obj.shape}, dtype={obj.dtype}")
	with h5py.File(h5file, 'r') as f:
		f.visititems(visitor)
		
gef_file = "./<SN>.tissue.gef"
 
with h5py.File('./<SN>.tissue.gef', 'r') as f:
	def visit(name):
		print(name)
	f.visit(visit)

Output:

contour
contour/tissueContour
geneExp
geneExp/bin1
geneExp/bin1/exon
geneExp/bin1/expression
geneExp/bin1/gene
geneExp/bin10
geneExp/bin10/exon
geneExp/bin10/expression
geneExp/bin10/gene
geneExp/bin100
geneExp/bin100/exon
geneExp/bin100/expression
geneExp/bin100/gene
geneExp/bin150
geneExp/bin150/exon
geneExp/bin150/expression
geneExp/bin150/gene
geneExp/bin20
geneExp/bin20/exon
geneExp/bin20/expression
geneExp/bin20/gene
geneExp/bin200
geneExp/bin200/exon
geneExp/bin200/expression
geneExp/bin200/gene
geneExp/bin5
geneExp/bin5/exon
geneExp/bin5/expression
geneExp/bin5/gene
geneExp/bin50
geneExp/bin50/exon
geneExp/bin50/expression
geneExp/bin50/gene
stat
stat/gene
wholeExp
wholeExp/bin1
wholeExp/bin10
wholeExp/bin100
wholeExp/bin150
wholeExp/bin20
wholeExp/bin200
wholeExp/bin5
wholeExp/bin50
wholeExpExon
wholeExpExon/bin1
wholeExpExon/bin10
wholeExpExon/bin100
wholeExpExon/bin150
wholeExpExon/bin20
wholeExpExon/bin200
wholeExpExon/bin5
wholeExpExon/bin50

Let’s check where <SN>.tissue.gef file stores the XY coordinates continuous in Python:

explore_gef(gef_file)

Output:

contour/tissueContour: shape=(), dtype=object
geneExp/bin1/exon: shape=(116772117,), dtype=uint8
geneExp/bin1/expression: shape=(116772117,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', 'u1')]
geneExp/bin1/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin10/exon: shape=(93972549,), dtype=uint16
geneExp/bin10/expression: shape=(93972549,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin10/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin100/exon: shape=(32675685,), dtype=uint16
geneExp/bin100/expression: shape=(32675685,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin100/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin150/exon: shape=(22060836,), dtype=uint16
geneExp/bin150/expression: shape=(22060836,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin150/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin20/exon: shape=(78271410,), dtype=uint16
geneExp/bin20/expression: shape=(78271410,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin20/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin200/exon: shape=(15879595,), dtype=uint16
geneExp/bin200/expression: shape=(15879595,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin200/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin5/exon: shape=(103779528,), dtype=uint8
geneExp/bin5/expression: shape=(103779528,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', 'u1')]
geneExp/bin5/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
geneExp/bin50/exon: shape=(52649679,), dtype=uint16
geneExp/bin50/expression: shape=(52649679,), dtype=[('x', '<i4'), ('y', '<i4'), ('count', '<u2')]
geneExp/bin50/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('offset', '<u4'), ('count', '<u4')]
stat/gene: shape=(57584,), dtype=[('geneID', 'S64'), ('geneName', 'S64'), ('MIDcount', '<u4'), ('E10', '<f4')]
wholeExp/bin1: shape=(26460, 26460), dtype=[('MIDcount', 'u1'), ('genecount', '<u2')]
wholeExp/bin10: shape=(2646, 2646), dtype=[('MIDcount', '<u2'), ('genecount', '<u2')]
wholeExp/bin100: shape=(265, 265), dtype=[('MIDcount', '<u2'), ('genecount', '<u2')]
wholeExp/bin150: shape=(177, 177), dtype=[('MIDcount', '<u4'), ('genecount', '<u2')]
wholeExp/bin20: shape=(1323, 1323), dtype=[('MIDcount', '<u2'), ('genecount', '<u2')]
wholeExp/bin200: shape=(133, 133), dtype=[('MIDcount', '<u4'), ('genecount', '<u2')]
wholeExp/bin5: shape=(5292, 5292), dtype=[('MIDcount', '<u2'), ('genecount', '<u2')]
wholeExp/bin50: shape=(530, 530), dtype=[('MIDcount', '<u2'), ('genecount', '<u2')]
wholeExpExon/bin1: shape=(26460, 26460), dtype=uint8
wholeExpExon/bin10: shape=(2646, 2646), dtype=uint16
wholeExpExon/bin100: shape=(265, 265), dtype=uint16
wholeExpExon/bin150: shape=(177, 177), dtype=uint32
wholeExpExon/bin20: shape=(1323, 1323), dtype=uint16
wholeExpExon/bin200: shape=(133, 133), dtype=uint32
wholeExpExon/bin5: shape=(5292, 5292), dtype=uint16
wholeExpExon/bin50: shape=(530, 530), dtype=uint16

Now, we know that the <SN>.tissue.gef file stores the XY coordinates in geneExp/bin1/expression on the smallest level. Therefore, we will save them out into tissue_xy_coords.txt for later use by using Python:

with h5py.File(gef_file, 'r') as f:
	expression = f['geneExp/bin1/expression']
	x_coords = expression['x']
	y_coords = expression['y']
 
xy_set = set(zip(x_coords, y_coords))
 
# Save to tab-separated list
with open('tissue_xy_coords.txt', 'w') as out:
	for x, y in sorted(xy_set):
		out.write(f"{x}\t{y}\n")

Check the output tissue_xy_coords.txt:

ubuntu4@ubuntu4:~$ head tissue_xy_coords.txt
5713	26441
5787	26358
5788	26318
5788	26319
5788	26321
5788	26329
5789	26316
5789	26327
5790	26305
5790	26310

2. The —clean-reads-fastq option in `saw` software

Actually, with the —clean-reads-fastq option in saw software, we can get the read ID along with the XY coordinates and the biological sequence originally from Read2, as shown below:

ubuntu@ubuntu:~$ head <ID>_<LaneNo.>_<SampleNo.>_1.clean_reads.fq
@E150018299L1C004R03402098317|Cx:i:19892|Cy:i:17727 E8BBE63524CF BEAB3
GCCAATAGTATAGTGTGGTGTGCTTTTACGTGATGGCGAGTGGGCAGCGGGCGGTGGGCTGTACACAGCCGTCTGTCCTTTGAATCTCAATCTGCCTGCG
+
9A>CCCCACCCC>CACB?CCC>ACCCCCA>C@<AB8<ABB?>B;6BC7?/8=B>ACA=>CC@BAB<BC=CCCACABB<CCCC1CCAC8C/A;CCCCC@CA
@E150018299L1C004R03402098337|Cx:i:13026|Cy:i:14469 1C20554DC029 A5B9F
GCGTGGCACTCAGGCGGGGCCCTGGGAGCGCTGCGGGCACGGGGTGGCCGGCAGGACGCGGGCTGGATGGCTCTGGCCGCGCCAGGAGGAGGCCGACCTG
+
1B1AC>ABC<CCCBB8<C0>>1C;A95C:>9>@869<2C?@C;C1C?A=ACA@CC6BC=CA>6CAA:BCC?BCCCA<1@CCA:@CA1<A)CC93-CCCCC
@E150018299L1C004R03402098484|Cx:i:17752|Cy:i:16736 8E7513DF377D FD27F
GCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAA

Therefore, we can use the tissue_xy_coords.txt to filter the *_1.clean_reads.fq files directly into a merged R2.tissue.fq.gz file by using the filter_cleanR1_by_XY_parallel.sh below:

#!/bin/bash
 
# ===== CONFIG =====
XY_LIST="tissue_xy_coords.txt"
OUTFILE="R2.tissue.fq.gz"
TMPDIR="tmp_cleanR1"
mkdir -p "$TMPDIR"
 
echo "📄 Preparing whitelist..."
# Extract and sort unique XY coordinates (Cx, Cy)
cut -f1,2 "$XY_LIST" | sort | uniq > whitelist_xy.txt
 
# Export for GNU parallel
export TMPDIR
export WL="$(realpath whitelist_xy.txt)"  # Use full path to avoid parallel context issues
 
# ===== FUNCTION: PROCESS EACH FILE =====
process_cleanR1() {
    fq="$1"
    base=$(basename "$fq" .clean_reads.fq)
    tmpout="$TMPDIR/${base}.tissue.fq"
 
    gawk -v WL="$WL" -v OUT="$tmpout" '
        BEGIN {
            while ((getline line < WL) > 0) {
                gsub(/\r/, "", line)  # Remove Windows-style line endings
                valid_xy[line] = 1
            }
        }
        {
            if (substr($0,1,1) == "@") {
                id = $0
                getline seq
                getline plus
                getline qual
                if (match(id, /\|Cx:i:([0-9]+)\|Cy:i:([0-9]+)/, m)) {
                    xy = m[1] "\t" m[2]
                    if (xy in valid_xy) {
                        print id >> OUT
                        print seq >> OUT
                        print plus >> OUT
                        print qual >> OUT
                    }
                }
            }
        }
    ' "$fq"
}
export -f process_cleanR1
 
# ===== PARALLEL PROCESSING =====
echo "🚀 Filtering reads based on XY coordinates..."
ls *_1.clean_reads.fq | parallel --bar -j "$(nproc)" process_cleanR1 {}
 
# ===== MERGE RESULTS =====
echo "🧬 Merging into $OUTFILE..."
if ls $TMPDIR/*.tissue.fq 1> /dev/null 2>&1; then
    cat $TMPDIR/*.tissue.fq | gzip > "$OUTFILE"
    echo "✅ Done. Output written to $OUTFILE"
else
    echo "⚠️  No matching reads found. Output file was not created."
fi

Output:

ubuntu@ubuntu:~$ zcat R2.tissue.fq.gz|head -n 12
@E150018299L1C004R03402098317|Cx:i:19892|Cy:i:17727 E8BBE63524CF BEAB3
GCCAATAGTATAGTGTGGTGTGCTTTTACGTGATGGCGAGTGGGCAGCGGGCGGTGGGCTGTACACAGCCGTCTGTCCTTTGAATCTCAATCTGCCTGCG
+
9A>CCCCACCCC>CACB?CCC>ACCCCCA>C@<AB8<ABB?>B;6BC7?/8=B>ACA=>CC@BAB<BC=CCCACABB<CCCC1CCAC8C/A;CCCCC@CA
@E150018299L1C004R03402098337|Cx:i:13026|Cy:i:14469 1C20554DC029 A5B9F
GCGTGGCACTCAGGCGGGGCCCTGGGAGCGCTGCGGGCACGGGGTGGCCGGCAGGACGCGGGCTGGATGGCTCTGGCCGCGCCAGGAGGAGGCCGACCTG
+
1B1AC>ABC<CCCBB8<C0>>1C;A95C:>9>@869<2C?@C;C1C?A=ACA@CC6BC=CA>6CAA:BCC?BCCCA<1@CCA:@CA1<A)CC93-CCCCC
@E150018299L1C004R03402098484|Cx:i:17752|Cy:i:16736 8E7513DF377D FD27F
GCCGGGGGCATTCGTATTGCGCCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTCATTAATCAA
+
CCCB=CCCCCCCCCCDCCCCCCCCCDCCCCCC@CCCCCCCCC8BBCCCBCCCCBBCCCCCBCCBCCCCCCCCCCCCCBCCBCCCCCCCCCCCCCCBCCCC

To here, we got what we want.

3. If you need un-clean reads

According to SAW User Manual V8.1:

Workflow-of-CID-mapping

3.1. CID mapping

CID mapping requires FASTQs and a chip mask file, recording position information for sequencing reads.

Check the amount of Ns in Coordinate IDs:

If there is 1 N base in CID, the N base will be replaced by A/T/C/G.
CIDs without N bases will be directly poured into the match pool.
CIDs with more than one N base will be discarded.

In the absense of an N base, if a CID does not match to any positions, each base of the CID will replaced by the other three types iterately until a successful match. After the mapping, only the unique CID match will be retained for subsequent steps.

3.2. RNA filtering

Before genome alignment, it is necessary to confirm that the reads entering the next step are cDNA sequences. Ideally, this part of the data only contains cDNA fragments, but there may be cases where the fragmented cDNA is too short or some non-cDNA fragments are detected.

Reads will be discarded if any of the following conditions are triggered:

with a length of less than 30 after cutting out adapter sequences,
mapped to DNB sequences,
with a length of less than 30 after cutting out the poly-A sequence.

The above three are collectively known as “non-relevant short reads”.

3.3. MID filtering

MID sequences will be filtered out if they match any of the following:

having more than one base with quality ⇐ Q10,
having N bases >= 1.

In summary, the —clean-reads-fastq option lets the SAW software output the Clean Reads (before RNA alignment) in FASTQ format, which have undergone CID mapping, RNA filtering, and MID filtering.

Therefore, if you also need the reads before CID mapping, RNA filtering, and MID filtering (unclean reads), you can get if by following the below steps:

3.4. Extract Raw Reads by Barcodes

Let’s check the content of <SN>.barcodeToPos.h5 in the first by using Python:

3.4.1 What’s inside the `<SN>.barcodeToPos.h5`?

import h5py
import numpy as np
 
with h5py.File("<SN>.barcodeToPos.h5", "r") as f:
	def show(name):
		print(name)
	f.visit(show)

Output:

bpMatrix_1

def visit_func(name, obj):
	if isinstance(obj, h5py.Dataset):
		print(f"{name}: shape={obj.shape}, dtype={obj.dtype}")
with h5py.File("<SN>.barcodeToPos.h5", "r") as f:
	f.visititems(visit_func)

Output:

bpMatrix_1: shape=(26462, 26462, 1), dtype=uint64

with h5py.File("<SN>.barcodeToPos.h5", "r") as f:
    bp_matrix = f["bpMatrix_1"]
    preview = bp_matrix[:10, :10, 0]  # First 10x10 values from the first "layer"
    print(preview)

Output:

[[ 836120372779751   26690364227217  310455682980154  810417757238641  757835266385246  468131266122076  730298897644317
   114211586351941  905790851940934  144665928054079]
 [ 592167366271184  315691865553282  539954256853234  637943645074939  638218522981883  935068209817518  943872959887278
   145802161834220  384478059976677   90995135132138]
 [ 473147532191787  916598384181374  373439451720958  911457121349196  946003516929503  701023829742140                0
     1973895908933  482823435537541  882002041392058]
 [  75850009055013  497328043110101  157141292161565  912556632976964  895216052774896  860031680686064  433145314645463
   454758089196737  441632677595089  371550774555605]
 [  92342146417517                0  583753158693405  603543734459023  540662194593155  930723353156714  284657638092236
   423271773771732  312992807700385  459897744576492]
 [ 389942908629505  377380128554073   28216769442888  591127799854154  316247602980563  754467704761660  770956076084520
   298320288767464  918872495227747  204576691495555]
 [ 408347756505642  406372427344417  612717698316805  299964748619746  591348231657186  302371740460318  421300098808982
                 0  173670142342295   69361901129395]
 [ 324703201436906  981215793203770  204232473287550  496942343803851  355968231201706  865069609921374 1004588767307362
   609138334283912   47116856113832                0]
 [ 400619138252874                0  186640287370110  261566054772075  763646648268041  189457938907919  631286476979333
   860377963801277  871676830062988  856013244432792]
 [1030647789522790  679626823667083                0  381032100570714  451005439602351  400828544750347  977314407882773
   860382258752061   97469403671661  851753710617944]]

with h5py.File("<SN>.barcodeToPos.h5", "r") as f:
    bp_matrix = f["bpMatrix_1"]
    x, y = 6518, 12274
    cid = int(bp_matrix[y, x, 0])
    print("CID (decimal):", cid)

Output:

CID (decimal): 100904410303266

3.4.2 Convert `CID` to `ATGC`

As we mentioned in in blog of From CID to ATGC: Decoding Stereo-seq Barcodes, we can use the STOmics/ST_BarcodeMap to convert the CID numbers into the actual ATGC sequences, as shown below:

./ST_BarcodeMap-0.0.1 --in A02598A4.barcodeToPos.h5 --out barcodeToPos.txt --action 3

Output:

ubuntu@ubuntu:~$ head barcodeToPos.txt
TGCTCTATCGCAACCCATGCTCCAG	26457	26459
GGATTACCACGTGTCCATTTACCCC	26456	26459
AATTGAAGCCGACACTGTATGGGGG	26453	26459
GCATATCGACCTCCTTCGATCCCTG	26447	26459
GTCGATGTGCTTGCAATCATGAGCT	26445	26459
AAGGTTTGCTTGAGCACGGGGCCGA	26444	26459
GAGCCTATGAACTTTCGCTCCGAAA	26443	26459
AAACTTAACAGCCGCCTGTCCTTTC	26442	26459
GCTCTCGGCGTCATCCATTTACTAC	26441	26459
GTCGTTTGTCCACGATAGCAACATT	26437	26459

Then, we can use the file of tissue_xy_coords.txt obtained from Section 1 to get the barcodes inside the tissue:

awk 'NR==FNR {xy[$1"\t"$2]=1; next} ($2"\t"$3) in xy' tissue_xy_coords.txt barcodeToPos.txt > barcodeToPos_tissue.txt

Output:

ubuntu@ubuntu:~$ head barcodeToPos_tissue.txt
TTTCTGCCCCTTATAGCTGTTATCG	6436	26451
ACAAACCAACCTGTCTGTCCTGCGA	18777	26448
GACTATAACGGTAGCTTAGGGTCGT	14541	26448
CTTCATCGCTCGGTCCTCGCTTCTG	6066	26448
GCTTTCCCCAGCATCTCACGCACCT	14401	26446
CAACAGCCTTCCACACCTACGGCGA	5713	26441
CAGTGTCCTGAAAAATCGTTCTCTC	6599	26439
GCCCGCAAGCTCTCGCAAGCCTCAC	6194	26437
CGAGCGATATTGGCTCCGAGAAATT	19123	26435
GCATATCTAGGATAGCACTTAATTT	18867	26435

Now, we can create a white list, barcodes_in_tissue.txt, of barcodes in tissue for later use:

cut -f1 barcodeToPos_tissue.txt > barcodes_in_tissue.txt

3.4.3 Filter Read 1 FASTQ files

According to SAW User Manual V8.1, paired FASTQs include a pair of read files, read 1 for CID, MID information and read 2 for captured RNA sequencing data respectively. An example of paired FASTQ:

# read 1
@E100026571L1C001R00300000000/1
TGTCCAACGGAGACGGCTCCGACAAGGCACTGGCA
+
>DG;<BGH=>*EFE8*G/3E@2:F0-GBGG188F<
 
# read 2
@E100026571L1C001R00300000000/2
GTCTCACCATACTTTTACAAAGTTATTTCAACCCAAATCACAATTTAAGAATTATTTGTTCTACCTATGCCACACTTTAAATAAATGTCTATTAAAACCA
+
-GFEECG?ECBFF<=@A@<E@><;FGCF=>=E53FEF5>FGF@,0ADE9CEAG2GBE@HF3EA<CE;G2F@=G8=?@G9FBGE.EG6G2;974E*D9DE9

Therefore, we can retrieve the all barcode-related reads ID from read 1 FASTQ file by using the extract_R1_ReadIDs_parallel.sh below:

#!/bin/bash
 
# Prerequisite: GNU parallel must be installed
# sudo apt install parallel
 
# === CONFIG ===
BARCODE_LIST="barcodes_in_tissue.txt"
OUTFILE="matched_read_ids.txt"
TMPDIR="tmp_matched_ids_parallel"
mkdir -p "$TMPDIR"
 
export BARCODE_LIST
export TMPDIR
 
# === FUNCTION TO PROCESS ONE FILE ===
process_file() {
    fq="$1"
    base=$(basename "$fq" _1.fq.gz)
    tmpfile="$TMPDIR/${base}_ids.txt"
 
    zcat "$fq" | paste - - - - \
    | awk -v BARCODE="$BARCODE_LIST" -v TMP="$tmpfile" '
        BEGIN {
            while ((getline line < BARCODE) > 0) {
                barcodes[line] = 1
            }
        }
        {
            seq = substr($2, 1, 25)
            if (seq in barcodes) {
                split($1, parts, "[/]")
                readid = substr(parts[1], 2)
                print readid >> TMP
            }
        }
    '
}
export -f process_file
 
# === RUN IN PARALLEL ===
echo "🧠 Running parallel processing..."
ls *_1.fq.gz | parallel --bar -j $(nproc) process_file {}
 
# === MERGE & CLEANUP ===
echo "🧹 Merging and deduplicating..."
cat $TMPDIR/*_ids.txt | sort | uniq > "$OUTFILE"
rm -r "$TMPDIR"
 
echo "✅ Done. All matched read IDs saved to $OUTFILE"

Output:

ubuntu@ubuntu:~$ head matched_read_ids.txt
E150018299L1C001R00100000800
E150018299L1C001R00100000806
E150018299L1C001R00100000892
E150018299L1C001R00100000906
E150018299L1C001R00100000969
E150018299L1C001R00100001174
E150018299L1C001R00100001181
E150018299L1C001R00100001294
E150018299L1C001R00100001305
E150018299L1C001R00100001311

3.4.4 Filter Read 2 FASTQ files

By using the matched_read_ids.txt from the above section, we can finally extract all reads inside the tissue by using extract_R2_parallel.sh below:

#!/bin/bash
 
# ====== CONFIG ======
IDFILE="matched_read_ids.txt"
OUTFILE="R2.tissue.fq.gz"
TMPDIR="tmp_r2_extract"
mkdir -p "$TMPDIR"
 
# ====== PREP ======
echo "📄 Indexing matched read IDs..."
sort "$IDFILE" | uniq > sorted_read_ids.txt
 
# ====== EXPORT FOR PARALLEL ======
export TMPDIR
export IDFILE="sorted_read_ids.txt"
 
# ====== FUNCTION TO PROCESS A SINGLE R2 FILE ======
extract_r2_reads() {
    fq="$1"
    base=$(basename "$fq" .fq.gz)
    tmpout="$TMPDIR/${base}.tissue.fq"
 
    zcat "$fq" | paste - - - - \
    | awk -v IDFILE="$IDFILE" -v OUT="$tmpout" '
        BEGIN {
            while ((getline id < IDFILE) > 0) {
                keep[id] = 1
            }
        }
        {
            split($1, a, "/");
            readid = substr(a[1], 2);  # remove leading @
            if (readid in keep) {
                print $1 "\n" $2 "\n" $3 "\n" $4 >> OUT
            }
        }
    '
}
export -f extract_r2_reads
 
# ====== RUN PARALLEL ======
echo "🚀 Extracting from R2 FASTQs using parallel..."
ls *_2.fq.gz | parallel --bar -j $(nproc) extract_r2_reads {}
 
# ====== MERGE RESULTS ======
echo "🧬 Merging results into $OUTFILE..."
cat $TMPDIR/*.tissue.fq | gzip > "$OUTFILE"
 
# ====== CLEANUP ======
rm -r "$TMPDIR"
rm sorted_read_ids.txt
 
echo "✅ Done! Output written to $OUTFILE"

Output:

ubuntu@ubuntu:~$ zcat R2.tissue.fq.gz|head -n 12
@E150018299L1C001R00100000906/2
GTCTTAGGAAGACAATGTGAAGTTGGAGGTCGAGAGACCTCTGATAAGGTCGCCATGCCTCTCAGTACGTCAGCAGAAGCAGTGGTATCAACGCAGAGTA
+
DDDDDCCDDDDDDDDDCDDDDDDDDDCDCDDDCDCDCDDDDDDCDDCCCCDDDDCDDDDDCDDCDDDDDDDCCDCDCCDDCDCDDDCDDCCDDDCDCDDB
@E150018299L1C001R00100001311/2
CCTGATCAGTCTTAGGAAGACAAAGAAAAGTGGTGCCTAAGGCCTCACCTGATAAGGTCGCCATGCCTCTCAGTACGTCAGCAGAAGCAGTGGTATCAAC
+
BCDDCDCCCDCDDDCBAACCCDC@CCCCBDCDCCDCCDCCCBCCDBDBCDCCCCCDCCCDBBCCDCDDCCB>CDCBCCDCBCCC=<CCCCCBBACCC?BC
@E150018299L1C001R00100002321/2
ACAAAGCACCGAGGAAAAAAGTAATGAGTCTGATAAGGTCGCCATGCCTCTCAGTACGTCAGCAGAAGCAGTGGTATCAACGCAGAGTACATTCCGTTGA
+
CDDDCCDCCCDDDDDCCDCDDCCCDCCDCCDDCDCCDDDDDCCDDDCCDDDCDDDCDDDDCDDCDCCDDCCDDDDDDDCDDDDCDCDCCCDDCDCCDCDD

Reference:

🧾 TL;DR — Extract Tissue Reads from Stereo-seq Using GEF & XY Coordinates

This workflow extracts read sequences localized within tissue areas from Stereo-seq datasets. It starts by decoding spatial expression data in .tissue.gef, determines valid XY coordinates of tissue-covered areas, and then filters clean or raw FASTQ reads accordingly. The pipeline supports both pre-cleaned reads (--clean-reads-fastq) and unprocessed raw reads via barcode decoding from chip mask HDF5 files.

📄 Required Files

📁 File	🧾 Description
`<SN>.tissue.gef`	Gene expression matrix with spatial bins
`<SN>.barcodeToPos.h5`	Encoded CID matrix for raw coordinate mapping
`*_1.clean_reads.fq`	Read1 files containing CID+MID+coordinates
`_1.fq.gz`, `_2.fq.gz`	Raw paired FASTQs (Read1 for CID, Read2 RNA)

🛠 Required Scripts

⚙️ Script	🔧 Purpose
`explore_gef()`	Inspects HDF5 GEF structure to locate XY coordinates
`filter_cleanR1_by_XY_parallel.sh`	Filters clean Read1 FASTQs by tissue XY coords
`extract_R1_ReadIDs_parallel.sh`	Gets read IDs from raw Read1 matching tissue barcodes
`extract_R2_parallel.sh`	Extracts raw Read2 reads based on matched read IDs
`ST_BarcodeMap`	Converts CID numbers to ATGC barcodes

🧭 Workflow Summary

Parse .tissue.gef to extract XY expression data (geneExp/bin1/expression).
Save tissue XYs to tissue_xy_coords.txt.
Use --clean-reads-fastq Read1 files to filter by XY, outputting R2.tissue.fq.gz.
For raw reads: decode barcodes using barcodeToPos.h5 + ST_BarcodeMap.
Match decoded barcodes to XY, extract read IDs, then filter raw Read2 FASTQs.
Final output: clean or raw R2.tissue.fq.gz with sequences inside tissue.

📂 Output Files

📤 Output	📌 Content
`tissue_xy_coords.txt`	List of XY coords within tissue
`R2.tissue.fq.gz`	Filtered FASTQ reads from tissue regions
`barcodes_in_tissue.txt`	Whitelist of barcodes localized in tissue
`matched_read_ids.txt`	Read IDs from Read1 sequences matching tissue barcodes

flowchart TD
  subgraph Input_Files
    A1["📁 <SN>.tissue.gef"]
    A2["📁 <SN>.barcodeToPos.h5"]
    A3["📁 *_1.clean_reads.fq"]
    A4["📁 *_1.fq.gz"]
    A5["📁 *_2.fq.gz"]
  end

  subgraph Processing_Steps
    B1["⚙️ explore_gef()"]
    B2["📄 tissue_xy_coords.txt"]
    B3["⚙️ ST_BarcodeMap"]
    B4["📄 barcodeToPos.txt"]
    B5["📄 barcodeToPos_tissue.txt"]
    B6["📄 barcodes_in_tissue.txt"]
    B7["⚙️ filter_cleanR1_by_XY_parallel.sh"]
    B8["⚙️ extract_R1_ReadIDs_parallel.sh"]
    B9["📄 matched_read_ids.txt"]
    B10["⚙️ extract_R2_parallel.sh"]
  end

  subgraph Outputs
    C1["📤 R2.tissue.fq.gz"]
  end

  A1 --> B1 --> B2
  A2 --> B3 --> B4 --> B5 --> B6
  B2 --> B7 --> C1
  A3 --> B7
  A4 --> B8 --> B9
  B6 --> B8
  A5 --> B10
  B9 --> B10 --> C1

If you found this helpful, feel free to comment, share, and follow for more. Your support encourages us to keep creating quality content.

BSGOU

Bees-GO!

Extract Tissue CIDs from Stereo-seq

1. What’s inside the `*.tissue.gef`?

2. The —clean-reads-fastq option in `saw` software

3. If you need un-clean reads

3.1. CID mapping

3.2. RNA filtering

3.3. MID filtering

3.4. Extract Raw Reads by Barcodes

3.4.1 What’s inside the `<SN>.barcodeToPos.h5`?

3.4.2 Convert `CID` to `ATGC`

3.4.3 Filter Read 1 FASTQ files

3.4.4 Filter Read 2 FASTQ files

Reference:

🧾 TL;DR — Extract Tissue Reads from Stereo-seq Using GEF & XY Coordinates

📄 Required Files

🛠 Required Scripts

🧭 Workflow Summary

📂 Output Files

Graph View

Table of Contents

Backlinks

BSGOU

Bees-GO!

Extract Tissue CIDs from Stereo-seq

1. What’s inside the *.tissue.gef?

2. The —clean-reads-fastq option in saw software

3. If you need un-clean reads

3.1. CID mapping

3.2. RNA filtering

3.3. MID filtering

3.4. Extract Raw Reads by Barcodes

3.4.1 What’s inside the <SN>.barcodeToPos.h5?

3.4.2 Convert CID to ATGC

3.4.3 Filter Read 1 FASTQ files

3.4.4 Filter Read 2 FASTQ files

Reference:

🧾 TL;DR — Extract Tissue Reads from Stereo-seq Using GEF & XY Coordinates

📄 Required Files

🛠 Required Scripts

🧭 Workflow Summary

📂 Output Files

Get new posts from DK.BeesGO by email:

Graph View

Table of Contents

Backlinks

1. What’s inside the `*.tissue.gef`?

2. The —clean-reads-fastq option in `saw` software

3.4.1 What’s inside the `<SN>.barcodeToPos.h5`?

3.4.2 Convert `CID` to `ATGC`