The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Dropbox (links are below).




Home / Release archive / 9.215

Contents


Downloads

Latest RiboGrove release — 9.215 (2022-11-30)

The release is based on RefSeq release 215.

The fasta file is compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Release notes

Starting with release 9.215, we publish section “Coverage of primer pairs for different V-regions of bacterial 16S rRNA genes”.

You can find notes to all RiboGrove releases on the release notes page.


Statistical summary

RiboGrove size
BacteriaArchaeaTotal
Number of gene sequences 158,769 813 159,582
Number of unique gene sequences 43,247 592 43,839
Number of species 7,717 381 8,098
Number of genomes 29,922 478 30,400
Number of genomes of category 1 19,915 174 20,089
Number of genomes of category 2 9,876 304 10,180
Number of genomes of category 3 131 0 131
16S rRNA gene lengths
BacteriaArchaea
Minimum (bp) 1,448.00 1,439.00
25th percentile (bp) * 1,517.00 1,472.00
Median (bp) * 1,531.00 1,474.00
75th percentile (bp) * 1,542.00 1,486.50
Average (bp) * 1,528.01 1,496.32
Mode (bp) * 1,537.00 1,472.00
Maximum (bp) 2,438.00 3,604.00
Standard deviation (bp) * 25.63 137.70

* Metrics marked with this sign were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
Copy number *BacteriaArchaea
Number of speciesPer cent of species (%)Number of speciesPer cent of species (%)
1 949 12.30 205 53.81
2 1,431 18.54 106 27.82
3 1,164 15.08 56 14.70
4 1,003 13.00 9 2.36
5 657 8.51 5 1.31
6 825 10.69 0 0.00
7 650 8.42 0 0.00
8 389 5.04 0 0.00
9 202 2.62 0 0.00
10 172 2.23 0 0.00
11 92 1.19 0 0.00
12 67 0.87 0 0.00
13 34 0.44 0 0.00
14 49 0.63 0 0.00
15 12 0.16 0 0.00
16 5 0.06 0 0.00
17 5 0.06 0 0.00
18 5 0.06 0 0.00
20 4 0.05 0 0.00
27 1 0.01 0 0.00
37 1 0.01 0 0.00

* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
OrganismGene length (bp)RiboGrove Sequence ID(s)Assembly ID
Bacteria
Thermus thermophilus strain AA2-2 2,438 G_10898951:NZ_AP024929.1:249100-251537:minus 10898951
Ca. Annandia pinicola strain Ad13-065 1,887 G_11277031:NZ_CP045876.1:290071-291957:minus 11277031
Nitrosophilus labii strain HRV44 1,806 G_8028891:NZ_AP022826.1:1258017-1259822:minus
G_8028891:NZ_AP022826.1:1532588-1534393:minus
G_8028891:NZ_AP022826.1:1939914-1941719:minus
8028891
Gelria sp. Kuro-4 1,788 G_10731991:NZ_AP024619.1:2016182-2017969:minus 10731991
Thermoanaerobacter
pseudethanolicus strain ATCC 33223
1,781 G_40148:NC_010321.1:2265744-2267524:minus 40148
Thermoanaerobacter brockii strain Ako-1 1,781 G_282748:NC_014964.1:2252888-2254668:minus 282748
Campylobacter sputorum strain RM3237 1,744 G_1153941:NZ_CP019682.1:607981-609724:plus
G_1153941:NZ_CP019682.1:929565-931308:minus
G_1153941:NZ_CP019682.1:1501945-1503688:minus
1153941
Campylobacter sputorum strain LMG 7795 1,744 G_4499991:NZ_CP043427.1:609141-610884:plus
G_4499991:NZ_CP043427.1:930699-932442:minus
G_4499991:NZ_CP043427.1:1503078-1504821:minus
4499991
Campylobacter sputorum strain CCUG 20703 1,743 G_1153911:NZ_CP019683.1:606847-608589:plus
G_1153911:NZ_CP019683.1:935163-936905:minus
G_1153911:NZ_CP019683.1:1558189-1559931:minus
1153911
Campylobacter hyointestinalis strain CHY5 1,742 G_7294871:NZ_CP053828.1:357136-358877:plus
G_7294871:NZ_CP053828.1:1667816-1669557:minus
7294871
Campylobacter sp. RM6137 1,742 G_1101781:NZ_CP018789.1:273370-275111:plus
G_1101781:NZ_CP018789.1:1545743-1547484:minus
1101781
Campylobacter sputorum strain RM8705 1,742 G_1153931:NZ_CP019685.1:577810-579551:plus
G_1153931:NZ_CP019685.1:891862-893603:minus
G_1153931:NZ_CP019685.1:1479764-1481505:minus
1153931
Archaea
Pyrobaculum ferrireducens strain 1860 3,604 G_351728:NC_016645.1:127214-130817:plus 351728
Pyrobaculum aerophilum strain IM2 2,213 G_28808:NC_003364.1:1089640-1091852:plus 28808
Pyrobaculum arsenaticum strain DSM 13514 2,212 G_37488:NC_009376.1:623323-625534:minus 37488
Aeropyrum pernix strain K1 2,202 G_32288:NC_000854.2:1218712-1220913:minus 32288
Pyrobaculum neutrophilum strain V24Sta 2,197 G_40848:NC_010525.1:690419-692615:plus 40848
Ca. Mancarchaeum acidiphilum strain Mia14 2,008 G_1145431:NZ_CP019964.1:751297-753304:minus 1145431
Ca. Micrarchaeum sp. A_DKE 2,003 G_9220081:NZ_CP060530.1:203642-205644:minus 9220081
Caldivirga maquilingensis strain IC-167 1,679 G_39388:NC_009954.1:129150-130828:minus 39388
Aeropyrum camini strain SY1 1,650 G_127981:NC_022521.1:1165168-1166817:minus 127981
Pyrolobus fumarii strain 1A 1,576 G_304318:NC_015931.1:84671-86246:minus 304318
Top-10 shortest 16S rRNA genes
OrganismGene length (bp)RiboGrove Sequence ID(s)Assembly ID
Bacteria
Hirschia baltica strain ATCC 49814 1,448 G_44428:NC_012982.1:2336679-2338126:minus 44428
Sagittula sp. P11 1,449 G_1460951:NZ_CP021913.1:2386837-2388285:plus
G_1460951:NZ_CP021913.1:3597920-3599368:plus
1460951
Hyphomonas sp. Mor2 1,451 G_860061:NZ_CP017718.1:2304269-2305719:minus 860061
Antarctobacter heliothermus strain SMS3 1,453 G_1163161:NZ_CP022540.1:1369380-1370832:plus
G_1163161:NZ_CP022540.1:2482480-2483932:plus
1163161
Mameliella alba strain KU6B 1,454 G_6279751:NZ_AP022337.1:267139-268592:plus
G_6279751:NZ_AP022337.1:1420942-1422395:plus
G_6279751:NZ_AP022337.1:3191208-3192661:minus
6279751
Hyphomonas neptunium strain ATCC 15444 1,455 G_34128:NC_008358.1:2818466-2819920:minus 34128
Hyphomonas sp. KY3 1,455 G_9503471:NZ_CP022271.1:2407999-2409453:minus 9503471
Pseudooceanicola algae strain Lw-13e 1,458 G_8694041:NZ_CP060436.1:2482207-2483664:minus 8694041
Ruegeria sp. SCSIO 43209 1,458 G_10854641:NZ_CP065359.1:3157837-3159294:minus 10854641
Paracoccus contaminans strain LMG 29738T 1,459 G_1078381:NZ_CP020612.1:582021-583479:minus
G_1078381:NZ_CP020612.1:1166317-1167775:minus
1078381
Sulfitobacter mediterraneus strain SC1-11 1,459 G_9217271:NZ_CP069004.1:3093411-3094869:plus 9217271
Pelagovum pacificum strain SM1903 1,459 G_8872011:NZ_CP065915.1:2729819-2731277:minus
G_8872011:NZ_CP065915.1:3593071-3594529:minus
8872011
Sulfitobacter pontiacus strain W028 1,459 G_13748391:NZ_CP081118.1:282264-284:minus 13748391
Sulfitobacter sp. B30-2 1,459 G_8738751:NZ_CP065429.1:477373-478831:plus 8738751
Archaea
Ignicoccus hospitalis strain KIN4/I 1,439 G_39048:NC_009776.1:728362-729800:plus 39048
Methanocaldococcus sp. SG7 1,457 G_10131521:NZ_LR792632.1:542755-544211:plus 10131521
Halorubrum sp. BOL3-1 1,463 G_2220501:NZ_CP034692.1:397753-399215:minus 2220501
Natronomonas halophila strain C90 1,466 G_7330651:NZ_CP058334.1:1530622-1532087:minus 7330651
Ca. Methanomethylophilus
alvus strain MGYG-HGUT-02456
1,466 G_4352521:NZ_LR699000.1:283607-285072:plus 4352521
Natronomonas sp. ZY43 1,466 G_13300761:NZ_CP101154.1:18680-20145:plus 13300761
Methanospirillum hungatei strain JF-1 1,466 G_34548:NC_007796.1:39814-41279:plus
G_34548:NC_007796.1:1301079-1302544:minus
G_34548:NC_007796.1:3501525-3502990:minus
G_34548:NC_007796.1:3507609-3509074:minus
34548
Natronomonas gomsonensis
strain KCTC 4088
1,466 G_13300951:NZ_CP101323.1:2500564-2502029:plus 13300951
Ca. Methanomethylophilus
alvus strain Mx-05
1,466 G_2068141:NZ_CP017686.1:283608-285073:plus 2068141
Methanospirillum hungatei strain GP1 1,466 G_10519241:NZ_CP077107.1:4649-6114:plus
G_10519241:NZ_CP077107.1:1359562-1361027:minus
G_10519241:NZ_CP077107.1:1365502-1366967:minus
G_10519241:NZ_CP077107.1:1986020-1987485:minus
10519241
Methanospirillum sp. J.3.6.1-F.2.7.3 1,466 G_10123301:NZ_CP075546.1:133354-134819:plus
G_10123301:NZ_CP075546.1:825954-827419:plus
G_10123301:NZ_CP075546.1:872641-874106:plus
G_10123301:NZ_CP075546.1:1727419-1728884:plus
10123301
Ca. Methanomethylophilus
alvus strain Mx1201
1,466 G_599268:NC_020913.1:283607-285072:plus 599268
Salinirubellus salinus strain ZS-35-S2 1,466 G_13813051:NZ_CP104003.1:3070232-3071697:plus 13813051
Top-10 genomes with the largest 16S rRNA copy numbers
OrganismCopy numberAssembly ID
Bacteria
Tumebacillus avium strain AR23208 37 1115491
Tumebacillus algifaecis strain THMBR28 27 1166771
Peribacillus asahii strain KF4 21 13022701
Priestia megaterium strain S2 21 6720751
Neobacillus drentensis strain JC05 20 11802511
Moritella sp. 28 20 9972251
Moritella sp. 5 20 9972261
Moritella sp. 36 20 9972241
Photobacterium damselae
strain AS-15-3942-7
19 11907491
Metabacillus litoralis strain Bac94 19 2023811
Archaea
Methanococcoides orientis strain LMO-1 5 11622961
Natrinema sp. SYSU A 869 5 10842511
Natronorubrum bangense strain JCM 10635 5 2580821
Natronorubrum aibiense strain 7-3 5 5073821
Methanoplanus endosymbiosus strain DSM 3599 5 13492921
Methanosphaera stadtmanae strain DSM 3091 4 33648
Natronococcus occultus strain SP4 4 521038
Methanospirillum sp. J.3.6.1-F.2.7.3 4 10123301
Methanospirillum hungatei strain GP1 4 10519241
Halosiccatus urmianus strain IBRC-M: 10911 4 11057071
Halomicrobium salinisoli strain LT50 4 11151361
Haloterrigena salifodinae strain BOL5-1 4 9298621
Halomicrobium salinisoli strain TH30 4 11151391
Methanococcus vannielii strain SB 4 38268
Methanospirillum hungatei strain JF-1 4 34548
Haloarcula sinaiiensis strain ATCC 33800 4 9962651
Methanosphaera stadtmanae
strain MGYG-HGUT-02164
4 4349641
Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
OrganismSum of entropy * (bits)Mean entropy * (bits)Number of variable positionsGene copy numberAssembly ID
Bacteria
Synechococcus sp. NB0720_010 243.35 0.16 265 3 12576831
Xanthomonas oryzae strain YNCX 227.74 0.15 248 3 13407211
Sporomusa termitida strain DSM 4440 226.25 0.13 247 12 4155511
Campylobacter hyointestinalis
strain CHY5
217.64 0.12 237 3 7294871
Campylobacter sp. RM6137 211.21 0.12 230 3 1101781
Acetivibrio thermocellus strain M3 211.00 0.14 211 2 13802461
Sinorhizobium meliloti strain AK76 184.58 0.12 201 3 9010851
Cylindrospermopsis raciborskii
strain KLL07
168.97 0.11 184 3 11851031
Klebsiella pneumoniae strain GZ-1 167.21 0.10 216 5 8227731
Olleya sp. Bg11-27 145.25 0.10 156 3 1469691
Archaea
Halomicrobium sp. ZPS1 ** 137.00 0.09 137 2 4982121
Halosiccatus urmianus
strain IBRC-M: 10911
131.55 0.09 146 4 11057071
Halapricum desulfuricans
strain HSR12-2
128.00 0.09 128 2 9390741
Halomicrobium salinisoli strain TH30 127.74 0.09 145 4 11151391
Halapricum desulfuricans
strain HSR-Bgl
127.00 0.09 127 2 9390521
Halomicrobium mukohataei strain JP60 125.81 0.09 137 3 2582391
Halomicrobium salinisoli strain LT50 123.31 0.08 140 4 11151361
Halapricum desulfuricans
strain HSR-Est
111.00 0.08 111 2 9390681
Halapricum desulfuricans strain HSR12-1 109.00 0.07 109 2 9390731
Halorussus sp. XZYJT49 105.10 0.07 113 3 12653301

* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, per cent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of per cent of identity is 95%.

Coverage* of primer pairs for different V-regions of bacterial 16S rRNA genes
Phylum Number
of genomes
Full gene V1–V2 V1–V3 V3–V4 V3–V5 V4–V5 V4–V6 V5–V6 V5–V7 V6–V7 V6–V8
27F–1492R
(%)
27F–338R
(%)
27F–534R
(%)
341F–785R
(%)
341F–944R
(%)
515F–944R
(%)
515F–1100R
(%)
784F–1100R
(%)
784F–1193R
(%)
939F–1193R
(%)
939F–1378R
(%)
Proteobacteria 17,538 99.74 99.53 99.73 99.95 82.30 82.34 90.47 90.16 93.59 92.59 96.94
Firmicutes 6,721 99.97 99.85 99.96 99.96 95.73 95.63 99.51 97.87 97.17 98.54 99.27
Actinobacteria 2,840 99.79 98.80 99.58 94.05 63.35 63.13 96.13 99.61 99.72 99.75 97.01
Bacteroidota 1,161 95.09 94.57 95.00 99.91 61.15 60.72 38.67 38.93 94.83 92.42 94.49
Tenericutes 468 97.22 94.44 73.29 98.29 90.38 90.60 73.50 41.88 42.95 77.99 0.43
Spirochaetes 261 65.13 65.13 65.13 93.87 100.00 100.00 100.00 72.03 72.03 88.12 50.19
Cyanobacteria 213 100.00 100.00 100.00 100.00 5.16 5.16 100.00 0.94 0.94 100.00 99.53
Chlamydiae 186 0.00 0.00 0.00 100.00 100.00 0.00 0.00 100.00 100.00 100.00 93.55
Verrucomicrobia 113 99.12 0.00 99.12 100.00 8.85 8.85 100.00 0.88 0.88 100.00 100.00
Fusobacteria 80 100.00 96.25 100.00 100.00 100.00 100.00 100.00 98.75 98.75 100.00 0.00
Deinococcus-Thermus 74 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 50.00 100.00
Planctomycetota 57 100.00 19.30 100.00 100.00 64.91 64.91 0.00 0.00 0.00 3.51 0.00
Thermotogae 42 100.00 97.62 100.00 100.00 9.52 9.52 100.00 0.00 0.00 59.52 97.62
Chloroflexi 41 100.00 90.24 100.00 39.02 0.00 0.00 87.80 4.88 4.88 92.68 26.83
Acidobacteria 31 96.77 96.77 96.77 100.00 100.00 100.00 100.00 61.29 45.16 83.87 100.00
Aquificae 14 100.00 21.43 100.00 100.00 21.43 21.43 100.00 0.00 0.00 7.14 21.43
Chlorobi 14 100.00 100.00 100.00 100.00 0.00 0.00 0.00 100.00 92.86 85.71 7.14
Nitrospirae 10 100.00 100.00 100.00 100.00 60.00 60.00 100.00 100.00 60.00 60.00 100.00
Thermodesulfobacteria 7 100.00 100.00 100.00 100.00 0.00 0.00 100.00 0.00 0.00 100.00 100.00
Deferribacteres 6 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00 100.00 100.00 100.00
Synergistetes 6 100.00 100.00 100.00 100.00 0.00 0.00 100.00 0.00 0.00 100.00 100.00
Ca. Saccharibacteria 6 100.00 100.00 100.00 100.00 16.67 16.67 16.67 0.00 0.00 100.00 100.00
Elusimicrobia 4 100.00 50.00 100.00 100.00 0.00 0.00 100.00 75.00 75.00 100.00 100.00
Gemmatimonadetes 4 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Kiritimatiellaeota 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00
Ignavibacteriae 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Fibrobacteres 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Dictyoglomi 2 100.00 100.00 100.00 100.00 0.00 0.00 100.00 0.00 0.00 100.00 0.00
Chrysiogenetes 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Balneolaeota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Ca. Bipolaricaulota 1 0.00 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 0.00 0.00
Caldiserica 1 100.00 100.00 100.00 100.00 0.00 0.00 0.00 0.00 100.00 100.00 100.00
Coprothermobacterota 1 0.00 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 100.00 0.00
Atribacterota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00
Armatimonadetes 1 100.00 0.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Calditrichaeota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Ca. Omnitrophica 1 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00 100.00 100.00 100.00
Ca. Cloacimonetes 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Ca. Absconditabacteria 1 100.00 0.00 100.00 100.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00

* Coverage of a primer pair is the per cent of genomes having at least one 16S rRNA gene which can be amplified by PCR using this primer pair. For details, see our paper about RiboGrove.

Primers used for coverage estimation
Primer nameSequenceReference
27FAGAGTTTGATYMTGGCTCAGFrank et al., 2008
338RGCTGCCTCCCGTAGGAGTSuzuki et al., 1996
341F *CCTACGGGNGGCWGCAGKlindworth et al., 2013
515FGTGCCAGCMGCCGCGGTAATurner et al., 1999
534RATTACCGCGGCTGCTGGWalker et al., 2015
784FAGGATTAGATACCCTGGTAAndersson et al., 2008
785R *GACTACHVGGGTATCTAATCCKlindworth et al., 2013
939FGAATTGACGGGGGCCCGCACAAGLebuhn et al., 2014
944RGAATTAAACCACATGCTCFuks et al., 2018
1100RAGGGTTGCGCTCGTTGTurner et al., 1999
1193RACGTCATCCCCACCTTCCBodenhausen et al, 2013
1378RCGGTGTGTACAAGGCCCGGGAACGLebuhn et al., 2014
1492RTACCTTGTTACGACTTFrank et al., 2008

* Primers 341F and 785R are used in the protocol for library preparation for sequencing of V3–V4 region of 16S rRNA genes on Illumina MiSeq.


Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>G_324861:NZ_CP009686.1:8908-10459:plus ;d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__cereus; category:1

Major blocks of a header are separated by spaces. A header consists of three such blocks:

  1. Sequence ID (seqID): G_324861:NZ_CP009686.1:8908-10459:plus. SeqID, in turn, consists of four parts separated by semicolons (:):
    1. The assembly ID of the genome, from which the gene originates: G_324861. Assembly ID is preceded by the prefix G_ to ensure search specificity.
    2. The accession number of the RefSeq sequence, from which the gene originates: NZ_CP009686.1.
    3. Coordinates of the gene within this RefSeq genomic sequence: 8908-10459 (coordinates are 1-based, left-closed and right-closed).
    4. Strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
  2. A taxonomy string, comprising domain (Bacteria), phylum (Firmicutes), class (Bacilli), order (Bacillales), family (Bacillaceae), genus (Bacillus) names, and the specific epithet (cereus).
    Each name is preceded by a prefix, which denotes rank: d__ for domain, p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for specific epithet. Prefixes contain double underscores.
    The taxonomic names are separated and flanked by semicolons (;).
  3. The category of the genome, from which the gene sequence originates: (category:1).

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_9.215_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "G_324861:NZ_CP009686.1:8908-10459:plus" ribogrove_9.215_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number NZ_CP009686.1.

seqkit grep -nrp ":NZ_CP009686.1:" ribogrove_9.215_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program not to exclude partial matches from output, i.e. if the pattern is a substring of a header, the header will be printed to output.

To ensure search specificity, surround the Accession.Version with colons (:).

Example 3. Select all gene sequences of a single genome (Assembly ID 10577151).

seqkit grep -nrp "G_10577151:" ribogrove_9.215_sequences.fasta.gz

To ensure search specificity, Assembly ID should be preceded by prefix G_ and followed by a colon (:).

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";p__Actinobacteria;" ribogrove_9.215_sequences.fasta.gz

To ensure search specificity, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_9.215_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Firmicutes.

seqkit grep -nvrp ";p__Firmicutes;" ribogrove_9.215_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. output will comprise sequences, headers of which do not contain thesubstring “;p__Firmicutes;”.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_9.215_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_9.215_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_9.215_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_9.215_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_9.215_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all “Assession.Version”s.

seqkit seq -ni ribogrove_9.215_sequences.fasta.gz | cut -f2 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 4. Select all Assembly IDs.

seqkit seq -ni ribogrove_9.215_sequences.fasta.gz | cut -f1 -d':' | sed 's/G_//' | sort | uniq

This might be done only if you have cut, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 5. Select all phylum names.

seqkit seq -n ribogrove_9.215_sequences.fasta.gz | grep -Eo ';p__[^;]+' | sed -E 's/;|p__//g' | sort | uniq

This might be done only if you have grep, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).



RiboGrove, 2023-05-19