RiboGrove mirror

🌐 English / Беларуская / Українська / Русский

The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Zenodo (links are below).

What is RiboGrove
- RiboGrove and other 16S rRNA databases
- Genome categories
Downloads

Statistical summary
Searching data in RiboGrove
Contacts
Citing RiboGrove
Questions people ask about RiboGrove

What is RiboGrove

RiboGrove is a database of 16S rRNA gene sequences of bacteria and archaea.

RiboGrove is based on the RefSeq database. It contains only full-length sequences of 16S rRNA genes, and the sequences are derived from completely assembled prokaryotic genomes deposited in RefSeq. Hence we posit high reliability of RiboGrove sequences.

RiboGrove and other 16S rRNA databases

Here is a summary showing what is the (qualitative) difference between RiboGrove and similar rRNA sequence databases, namely rrnDB, Silva, RDP, and Greengenes. Briefly, RiboGrove is inferior in sequence amount and diversity, but superior in sequence reliability.

	RiboGrove	rrnDB	Silva	RDP	Greengenes
Represented organisms	Bacteria Archaea	Bacteria Archaea	Bacteria Archaea Eukaryotes	Bacteria Archaea Eukaryotes	Bacteria Archaea
Represented ribosome subunits	Small	Small	Large Small	Large Small	Small
Contains sequences from assembled genomes	Yes	Yes	Yes	Yes	Yes
Contains amplicon sequences	No	No	Yes	Yes	Yes
Contains partial gene sequences	No	Yes	Yes	Yes	Yes
Discriminates genome categories	Yes	No	Not applicable	Not applicable	Not applicable

Genome categories

All genomes used for RiboGrove construction were divided into three categories according to their expected reliability:

Category 1 (the highest reliability). Genomes showing no signs of a low-quality assembly and sequenced either with PacBio technology or with a combination “Oxford Nanopore + Illumina”.
Category 2. Genomes showing no signs of a low-quality assembly and sequenced with any other technology (or the technology is not specified).
Category 3 (the lowest reliability). Genomes showing at least one sign of a low-quality assembly.

Signs of a low-quality assembly are the following:

The genome contains degenerate base(s) in 16S rRNA gene sequences.
The assembly includes at least one RefSeq record whose title contains the phrase “map unlocalized” and this record contains a 16S rRNA gene or a part of it.

The software used for the RiboGrove construction can be found in the following GitHub repository: ribogrove-tools.

Downloads

Latest RiboGrove release — 26.232 (2025-09-09)

The release is based on RefSeq release 232.

A fasta file of full-length 16S gene sequences. Download (gzipped fasta file, 11.03 MB)
Metadata Download (zip archive 7.25 MB)
What information exactly does the metadata contain?
The metadata consists of the following files:
1. discarded_sequences.fasta.gz
  This is a fasta file of sequences, which were present in source RefSeq genomes and were annotated a 16S rRNA genes but which have been discarded according to their incompleteness, internal repeats etc. and thus haven’t been included into RiboGrove.
2. source_RefSeq_genomes.tsv
  This is a TSV files, which contains information about what genomes were used for the RiboGrove construction.
3. gene_seqs_base_counts.tsv, discarded_gene_seqs_base_counts.tsv
  These are TSV files, which contain nucleotide composition and size of the gene sequences. The first file describes final RiboGrove gene sequences, the second file describes discarded sequences.
4. categories.tsv
  This is a TSV file, which contains information about what genome categories were assigned to each genome and why. Moreover, it contains information about what sequencing technology was used to sequence each genome.
5. taxonomy.tsv
  This is a TSV file, which contains taxonomic affiliation of each genome and gene.
6. intragenic_repeats.tsv
  This is a TSV file, which contains information about intragenomic repeats found in gene sequences using RepeatFinder.
7. entropy_summary.tsv
  This is a TSV file, which contains summary of instragenomic variability of the 16S rRNA genes. Intragenomic variability are calculated only for the category 1 genomes having more than one 16S rRNA gene. Intragenomic variability is evaluated using Shannon entropy. We align gene sequences from each genome using MAFFT, and then we calculate Shannon entropy for each multiple alignment column (i.e. base).
8. 16S_GCNs.tsv
  This is a TSV file of 16S rRNA Gene Copy Numbers for each genome in the release.
9. primer_pair_genomic_coverage.tsv
  This is a TSV file which contains genomic coverage of primer pairs targeting different V-regions of 16S rRNA genes. For example, for Enterobacteriaceae, genomic coverage of a primer pair is the percent of Enterobacteriaceae genomes which contain at least one 16S rRNA gene that can (theoretically) produce a PCR product using the primer pair.

The fasta file is compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Release notes

Starting with the release 26.232, RiboGrove assigns kingdom rank (10.1099/ijsem.0.006242) to each sequence.

You can find notes to all RiboGrove releases on the release notes page.

Statistical summary

RiboGrove size
	Bacteria	Archaea	Total
Number of gene sequences	288,462	1,111	289,573
Number of unique gene sequences	68,300	780	69,080
Number of species	13,234	503	13,737
Number of genomes	52,357	633	52,990
Number of genomes of category 1	34,853	262	35,115
Number of genomes of category 2	17,226	371	17,597
Number of genomes of category 3	278	0	278

16S rRNA gene lengths
	Bacteria	Archaea
Minimum (bp)	1,401.00	1,439.00
25th percentile (bp) ^*	1,517.00	1,471.00
Median (bp) ^*	1,529.00	1,474.00
75th percentile (bp) ^*	1,542.00	1,483.00
Average (bp) ^*	1,527.13	1,491.07
Mode (bp) ^*	1,537.00	1,472.00
Maximum (bp)	2,438.00	3,604.00
Standard deviation (bp) ^*	25.12	120.22

^* Metrics marked with an asterisk were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
Copy number ^*	Bacteria		Archaea
	Number of species	Percent of species (%)	Number of species	Percent of species (%)
1	1,644	12.42	249	49.50
2	2,257	17.05	151	30.02
3	1,800	13.60	79	15.71
4	1,697	12.82	18	3.58
5	1,024	7.74	6	1.19
6	1,749	13.22	0	0.00
7	1,197	9.04	0	0.00
8	667	5.04	0	0.00
9	344	2.60	0	0.00
10	323	2.44	0	0.00
11	162	1.22	0	0.00
12	146	1.10	0	0.00
13	59	0.45	0	0.00
14	91	0.69	0	0.00
15	26	0.20	0	0.00
16	12	0.09	0	0.00
17	13	0.10	0	0.00
18	6	0.05	0	0.00
19	3	0.02	0	0.00
20	8	0.06	0	0.00
21	1	0.01	0	0.00
22	1	0.01	0	0.00
24	1	0.01	0	0.00
25	1	0.01	0	0.00
27	1	0.01	0	0.00
37	1	0.01	0	0.00

^* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Bacteria
Thermus thermophilus strain AA2-2	2,438	GCF_019974355.1:NZ_AP024929.1:249100-251537:minus	GCF_019974355.1
Ca. Annandia pinicola strain Ad13-065	1,887	GCF_020541245.1:NZ_CP045876.1:290071-291957:minus	GCF_020541245.1
Thermoanaerobacter ethanolicus strain JW 200	1,812	GCF_003722315.1:NZ_CP033580.1:456062-457873:plus	GCF_003722315.1
Nitrosophilus labii strain HRV44	1,806	GCF_014466985.1:NZ_AP022826.1:1258017-1259822:minus GCF_014466985.1:NZ_AP022826.1:1532588-1534393:minus GCF_014466985.1:NZ_AP022826.1:1939914-1941719:minus	GCF_014466985.1
Agarivorans sp. QJM3NY_29	1,803	GCF_050870835.2:NZ_CP194036.2:4273146-4274948:minus	GCF_050870835.2
Agarivorans sp. QJM3NY_30	1,803	GCF_050870855.2:NZ_CP194038.2:4273147-4274949:minus	GCF_050870855.2
Agarivorans sp. Z349TD_7	1,803	GCF_050870845.2:NZ_CP194040.2:4273139-4274941:minus	GCF_050870845.2
Sporomusa rhizae strain DSM 16652	1,802	GCF_041428845.1:NZ_CP156925.1:3123180-3124981:minus	GCF_041428845.1
Gelria sp. Kuro-4	1,788	GCF_019668485.1:NZ_AP024619.1:2016182-2017969:minus	GCF_019668485.1
Helicobacter mastomyrinus strain Hm-17	1,785	GCF_039555295.1:NZ_CP145316.1:765140-766924:minus	GCF_039555295.1

Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Archaea
Pyrobaculum ferrireducens strain 1860	3,604	GCF_000234805.1:NC_016645.1:127214-130817:plus	GCF_000234805.1
Pyrobaculum aerophilum strain IM2	2,213	GCF_000007225.1:NC_003364.1:1089640-1091852:plus	GCF_000007225.1
Pyrobaculum arsenaticum strain DSM 13514	2,212	GCF_000016385.1:NC_009376.1:623323-625534:minus	GCF_000016385.1
Aeropyrum pernix strain K1	2,202	GCF_000011125.1:NC_000854.2:1218712-1220913:minus	GCF_000011125.1
Pyrobaculum neutrophilum strain V24Sta	2,197	GCF_000019805.1:NC_010525.1:690419-692615:plus	GCF_000019805.1
Ca. Mancarchaeum acidiphilum strain Mia14	2,008	GCF_002214165.1:NZ_CP019964.1:751297-753304:minus	GCF_002214165.1
Ca. Micrarchaeum sp. A_DKE	2,003	GCF_016806735.1:NZ_CP060530.1:203642-205644:minus	GCF_016806735.1
Caldivirga maquilingensis strain IC-167	1,679	GCF_000018305.1:NC_009954.1:129150-130828:minus	GCF_000018305.1
Aeropyrum camini strain SY1	1,650	GCF_000591035.1:NC_022521.1:1165168-1166817:minus	GCF_000591035.1
Pyrolobus fumarii strain 1A	1,576	GCF_000223395.1:NC_015931.1:84671-86246:minus	GCF_000223395.1

Top-10 shortest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Bacteria
Anabaena sp. YBS01	1,401	GCF_009498015.1:NZ_CP034058.1:6920299-6921699:minus	GCF_009498015.1
Clostridioides difficile strain TW11	1,426	GCF_009362915.1:NZ_CP045224.1:4068440-4069865:minus	GCF_009362915.1
Roseicitreum antarcticum strain ZS2-28	1,447	GCF_014681765.1:NZ_CP061498.1:3436150-3437596:plus	GCF_014681765.1
Hirschia baltica strain ATCC 49814	1,448	GCF_000023785.1:NC_012982.1:2336679-2338126:minus	GCF_000023785.1
Sagittula stellata strain E-37	1,449	GCF_039724765.1:NZ_CP155729.1:664616-666064:plus GCF_039724765.1:NZ_CP155729.1:1804792-1806240:plus	GCF_039724765.1
Mameliella sp.	1,449	GCF_965277915.1:NZ_OZ255849.1:1028793-1030241:plus GCF_965277915.1:NZ_OZ255849.1:2596915-2598363:minus GCF_965277915.1:NZ_OZ255849.1:4859504-4860952:plus	GCF_965277915.1
Sagittula sp. P11	1,449	GCF_002814095.1:NZ_CP021913.1:3597920-3599368:plus GCF_002814095.1:NZ_CP021913.1:2386837-2388285:plus	GCF_002814095.1
Mameliella sp.	1,449	GCF_965249415.1:NZ_OZ252233.1:702863-704311:plus GCF_965249415.1:NZ_OZ252233.1:1895495-1896943:plus GCF_965249415.1:NZ_OZ252233.1:3463560-3465008:minus	GCF_965249415.1
Sagittula sp. MA-2	1,449	GCF_030126985.1:NZ_CP126145.1:439-1887:plus GCF_030126985.1:NZ_CP126145.1:2907211-2908659:minus	GCF_030126985.1
Mameliella alba strain KU6B	1,449	GCF_011405015.1:NZ_AP022337.1:1420943-1422391:plus GCF_011405015.1:NZ_AP022337.1:3191212-3192660:minus GCF_011405015.1:NZ_AP022337.1:267140-268588:plus	GCF_011405015.1

Other genes of the same size left out of the top by sheer chance

Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Mameliella sp.	1,449	GCF_965212485.1:NZ_OZ243118.1:780420-781868:minus GCF_965212485.1:NZ_OZ243118.1:3042962-3044410:plus GCF_965212485.1:NZ_OZ243118.1:4611080-4612528:minus	GCF_965212485.1

Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Archaea
Ignicoccus hospitalis strain KIN4/I	1,439	GCF_000017945.1:NC_009776.1:728362-729800:plus	GCF_000017945.1
Methanocaldococcus lauensis strain SG7	1,457	GCF_902827225.1:NZ_LR792632.1:542755-544211:plus	GCF_902827225.1
Halorubrum sp. BOL3-1	1,463	GCF_004114375.1:NZ_CP034692.1:397753-399215:minus	GCF_004114375.1
Methanomethylophilus alvi strain MGYG-HGUT-02456	1,466	GCF_902387285.1:NZ_LR699000.1:283607-285072:plus	GCF_902387285.1
Salinirubellus salinus strain ZS-35-S2	1,466	GCF_025231485.1:NZ_CP104003.1:3070232-3071697:plus	GCF_025231485.1
Methanospirillum purgamenti strain GP1	1,466	GCF_019263745.1:NZ_CP077107.1:4649-6114:plus GCF_019263745.1:NZ_CP077107.1:1359562-1361027:minus GCF_019263745.1:NZ_CP077107.1:1365502-1366967:minus GCF_019263745.1:NZ_CP077107.1:1986020-1987485:minus	GCF_019263745.1
Methanospirillum stamsii strain Pt1	1,466	GCF_046244385.1:NZ_CP176366.1:1311724-1313189:plus GCF_046244385.1:NZ_CP176366.1:2035802-2037267:plus GCF_046244385.1:NZ_CP176366.1:2042927-2044392:plus GCF_046244385.1:NZ_CP176366.1:3625347-3626812:minus	GCF_046244385.1
Salinirubellus litoreus strain SYNS196	1,466	GCF_037335815.1:NZ_CP147841.1:597195-598660:minus	GCF_037335815.1
Methanospirillum purgamenti strain J.3.6.1-F.2.7.3	1,466	GCF_018502485.1:NZ_CP075546.1:133354-134819:plus GCF_018502485.1:NZ_CP075546.1:825954-827419:plus GCF_018502485.1:NZ_CP075546.1:872641-874106:plus GCF_018502485.1:NZ_CP075546.1:1727419-1728884:plus	GCF_018502485.1
Natronomonas halophila strain C90	1,466	GCF_013391085.1:NZ_CP058334.1:1530622-1532087:minus	GCF_013391085.1

Other genes of the same size left out of the top by sheer chance

Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Natronomonas marina strain ZY43	1,466	GCF_024298905.1:NZ_CP101154.1:18680-20145:plus	GCF_024298905.1
Methanospirillum hungatei strain JF-1	1,466	GCF_000013445.1:NC_007796.1:39814-41279:plus GCF_000013445.1:NC_007796.1:1301079-1302544:minus GCF_000013445.1:NC_007796.1:3501525-3502990:minus GCF_000013445.1:NC_007796.1:3507609-3509074:minus	GCF_000013445.1
Methanomethylophilus alvi strain Mx-05	1,466	GCF_003711245.1:NZ_CP017686.1:283608-285073:plus	GCF_003711245.1
Ca. Methanomethylophilus alvi strain Mx1201	1,466	GCF_000300255.2:NC_020913.1:283607-285072:plus	GCF_000300255.2
Natronomonas gomsonensis strain KCTC 4088	1,466	GCF_024300825.1:NZ_CP101323.1:2500564-2502029:plus	GCF_024300825.1

Top-10 genomes with the largest 16S rRNA copy numbers
Organism	Copy number	Assembly accession
Bacteria
Tumebacillus avium strain AR23208	37	GCF_002162355.1
Tumebacillus algifaecis strain THMBR28	27	GCF_002243515.1
Photobacterium piscicola strain WVL24019	25	GCF_046058925.1
Photobacterium phosphoreum strain MIP2473	24	GCF_949787665.1
Mesobacillus maritimus strain ADH-29	22	GCF_044803185.1
Peribacillus asahii strain KF4	21	GCF_023823975.1
Photobacterium damselae strain Pdd1411	21	GCF_030168855.1
Photobacterium leiognathi strain Sr3.10	21	GCF_048537505.1
Aneurinibacillus sp. Ricciae_BoGa-3	21	GCF_028421645.1
Photobacterium leiognathi strain Sr3.21	21	GCF_048537525.1

Other genomes of the same copy number left out of the top by sheer chance

Organism	Copy number	Assembly accession
Photobacterium damselae strain Phdp Wu-1	21	GCF_003130755.1

Organism	Copy number	Assembly accession
Archaea
Methanoplanus endosymbiosus strain DSM 3599	5	GCF_024662215.1
Methanococcoides orientis strain LMO-1	5	GCF_021184045.1
Natrinema sp. SYSU A 869	5	GCF_019879105.1
Natronorubrum aibiense strain 7-3	5	GCF_009392895.1
Natronorubrum bangense strain JCM 10635	5	GCF_004799645.1
Methanolobus sp. ZRKC3	5	GCF_045291275.1
Methanospirillum lacunae strain Ki8-1	4	GCF_046195335.1
Natronococcus occultus strain SP4	4	GCF_000328685.1
Methanolobus sediminis strain FTZ6	4	GCF_031312595.1
Methanogenium organophilum strain DSM 3596	4	GCF_026684035.1

Other genomes of the same copy number left out of the top by sheer chance

Organism	Copy number	Assembly accession
Methanogenium sp. S4BF	4	GCF_029633965.1
Methanococcus vannielii strain SB	4	GCF_000017165.1
Haloarcula marismortui strain ATCC 33800	4	GCF_018200015.1
Haloterrigena salifodinae strain BOL5-1	4	GCF_016906025.1
Halomicrobium salinisoli strain LT50	4	GCF_020405185.1
Halomicrobium urmianum strain IBRC-M: 10911	4	GCF_020217425.1
Methanospirillum purgamenti strain J.3.6.1-F.2.7.3	4	GCF_018502485.1
Natrinema thermotolerans strain A29	4	GCF_031165565.1
Methanosphaera stadtmanae strain DSM 3091	4	GCF_000012545.1
Methanospirillum hungatei strain JF-1	4	GCF_000013445.1
Methanolobus mangrovi strain FTZ2	4	GCF_031312535.1
Methanolobus sp. WCC4	4	GCF_038022665.1
Methanospirillum purgamenti strain GP1	4	GCF_019263745.1
Halomicrobium salinisoli strain TH30	4	GCF_020405245.1
Methanospirillum stamsii strain Pt1	4	GCF_046244385.1
Methanochimaera problematica strain FWC-SCC4	4	GCF_032878975.1
Methanococcoides sp. FTZ1	4	GCF_052057775.1
Methanosphaera stadtmanae strain MGYG-HGUT-02164	4	GCF_902384015.1

Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
Organism	Sum of entropy^* (bits)	Mean entropy^* (bits)	Number of variable positions	Gene copy number	Assembly accession
Bacteria
Clostridium perfringens strain A SNU21005	780.95	0.41	1,171	9	GCF_047150065.1
Escherichia coli strain P276M	433.81	0.26	569	6	GCF_009762385.1
Listeria monocytogenes strain 10-092876-1155 LM6	357.10	0.20	370	3	GCF_001999045.1
Klebsiella pneumoniae strain GZ-1	304.27	0.18	464	8	GCF_014854815.1
Streptococcus infantis strain SO	291.50	0.18	308	3	GCF_021497965.1
Synechococcus sp. NB0720_010	243.35	0.16	265	3	GCF_023078835.1
Streptomyces griseorubiginosus strain NBC_00586	231.55	0.15	342	6	GCF_036345135.1
Caminibacter mediatlanticus strain TB-2	228.78	0.15	282	4	GCF_005843985.1
Xanthomonas oryzae strain YNCX	227.74	0.15	248	3	GCF_024499285.1
Sporomusa termitida strain DSM 4440	226.25	0.13	247	12	GCF_007641255.1
Archaea
Halomicrobium sp. ZPS1 ^**	137.00	0.09	137	2	GCF_009217585.1
Halomicrobium urmianum strain IBRC-M: 10911	131.55	0.09	146	4	GCF_020217425.1
Halapricum desulfuricans strain HSR12-2	128.00	0.09	128	2	GCF_017094525.1
Halomicrobium salinisoli strain TH30	127.74	0.09	145	4	GCF_020405245.1
Halapricum desulfuricans strain HSR-Bgl	127.00	0.09	127	2	GCF_017094445.1
Halomicrobium mukohataei strain JP60	125.81	0.09	137	3	GCF_004803735.1
Halomicrobium sp. HM KBTZ05	124.38	0.08	134	3	GCF_041530035.1
Halomicrobium salinisoli strain LT50	123.31	0.08	140	4	GCF_020405185.1
Halapricum desulfuricans strain HSR-Est	111.00	0.08	111	2	GCF_017094465.1
Halapricum desulfuricans strain HSR12-1	109.00	0.07	109	2	GCF_017094505.1

^* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

^** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, percent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of percent of identity is 95%.

Coverage^* of primer pairs for different V-regions of 16S rRNA genes

^* Coverage of a primer pair is the percent of genomes having at least one 16S rRNA gene which can be amplified by PCR using this primer pair. For details, see our paper about RiboGrove.

In the tables below, you can find coverage of primer pairs that are being commonly used to amplify bacterial and archaeal genes (“bacterial” and “archaeal” primers).

You can find a more detailed table in the file primer_pair_genomic_coverage.tsv in the metadata. That table contains coverage not just for phyla, but also for each kingdom, class, order, family, genus, and species. Moreover, that table contains coverage values for additional primer pairs, namely 1115F-1492R, 349f-519r, 1106F-Ar1378R, 1106F-SSU1492Rngs, SSU1ArF-SSU468R, SSU1ArF-SSU520R. In the tables below, they are omitted for brevity.

Bacterial genes, “bacterial” primers
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V3–V4	V3–V5	V4	V4–V5	V4–V6	V5–V6	V5–V7	V6–V7	V6–V8
Phylum	Number of genomes	27F– 1492R (%)	27F– 338R (%)	27F– 534R (%)	341F– 785R (%)	341F– 944R (%)	515F– 806R (%)	515F– 944R (%)	515F– 1100R (%)	784F– 1100R (%)	784F– 1193R (%)	939F– 1193R (%)	939F– 1378R (%)
Pseudomonadota	28,599	99.49	99.31	99.48	99.82	84.10	99.90	84.28	88.49	88.16	93.55	92.65	96.47
Bacillota	12,125	99.84	99.77	99.82	99.94	95.28	99.98	95.16	99.49	98.13	97.58	98.74	99.41
Actinomycetota	5,430	99.91	99.12	99.74	94.95	65.71	94.77	65.47	97.15	99.78	99.85	99.85	97.13
Bacteroidota	1,794	96.71	96.38	96.77	99.89	64.33	99.39	63.94	38.24	38.35	92.25	92.08	95.76
Campylobacterota	1,327	100.00	100.00	100.00	100.00	100.00	99.92	99.92	99.92	99.47	99.47	99.70	99.55
Mycoplasmatota	846	90.31	84.52	73.76	99.05	91.96	99.17	92.32	72.46	48.82	43.97	78.84	0.71
Spirochaetota	421	57.48	57.72	57.96	93.59	99.76	93.59	99.76	99.76	72.45	72.45	89.31	45.61
Cyanobacteriota	383	99.74	99.74	99.74	100.00	3.92	100.00	3.92	100.00	1.31	1.31	100.00	99.74
Fusobacteriota	246	100.00	98.78	99.59	99.59	99.59	99.59	99.59	99.59	99.59	99.59	100.00	0.00
Chlamydiota	241	0.00	0.00	0.00	100.00	100.00	0.00	0.00	0.00	100.00	100.00	100.00	94.61
Thermodesulfobacteriota	156	100.00	99.36	100.00	100.00	39.10	100.00	39.10	100.00	95.51	91.67	96.15	99.36
Verrucomicrobiota	142	99.30	0.00	99.30	100.00	13.38	100.00	13.38	100.00	1.41	1.41	98.59	98.59
Myxococcota	124	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Deinococcota	98	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	52.04	100.00
Planctomycetota	84	100.00	26.19	100.00	100.00	61.90	100.00	61.90	0.00	0.00	0.00	2.38	0.00
Chloroflexota	52	100.00	92.31	100.00	42.31	0.00	94.23	0.00	90.38	11.54	11.54	94.23	26.92
Thermotogota	50	100.00	98.00	100.00	100.00	8.00	100.00	8.00	100.00	0.00	0.00	52.00	98.00
Bdellovibrionota	44	100.00	100.00	100.00	100.00	77.27	100.00	77.27	100.00	100.00	100.00	100.00	100.00
Acidobacteriota	43	97.67	97.67	97.67	100.00	100.00	100.00	100.00	100.00	72.09	58.14	86.05	100.00
Aquificota	18	100.00	16.67	100.00	100.00	16.67	100.00	16.67	100.00	0.00	0.00	0.00	16.67
Rhodothermota	16	43.75	43.75	43.75	100.00	100.00	100.00	100.00	81.25	81.25	100.00	100.00	100.00
Chlorobiota	15	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	100.00	93.33	86.67	6.67
Nitrospirota	15	100.00	100.00	100.00	100.00	73.33	100.00	73.33	100.00	100.00	73.33	73.33	100.00
Ca. Saccharimonadota	13	100.00	100.00	100.00	100.00	7.69	7.69	7.69	7.69	0.00	0.00	100.00	100.00
Gemmatimonadota	13	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Synergistota	10	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	100.00	100.00
Elusimicrobiota	6	100.00	66.67	100.00	100.00	0.00	100.00	0.00	100.00	50.00	50.00	100.00	100.00
Deferribacterota	6	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00
Atribacterota	5	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ignavibacteriota	3	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Balneolota	3	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Thermodesulfobiota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00
Thermomicrobiota	2	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	50.00	50.00
Armatimonadota	2	100.00	50.00	100.00	50.00	50.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Chrysiogenota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Dictyoglomota	2	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	100.00	0.00
Fibrobacterota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Kiritimatiellota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ca. Fervidibacterota	1	100.00	0.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ca. Cloacimonadota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ca. Bipolaricaulota	1	0.00	0.00	0.00	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00
Ca. Absconditibacteriota	1	100.00	0.00	100.00	100.00	0.00	100.00	0.00	0.00	0.00	100.00	0.00	0.00
Calditrichota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Caldisericota	1	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	100.00
Ca. Omnitrophota	1	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00
Coprothermobacterota	1	0.00	0.00	0.00	100.00	100.00	100.00	100.00	0.00	0.00	0.00	100.00	0.00
Vulcanimicrobiota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Thermosulfidibacterota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Nitrospinota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Minisyncoccota	1	0.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Lentisphaerota	1	100.00	0.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Fidelibacterota	1	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00

Archaeal genes, “archaeal” primers
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V1–V3	V3–V4	V3–V4	V3–V4	V3–V5	V3–V5	V4	V4–V5	V5–V7
Phylum	Number of genomes	SSU1ArF– SSU1492Rngs (%)	SSU1ArF– SSU280ArR (%)	SSU1ArF– SSU470R (%)	SSU1ArF– A519R (%)	349f– SSU666ArR (%)	340f– SSU666ArR (%)	340f– 806rB (%)	349f– SSU1000ArR (%)	340f– SSU1000ArR (%)	515fB– 806rB (%)	Parch519f– Arch915r (%)	A751F– UA1204R (%)
Methanobacteriota	465	89.03	86.24	89.25	89.03	51.40	50.32	100.00	99.35	100.00	100.00	99.57	89.68
Thermoproteota	110	96.36	98.18	100.00	100.00	72.73	98.18	100.00	69.09	93.64	100.00	99.09	98.18
Nitrososphaerota	31	96.77	96.77	96.77	96.77	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Thermoplasmatota	19	84.21	68.42	100.00	100.00	42.11	42.11	100.00	63.16	84.21	100.00	100.00	52.63
Ca. Nanohalarchaeota	4	0.00	25.00	0.00	100.00	0.00	0.00	100.00	50.00	100.00	100.00	100.00	0.00
Ca. Micrarchaeota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00
Nanobdellota	1	100.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00
Promethearchaeota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00	100.00

Bacterial genes, “archaeal” primers

Bacterial genes, “archaeal” primers
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V1–V3	V3–V4	V3–V4	V3–V4	V3–V5	V3–V5	V4	V4–V5	V5–V7
Phylum	Number of genomes	SSU1ArF– SSU1492Rngs (%)	SSU1ArF– SSU280ArR (%)	SSU1ArF– SSU470R (%)	SSU1ArF– A519R (%)	349f– SSU666ArR (%)	340f– SSU666ArR (%)	340f– 806rB (%)	349f– SSU1000ArR (%)	340f– SSU1000ArR (%)	515fB– 806rB (%)	Parch519f– Arch915r (%)	A751F– UA1204R (%)
Pseudomonadota	28,599	1.19	0.02	0.51	0.58	0.00	0.00	0.09	0.00	0.00	99.90	27.72	0.00
Bacillota	12,125	2.43	0.06	0.12	1.39	0.02	0.00	0.06	0.01	0.00	99.98	98.44	0.00
Actinomycetota	5,430	0.96	0.22	0.77	1.22	0.00	0.00	0.04	0.00	0.00	94.77	88.07	0.00
Bacteroidota	1,794	1.95	0.00	1.90	2.01	0.00	0.00	0.17	0.00	0.00	99.39	99.28	0.00
Campylobacterota	1,327	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	99.92	0.15	0.00
Mycoplasmatota	846	1.77	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	99.17	80.14	0.00
Spirochaetota	421	0.48	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	93.59	93.35	0.00
Cyanobacteriota	383	3.13	0.00	0.26	0.26	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Fusobacteriota	246	0.41	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	99.59	99.59	0.00
Chlamydiota	241	1.66	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Thermodesulfobacteriota	156	5.77	0.64	1.28	1.28	0.00	0.00	0.00	0.00	0.00	100.00	70.51	0.00
Verrucomicrobiota	142	6.34	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	10.56	0.70
Myxococcota	124	30.65	4.03	3.23	3.23	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Deinococcota	98	38.78	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	96.94	0.00
Planctomycetota	84	2.38	1.19	1.19	1.19	0.00	0.00	0.00	0.00	0.00	100.00	83.33	0.00
Chloroflexota	52	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	94.23	100.00	0.00
Thermotogota	50	38.00	0.00	28.00	28.00	0.00	0.00	6.00	0.00	0.00	100.00	100.00	0.00
Bdellovibrionota	44	0.00	0.00	0.00	0.00	0.00	0.00	0.00	4.55	0.00	100.00	27.27	0.00
Acidobacteriota	43	11.63	0.00	0.00	6.98	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Aquificota	18	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	83.33	44.44
Rhodothermota	16	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Chlorobiota	15	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Nitrospirota	15	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Ca. Saccharimonadota	13	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	7.69	7.69	0.00
Gemmatimonadota	13	0.00	7.69	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Synergistota	10	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Elusimicrobiota	6	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Deferribacterota	6	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Atribacterota	5	60.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Ignavibacteriota	3	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Balneolota	3	33.33	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Thermodesulfobiota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Thermomicrobiota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Armatimonadota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	50.00	0.00
Chrysiogenota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Dictyoglomota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Fibrobacterota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Kiritimatiellota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Ca. Fervidibacterota	1	100.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00
Ca. Cloacimonadota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Ca. Bipolaricaulota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00
Ca. Absconditibacteriota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00
Calditrichota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Caldisericota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Ca. Omnitrophota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Coprothermobacterota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Vulcanimicrobiota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Thermosulfidibacterota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Nitrospinota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Minisyncoccota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Lentisphaerota	1	100.00	0.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00
Fidelibacterota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	0.00

Archaeal genes, “bacterial” primers

Archaeal genes, “bacterial” primers
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V3–V4	V3–V5	V4	V4–V5	V4–V6	V5–V6	V5–V7	V6–V7	V6–V8
Phylum	Number of genomes	27F– 1492R (%)	27F– 338R (%)	27F– 534R (%)	341F– 785R (%)	341F– 944R (%)	515F– 806R (%)	515F– 944R (%)	515F– 1100R (%)	784F– 1100R (%)	784F– 1193R (%)	939F– 1193R (%)	939F– 1378R (%)
Methanobacteriota	465	0.00	0.00	0.00	0.00	0.00	100.00	0.00	82.37	0.00	0.00	0.00	0.00
Thermoproteota	110	0.91	0.00	0.00	0.00	0.00	100.00	0.00	89.09	0.00	0.00	0.00	0.00
Nitrososphaerota	31	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00
Thermoplasmatota	19	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00
Ca. Nanohalarchaeota	4	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00
Ca. Micrarchaeota	2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Nanobdellota	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Promethearchaeota	1	0.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00

Primers used for coverage estimation

Primers used for coverage estimation
Primer name	Sequence	Reference
27F	AGAGTTTGATYMTGGCTCAG	Frank et al., 2008
338R	GCTGCCTCCCGTAGGAGT	Suzuki et al., 1996
341F^*	CCTACGGGNGGCWGCAG	Klindworth et al., 2013
515F	GTGCCAGCMGCCGCGGTAA	Turner et al., 1999
534R	ATTACCGCGGCTGCTGG	Walker et al., 2015
784F	AGGATTAGATACCCTGGTA	Andersson et al., 2008
785R^*	GACTACHVGGGTATCTAATCC	Klindworth et al., 2013
806R	GGACTACHVGGGTWTCTAAT	Caporaso et al., 2010
939F	GAATTGACGGGGGCCCGCACAAG	Lebuhn et al., 2014
944R	GAATTAAACCACATGCTC	Fuks et al., 2018
1100R	AGGGTTGCGCTCGTTG	Turner et al., 1999
1193R	ACGTCATCCCCACCTTCC	Bodenhausen et al, 2013
1378R	CGGTGTGTACAAGGCCCGGGAACG	Lebuhn et al., 2014
1492R	TACCTTGTTACGACTT	Frank et al., 2008
SSU1ArF	TCCGGTTGATCCYGCBRG	Bahram et al., 2018
SSU520R	GCTACGRRYGYTTTARRC	Bahram et al., 2018
340f	CCCTAYGGGGYGCASCAG	Gantner et al., 2011
806rB	GGACTACNVGGGTWTCTAAT	Appril et al., 2015
349f	GYGCASCAGKCGMGAAW	Takai and Horikoshi, 2000
519r	TTACCGCGGCKGCTG	Klindworth et al., 2013
515fB	GTGYCAGCMGCCGCGGTAA	Parada et al., 2015
Parch519f	CAGCCGCCGCGGTAA	Ovreås et al., 1997
Arch915r	GTGCTCCCCCGCCAATTCCT	Raskin et al., 1994
1106F	TTWAGTCAGGCAACGAGC	Watanabe et al., 2007
Ar1378R^**	TGTGCAAGGAGCAGGGAC	Watanabe et al., 2007
A751F	CCGACGGTGAGRGRYGAA	Baker et al., 2003
SSU1492Rngs	CGGNTACCTTGTKACGAC	Bahram et al., 2018
SSU280ArR	TCAGWNYCCNWCTCSRGG	Bahram et al., 2018
SSU470R	DCNGCNGGTDTTACCGCG	Bahram et al., 2018
SSU468R	GNDCNGCNGGTDTTACCG	Bahram et al., 2018
A519R	GGTDTTACCGCGGCKGCTG	Wang and Qian, 2009
SSU666ArR	HGCYTTCGCCACHGGTRG	Bahram et al., 2018
SSU1000ArR	GGCCATGCAMYWCCTCTC	Bahram et al., 2018
UA1204R	TTMGGGGCATRCIKACCT	Baker et al., 2003

^* Primers 341F and 785R are used in the protocol for library preparation for sequencing of V3–V4 region of 16S rRNA genes on Illumina MiSeq.

^** Ar1378R is originally named 1378R. We use amended name to avoid confusion.

Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>GCF_000978375.1:NZ_CP009686.1:8908-10459:plus ;d__Bacteria;k__Bacillati;p__Bacillota;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__cereus; category:1

Major blocks of a header are separated by spaces. A header consists of three such blocks:

Sequence ID (seqID): GCF_000978375.1:NZ_CP009686.1:8908-10459:plus. SeqID, in turn, consists of four parts separated by semicolons (:):
1. The Assembly accession of the genome from which the gene originates: GCF_000978375.1.
2. The accession number of the RefSeq sequence, from which the gene originates: NZ_CP009686.1.
3. Coordinates of the gene within this RefSeq genomic sequence: 8908-10459 (coordinates are 1-based, left-closed and right-closed).
4. Strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
A taxonomy string, comprising domain (Bacteria), kingdom (Bacillati), phylum (Bacillota), class (Bacilli), order (Bacillales), family (Bacillaceae), genus (Bacillus) names, and the specific epithet (cereus).
Each name is preceded by a prefix, which denotes rank: d__ for domain, k__ for kingdom, p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for specific epithet. Prefixes contain double underscores.
The taxonomic names are separated and flanked by semicolons (;).
The category of the genome, from which the gene sequence originates: (category:1).

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_26.232_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "GCF_000978375.1:NZ_CP009686.1:8908-10459:plus" ribogrove_26.232_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number NZ_CP009686.1.

seqkit grep -nrp ":NZ_CP009686.1:" ribogrove_26.232_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program to include partial matches into output, i.e. if the pattern is a substring of a header, the header will be printed to output.

To ensure search specificity, surround the Accession.Version with colons (:).

Example 3. Select all gene sequences of a single genome (Assembly accession GCF_019357495.1).

seqkit grep -nrp "GCF_019357495.1:" ribogrove_26.232_sequences.fasta.gz

To ensure search specificity, put a colon (:) after the assembly accession.

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";p__Actinobacteria;" ribogrove_26.232_sequences.fasta.gz

To ensure search specificity, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_26.232_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Bacillota.

seqkit grep -nvrp ";p__Bacillota;" ribogrove_26.232_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. output will comprise sequences, headers of which do not contain the substring “;p__Bacillota;”.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_26.232_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_26.232_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_26.232_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_26.232_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_26.232_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all RefSeq “Assession.Version”s.

seqkit seq -ni ribogrove_26.232_sequences.fasta.gz | cut -f2 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 4. Select all Assembly accessions.

seqkit seq -ni ribogrove_26.232_sequences.fasta.gz | cut -f1 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 5. Select all phylum names.

This might be done only if you have grep, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Contacts

For any questions concerning RiboGrove, please contact Maksim Sikolenko at sikolenko [ at ] mbio.bas-net.by or maximdeynonih [ at ] gmail.com.

Citing RiboGrove

If you find RiboGrove useful for your research please cite:

Maxim A. Sikolenko, Leonid N. Valentovich. “RiboGrove: a database of full-length prokaryotic 16S rRNA genes derived from completely assembled genomes” // Research in Microbiology, Volume 173, Issue 4, May 2022, 103936.
(DOI: 10.1016/j.resmic.2022.103936).

You can also cite RiboGrove itself on Zenodo:

RiboGrove of no particular release: DOI: 10.5281/zenodo.17190266;
RiboGrove release 26.232 specifically: DOI: 10.5281/zenodo.17201587.

Questions people ask about RiboGrove

1. How do I create QIIME2-compatible taxonomy file from RiboGrove data?

Please use the make_qiime_taxonomy_file.py script to convert the RiboGrove file metadata/taxonomy.tsv to a QIIME2-compatible file. You can find out how to use this script in the corresponding README file.

2. How do I save selected sequences in Seqkit to a file?

People have already provided several useful answers in the corresponding discussion: https://bioinformatics.stackexchange.com/questions/20915/how-do-i-save-selected-sequences-in-seqkit-to-a-file.

3. How do I search a FASTA database by sequence in Seqkit?

People have already provided several useful answers in the corresponding discussion: https://www.biostars.org/p/9561418.

RiboGrove, 2025-10-05

Contents