RiboGrove mirror

The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Zenodo (links are below).

Home / Release archive / 20.226

Downloads

Statistical summary
Searching data in RiboGrove

Downloads

RiboGrove release 20.226 (2024-09-18)

The release is based on RefSeq release 226.

A fasta file of full-length 16S gene sequences. Download (gzipped fasta file, 9.20 MB)
Metadata Download (zip archive 6.95 MB)
What information exactly does the metadata contain?
The metadata consists of the following files:
1. discarded_sequences.fasta.gz
  This is a fasta file of sequences, which were present in source RefSeq genomes and were annotated a 16S rRNA genes but which have been discarded according to their incompleteness, internal repeats etc. and thus haven’t been included into RiboGrove.
2. source_RefSeq_genomes.tsv
  This is a TSV files, which contains information about what genomes were used for the RiboGrove construction.
3. gene_seqs_statistics.tsv, discarded_gene_seqs_statistics.tsv
  These are TSV files, which contain nucleotide conposition, size, genomic and taxonomic affiliation of the gene sequences. The first file describes final RiboGrove gene sequences, the second file describes discarded sequences.
4. categories.tsv
  This is a TSV file, which contains information about what genome categories were assigned to each genome and why. Moreover, it contains information about what sequencing technology was used to sequence each genome.
5. taxonomy.tsv
  This is a TSV file, which contains taxonomic affiliation of each genome and gene.
6. intragenic_repeats.tsv
  This is a TSV file, which contains information about intragenomic repeats found in gene sequences using RepeatFinder.
7. entropy_summary.tsv
  This is a TSV file, which contains summary of instragenomic variability of the 16S rRNA genes. Intragenomic variability are calculated only for the category 1 genomes having more than one 16S rRNA gene. Intragenomic variability is evaluated using Shannon entropy. We align gene sequences from each genome using MAFFT, and then we calculate Shannon entropy for each multiple alignment column (i.e. base).
8. 16S_GCNs.tsv
  This is a TSV file of 16S rRNA Gene Copy Numbers for each genome in the release.
9. primer_pair_genomic_coverage.tsv
  This is a TSV file which contains genomic coverage of primer pairs targeting different V-regions of 16S rRNA genes. For example, for Enterobacteriaceae, genomic coverage of a primer pair is the percent of Enterobacteriaceae genomes which contain at least one 16S rRNA gene that can (theoretically) produce a PCR product using the primer pair.

The fasta file is compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Release notes

No important differences from the previous release.

You can find notes to all RiboGrove releases on the release notes page.

Statistical summary

RiboGrove size
	Bacteria	Archaea	Total
Number of gene sequences	235,023	1,026	236,049
Number of unique gene sequences	58,334	731	59,065
Number of species	10,885	466	11,351
Number of genomes	42,805	594	43,399
Number of genomes of category 1	28,873	244	29,117
Number of genomes of category 2	13,716	350	14,066
Number of genomes of category 3	216	0	216

16S rRNA gene lengths
	Bacteria	Archaea
Minimum (bp)	1,401.00	1,439.00
25th percentile (bp) ^*	1,517.00	1,471.00
Median (bp) ^*	1,530.00	1,473.00
75th percentile (bp) ^*	1,542.00	1,483.75
Average (bp) ^*	1,527.47	1,492.20
Mode (bp) ^*	1,537.00	1,472.00
Maximum (bp)	2,438.00	3,604.00
Standard deviation (bp) ^*	25.34	124.82

^* Metrics marked with an asterisk were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
Copy number ^*	Bacteria		Archaea
	Number of species	Percent of species (%)	Number of species	Percent of species (%)
1	1,374	12.62	241	51.72
2	1,843	16.93	135	28.97
3	1,530	14.06	68	14.59
4	1,336	12.27	17	3.65
5	836	7.68	5	1.07
6	1,335	12.26	0	0.00
7	993	9.12	0	0.00
8	599	5.50	0	0.00
9	306	2.81	0	0.00
10	290	2.66	0	0.00
11	135	1.24	0	0.00
12	122	1.12	0	0.00
13	47	0.43	0	0.00
14	78	0.72	0	0.00
15	22	0.20	0	0.00
16	8	0.07	0	0.00
17	11	0.10	0	0.00
18	7	0.06	0	0.00
19	2	0.02	0	0.00
20	7	0.06	0	0.00
21	1	0.01	0	0.00
24	1	0.01	0	0.00
27	1	0.01	0	0.00
37	1	0.01	0	0.00

^* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Bacteria
Thermus thermophilus strain AA2-2	2,438	GCF_019974355.1:NZ_AP024929.1:249100-251537:minus	GCF_019974355.1
Ca. Annandia pinicola strain Ad13-065	1,887	GCF_020541245.1:NZ_CP045876.1:290071-291957:minus	GCF_020541245.1
Thermoanaerobacter ethanolicus strain JW 200	1,812	GCF_003722315.1:NZ_CP033580.1:456062-457873:plus	GCF_003722315.1
Nitrosophilus labii strain HRV44	1,806	GCF_014466985.1:NZ_AP022826.1:1258017-1259822:minus GCF_014466985.1:NZ_AP022826.1:1532588-1534393:minus GCF_014466985.1:NZ_AP022826.1:1939914-1941719:minus	GCF_014466985.1
Sporomusa rhizae strain DSM 16652	1,802	GCF_041428845.1:NZ_CP156925.1:3123180-3124981:minus	GCF_041428845.1
Gelria sp. Kuro-4	1,788	GCF_019668485.1:NZ_AP024619.1:2016182-2017969:minus	GCF_019668485.1
Helicobacter mastomyrinus strain Hm-17	1,785	GCF_039555295.1:NZ_CP145316.1:765140-766924:minus	GCF_039555295.1
Thermoanaerobacter pseudethanolicus strain ATCC 33223	1,781	GCF_000019085.1:NC_010321.1:2265744-2267524:minus	GCF_000019085.1
Thermoanaerobacter brockii strain Ako-1	1,781	GCF_000175295.2:NC_014964.1:2252888-2254668:minus	GCF_000175295.2
Thermoanaerobacter sp. RKWS2	1,754	GCF_026240795.1:NZ_CP110888.1:94012-95765:plus	GCF_026240795.1
Archaea
Pyrobaculum ferrireducens strain 1860	3,604	GCF_000234805.1:NC_016645.1:127214-130817:plus	GCF_000234805.1
Pyrobaculum aerophilum strain IM2	2,213	GCF_000007225.1:NC_003364.1:1089640-1091852:plus	GCF_000007225.1
Pyrobaculum arsenaticum strain DSM 13514	2,212	GCF_000016385.1:NC_009376.1:623323-625534:minus	GCF_000016385.1
Aeropyrum pernix strain K1	2,202	GCF_000011125.1:NC_000854.2:1218712-1220913:minus	GCF_000011125.1
Pyrobaculum neutrophilum strain V24Sta	2,197	GCF_000019805.1:NC_010525.1:690419-692615:plus	GCF_000019805.1
Ca. Mancarchaeum acidiphilum strain Mia14	2,008	GCF_002214165.1:NZ_CP019964.1:751297-753304:minus	GCF_002214165.1
Ca. Micrarchaeum sp. A_DKE	2,003	GCF_016806735.1:NZ_CP060530.1:203642-205644:minus	GCF_016806735.1
Caldivirga maquilingensis strain IC-167	1,679	GCF_000018305.1:NC_009954.1:129150-130828:minus	GCF_000018305.1
Aeropyrum camini strain SY1	1,650	GCF_000591035.1:NC_022521.1:1165168-1166817:minus	GCF_000591035.1
Pyrolobus fumarii strain 1A	1,576	GCF_000223395.1:NC_015931.1:84671-86246:minus	GCF_000223395.1

Top-10 shortest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly accession
Bacteria
Anabaena sp. YBS01	1,401	GCF_009498015.1:NZ_CP034058.1:6920299-6921699:minus	GCF_009498015.1
Clostridioides difficile strain TW11	1,426	GCF_009362915.1:NZ_CP045224.1:4068440-4069865:minus	GCF_009362915.1
Staphylococcus warneri strain TWSL_1	1,440	GCF_032147125.1:NZ_CP135051.1:2625669-2627108:plus	GCF_032147125.1
Roseicitreum antarcticum strain ZS2-28	1,447	GCF_014681765.1:NZ_CP061498.1:3436150-3437596:plus	GCF_014681765.1
Hirschia baltica strain ATCC 49814	1,448	GCF_000023785.1:NC_012982.1:2336679-2338126:minus	GCF_000023785.1
Sagittula stellata strain E-37	1,449	GCF_039724765.1:NZ_CP155729.1:664616-666064:plus GCF_039724765.1:NZ_CP155729.1:1804792-1806240:plus	GCF_039724765.1
Mameliella alba strain KU6B	1,449	GCF_011405015.1:NZ_AP022337.1:1420943-1422391:plus GCF_011405015.1:NZ_AP022337.1:3191212-3192660:minus GCF_011405015.1:NZ_AP022337.1:267140-268588:plus	GCF_011405015.1
Sagittula sp. MA-2	1,449	GCF_030126985.1:NZ_CP126145.1:439-1887:plus GCF_030126985.1:NZ_CP126145.1:2907211-2908659:minus	GCF_030126985.1
Sagittula sp. P11	1,449	GCF_002814095.1:NZ_CP021913.1:3597920-3599368:plus GCF_002814095.1:NZ_CP021913.1:2386837-2388285:plus	GCF_002814095.1
Clostridioides difficile strain Cd18	1,450	GCF_018884705.1:NZ_CP037806.1:136016-137465:plus	GCF_018884705.1
Archaea
Ignicoccus hospitalis strain KIN4/I	1,439	GCF_000017945.1:NC_009776.1:728362-729800:plus	GCF_000017945.1
Methanocaldococcus lauensis strain SG7	1,457	GCF_902827225.1:NZ_LR792632.1:542755-544211:plus	GCF_902827225.1
Halorubrum sp. BOL3-1	1,463	GCF_004114375.1:NZ_CP034692.1:397753-399215:minus	GCF_004114375.1
Ca. Methanomethylophilus alvi strain Mx1201	1,466	GCF_000300255.2:NC_020913.1:283607-285072:plus	GCF_000300255.2
Natronomonas gomsonensis strain KCTC 4088	1,466	GCF_024300825.1:NZ_CP101323.1:2500564-2502029:plus	GCF_024300825.1
Salinirubellus salinus strain ZS-35-S2	1,466	GCF_025231485.1:NZ_CP104003.1:3070232-3071697:plus	GCF_025231485.1
Methanomethylophilus alvi strain MGYG-HGUT-02456	1,466	GCF_902387285.1:NZ_LR699000.1:283607-285072:plus	GCF_902387285.1
Methanospirillum hungatei strain GP1	1,466	GCF_019263745.1:NZ_CP077107.1:4649-6114:plus GCF_019263745.1:NZ_CP077107.1:1359562-1361027:minus GCF_019263745.1:NZ_CP077107.1:1365502-1366967:minus GCF_019263745.1:NZ_CP077107.1:1986020-1987485:minus	GCF_019263745.1
Natronomonas marina strain ZY43	1,466	GCF_024298905.1:NZ_CP101154.1:18680-20145:plus	GCF_024298905.1
Methanospirillum sp. J.3.6.1-F.2.7.3	1,466	GCF_018502485.1:NZ_CP075546.1:133354-134819:plus GCF_018502485.1:NZ_CP075546.1:825954-827419:plus GCF_018502485.1:NZ_CP075546.1:872641-874106:plus GCF_018502485.1:NZ_CP075546.1:1727419-1728884:plus	GCF_018502485.1
Natronomonas halophila strain C90	1,466	GCF_013391085.1:NZ_CP058334.1:1530622-1532087:minus	GCF_013391085.1
Salinirubellus sp. SYNS196	1,466	GCF_037335815.1:NZ_CP147841.1:597195-598660:minus	GCF_037335815.1
Methanomethylophilus alvi strain Mx-05	1,466	GCF_003711245.1:NZ_CP017686.1:283608-285073:plus	GCF_003711245.1
Methanospirillum hungatei strain JF-1	1,466	GCF_000013445.1:NC_007796.1:39814-41279:plus GCF_000013445.1:NC_007796.1:1301079-1302544:minus GCF_000013445.1:NC_007796.1:3501525-3502990:minus GCF_000013445.1:NC_007796.1:3507609-3509074:minus	GCF_000013445.1

Top-10 genomes with the largest 16S rRNA copy numbers
Organism	Copy number	Assembly accession
Bacteria
Tumebacillus avium strain AR23208	37	GCF_002162355.1
Tumebacillus algifaecis strain THMBR28	27	GCF_002243515.1
Photobacterium phosphoreum strain MIP2473	24	GCF_949787665.1
Photobacterium damselae strain Phdp Wu-1	21	GCF_003130755.1
Photobacterium damselae strain Pdd1411	21	GCF_030168855.1
Aneurinibacillus sp. Ricciae_BoGa-3	21	GCF_028421645.1
Peribacillus asahii strain KF4	21	GCF_023823975.1
Photobacterium damselae strain AS-15-0759-2	20	GCF_021768425.1
Photobacterium damselae strain XP-11	20	GCF_023973125.1
Photobacterium damselae strain CSP DAM2	20	GCF_021765875.1
Clostridium tagluense strain CM008	20	GCF_030585445.1
Neobacillus sp. SuZ13	20	GCF_030123365.1
Photobacterium damselae strain 04Ya311	20	GCF_026001825.1
Photobacterium damselae strain WMD-P3	20	GCF_038086915.1
Photobacterium damselae strain 9046-81	20	GCF_009763125.1
Photobacterium damselae strain WMD-P2	20	GCF_038086725.1
Photobacterium damselae strain CSP DAM1	20	GCF_021766015.1
Photobacterium damselae strain WMD-P1	20	GCF_038086615.1
Photobacterium damselae strain AS-16-0963-1	20	GCF_021768345.1
Photobacterium damselae strain RM-71	20	GCF_001708035.2
Domibacillus sp. DTU_2020_1001157_1_SI_ALB_TIR_016	20	GCF_032341995.1
Moritella sp. 5	20	GCF_018219455.1
Moritella sp. 28	20	GCF_018219435.1
Photobacterium toruni strain WD2103	20	GCF_024494545.1
Moritella sp. 36	20	GCF_018219415.1
Photobacterium damselae strain AS-15-3942-9	20	GCF_021768365.1
Neobacillus drentensis strain JC05	20	GCF_021560175.1
Photobacterium damselae strain AS-15-3942-7	20	GCF_021768405.1
Archaea
Natrinema sp. SYSU A 869	5	GCF_019879105.1
Natronorubrum bangense strain JCM 10635	5	GCF_004799645.1
Methanoplanus endosymbiosus strain DSM 3599	5	GCF_024662215.1
Natronorubrum aibiense strain 7-3	5	GCF_009392895.1
Methanococcoides orientis strain LMO-1	5	GCF_021184045.1
Haloterrigena salifodinae strain BOL5-1	4	GCF_016906025.1
Methanosphaera stadtmanae strain MGYG-HGUT-02164	4	GCF_902384015.1
Methanococcus vannielii strain SB	4	GCF_000017165.1
Methanogenium sp. S4BF	4	GCF_029633965.1
Natronococcus occultus strain SP4	4	GCF_000328685.1
Methanogenium organophilum strain DSM 3596	4	GCF_026684035.1
Natrinema thermotolerans strain A29	4	GCF_031165565.1
Methanospirillum hungatei strain JF-1	4	GCF_000013445.1
Methanosphaera stadtmanae strain DSM 3091	4	GCF_000012545.1
Methanolobus mangrovi strain FTZ2	4	GCF_031312535.1
Methanolobus sp. WCC4	4	GCF_038022665.1
Uncultured Methanospirillum sp.	4	GCF_963668415.1
Halomicrobium salinisoli strain LT50	4	GCF_020405185.1
Halomicrobium salinisoli strain TH30	4	GCF_020405245.1
Methanolobus sediminis strain FTZ6	4	GCF_031312595.1
Methanoplanus sp. FWC-SCC4	4	GCF_032878975.1
Methanospirillum hungatei strain GP1	4	GCF_019263745.1
Halomicrobium urmianum strain IBRC-M: 10911	4	GCF_020217425.1
Haloarcula sinaiiensis strain ATCC 33800	4	GCF_018200015.1
Uncultured Methanolobus sp.	4	GCF_963674485.1
Methanospirillum sp. J.3.6.1-F.2.7.3	4	GCF_018502485.1
Uncultured Methanospirillum sp.	4	GCF_963668475.1

Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
Organism	Sum of entropy^* (bits)	Mean entropy^* (bits)	Number of variable positions	Gene copy number	Assembly accession
Bacteria
Escherichia coli strain P276M	433.81	0.26	569	6	GCF_009762385.1
Listeria monocytogenes strain 10-092876-1155 LM6	357.10	0.20	370	3	GCF_001999045.1
Klebsiella pneumoniae strain GZ-1	304.27	0.18	464	8	GCF_014854815.1
Streptococcus infantis strain SO	291.50	0.18	308	3	GCF_021497965.1
Synechococcus sp. NB0720_010	243.35	0.16	265	3	GCF_023078835.1
Streptomyces griseorubiginosus strain NBC_00586	231.55	0.15	342	6	GCF_036345135.1
Caminibacter mediatlanticus strain TB-2	228.78	0.15	282	4	GCF_005843985.1
Xanthomonas oryzae strain YNCX	227.74	0.15	248	3	GCF_024499285.1
Sporomusa termitida strain DSM 4440	226.25	0.13	247	12	GCF_007641255.1
Campylobacter hyointestinalis strain CHY5	217.64	0.13	237	3	GCF_013372165.1
Archaea
Halomicrobium sp. ZPS1 ^**	137.00	0.09	137	2	GCF_009217585.1
Halomicrobium urmianum strain IBRC-M: 10911	131.55	0.09	146	4	GCF_020217425.1
Halapricum desulfuricans strain HSR12-2	128.00	0.09	128	2	GCF_017094525.1
Halomicrobium salinisoli strain TH30	127.74	0.09	145	4	GCF_020405245.1
Halapricum desulfuricans strain HSR-Bgl	127.00	0.09	127	2	GCF_017094445.1
Halomicrobium mukohataei strain JP60	125.81	0.09	137	3	GCF_004803735.1
Halomicrobium sp. HM KBTZ05	124.38	0.08	134	3	GCF_041530035.1
Halomicrobium salinisoli strain LT50	123.31	0.08	140	4	GCF_020405185.1
Halapricum desulfuricans strain HSR-Est	111.00	0.08	111	2	GCF_017094465.1
Halapricum desulfuricans strain HSR12-1	109.00	0.07	109	2	GCF_017094505.1

^* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

^** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, percent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of percent of identity is 95%.

Coverage^* of primer pairs for different V-regions of bacterial 16S rRNA genes
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V3–V4	V3–V5	V4	V4–V5	V4–V6	V5–V6	V5–V7	V6–V7	V6–V8
Phylum	Number of genomes	27F–1492R (%)	27F–338R (%)	27F–534R (%)	341F–785R (%)	341F–944R (%)	515F–806R (%)	515F–944R (%)	515F–1100R (%)	784F–1100R (%)	784F–1193R (%)	939F–1193R (%)	939F–1378R (%)
Pseudomonadota	23,482	99.71	99.48	99.68	99.94	83.90	99.88	83.96	88.98	88.72	93.62	92.61	96.43
Bacillota	9,818	99.86	99.76	99.80	99.92	95.27	99.96	95.13	99.48	98.15	97.54	98.67	99.46
Actinomycetota	4,174	99.86	99.07	99.64	94.20	66.29	94.03	66.12	96.48	99.69	99.78	99.81	97.05
Bacteroidota	1,482	96.15	95.75	96.09	99.80	61.13	99.33	60.80	38.06	38.19	94.26	92.31	95.14
Campylobacterota	1,181	100.00	100.00	100.00	100.00	100.00	99.92	99.92	99.92	99.49	99.49	99.66	99.49
Mycoplasmatota	698	89.26	84.10	71.78	98.57	90.40	98.57	90.69	73.64	48.42	42.55	76.22	0.57
Spirochaetota	468	47.86	48.29	48.29	92.52	99.79	92.52	99.79	99.79	79.49	79.49	92.95	38.68
Cyanobacteriota	285	99.65	99.65	99.65	100.00	4.56	100.00	4.56	100.00	1.40	1.40	100.00	99.65
Fusobacteriota	215	100.00	98.60	100.00	100.00	100.00	100.00	100.00	100.00	97.67	97.67	100.00	0.00
Chlamydiota	206	0.00	0.00	0.00	100.00	100.00	0.00	0.00	0.00	100.00	100.00	100.00	94.66
Thermodesulfobacteriota	163	100.00	99.39	100.00	100.00	46.01	100.00	46.01	100.00	90.18	86.50	92.64	93.87
Verrucomicrobiota	137	99.27	0.00	99.27	100.00	12.41	100.00	12.41	100.00	1.46	1.46	98.54	98.54
Deinococcota	89	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	56.18	100.00
Planctomycetota	62	100.00	17.74	100.00	100.00	62.90	100.00	62.90	0.00	0.00	0.00	3.23	0.00
Myxococcota	51	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Chloroflexota	49	100.00	91.84	100.00	42.86	0.00	93.88	0.00	89.80	10.20	10.20	93.88	28.57
Thermotogota	43	100.00	97.67	100.00	100.00	9.30	100.00	9.30	100.00	0.00	0.00	58.14	97.67
Acidobacteriota	43	95.35	95.35	95.35	100.00	100.00	100.00	100.00	100.00	72.09	60.47	88.37	100.00
Bdellovibrionota	24	100.00	100.00	100.00	100.00	62.50	100.00	62.50	100.00	100.00	100.00	100.00	100.00
Aquificota	17	100.00	11.76	100.00	100.00	11.76	100.00	11.76	100.00	0.00	0.00	0.00	11.76
Nitrospirota	15	100.00	100.00	100.00	100.00	73.33	100.00	73.33	100.00	100.00	73.33	73.33	100.00
Chlorobiota	15	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	100.00	93.33	86.67	6.67
Rhodothermota	12	33.33	33.33	33.33	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ca. Saccharibacteria	10	100.00	100.00	100.00	100.00	10.00	10.00	10.00	10.00	0.00	0.00	100.00	100.00
Synergistota	9	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	100.00	100.00
Gemmatimonadota	6	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Deferribacterota	6	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00
Elusimicrobiota	4	100.00	50.00	100.00	100.00	0.00	100.00	0.00	100.00	75.00	75.00	100.00	100.00
Ignavibacteriota	3	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Thermomicrobiota	2	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	50.00	50.00
Dictyoglomota	2	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	0.00	0.00	100.00	0.00
Thermodesulfobiota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00
Kiritimatiellota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Atribacterota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Armatimonadota	2	100.00	50.00	100.00	50.00	50.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Chrysiogenota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Fibrobacterota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Lentisphaerota	1	100.00	0.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ca. Omnitrophota	1	100.00	100.00	100.00	100.00	0.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00
Coprothermobacterota	1	0.00	0.00	0.00	100.00	100.00	100.00	100.00	0.00	0.00	0.00	100.00	0.00
Ca. Fervidibacterota	1	100.00	0.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ca. Cloacimonadota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ca. Bipolaricaulota	1	0.00	0.00	0.00	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00
Balneolota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Calditrichota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Nitrospinota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Caldisericota	1	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00	100.00	100.00	100.00
Ca. Absconditabacteria	1	100.00	0.00	100.00	100.00	0.00	100.00	0.00	0.00	0.00	100.00	0.00	0.00
Thermosulfidibacterota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Vulcanimicrobiota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00

^* Coverage of a primer pair is the percent of genomes having at least one 16S rRNA gene which can be amplified by PCR using this primer pair. For details, see our paper about RiboGrove.

You can find a more detailed table in the file primer_pair_genomic_coverage.tsv in the metadata. That table contains coverage not just for phyla, but also for each bacterial class, order, family, genus, and species. Moreover, that table contains coverage values for primer pair 1115F–1492R (V7–V9 region). In this table, it is omitted for brevity.

Primers used for coverage estimation
Primer name	Sequence	Reference
27F	AGAGTTTGATYMTGGCTCAG	Frank et al., 2008
338R	GCTGCCTCCCGTAGGAGT	Suzuki et al., 1996
341F^*	CCTACGGGNGGCWGCAG	Klindworth et al., 2013
515F	GTGCCAGCMGCCGCGGTAA	Turner et al., 1999
534R	ATTACCGCGGCTGCTGG	Walker et al., 2015
784F	AGGATTAGATACCCTGGTA	Andersson et al., 2008
785R^*	GACTACHVGGGTATCTAATCC	Klindworth et al., 2013
806R	GGACTACHVGGGTWTCTAAT	Caporaso et al., 2010
939F	GAATTGACGGGGGCCCGCACAAG	Lebuhn et al., 2014
944R	GAATTAAACCACATGCTC	Fuks et al., 2018
1100R	AGGGTTGCGCTCGTTG	Turner et al., 1999
1193R	ACGTCATCCCCACCTTCC	Bodenhausen et al, 2013
1378R	CGGTGTGTACAAGGCCCGGGAACG	Lebuhn et al., 2014
1492R	TACCTTGTTACGACTT	Frank et al., 2008

^* Primers 341F and 785R are used in the protocol for library preparation for sequencing of V3–V4 region of 16S rRNA genes on Illumina MiSeq.

Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>GCF_000978375.1:NZ_CP009686.1:8908-10459:plus ;d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__cereus; category:1

Major blocks of a header are separated by spaces. A header consists of three such blocks:

Sequence ID (seqID): GCF_000978375.1:NZ_CP009686.1:8908-10459:plus. SeqID, in turn, consists of four parts separated by semicolons (:):
1. The Assembly accession of the genome from which the gene originates: GCF_000978375.1.
2. The accession number of the RefSeq sequence, from which the gene originates: NZ_CP009686.1.
3. Coordinates of the gene within this RefSeq genomic sequence: 8908-10459 (coordinates are 1-based, left-closed and right-closed).
4. Strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
A taxonomy string, comprising domain (Bacteria), phylum (Firmicutes), class (Bacilli), order (Bacillales), family (Bacillaceae), genus (Bacillus) names, and the specific epithet (cereus).
Each name is preceded by a prefix, which denotes rank: d__ for domain, p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for specific epithet. Prefixes contain double underscores.
The taxonomic names are separated and flanked by semicolons (;).
The category of the genome, from which the gene sequence originates: (category:1).

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_20.226_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "GCF_000978375.1:NZ_CP009686.1:8908-10459:plus" ribogrove_20.226_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number NZ_CP009686.1.

seqkit grep -nrp ":NZ_CP009686.1:" ribogrove_20.226_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program to include partial matches into output, i.e. if the pattern is a substring of a header, the header will be printed to output.

To ensure search specificity, surround the Accession.Version with colons (:).

Example 3. Select all gene sequences of a single genome (Assembly accession GCF_019357495.1).

seqkit grep -nrp "GCF_019357495.1:" ribogrove_20.226_sequences.fasta.gz

To ensure search specificity, put a colon (:) after the assembly accession.

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";p__Actinobacteria;" ribogrove_20.226_sequences.fasta.gz

To ensure search specificity, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_20.226_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Firmicutes.

seqkit grep -nvrp ";p__Firmicutes;" ribogrove_20.226_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. output will comprise sequences, headers of which do not contain the substring “;p__Firmicutes;”.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_20.226_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_20.226_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_20.226_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_20.226_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_20.226_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all RefSeq “Assession.Version”s.

seqkit seq -ni ribogrove_20.226_sequences.fasta.gz | cut -f2 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 4. Select all Assembly accessions.

seqkit seq -ni ribogrove_20.226_sequences.fasta.gz | cut -f1 -d':' | sort | uniq