RiboGrove mirror

The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Zenodo (links are below).

Home / Release archive / 7.213

Downloads
- RiboGrove release 7.213
- RiboGrove release archive

Statistical summary
Searching data in RiboGrove

Downloads

RiboGrove release 7.213 (2022-07-20)

The release is based on RefSeq release 213.

A fasta file of full-length 16S gene sequences. Download (gzipped fasta file, 5.87 MB)
A “raw“ version of the fasta file above. This file contains partial sequences. Download (gzipped fasta file, 6.16 MB)
Metadata Download (zip archive 19.89 MB)
What information exactly does the metadata contain?
The metadata consists of the following files:
1. source_RefSeq_genomes.tsv
  This is a TSV files, which contains information about what genomes were used for the RiboGrove construction.
2. gene_seqs_statistics.tsv, raw_gene_seqs_statistics.tsv
  These are TSV file, which contains nucleotide conposition, size, genomic and taxonomic affiliation of the gene sequences. The “raw” file also includes information about partial genes.
3. per_replicon_statistics.tsv, raw_per_replicon_statistics.tsv
  These are TSV files, which contain information about number of 16S rRNA genes in each RefSeq genomic sequences, and about sizes of these genes. The “raw” file also includes information about partial genes.
4. categories.tsv
  This is a TSV file, which contains information about what genome categories were assigned to each genome and why. Moreover, it contains information about what sequencing technology was used to sequence each genome.
5. taxonomy.tsv
  This is a TSV file, which contains taxonomic affiliation of each genome and gene.
6. intragenic_repeats.tsv
  This is a TSV file, which contains information about intragenomic repeats found in gene sequences using RepeatFinder.
7. cmscan_output_table.tblout
  This is a TSV file, which contains complete output of cmscan program outputted after processing all 16S rRNA sequences.
8. entropy_summary.tsv
  This is a TSV file, which contains summary of instragenomic variability of the 16S rRNA genes. Intragenomic variability are calculated only for the category 1 genomes having more than one 16S rRNA gene. Intragenomic variability is evaluated using Shannon entropy. We align gene sequences from each genome using MUSCLE, and then we calculate Shannon entropy for each multiple alignment column (i.e. base).
9. QIIME2-compatible-taxonomy.txt
  This is a TSV file, which can be used to train a QIIME2 classifier (see the tutorial).

The fasta files are compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Statistical summary

RiboGrove size
	Bacteria	Archaea	Total
Number of gene sequences	146,527	759	147,286
Number of unique gene sequences	40,497	553	41,050
Number of species	7,242	349	7,591
Number of genomes	27,737	442	28,179
Number of genomes of category 1	18,357	144	18,501
Number of genomes of category 2	9,250	298	9,548
Number of genomes of category 3	130	0	130

16S rRNA gene lengths
	Bacteria	Archaea
Minimum (bp)	1,448.00	1,439.00
25th percentile (bp) ^*	1,517.50	1,472.00
Median (bp) ^*	1,532.00	1,474.00
75th percentile (bp) ^*	1,543.00	1,488.00
Average (bp) ^*	1,528.35	1,498.54
Mode (bp) ^*	1,537.00	1,472.00
Maximum (bp)	2,438.00	3,604.00
Standard deviation (bp) ^*	25.71	143.68

^* Metrics marked with this sign were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
(Bacteria and Archaea)
Copy number^*	Number of species	Per cent of species (%)
1	1,059	13.95
2	1,462	19.26
3	1,161	15.29
4	961	12.66
5	610	8.04
6	756	9.96
7	604	7.96
8	373	4.91
9	188	2.48
10	164	2.16
11	85	1.12
12	62	0.82
13	32	0.42
14	45	0.59
15	10	0.13
16	5	0.07
17	4	0.05
18	4	0.05
20	4	0.05
27	1	0.01
37	1	0.01

^* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly ID
Bacteria
Thermus thermophilus strain AA2-2	2,438	NZ_AP024929.1:249100-251537_minus	10898951
Ca. Annandia pinicola strain Ad13-065	1,887	NZ_CP045876.1:290071-291957_minus	11277031
Nitrosophilus labii strain HRV44	1,806	NZ_AP022826.1:1258017-1259822_minus NZ_AP022826.1:1532588-1534393_minus NZ_AP022826.1:1939914-1941719_minus	8028891
Gelria sp. Kuro-4	1,788	NZ_AP024619.1:2016182-2017969_minus	10731991
Thermoanaerobacter pseudethanolicus strain ATCC 33223	1,781	NC_010321.1:2265744-2267524_minus	40148
Thermoanaerobacter brockii strain Ako-1	1,781	NC_014964.1:2252888-2254668_minus	282748
Campylobacter sputorum strain RM3237	1,744	NZ_CP019682.1:607981-609724_plus NZ_CP019682.1:929565-931308_minus NZ_CP019682.1:1501945-1503688_minus	1153941
Campylobacter sputorum strain LMG 7795	1,744	NZ_CP043427.1:609141-610884_plus NZ_CP043427.1:930699-932442_minus NZ_CP043427.1:1503078-1504821_minus	4499991
Campylobacter sputorum strain CCUG 20703	1,743	NZ_CP019683.1:606847-608589_plus NZ_CP019683.1:935163-936905_minus NZ_CP019683.1:1558189-1559931_minus	1153911
Campylobacter sp. RM6137	1,742	NZ_CP018789.1:273370-275111_plus NZ_CP018789.1:1545743-1547484_minus	1101781
Campylobacter hyointestinalis strain CHY5	1,742	NZ_CP053828.1:357136-358877_plus NZ_CP053828.1:1667816-1669557_minus	7294871
Campylobacter sputorum strain RM8705	1,742	NZ_CP019685.1:577810-579551_plus NZ_CP019685.1:891862-893603_minus NZ_CP019685.1:1479764-1481505_minus	1153931
Archaea
Pyrobaculum ferrireducens strain 1860	3,604	NC_016645.1:127214-130817_plus	351728
Pyrobaculum aerophilum strain IM2	2,213	NC_003364.1:1089640-1091852_plus	28808
Pyrobaculum arsenaticum strain DSM 13514	2,212	NC_009376.1:623323-625534_minus	37488
Aeropyrum pernix strain K1	2,202	NC_000854.2:1218712-1220913_minus	32288
Pyrobaculum neutrophilum strain V24Sta	2,197	NC_010525.1:690419-692615_plus	40848
Ca. Mancarchaeum acidiphilum strain Mia14	2,008	NZ_CP019964.1:751297-753304_minus	1145431
Ca. Micrarchaeum sp. A_DKE	2,003	NZ_CP060530.1:203642-205644_minus	9220081
Caldivirga maquilingensis strain IC-167	1,679	NC_009954.1:129150-130828_minus	39388
Aeropyrum camini strain SY1	1,650	NC_022521.1:1165168-1166817_minus	127981
Pyrolobus fumarii strain 1A	1,576	NC_015931.1:84671-86246_minus	304318

Top-10 shortest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly ID
Bacteria
Hirschia baltica strain ATCC 49814	1,448	NC_012982.1:2336679-2338126_minus	44428
Sagittula sp. P11	1,449	NZ_CP021913.1:2386837-2388285_plus NZ_CP021913.1:3597920-3599368_plus	1460951
Hyphomonas sp. Mor2	1,451	NZ_CP017718.1:2304269-2305719_minus	860061
Antarctobacter heliothermus strain SMS3	1,453	NZ_CP022540.1:1369380-1370832_plus NZ_CP022540.1:2482480-2483932_plus	1163161
Mameliella alba strain KU6B	1,454	NZ_AP022337.1:267139-268592_plus NZ_AP022337.1:1420942-1422395_plus NZ_AP022337.1:3191208-3192661_minus	6279751
Hyphomonas sp. KY3	1,455	NZ_CP022271.1:2407999-2409453_minus	9503471
Hyphomonas neptunium strain ATCC 15444	1,455	NC_008358.1:2818466-2819920_minus	34128
Ruegeria sp. SCSIO 43209	1,458	NZ_CP065359.1:3157837-3159294_minus	10854641
Pseudooceanicola algae strain Lw-13e	1,458	NZ_CP060436.1:2482207-2483664_minus	8694041
Paracoccus contaminans strain LMG 29738T	1,459	NZ_CP020612.1:582021-583479_minus NZ_CP020612.1:1166317-1167775_minus	1078381
Sulfitobacter mediterraneus strain SC1-11	1,459	NZ_CP069004.1:3093411-3094869_plus	9217271
Sulfitobacter sp. B30-2	1,459	NZ_CP065429.1:477373-478831_plus	8738751
Pelagovum pacificum strain SM1903	1,459	NZ_CP065915.1:2729819-2731277_minus NZ_CP065915.1:3593071-3594529_minus	8872011
Archaea
Ignicoccus hospitalis strain KIN4/I	1,439	NC_009776.1:728362-729800_plus	39048
Methanocaldococcus sp. SG7	1,457	NZ_LR792632.1:542755-544211_plus	10131521
Halorubrum sp. BOL3-1	1,463	NZ_CP034692.1:397753-399215_minus	2220501
Ca. Methanomethylophilus alvus strain Mx-05	1,466	NZ_CP017686.1:283608-285073_plus	2068141
Natronomonas halophila strain C90	1,466	NZ_CP058334.1:1530622-1532087_minus	7330651
Methanospirillum sp. J.3.6.1-F.2.7.3	1,466	NZ_CP075546.1:133354-134819_plus NZ_CP075546.1:825954-827419_plus NZ_CP075546.1:872641-874106_plus NZ_CP075546.1:1727419-1728884_plus	10123301
Methanospirillum hungatei strain GP1	1,466	NZ_CP077107.1:4649-6114_plus NZ_CP077107.1:1359562-1361027_minus NZ_CP077107.1:1365502-1366967_minus NZ_CP077107.1:1986020-1987485_minus	10519241
Methanospirillum hungatei strain JF-1	1,466	NC_007796.1:39814-41279_plus NC_007796.1:1301079-1302544_minus NC_007796.1:3501525-3502990_minus NC_007796.1:3507609-3509074_minus	34548
Ca. Methanomethylophilus alvus strain Mx1201	1,466	NC_020913.1:283607-285072_plus	599268
Ca. Methanomethylophilus alvus strain MGYG-HGUT-02456	1,466	NZ_LR699000.1:283607-285072_plus	4352521

Top-10 genomes with the largest 16S rRNA copy numbers
Organism	Copy number	Assembly ID
Bacteria
Tumebacillus avium strain AR23208	37	1115491
Tumebacillus algifaecis strain THMBR28	27	1166771
Priestia megaterium strain S2	21	6720751
Peribacillus asahii strain KF4	21	13022701
Neobacillus drentensis strain JC05	20	11802511
Moritella sp. 5	20	9972261
Moritella sp. 28	20	9972251
Moritella sp. 36	20	9972241
Metabacillus litoralis strain Bac94	19	2023811
Photobacterium damselae strain AS-15-3942-7	19	11907491
Archaea
Natronorubrum aibiense strain 7-3	5	5073821
Methanococcoides orentis strain LMO-1	5	11622961
Natrinema sp. SYSU A 869	5	10842511
Natronorubrum bangense strain JCM 10635	5	2580821
Halomicrobium salinisoli strain TH30	4	11151391
Methanosphaera stadtmanae strain DSM 3091	4	33648
Methanosphaera stadtmanae strain MGYG-HGUT-02164	4	4349641
Halomicrobium salinisoli strain LT50	4	11151361
Halosiccatus urmianus strain IBRC-M: 10911	4	11057071
Natronococcus occultus strain SP4	4	521038
Methanospirillum hungatei strain GP1	4	10519241
Methanococcus vannielii strain SB	4	38268
Haloterrigena salifodinae strain BOL5-1	4	9298621
Haloarcula sinaiiensis strain ATCC 33800	4	9962651
Methanospirillum hungatei strain JF-1	4	34548
Methanospirillum sp. J.3.6.1-F.2.7.3	4	10123301

Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
Organism	Sum of entropy^* (bits)	Mean entropy^* (bits)	Number of variable positions	Gene copy number	Assembly ID
Bacteria
Synechococcus sp. NB0720_10	243.35	0.16	265	3	12576831
Sporomusa termitida strain DSM 4440	226.25	0.13	247	12	4155511
Campylobacter hyointestinalis strain CHY5	217.64	0.12	237	3	7294871
Campylobacter sp. RM6137	211.21	0.12	230	3	1101781
Sinorhizobium meliloti strain AK76	184.58	0.12	201	3	9010851
Cylindrospermopsis raciborskii strain KLL07	168.97	0.11	184	3	11851031
Klebsiella pneumoniae strain GZ-1	167.21	0.10	216	5	8227731
Olleya sp. Bg11-27	145.25	0.10	156	3	1469691
Microbulbifer sp. YPW1	136.25	0.09	145	4	7292581
Selenomonas sp. 136 F0591	135.84	0.08	138	4	638441
Archaea
Halomicrobium sp. ZPS1 ^**	137.00	0.09	137	2	4982121
Halosiccatus urmianus strain IBRC-M: 10911	131.55	0.09	146	4	11057071
Halapricum desulfuricans strain HSR12-2	128.00	0.09	128	2	9390741
Halomicrobium salinisoli strain TH30	127.74	0.09	145	4	11151391
Halapricum desulfuricans strain HSR-Bgl	127.00	0.09	127	2	9390521
Halomicrobium mukohataei strain JP60	125.81	0.09	137	3	2582391
Halomicrobium salinisoli strain LT50	123.31	0.08	140	4	11151361
Halapricum desulfuricans strain HSR-Est	111.00	0.08	111	2	9390681
Halapricum desulfuricans strain HSR12-1	109.00	0.07	109	2	9390731
Halorussus sp. XZYJT49	105.10	0.07	113	3	12653301

^* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

^** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, per cent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of per cent of identity is 95%.

Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>NZ_CP079719.1:86193-87742_plus Bacillus_velezensis ;Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus; category:2

Major blocks of a header are separated by spaces. A header consists of four such blocks:

Sequence ID (seqID): NZ_CP079719.1:86193-87742_plus. SeqID, in turn, consists of three parts:
1. the accession number of the RefSeq sequence, from which the gene originates: NZ_CP079719.1;
2. coordinates of the gene within this RefSeq genomic sequence: 86193-87742 (coordinates are 1-based, left-closed and right-closed);
3. strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
The complete name of the organism, according to the NCBI Taxonomy database: Bacillus_velezensis.
Taxonomy string, comprising domain (Bacteria), phylum (Firmicutes), class (Bacilli), order (Bacillales), family (Bacillaceae), and genus (Bacillus) names. The names are separated and flanked by semicolons (;).
The category of the genome, from which the gene sequence originates: (category:2). Category 1 genomes are of the highest reliability, and category 3 genomes are least reliable.

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_7.213_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "NZ_CP079719.1:86193-87742_plus" ribogrove_7.213_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number.

seqkit grep -nrp "NZ_CP079719.1" ribogrove_7.213_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program not to exclude partial matches from output, i.e. if the pattern is a substring of a header, the header will be printed to output.

Example 3. Select all gene sequences of a single genome (Assembly ID 10577151), which has two replicons: NZ_CP079110.1 and NZ_CP079111.1.

seqkit grep -nr -p "NZ_CP079110.1" -p "NZ_CP079111.1" ribogrove_7.213_sequences.fasta.gz

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";Actinobacteria;" ribogrove_7.213_sequences.fasta.gz

Just in case, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_7.213_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Firmicutes.

seqkit grep -nvrp ";Firmicutes;" ribogrove_7.213_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. without it the search would result in sequences belonging to Firmicutes only.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_7.213_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_7.213_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_7.213_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_7.213_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_7.213_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all accession numbers.

seqkit seq -ni ribogrove_7.213_sequences.fasta.gz | cut -f1 -d':' | sort | uniq

This might be done only if you have sort, cut and uniq utilities installed (Linux and Mac OS systems should have them built-in).

RiboGrove, 2025-09-28

Contents