RiboGrove mirror

The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Zenodo (links are below).

Home / Release archive / 10.216

Downloads

Statistical summary
Searching data in RiboGrove

Downloads

RiboGrove release 10.216 (2023-01-18)

The release is based on RefSeq release 216.

A fasta file of full-length 16S gene sequences. Download (gzipped fasta file, 6.69 MB)
Metadata Download (zip archive 18.36 MB)
What information exactly does the metadata contain?
The metadata consists of the following files:
1. discarded_sequences.fasta.gz
  This is a fasta file of sequences, which were present in source RefSeq genomes and were annotated a 16S rRNA genes but which have been discarded according to their incompleteness, internal repeats etc. and thus haven’t been included into RiboGrove.
2. source_RefSeq_genomes.tsv
  This is a TSV files, which contains information about what genomes were used for the RiboGrove construction.
3. gene_seqs_statistics.tsv, discarded_gene_seqs_statistics.tsv
  These are TSV files, which contain nucleotide conposition, size, genomic and taxonomic affiliation of the gene sequences. The first file describes final RiboGrove gene sequences, the second file describes discarded sequences.
4. categories.tsv
  This is a TSV file, which contains information about what genome categories were assigned to each genome and why. Moreover, it contains information about what sequencing technology was used to sequence each genome.
5. taxonomy.tsv
  This is a TSV file, which contains taxonomic affiliation of each genome and gene.
6. intragenic_repeats.tsv
  This is a TSV file, which contains information about intragenomic repeats found in gene sequences using RepeatFinder.
7. cmscan_output_table.tblout
  This is a TSV file, which contains complete output of cmscan program outputted after processing all 16S rRNA sequences.
8. entropy_summary.tsv
  This is a TSV file, which contains summary of instragenomic variability of the 16S rRNA genes. Intragenomic variability are calculated only for the category 1 genomes having more than one 16S rRNA gene. Intragenomic variability is evaluated using Shannon entropy. We align gene sequences from each genome using MUSCLE, and then we calculate Shannon entropy for each multiple alignment column (i.e. base).
9. QIIME2-compatible-taxonomy.txt
  This is a TSV file, which can be used to train a QIIME2 classifier (see the tutorial).

The fasta file is compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Release notes

Starting with RiboGrove 10.216 we use Rfam 14.9 for sequence filtering. In Rfam 14.9, the model RF00177 (Bacterial small subunit ribosomal RNA) is updated. Before RiboGrove 10.216 we used Rfam 14.6.

You can find notes to all RiboGrove releases on the release notes page.

Statistical summary

RiboGrove size
	Bacteria	Archaea	Total
Number of gene sequences	163,262	835	164,097
Number of unique gene sequences	44,021	606	44,627
Number of species	7,942	391	8,333
Number of genomes	30,880	491	31,371
Number of genomes of category 1	20,501	183	20,684
Number of genomes of category 2	10,248	308	10,556
Number of genomes of category 3	131	0	131

16S rRNA gene lengths
	Bacteria	Archaea
Minimum (bp)	1,448.00	1,439.00
25th percentile (bp) ^*	1,517.00	1,471.50
Median (bp) ^*	1,531.00	1,474.00
75th percentile (bp) ^*	1,542.00	1,486.00
Average (bp) ^*	1,527.85	1,495.76
Mode (bp) ^*	1,537.00	1,472.00
Maximum (bp)	2,438.00	3,604.00
Standard deviation (bp) ^*	25.63	135.98

^* Metrics marked with this sign were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
Copy number ^*	Bacteria		Archaea
	Number of species	Per cent of species (%)	Number of species	Per cent of species (%)
1	1,007	12.68	211	53.96
2	1,459	18.37	109	27.88
3	1,192	15.01	56	14.32
4	1,038	13.07	10	2.56
5	679	8.55	5	1.28
6	847	10.66	0	0.00
7	670	8.44	0	0.00
8	391	4.92	0	0.00
9	203	2.56	0	0.00
10	174	2.19	0	0.00
11	92	1.16	0	0.00
12	71	0.89	0	0.00
13	34	0.43	0	0.00
14	49	0.62	0	0.00
15	14	0.18	0	0.00
16	4	0.05	0	0.00
17	7	0.09	0	0.00
18	5	0.06	0	0.00
20	4	0.05	0	0.00
27	1	0.01	0	0.00
37	1	0.01	0	0.00

^* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly ID
Bacteria
Thermus thermophilus strain AA2-2	2,438	G_10898951:NZ_AP024929.1:249100-251537:minus	10898951
Ca. Annandia pinicola strain Ad13-065	1,887	G_11277031:NZ_CP045876.1:290071-291957:minus	11277031
Nitrosophilus labii strain HRV44	1,806	G_8028891:NZ_AP022826.1:1258017-1259822:minus G_8028891:NZ_AP022826.1:1532588-1534393:minus G_8028891:NZ_AP022826.1:1939914-1941719:minus	8028891
Gelria sp. Kuro-4	1,788	G_10731991:NZ_AP024619.1:2016182-2017969:minus	10731991
Thermoanaerobacter brockii strain Ako-1	1,781	G_282748:NC_014964.1:2252888-2254668:minus	282748
Thermoanaerobacter pseudethanolicus strain ATCC 33223	1,781	G_40148:NC_010321.1:2265744-2267524:minus	40148
Thermoanaerobacter sp. RKWS2	1,754	G_14447161:NZ_CP110888.1:94012-95765:plus	14447161
Campylobacter sputorum strain RM3237	1,744	G_1153941:NZ_CP019682.1:607981-609724:plus G_1153941:NZ_CP019682.1:929565-931308:minus G_1153941:NZ_CP019682.1:1501945-1503688:minus	1153941
Campylobacter sputorum strain LMG 7795	1,744	G_4499991:NZ_CP043427.1:609141-610884:plus G_4499991:NZ_CP043427.1:930699-932442:minus G_4499991:NZ_CP043427.1:1503078-1504821:minus	4499991
Campylobacter sputorum strain CCUG 20703	1,743	G_1153911:NZ_CP019683.1:606847-608589:plus G_1153911:NZ_CP019683.1:935163-936905:minus G_1153911:NZ_CP019683.1:1558189-1559931:minus	1153911
Archaea
Pyrobaculum ferrireducens strain 1860	3,604	G_351728:NC_016645.1:127214-130817:plus	351728
Pyrobaculum aerophilum strain IM2	2,213	G_28808:NC_003364.1:1089640-1091852:plus	28808
Pyrobaculum arsenaticum strain DSM 13514	2,212	G_37488:NC_009376.1:623323-625534:minus	37488
Aeropyrum pernix strain K1	2,202	G_32288:NC_000854.2:1218712-1220913:minus	32288
Pyrobaculum neutrophilum strain V24Sta	2,197	G_40848:NC_010525.1:690419-692615:plus	40848
Ca. Mancarchaeum acidiphilum strain Mia14	2,008	G_1145431:NZ_CP019964.1:751297-753304:minus	1145431
Ca. Micrarchaeum sp. A_DKE	2,003	G_9220081:NZ_CP060530.1:203642-205644:minus	9220081
Caldivirga maquilingensis strain IC-167	1,679	G_39388:NC_009954.1:129150-130828:minus	39388
Aeropyrum camini strain SY1	1,650	G_127981:NC_022521.1:1165168-1166817:minus	127981
Pyrolobus fumarii strain 1A	1,576	G_304318:NC_015931.1:84671-86246:minus	304318

Top-10 shortest 16S rRNA genes
Organism	Gene length (bp)	RiboGrove Sequence ID(s)	Assembly ID
Bacteria
Hirschia baltica strain ATCC 49814	1,448	G_44428:NC_012982.1:2336679-2338126:minus	44428
Sagittula sp. P11	1,449	G_1460951:NZ_CP021913.1:2386837-2388285:plus G_1460951:NZ_CP021913.1:3597920-3599368:plus	1460951
Hyphomonas sp. Mor2	1,451	G_860061:NZ_CP017718.1:2304269-2305719:minus	860061
Antarctobacter heliothermus strain SMS3	1,453	G_1163161:NZ_CP022540.1:1369380-1370832:plus G_1163161:NZ_CP022540.1:2482480-2483932:plus	1163161
Mameliella alba strain KU6B	1,454	G_6279751:NZ_AP022337.1:267139-268592:plus G_6279751:NZ_AP022337.1:1420942-1422395:plus G_6279751:NZ_AP022337.1:3191208-3192661:minus	6279751
Hyphomonas sp. KY3	1,455	G_9503471:NZ_CP022271.1:2407999-2409453:minus	9503471
Hyphomonas neptunium strain ATCC 15444	1,455	G_34128:NC_008358.1:2818466-2819920:minus	34128
Cognatishimia activa strain SOCE 004	1,456	G_14327851:NZ_CP096147.1:529008-530463:plus	14327851
Pseudooceanicola algae strain Lw-13e	1,458	G_8694041:NZ_CP060436.1:2482207-2483664:minus	8694041
Ruegeria sp. SCSIO 43209	1,458	G_10854641:NZ_CP065359.1:3157837-3159294:minus	10854641
Archaea
Ignicoccus hospitalis strain KIN4/I	1,439	G_39048:NC_009776.1:728362-729800:plus	39048
Methanocaldococcus sp. SG7	1,457	G_10131521:NZ_LR792632.1:542755-544211:plus	10131521
Halorubrum sp. BOL3-1	1,463	G_2220501:NZ_CP034692.1:397753-399215:minus	2220501
Natronomonas gomsonensis strain KCTC 4088	1,466	G_13300951:NZ_CP101323.1:2500564-2502029:plus	13300951
Methanospirillum sp. J.3.6.1-F.2.7.3	1,466	G_10123301:NZ_CP075546.1:133354-134819:plus G_10123301:NZ_CP075546.1:825954-827419:plus G_10123301:NZ_CP075546.1:872641-874106:plus G_10123301:NZ_CP075546.1:1727419-1728884:plus	10123301
Methanospirillum hungatei strain GP1	1,466	G_10519241:NZ_CP077107.1:4649-6114:plus G_10519241:NZ_CP077107.1:1359562-1361027:minus G_10519241:NZ_CP077107.1:1365502-1366967:minus G_10519241:NZ_CP077107.1:1986020-1987485:minus	10519241
Salinirubellus salinus strain ZS-35-S2	1,466	G_13813051:NZ_CP104003.1:3070232-3071697:plus	13813051
Methanospirillum hungatei strain JF-1	1,466	G_34548:NC_007796.1:39814-41279:plus G_34548:NC_007796.1:1301079-1302544:minus G_34548:NC_007796.1:3501525-3502990:minus G_34548:NC_007796.1:3507609-3509074:minus	34548
Ca. Methanomethylophilus alvus strain Mx-05	1,466	G_2068141:NZ_CP017686.1:283608-285073:plus	2068141
Natronomonas sp. ZY43	1,466	G_13300761:NZ_CP101154.1:18680-20145:plus	13300761
Ca. Methanomethylophilus alvus strain MGYG-HGUT-02456	1,466	G_4352521:NZ_LR699000.1:283607-285072:plus	4352521
Natronomonas halophila strain C90	1,466	G_7330651:NZ_CP058334.1:1530622-1532087:minus	7330651
Ca. Methanomethylophilus alvus strain Mx1201	1,466	G_599268:NC_020913.1:283607-285072:plus	599268

Top-10 genomes with the largest 16S rRNA copy numbers
Organism	Copy number	Assembly ID
Bacteria
Tumebacillus avium strain AR23208	37	1115491
Tumebacillus algifaecis strain THMBR28	27	1166771
Peribacillus asahii strain KF4	21	13022701
Priestia megaterium strain S2	21	6720751
Photobacterium damselae strain 04Ya311	20	14314271
Moritella sp. 36	20	9972241
Neobacillus drentensis strain JC05	20	11802511
Moritella sp. 5	20	9972261
Moritella sp. 28	20	9972251
Metabacillus litoralis strain Bac94	19	2023811
Photobacterium damselae strain AS-15-3942-7	19	11907491
Archaea
Methanococcoides orientis strain LMO-1	5	11622961
Natronorubrum aibiense strain 7-3	5	5073821
Natrinema sp. SYSU A 869	5	10842511
Natronorubrum bangense strain JCM 10635	5	2580821
Methanoplanus endosymbiosus strain DSM 3599	5	13492921
Methanogenium organophilum strain DSM 3596	4	14706461
Methanospirillum sp. J.3.6.1-F.2.7.3	4	10123301
Halosiccatus urmianus strain IBRC-M: 10911	4	11057071
Methanococcus vannielii strain SB	4	38268
Natronococcus occultus strain SP4	4	521038
Halomicrobium salinisoli strain TH30	4	11151391
Methanospirillum hungatei strain JF-1	4	34548
Halomicrobium salinisoli strain LT50	4	11151361
Methanospirillum hungatei strain GP1	4	10519241
Methanosphaera stadtmanae strain DSM 3091	4	33648
Haloterrigena salifodinae strain BOL5-1	4	9298621
Haloarcula sinaiiensis strain ATCC 33800	4	9962651
Methanosphaera stadtmanae strain MGYG-HGUT-02164	4	4349641

Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
Organism	Sum of entropy^* (bits)	Mean entropy^* (bits)	Number of variable positions	Gene copy number	Assembly ID
Bacteria
Synechococcus sp. NB0720_010	243.35	0.16	265	3	12576831
Xanthomonas oryzae strain YNCX	227.74	0.15	248	3	13407211
Sporomusa termitida strain DSM 4440	226.25	0.13	247	12	4155511
Campylobacter hyointestinalis strain CHY5	217.64	0.12	237	3	7294871
Campylobacter sp. RM6137	211.21	0.12	230	3	1101781
Acetivibrio thermocellus strain M3	211.00	0.14	211	2	13802461
Sinorhizobium meliloti strain AK76	184.58	0.12	201	3	9010851
Cylindrospermopsis raciborskii strain KLL07	168.97	0.11	184	3	11851031
Klebsiella pneumoniae strain GZ-1	167.21	0.10	216	5	8227731
Olleya sp. Bg11-27	145.25	0.10	156	3	1469691
Archaea
Halomicrobium sp. ZPS1 ^**	137.00	0.09	137	2	4982121
Halosiccatus urmianus strain IBRC-M: 10911	131.55	0.09	146	4	11057071
Halapricum desulfuricans strain HSR12-2	128.00	0.09	128	2	9390741
Halomicrobium salinisoli strain TH30	127.74	0.09	145	4	11151391
Halapricum desulfuricans strain HSR-Bgl	127.00	0.09	127	2	9390521
Halomicrobium mukohataei strain JP60	125.81	0.09	137	3	2582391
Halomicrobium salinisoli strain LT50	123.31	0.08	140	4	11151361
Halapricum desulfuricans strain HSR-Est	111.00	0.08	111	2	9390681
Halapricum desulfuricans strain HSR12-1	109.00	0.07	109	2	9390731
Halorussus sp. XZYJT49	105.10	0.07	113	3	12653301

^* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

^** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, per cent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of per cent of identity is 95%.

Coverage^* of primer pairs for different V-regions of bacterial 16S rRNA genes
Phylum	Number of genomes	Full gene	V1–V2	V1–V3	V3–V4	V3–V5	V4–V5	V4–V6	V5–V6	V5–V7	V6–V7	V6–V8
Phylum	Number of genomes	27F–1492R (%)	27F–338R (%)	27F–534R (%)	341F–785R (%)	341F–944R (%)	515F–944R (%)	515F–1100R (%)	784F–1100R (%)	784F–1193R (%)	939F–1193R (%)	939F–1378R (%)
Proteobacteria	18,075	99.73	99.51	99.72	99.95	82.51	82.55	90.17	89.87	93.64	92.63	96.68
Bacillota	6,885	99.97	99.85	99.94	99.96	95.89	95.79	99.51	97.88	97.21	98.56	99.27
Actinomycetota	2,946	99.80	98.85	99.59	94.09	63.88	63.68	96.13	99.63	99.73	99.76	97.15
Bacteroidota	1,185	95.19	94.68	95.11	99.92	61.60	61.18	38.73	38.99	94.77	92.41	94.51
Tenericutes	489	97.34	94.68	74.44	98.36	90.59	90.80	71.37	41.10	42.33	78.73	0.41
Spirochaetes	351	49.29	49.29	49.29	95.44	100.00	100.00	100.00	78.35	78.35	91.17	38.18
Cyanobacteria	224	100.00	100.00	100.00	100.00	4.91	4.91	100.00	0.89	0.89	100.00	99.55
Chlamydiae	187	0.00	0.00	0.00	100.00	100.00	0.00	0.00	100.00	100.00	100.00	93.58
Verrucomicrobia	114	99.12	0.00	99.12	100.00	8.77	8.77	100.00	0.88	0.88	100.00	100.00
Fusobacteria	80	100.00	96.25	100.00	100.00	100.00	100.00	100.00	98.75	98.75	100.00	0.00
Deinococcus-Thermus	76	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	50.00	100.00
Planctomycetota	57	100.00	19.30	100.00	100.00	64.91	64.91	0.00	0.00	0.00	3.51	0.00
Thermotogae	42	100.00	97.62	100.00	100.00	9.52	9.52	100.00	0.00	0.00	59.52	97.62
Chloroflexi	41	100.00	90.24	100.00	39.02	0.00	0.00	87.80	4.88	4.88	92.68	26.83
Acidobacteria	31	96.77	96.77	96.77	100.00	100.00	100.00	100.00	61.29	45.16	83.87	100.00
Chlorobi	15	100.00	100.00	100.00	100.00	0.00	0.00	0.00	100.00	93.33	86.67	6.67
Aquificae	14	100.00	21.43	100.00	100.00	21.43	21.43	100.00	0.00	0.00	7.14	21.43
Nitrospirae	10	100.00	100.00	100.00	100.00	60.00	60.00	100.00	100.00	60.00	60.00	100.00
Thermodesulfobacteria	7	100.00	100.00	100.00	100.00	0.00	0.00	100.00	0.00	0.00	100.00	100.00
Ca. Saccharibacteria	6	100.00	100.00	100.00	100.00	16.67	16.67	16.67	0.00	0.00	100.00	100.00
Synergistetes	6	100.00	100.00	100.00	100.00	0.00	0.00	100.00	0.00	0.00	100.00	100.00
Deferribacteres	6	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00	100.00	100.00	100.00
Elusimicrobia	4	100.00	50.00	100.00	100.00	0.00	0.00	100.00	75.00	75.00	100.00	100.00
Gemmatimonadetes	4	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Dictyoglomi	2	100.00	100.00	100.00	100.00	0.00	0.00	100.00	0.00	0.00	100.00	0.00
Fibrobacteres	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ignavibacteriae	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Kiritimatiellaeota	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Chrysiogenetes	2	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Calditrichaeota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ca. Absconditabacteria	1	100.00	0.00	100.00	100.00	0.00	0.00	0.00	0.00	100.00	0.00	0.00
Ca. Bipolaricaulota	1	0.00	0.00	0.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00
Caldiserica	1	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	100.00	100.00	100.00
Balneolota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Atribacterota	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00
Ca. Cloacimonetes	1	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Armatimonadetes	1	100.00	0.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Ca. Omnitrophica	1	100.00	100.00	100.00	100.00	0.00	0.00	100.00	100.00	100.00	100.00	100.00
Coprothermobacterota	1	0.00	0.00	0.00	100.00	100.00	100.00	0.00	0.00	0.00	100.00	0.00

^* Coverage of a primer pair is the per cent of genomes having at least one 16S rRNA gene which can be amplified by PCR using this primer pair. For details, see our paper about RiboGrove.

Primers used for coverage estimation
Primer name	Sequence	Reference
27F	AGAGTTTGATYMTGGCTCAG	Frank et al., 2008
338R	GCTGCCTCCCGTAGGAGT	Suzuki et al., 1996
341F^*	CCTACGGGNGGCWGCAG	Klindworth et al., 2013
515F	GTGCCAGCMGCCGCGGTAA	Turner et al., 1999
534R	ATTACCGCGGCTGCTGG	Walker et al., 2015
784F	AGGATTAGATACCCTGGTA	Andersson et al., 2008
785R^*	GACTACHVGGGTATCTAATCC	Klindworth et al., 2013
939F	GAATTGACGGGGGCCCGCACAAG	Lebuhn et al., 2014
944R	GAATTAAACCACATGCTC	Fuks et al., 2018
1100R	AGGGTTGCGCTCGTTG	Turner et al., 1999
1193R	ACGTCATCCCCACCTTCC	Bodenhausen et al, 2013
1378R	CGGTGTGTACAAGGCCCGGGAACG	Lebuhn et al., 2014
1492R	TACCTTGTTACGACTT	Frank et al., 2008

^* Primers 341F and 785R are used in the protocol for library preparation for sequencing of V3–V4 region of 16S rRNA genes on Illumina MiSeq.

Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>G_324861:NZ_CP009686.1:8908-10459:plus ;d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__cereus; category:1

Major blocks of a header are separated by spaces. A header consists of three such blocks:

Sequence ID (seqID): G_324861:NZ_CP009686.1:8908-10459:plus. SeqID, in turn, consists of four parts separated by semicolons (:):
1. The assembly ID of the genome, from which the gene originates: G_324861. Assembly ID is preceded by the prefix G_ to ensure search specificity.
2. The accession number of the RefSeq sequence, from which the gene originates: NZ_CP009686.1.
3. Coordinates of the gene within this RefSeq genomic sequence: 8908-10459 (coordinates are 1-based, left-closed and right-closed).
4. Strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
A taxonomy string, comprising domain (Bacteria), phylum (Firmicutes), class (Bacilli), order (Bacillales), family (Bacillaceae), genus (Bacillus) names, and the specific epithet (cereus).
Each name is preceded by a prefix, which denotes rank: d__ for domain, p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for specific epithet. Prefixes contain double underscores.
The taxonomic names are separated and flanked by semicolons (;).
The category of the genome, from which the gene sequence originates: (category:1).

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_10.216_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "G_324861:NZ_CP009686.1:8908-10459:plus" ribogrove_10.216_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number NZ_CP009686.1.

seqkit grep -nrp ":NZ_CP009686.1:" ribogrove_10.216_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program not to exclude partial matches from output, i.e. if the pattern is a substring of a header, the header will be printed to output.

To ensure search specificity, surround the Accession.Version with colons (:).

Example 3. Select all gene sequences of a single genome (Assembly ID 10577151).

seqkit grep -nrp "G_10577151:" ribogrove_10.216_sequences.fasta.gz

To ensure search specificity, Assembly ID should be preceded by prefix G_ and followed by a colon (:).

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";p__Actinobacteria;" ribogrove_10.216_sequences.fasta.gz

To ensure search specificity, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_10.216_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Firmicutes.

seqkit grep -nvrp ";p__Firmicutes;" ribogrove_10.216_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. output will comprise sequences, headers of which do not contain the substring “;p__Firmicutes;”.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_10.216_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_10.216_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_10.216_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_10.216_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_10.216_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all “Assession.Version”s.

seqkit seq -ni ribogrove_10.216_sequences.fasta.gz | cut -f2 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 4. Select all Assembly IDs.

seqkit seq -ni ribogrove_10.216_sequences.fasta.gz | cut -f1 -d':' | sed 's/G_//' | sort | uniq

This might be done only if you have cut, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 5. Select all phylum names.

This might be done only if you have grep, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

RiboGrove, 2025-09-28

Contents