The main website where RiboGrove is hosted may be unavailable outside Belarus due to technical troubles and the overall disaster.
Hence this mirror has been created, and RiboGrove files are available through Dropbox (links are below).




Home / Release archive / 12.218

Contents


Downloads

RiboGrove release 12.218 (2023-05-15)

The release is based on RefSeq release 218.

The fasta file is compressed with gzip, and the metadata file is a zip archive. To uncompress them, Linux and Mac OS users may use gzip and zip programs, they should be built-in. For Windows users, the free and open-source (de)compression program 7-Zip is available.

RiboGrove release archive

You can find all releases in the RiboGrove release archive.

Release notes

You can find notes to all RiboGrove releases on the release notes page.


Statistical summary

RiboGrove size
BacteriaArchaeaTotal
Number of gene sequences 174,836 878 175,714
Number of unique gene sequences 46,298 633 46,931
Number of species 8,263 411 8,674
Number of genomes 32,547 518 33,065
Number of genomes of category 1 21,804 196 22,000
Number of genomes of category 2 10,597 322 10,919
Number of genomes of category 3 146 0 146
16S rRNA gene lengths
BacteriaArchaea
Minimum (bp) 1,401.00 1,439.00
25th percentile (bp) * 1,517.00 1,471.00
Median (bp) * 1,531.00 1,473.00
75th percentile (bp) * 1,542.00 1,486.00
Average (bp) * 1,527.16 1,494.32
Mode (bp) * 1,537.00 1,472.00
Maximum (bp) 2,438.00 3,604.00
Standard deviation (bp) * 25.51 132.74

* Metrics marked with an asterisk were calculated with preliminary normalization, i.e. median within-species gene length was used for the summary.

16S rRNA gene copy number
Copy number *BacteriaArchaea
Number of speciesPer cent of species (%)Number of speciesPer cent of species (%)
1 989 11.97 218 53.04
2 1,493 18.07 121 29.44
3 1,239 14.99 56 13.63
4 1,084 13.12 11 2.68
5 686 8.30 5 1.22
6 911 11.03 0 0.00
7 725 8.77 0 0.00
8 408 4.94 0 0.00
9 234 2.83 0 0.00
10 186 2.25 0 0.00
11 98 1.19 0 0.00
12 80 0.97 0 0.00
13 34 0.41 0 0.00
14 50 0.61 0 0.00
15 19 0.23 0 0.00
16 4 0.05 0 0.00
17 8 0.10 0 0.00
18 6 0.07 0 0.00
19 1 0.01 0 0.00
20 5 0.06 0 0.00
21 1 0.01 0 0.00
27 1 0.01 0 0.00
37 1 0.01 0 0.00

* These are median within-species copy numbers.

Top-10 longest 16S rRNA genes
OrganismGene length (bp)RiboGrove Sequence ID(s)Assembly
accession
Bacteria
Thermus thermophilus strain AA2-2 2,438 GCF_019974355.1:NZ_AP024929.1:249100-251537:minus GCF_019974355.1
Ca. Annandia pinicola strain Ad13-065 1,887 GCF_020541245.1:NZ_CP045876.1:290071-291957:minus GCF_020541245.1
Nitrosophilus labii strain HRV44 1,806 GCF_014466985.1:NZ_AP022826.1:1258017-1259822:minus
GCF_014466985.1:NZ_AP022826.1:1532588-1534393:minus
GCF_014466985.1:NZ_AP022826.1:1939914-1941719:minus
GCF_014466985.1
Gelria sp. Kuro-4 1,788 GCF_019668485.1:NZ_AP024619.1:2016182-2017969:minus GCF_019668485.1
Thermoanaerobacter
pseudethanolicus strain ATCC 33223
1,781 GCF_000019085.1:NC_010321.1:2265744-2267524:minus GCF_000019085.1
Thermoanaerobacter brockii strain Ako-1 1,781 GCF_000175295.2:NC_014964.1:2252888-2254668:minus GCF_000175295.2
Thermoanaerobacter sp. RKWS2 1,754 GCF_026240795.1:NZ_CP110888.1:94012-95765:plus GCF_026240795.1
Campylobacter sputorum strain LMG 7795 1,744 GCF_008245005.1:NZ_CP043427.1:609141-610884:plus
GCF_008245005.1:NZ_CP043427.1:930699-932442:minus
GCF_008245005.1:NZ_CP043427.1:1503078-1504821:minus
GCF_008245005.1
Campylobacter sputorum strain RM3237 1,744 GCF_002220795.1:NZ_CP019682.1:607981-609724:plus
GCF_002220795.1:NZ_CP019682.1:929565-931308:minus
GCF_002220795.1:NZ_CP019682.1:1501945-1503688:minus
GCF_002220795.1
Campylobacter sputorum strain CCUG 20703 1,743 GCF_002220735.1:NZ_CP019683.1:606847-608589:plus
GCF_002220735.1:NZ_CP019683.1:935163-936905:minus
GCF_002220735.1:NZ_CP019683.1:1558189-1559931:minus
GCF_002220735.1
Archaea
Pyrobaculum ferrireducens strain 1860 3,604 GCF_000234805.1:NC_016645.1:127214-130817:plus GCF_000234805.1
Pyrobaculum aerophilum strain IM2 2,213 GCF_000007225.1:NC_003364.1:1089640-1091852:plus GCF_000007225.1
Pyrobaculum arsenaticum strain DSM 13514 2,212 GCF_000016385.1:NC_009376.1:623323-625534:minus GCF_000016385.1
Aeropyrum pernix strain K1 2,202 GCF_000011125.1:NC_000854.2:1218712-1220913:minus GCF_000011125.1
Pyrobaculum neutrophilum strain V24Sta 2,197 GCF_000019805.1:NC_010525.1:690419-692615:plus GCF_000019805.1
Ca. Mancarchaeum acidiphilum strain Mia14 2,008 GCF_002214165.1:NZ_CP019964.1:751297-753304:minus GCF_002214165.1
Ca. Micrarchaeum sp. A_DKE 2,003 GCF_016806735.1:NZ_CP060530.1:203642-205644:minus GCF_016806735.1
Caldivirga maquilingensis strain IC-167 1,679 GCF_000018305.1:NC_009954.1:129150-130828:minus GCF_000018305.1
Aeropyrum camini strain SY1 1,650 GCF_000591035.1:NC_022521.1:1165168-1166817:minus GCF_000591035.1
Pyrolobus fumarii strain 1A 1,576 GCF_000223395.1:NC_015931.1:84671-86246:minus GCF_000223395.1
Top-10 shortest 16S rRNA genes
OrganismGene length (bp)RiboGrove Sequence ID(s)Assembly
accession
Bacteria
Anabaena sp. YBS01 1,401 GCF_009498015.1:NZ_CP034058.1:6920299-6921699:minus GCF_009498015.1
Clostridioides difficile strain TW11 1,426 GCF_009362915.1:NZ_CP045224.1:4068440-4069865:minus GCF_009362915.1
Roseicitreum antarcticum strain ZS2-28 1,447 GCF_014681765.1:NZ_CP061498.1:3436150-3437596:plus GCF_014681765.1
Hirschia baltica strain ATCC 49814 1,448 GCF_000023785.1:NC_012982.1:2336679-2338126:minus GCF_000023785.1
Mameliella alba strain KU6B 1,449 GCF_011405015.1:NZ_AP022337.1:267140-268588:plus
GCF_011405015.1:NZ_AP022337.1:1420943-1422391:plus
GCF_011405015.1:NZ_AP022337.1:3191212-3192660:minus
GCF_011405015.1
Clostridioides difficile strain Cd18 1,450 GCF_018884705.1:NZ_CP037806.1:136016-137465:plus GCF_018884705.1
Hyphomonas sp. Mor2 1,451 GCF_001854405.1:NZ_CP017718.1:2304269-2305719:minus GCF_001854405.1
Listeria monocytogenes strain 3BS28 1,452 GCF_018604265.1:NZ_CP075876.1:2326384-2327835:minus GCF_018604265.1
Antarctobacter heliothermus strain SMS3 1,453 GCF_002237555.1:NZ_CP022540.1:1369380-1370832:plus
GCF_002237555.1:NZ_CP022540.1:2482480-2483932:plus
GCF_002237555.1
Paracoccus yeei strain CCUG 32053 1,454 GCF_003612915.1:NZ_CP031078.1:426078-427531:plus GCF_003612915.1
Archaea
Ignicoccus hospitalis strain KIN4/I 1,439 GCF_000017945.1:NC_009776.1:728362-729800:plus GCF_000017945.1
Methanocaldococcus lauensis strain SG7 1,457 GCF_902827225.1:NZ_LR792632.1:542755-544211:plus GCF_902827225.1
Halorubrum sp. BOL3-1 1,463 GCF_004114375.1:NZ_CP034692.1:397753-399215:minus GCF_004114375.1
Methanospirillum hungatei strain GP1 1,466 GCF_019263745.1:NZ_CP077107.1:4649-6114:plus
GCF_019263745.1:NZ_CP077107.1:1359562-1361027:minus
GCF_019263745.1:NZ_CP077107.1:1365502-1366967:minus
GCF_019263745.1:NZ_CP077107.1:1986020-1987485:minus
GCF_019263745.1
Ca. Methanomethylophilus alvus
strain MGYG-HGUT-02456
1,466 GCF_902387285.1:NZ_LR699000.1:283607-285072:plus GCF_902387285.1
Methanospirillum sp. J.3.6.1-F.2.7.3 1,466 GCF_018502485.1:NZ_CP075546.1:133354-134819:plus
GCF_018502485.1:NZ_CP075546.1:825954-827419:plus
GCF_018502485.1:NZ_CP075546.1:872641-874106:plus
GCF_018502485.1:NZ_CP075546.1:1727419-1728884:plus
GCF_018502485.1
Ca. Methanomethylophilus alvus strain Mx1201 1,466 GCF_000300255.2:NC_020913.1:283607-285072:plus GCF_000300255.2
Natronomonas sp. ZY43 1,466 GCF_024298905.1:NZ_CP101154.1:18680-20145:plus GCF_024298905.1
Natronomonas gomsonensis strain KCTC 4088 1,466 GCF_024300825.1:NZ_CP101323.1:2500564-2502029:plus GCF_024300825.1
Methanospirillum hungatei strain JF-1 1,466 GCF_000013445.1:NC_007796.1:39814-41279:plus
GCF_000013445.1:NC_007796.1:1301079-1302544:minus
GCF_000013445.1:NC_007796.1:3501525-3502990:minus
GCF_000013445.1:NC_007796.1:3507609-3509074:minus
GCF_000013445.1
Salinirubellus salinus strain ZS-35-S2 1,466 GCF_025231485.1:NZ_CP104003.1:3070232-3071697:plus GCF_025231485.1
Ca. Methanomethylophilus alvus strain Mx-05 1,466 GCF_003711245.1:NZ_CP017686.1:283608-285073:plus GCF_003711245.1
Natronomonas halophila strain C90 1,466 GCF_013391085.1:NZ_CP058334.1:1530622-1532087:minus GCF_013391085.1
Top-10 genomes with the largest 16S rRNA copy numbers
OrganismCopy numberAssembly
accession
Bacteria
Tumebacillus avium strain AR23208 37 GCF_002162355.1
Tumebacillus algifaecis strain THMBR28 27 GCF_002243515.1
Photobacterium damselae strain Phdp Wu-1 21 GCF_003130755.1
Aneurinibacillus sp. Ricciae_BoGa-3 21 GCF_028421645.1
Priestia megaterium strain S2 21 GCF_012275205.1
Peribacillus asahii strain KF4 21 GCF_023823975.1
Moritella sp. 5 20 GCF_018219455.1
Photobacterium toruni strain WD2103 20 GCF_024494545.1
Photobacterium damselae strain CSP DAM1 20 GCF_021766015.1
Photobacterium damselae strain RM-71 20 GCF_001708035.2
Photobacterium damselae strain 04Ya311 20 GCF_026001825.1
Photobacterium damselae strain 9046-81 20 GCF_009763125.1
Neobacillus drentensis strain JC05 20 GCF_021560175.1
Photobacterium damselae strain AS-16-0963-1 20 GCF_021768345.1
Photobacterium damselae strain AS-15-3942-9 20 GCF_021768365.1
Photobacterium damselae strain AS-15-3942-7 20 GCF_021768405.1
Moritella sp. 28 20 GCF_018219435.1
Photobacterium damselae strain AS-15-0759-2 20 GCF_021768425.1
Photobacterium damselae strain XP-11 20 GCF_023973125.1
Moritella sp. 36 20 GCF_018219415.1
Photobacterium damselae strain CSP DAM2 20 GCF_021765875.1
Archaea
Natronorubrum aibiense strain 7-3 5 GCF_009392895.1
Methanococcoides orientis strain LMO-1 5 GCF_021184045.1
Natrinema sp. SYSU A 869 5 GCF_019879105.1
Methanoplanus endosymbiosus strain DSM 3599 5 GCF_024662215.1
Natronorubrum bangense strain JCM 10635 5 GCF_004799645.1
Natronococcus occultus strain SP4 4 GCF_000328685.1
Methanogenium organophilum strain DSM 3596 4 GCF_026684035.1
Methanospirillum sp. J.3.6.1-F.2.7.3 4 GCF_018502485.1
Haloarcula sinaiiensis strain ATCC 33800 4 GCF_018200015.1
Methanococcus vannielii strain SB 4 GCF_000017165.1
Halosiccatus urmianus strain IBRC-M: 10911 4 GCF_020217425.1
Halomicrobium salinisoli strain LT50 4 GCF_020405185.1
Methanospirillum hungatei strain JF-1 4 GCF_000013445.1
Methanosphaera stadtmanae strain DSM 3091 4 GCF_000012545.1
Halomicrobium salinisoli strain TH30 4 GCF_020405245.1
Methanospirillum hungatei strain GP1 4 GCF_019263745.1
Haloterrigena salifodinae strain BOL5-1 4 GCF_016906025.1
Methanogenium sp. S4BF 4 GCF_029633965.1
Methanosphaera stadtmanae
strain MGYG-HGUT-02164
4 GCF_902384015.1
Top-10 genomes with the highest intragenomic variability of 16S rRNA genes
OrganismSum of entropy * (bits)Mean entropy * (bits)Number of variable positionsGene copy numberAssembly
accession
Bacteria
Escherichia coli strain P276M 433.81 0.26 569 6 GCF_009762385.1
Listeria monocytogenes
strain 10-092876-1155 LM6
357.10 0.20 370 3 GCF_001999045.1
Klebsiella pneumoniae strain GZ-1 304.27 0.18 464 8 GCF_014854815.1
Streptococcus infantis strain SO 291.50 0.18 308 3 GCF_021497965.1
Bacillus pumilus strain EB130 272.85 0.16 427 8 GCF_019710455.1
Synechococcus sp. NB0720_010 243.35 0.16 265 3 GCF_023078835.1
Caminibacter mediatlanticus strain TB-2 228.78 0.15 282 4 GCF_005843985.1
Xanthomonas oryzae strain YNCX 227.74 0.15 248 3 GCF_024499285.1
Sporomusa termitida strain DSM 4440 226.25 0.13 247 12 GCF_007641255.1
Campylobacter hyointestinalis strain CHY5 217.64 0.13 237 3 GCF_013372165.1
Archaea
Halomicrobium sp. ZPS1 ** 137.00 0.09 137 2 GCF_009217585.1
Halosiccatus urmianus strain IBRC-M: 10911 131.55 0.09 146 4 GCF_020217425.1
Halapricum desulfuricans strain HSR12-2 128.00 0.09 128 2 GCF_017094525.1
Halomicrobium salinisoli strain TH30 127.74 0.09 145 4 GCF_020405245.1
Halapricum desulfuricans strain HSR-Bgl 127.00 0.09 127 2 GCF_017094445.1
Halomicrobium mukohataei strain JP60 125.81 0.09 137 3 GCF_004803735.1
Halomicrobium salinisoli strain LT50 123.31 0.08 140 4 GCF_020405185.1
Halapricum desulfuricans strain HSR-Est 111.00 0.08 111 2 GCF_017094465.1
Halapricum desulfuricans strain HSR12-1 109.00 0.07 109 2 GCF_017094505.1
Halorussus sp. XZYJT49 105.10 0.07 113 3 GCF_023238205.1

* Entropy is Shannon entropy calculated for each column of the multiple sequence alignment (MSA) of all full-length 16S rRNA genes of a genome. Entropy is then summed up (column “Sum of entropy”) and averaged (column “Mean entropy”).

** Halomicrobium sp. ZPS1 is a quite remarkable case. This genome harbours two 16S rRNA genes, therefore entropy is equal to the number of mismatching nucleotides between sequences of the genes. Respectively, per cent of identity between these two gene sequences is 90.70%! This is remarkable because the usual (however arbitrary) genus demarcation threshold of per cent of identity is 95%.

Coverage* of primer pairs for different V-regions of bacterial 16S rRNA genes
Phylum Number
of genomes
Full gene V1–V2 V1–V3 V3–V4 V3–V5 V4 V4–V5 V4–V6 V5–V6 V5–V7 V6–V7 V6–V8
27F–1492R
(%)
27F–338R
(%)
27F–534R
(%)
341F–785R
(%)
341F–944R
(%)
515F–806R
(%)
515F–944R
(%)
515F–1100R
(%)
784F–1100R
(%)
784F–1193R
(%)
939F–1193R
(%)
939F–1378R
(%)
Pseudomonadota 18,010 99.67 99.47 99.63 99.94 81.73 99.87 81.79 89.54 89.25 93.40 92.30 96.58
Bacillota 7,212 99.85 99.72 99.81 99.94 95.80 99.99 95.72 99.51 97.99 97.35 98.63 99.35
Actinomycetota 3,104 99.81 98.87 99.61 93.81 63.37 93.59 63.18 96.01 99.65 99.74 99.77 97.10
Bacteroidota 1,226 96.00 95.35 95.92 99.84 61.42 99.10 60.85 38.58 38.83 94.62 92.17 94.62
Campylobacterota 951 100.00 100.00 100.00 100.00 100.00 99.89 99.89 99.89 99.58 99.58 99.68 99.47
Mycoplasmatota 532 96.99 94.55 75.19 98.12 90.79 98.12 91.17 70.86 42.29 43.23 79.70 0.38
Spirochaetota 357 49.86 50.42 50.42 95.52 99.72 95.52 99.72 100.00 78.71 78.71 91.32 38.10
Cyanobacteriota 223 99.55 100.00 99.55 100.00 5.38 100.00 5.38 100.00 1.35 1.35 100.00 99.55
Chlamydiota 186 0.00 0.00 0.00 100.00 100.00 0.00 0.00 0.00 100.00 100.00 100.00 94.09
Verrucomicrobiota 116 99.14 0.00 99.14 100.00 8.62 100.00 8.62 100.00 0.86 0.86 100.00 100.00
Thermodesulfobacteriota 114 100.00 99.12 100.00 100.00 45.61 100.00 45.61 100.00 93.86 89.47 95.61 99.12
Fusobacteriota 87 100.00 96.55 100.00 100.00 100.00 100.00 100.00 100.00 98.85 98.85 100.00 0.00
Deinococcota 79 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 50.63 100.00
Planctomycetota 59 100.00 18.64 100.00 100.00 62.71 100.00 62.71 0.00 0.00 0.00 3.39 0.00
Myxococcota 47 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Thermotogota 42 100.00 97.62 100.00 100.00 9.52 100.00 9.52 100.00 0.00 0.00 59.52 97.62
Chloroflexota 41 100.00 90.24 100.00 36.59 0.00 92.68 0.00 87.80 4.88 4.88 92.68 26.83
Acidobacteriota 31 96.77 96.77 96.77 100.00 100.00 100.00 100.00 100.00 61.29 45.16 83.87 100.00
Bdellovibrionota 20 100.00 100.00 100.00 100.00 65.00 100.00 65.00 100.00 100.00 100.00 100.00 100.00
Chlorobiota 15 100.00 100.00 100.00 100.00 0.00 0.00 0.00 0.00 100.00 93.33 86.67 6.67
Aquificota 14 100.00 21.43 100.00 100.00 21.43 100.00 21.43 100.00 0.00 0.00 7.14 21.43
Rhodothermota 12 33.33 33.33 33.33 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Nitrospirota 10 100.00 100.00 100.00 100.00 60.00 100.00 60.00 100.00 100.00 60.00 60.00 100.00
Synergistota 7 100.00 100.00 100.00 100.00 0.00 100.00 0.00 100.00 0.00 0.00 100.00 100.00
Deferribacterota 6 100.00 100.00 100.00 100.00 0.00 100.00 0.00 100.00 100.00 100.00 100.00 100.00
Ca. Saccharibacteria 6 100.00 100.00 100.00 100.00 16.67 16.67 16.67 16.67 0.00 0.00 100.00 100.00
Elusimicrobiota 4 100.00 50.00 100.00 100.00 0.00 100.00 0.00 100.00 75.00 75.00 100.00 100.00
Gemmatimonadota 4 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Kiritimatiellota 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00
Thermomicrobiota 2 100.00 100.00 100.00 100.00 0.00 100.00 0.00 100.00 0.00 0.00 50.00 50.00
Fibrobacterota 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Ignavibacteriota 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Dictyoglomota 2 100.00 100.00 100.00 100.00 0.00 100.00 0.00 100.00 0.00 0.00 100.00 0.00
Chrysiogenota 2 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Lentisphaerota 1 100.00 0.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00
Atribacterota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 0.00 0.00 100.00 100.00
Caldisericota 1 100.00 100.00 100.00 100.00 0.00 0.00 0.00 0.00 0.00 100.00 100.00 100.00
Balneolota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Armatimonadota 1 100.00 0.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Coprothermobacterota 1 0.00 0.00 0.00 100.00 100.00 100.00 100.00 0.00 0.00 0.00 100.00 0.00
Ca. Omnitrophota 1 100.00 100.00 100.00 100.00 0.00 100.00 0.00 100.00 100.00 100.00 100.00 100.00
Ca. Cloacimonadota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Ca. Bipolaricaulota 1 0.00 0.00 0.00 100.00 100.00 100.00 100.00 0.00 0.00 0.00 0.00 0.00
Ca. Absconditabacteria 1 100.00 0.00 100.00 100.00 0.00 100.00 0.00 0.00 0.00 100.00 0.00 0.00
Calditrichota 1 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

* Coverage of a primer pair is the per cent of genomes having at least one 16S rRNA gene which can be amplified by PCR using this primer pair. For details, see our paper about RiboGrove.

Primers used for coverage estimation
Primer nameSequenceReference
27FAGAGTTTGATYMTGGCTCAGFrank et al., 2008
338RGCTGCCTCCCGTAGGAGTSuzuki et al., 1996
341F *CCTACGGGNGGCWGCAGKlindworth et al., 2013
515FGTGCCAGCMGCCGCGGTAATurner et al., 1999
534RATTACCGCGGCTGCTGGWalker et al., 2015
784FAGGATTAGATACCCTGGTAAndersson et al., 2008
785R *GACTACHVGGGTATCTAATCCKlindworth et al., 2013
806RGGACTACHVGGGTWTCTAATCaporaso et al., 2010
939FGAATTGACGGGGGCCCGCACAAGLebuhn et al., 2014
944RGAATTAAACCACATGCTCFuks et al., 2018
1100RAGGGTTGCGCTCGTTGTurner et al., 1999
1193RACGTCATCCCCACCTTCCBodenhausen et al, 2013
1378RCGGTGTGTACAAGGCCCGGGAACGLebuhn et al., 2014
1492RTACCTTGTTACGACTTFrank et al., 2008

* Primers 341F and 785R are used in the protocol for library preparation for sequencing of V3–V4 region of 16S rRNA genes on Illumina MiSeq.


Searching data in RiboGrove

RiboGrove is a very minimalistic database — it comprises a collection of plain fasta files with metadata. Thus, extended search instruments are not available for it. We admit this problem and provide a list of suggestions below. The suggestions would help you to explore and select RiboGrove data.

Header format

RiboGrove fasta data has the following format of header:

>GCF_000978375.1:NZ_CP009686.1:8908-10459:plus ;d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__cereus; category:1

Major blocks of a header are separated by spaces. A header consists of three such blocks:

  1. Sequence ID (seqID): GCF_000978375.1:NZ_CP009686.1:8908-10459:plus. SeqID, in turn, consists of four parts separated by semicolons (:):
    1. The Assembly accession of the genome from which the gene originates: GCF_000978375.1.
    2. The accession number of the RefSeq sequence, from which the gene originates: NZ_CP009686.1.
    3. Coordinates of the gene within this RefSeq genomic sequence: 8908-10459 (coordinates are 1-based, left-closed and right-closed).
    4. Strand of the RefSeq genomic sequence, where the gene is located: plus (or minus).
  2. A taxonomy string, comprising domain (Bacteria), phylum (Firmicutes), class (Bacilli), order (Bacillales), family (Bacillaceae), genus (Bacillus) names, and the specific epithet (cereus).
    Each name is preceded by a prefix, which denotes rank: d__ for domain, p__ for phylum, c__ for class, o__ for order, f__ for family, g__ for genus, and s__ for specific epithet. Prefixes contain double underscores.
    The taxonomic names are separated and flanked by semicolons (;).
  3. The category of the genome, from which the gene sequence originates: (category:1).

Sequence selection

You can select specific sequences from fasta files using the Seqkit program (GitHub repo, documentation). It is free, cross-platform, multifunctional and pretty fast and can process both gzipped and uncompressed fasta files. Programs seqkit grep and seqkit seq are useful for sequence selection.

Search sequences by header

Given the downloaded fasta file ribogrove_12.218_sequences.fasta.gz, consider the following examples of sequence selection using seqkit grep:

Example 1. Select a single sequence by SeqID.

seqkit grep -p "GCF_000978375.1:NZ_CP009686.1:8908-10459:plus" ribogrove_12.218_sequences.fasta.gz

The -p option sets a pattern to search in fasta headers (only in sequence IDs, actually).

Example 2. Select all gene sequences of a single RefSeq genomic sequence by accession number NZ_CP009686.1.

seqkit grep -nrp ":NZ_CP009686.1:" ribogrove_12.218_sequences.fasta.gz

Here, two more options are required: -n and -r. The former tells the program to match the whole headers instead of IDs only. The latter tells the program to include partial matches into output, i.e. if the pattern is a substring of a header, the header will be printed to output.

To ensure search specificity, surround the Accession.Version with colons (:).

Example 3. Select all gene sequences of a single genome (Assembly accession GCF_019357495.1).

seqkit grep -nrp "GCF_019357495.1:" ribogrove_12.218_sequences.fasta.gz

To ensure search specificity, put a colon (:) after the assembly accession.

Example 4. Select all actinobacterial sequences.

seqkit grep -nrp ";p__Actinobacteria;" ribogrove_12.218_sequences.fasta.gz

To ensure search specificity, surround the taxonomy name with semicolons (;).

Example 5. Select all sequences originating from category 1 genomes.

seqkit grep -nrp "category:1" ribogrove_12.218_sequences.fasta.gz

Example 6. Select all sequences except for those belonging to Firmicutes.

seqkit grep -nvrp ";p__Firmicutes;" ribogrove_12.218_sequences.fasta.gz

Recognize the -v option within the option sequence -nvrp. This option inverts match, i.e. output will comprise sequences, headers of which do not contain the substring “;p__Firmicutes;”.

Search sequences by length

You can use the seqkit seq program to select sequences by length.

Example 1. Select all sequences longer than 1600 bp.

seqkit seq -m 1601 ribogrove_12.218_sequences.fasta.gz

The -m option sets the minimum length of a sequence to be printed to output.

Example 2. Select all sequences shorter than 1500 bp.

seqkit seq -M 1499 ribogrove_12.218_sequences.fasta.gz

The -M option sets the maximum length of a sequence to be printed to output.

Example 3. Select all sequences having length in range [1500, 1600] bp.

seqkit seq -m 1500 -M 1600 ribogrove_12.218_sequences.fasta.gz

Selecting header data

It is sometimes useful to retrieve only header information from a fasta file. You can use the seqkit seq program for it.

Example 1. Select all headers.

seqkit seq -n ribogrove_12.218_sequences.fasta.gz

The -n option tells the program to output only headers.

Example 2. Select all SeqIDs (header parts before the first space).

seqkit seq -ni ribogrove_12.218_sequences.fasta.gz

The -i option tells the program to output only sequence IDs.

Example 3. Select all RefSeq “Assession.Version”s.

seqkit seq -ni ribogrove_12.218_sequences.fasta.gz | cut -f2 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 4. Select all Assembly accessions.

seqkit seq -ni ribogrove_12.218_sequences.fasta.gz | cut -f1 -d':' | sort | uniq

This might be done only if you have cut, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).

Example 5. Select all phylum names.

seqkit seq -n ribogrove_12.218_sequences.fasta.gz | grep -Eo ';p__[^;]+' | sed -E 's/;|p__//g' | sort | uniq

This might be done only if you have grep, sed, sort and uniq utilities installed (Linux and Mac OS systems should have them built-in).



RiboGrove, 2023-07-19