SLTChemDB: A database of chemical compounds present in Smokeless tobacco products

Smokeless tobacco (SLT), a cause of potentially preventable diseases, has a diverse chemical composition encompassing toxicants as well as potent carcinogens. Though the chemical profile of SLT products has been analyzed earlier, this information is not available in a comprehensive and easily accessible format. Hence, there is an imperative felt need to develop a one-stop information source providing inclusive information on SLT products. SLTChemDB is the first such database that makes available detailed information on various properties of chemical compounds identified across different brands of SLT products. The primary information for the database was extracted through extensive literature search, which was further curated from popular chemical web servers and databases. At present, SLTChemDB contains comprehensive information on 233 unique chemical compounds and 82 SLT products. The database has been made user-friendly with facility for systematic search and filters. SLTChemDB would provide the initial data on chemical compounds in SLT products to various tobacco testing laboratories. The database also highlights research gaps and thus, would be a guide for researchers interested in chemistry and toxicology of SLT products. With regular update of information in the database, it shall be a valuable evidence base for policymakers to formulate stringent policies for SLT control.

Consumption of SLT is a global menace, estimated to account for approximately 0.65 million deaths per year 1 . Epidemiological studies indicate a significant role of SLT products in cancers, stroke, nervous and reproductive disorders [1][2][3][4][5][6] . A recent in-silico study indicated significant role of toxic chemical compounds in the diseases caused by SLT products 7 . Worldwide, there are different forms of SLT products available ranging from simple tobacco to complex products having many additives and flavoring agents. Reports suggest that the chemical composition of a tobacco plant gets altered significantly during the curing and processing of SLT products 8 . Many chemical compounds in the form of non-tobacco plant materials (like areca nut), humectants, flavoring agents and alkaline agents are also introduced to enhance the attractiveness and addictiveness of the SLT products 8 .
For effective control of SLT-attributable diseases, it is imperative to identify the chemical compounds present in SLT products, estimate their toxicity and study their specific role in diseases. No attempts have yet been made www.nature.com/scientificreports www.nature.com/scientificreports/ to compile data about the chemical compounds in SLT products. To the best of our knowledge, this study is the first attempt wherein we have collected and compiled vital details about the chemical information, physicochemical properties, biological information, toxicological information and distribution of chemical compounds present in SLT products. SLTChemDB is a one-stop information source crosslinked to various popular chemical databases like PubChem, ChemSpider and ChEMBL.

Results
Database statistics. SLTChemDB is a comprehensive database of all the chemicals identified by testing various SLT products. Presently, the database contains comprehensive information about 233 chemical compounds (+2 mixtures of these compounds) and 82 SLT products. We also provide brand-wise chemical composition, pH, moisture, free nicotine and tobacco content of 41 SLT products. Figure 1 briefly explains all the information available in SLTChemDB. The database contains information about chemical composition, pH, moisture, mode of intake, free nicotine, tobacco content and country-wise information of SLT products along with Biological, Toxicological and Physico-chemical information of chemical compounds.
Out of the 233 chemical compounds, chemical information like canonical SMILES and IUPAC name of 224 compounds were taken from PubChem 23 and ChemSpider 24 . 3D structures/Canonical SMILES taken from PubChem/ChemSpider were converted to 2D using Open Babel 25 (for more details refer to Supplementary Table 1). Structures of 5 compounds (unavailable on PubChem/ChemSpider) were self-drawn and SMILES were generated using the tool provided on the SLTChemDB website. Structures of 4 chemical compounds (2 PAH, 1 Coumarin and 1 Radionucleotide) could not be generated due to lack of complete chemical information. Canonical SMILES were used to calculate properties like polar surface area, number of donor and acceptor hydrogen bonds, molecular weight, molecular formula. Further, canonical SMILES was used to predict toxicological information using pkCSM web server 26 . Information about protein targets of 38 chemicals was extracted from ChEMBL 27 . Complete statistics of the chemical compounds is shown in Table 1.
As per the International Agency for Research on Cancer (IARC) classification of carcinogens 18 , 69 compounds out of the 233 compounds were classified various carcinogenic groups. Out of 69 classified compounds, 7 compounds (Formaldehyde, Beryllium, Arsenic, Cadmium, N-nitrosonornicotine,  Database utility. SLTChemDB holds immense utility for various stakeholders viz. researchers and policymakers by providing a one stop information source on chemical constituents of SLT products. SLTChemDB is the first such database that contains raw data useful for tobacco testing laboratories across the globe. Information from the database can also be used to identify the products and brands having minimum concentration of carcinogenic compounds. As an example, among different SLT products amount of NNN varies from 0.0132 µg/g in Rapè tobacco (Brand: Rapé Guarany Cristal) to 3085 µg/g (dry weight) in Toombak   www.nature.com/scientificreports www.nature.com/scientificreports/ Also, the database highlights research gaps by providing product-wise list of identified chemical compounds. Thus, SLTChemDB will formulate the evidence base and initial data depicting the need for regulation and periodic testing of chemical constituents of SLT products.

sLtChemDB Web Interface
Data searching. SLTChemDB has a very simple and user-friendly interface. Extensive search options using various tools have been provided, explained briefly as below: Simple search. This search option allows the users to search SLTChemDB in a very simple way using various keywords. User can search comprehensive information about SLT products and their chemical compounds using the options provided in the Search tab. This function has been depicted in Supplementary Fig. 5.
Advanced search. Advance search allows user to build complex queries using logical operators like "AND" and "OR" to search across various fields. The advanced search has been explained in Supplementary Fig. 6.
The available fields for Simple and Advanced search are depicted in Table 2.
Structure search. Structure based search allows the user to derive information about chemical compounds by providing its structure either by drawing it or uploading a SMILES structure or mol file. User can select from three search types: Substructure/Exact search, Topological fingerprint-based search and MACCS key based search to generate results. Tanimoto coefficient depicting the structural similarity is displayed against each search result. Structure based search takes place using RD kit 28 . Structures are visualized using JSmol 29 . More information about the results of this search function is explained in Supplementary Fig. 7.
Compare results. Using this option, user can compare the composition of different chemicals analyzed between available SLT products, brands and/or countries. This function has been explained in Supplementary Fig. 8 with example.
Data browsing. The current version of SLTChemDB contains information 233 chemical compounds and 82 SLT products. All information about SLT products and their chemicals is stored in seven tables. The following browse tabs are provided: • Physicochemical Information: This tab displays the compound name, molecular weight, Log P, hydrogen bond donors, hydrogen bond acceptors, polar surface area and links to other chemical databases of all the chemical compounds.

Download tab.
Option to download all the data in.csv format will be available soon.
Update of sLtChemDB. The database shall be updated regularly to incorporate newly added research on this topic. Chemico-toxicological information on more SLT products shall also be included, whenever available in an authenticated form. Additionally, the database also provides an option to the user to submit his/her own information using the submission form available at SLTChemDB website. However, such data shall be authenticated by our team before inclusion in the database.

Discussion
With the widespread use of SLT products, there is an enhanced rate of mortality and morbidity associated with SLT use. Since the health effects of SLT are attributed to its chemical constituents, it is essential to study in detail the chemical profile of various products. This research is hindered by the lack of easily available information on chemical composition of SLT products in a readily usable format. The situation is made more complex by the wide (2019) 9:7142 | https://doi.org/10.1038/s41598-019-43559-y www.nature.com/scientificreports www.nature.com/scientificreports/  www.nature.com/scientificreports www.nature.com/scientificreports/ variation in chemical profile across brands of same product and within batches of a brand. Hence, SLTChemDB has been developed as the first comprehensive data repository of chemical, biological and toxicological information about chemical compounds identified across various brands of SLT products.
This database holds promise as an invaluable resource for various stakeholders viz. researchers and policymakers by providing a one-stop information source on chemical profiling of SLT products. For instance, SLTChemDB contains information about 222 Moist Snuff brands. Among them, the number of chemicals identified varies from as high as 44 in Copenhagen to just 1 in other brands like Husky Long Cut Wintergreen, Husky Long Cut Natural etc. With this information, SLTChemDB will highlight the existing gaps in testing of SLT products. Thus, this database is likely to be a valuable resource for the researchers with interest in chemical profiling of SLT products.
Since the levels of various chemicals vary widely between SLT products and also within brands of a particular products 30 , the creation of this database assumes importance being a valuable source for comparison between products and brands of a product. As an example, the amount of N-nitrosonornicotine (NNN) in all the moist snuff samples tested from United States varied from 0.71 µg/g to 64 µg/g across different brands. Within one moist snuff brand from United States, Grizzly, the amount of N-nitrosonornicotine (NNN) ranged between 2.64 µg/g to   www.nature.com/scientificreports www.nature.com/scientificreports/ 11.1 µg/g and while within Skoal it varied from 0.76 µg/g (wet weight) to 42.6 µg/g (wet weight). This information from SLTChemDB shall provide evidence-base to policy makers to form stringent policies on regulation of toxic contents in SLT products. We would like to mention one limitation, that since the information on chemical composition of brands has been retrieved from published literature, a few brands in the database do not carry a name due to lack of this information in the concerned papers 12,[31][32][33][34] .
SLT research is currently a changing arena with ongoing research on various aspects. SLTChemDB has been developed as an activity of WHO FCTC Global Knowledge Hub on Smokeless Tobacco established in ICMR-NICPR. This database shall be regularly updated to incorporate future information to maintain the comprehensibility of this database. In addition, some future directions may include incorporation of metabolites of chemicals present in SLT products and validation of the toxicological and biological information (estimated in-silico till now) through in-vivo and in-vitro methodologies.

Methods
Data collection and compilation. The information about chemical compounds present in SLT products was extracted through extensive search of peer-reviewed literature like papers, reports and monographs from PubMed and Google Scholar using various combinations of keywords (Table 3).
A flow diagram depicting the complete data collection process using PRISMA 35 is available in Fig. 4. A total of 821 articles were collected which were filtered on basis of availability of full text and data on testing of SLT products. Information about the classification and composition of chemical compounds along with mode of intake, pH and moisture of SLT products was extracted from 85 published articles.
Data was compiled to obtain information about the physicochemical properties from renowned chemical databases like PubChem 24 and Chemspider 25 . Further, the information about biological targets was extracted from large scale bioactivity database ChEMBL 26 . Each compound was classified into various carcinogenic groups as per the IARC classification of carcinogens [27][28][29][30] . Toxicological information of each chemical compound was calculated using pkCSM server 31 .

Database framework and web interface. SLTChemDB is developed using efficient and open source
technologies like Apache and MYSQL. Front end is developed using HTML, PHP and JavaScript while the back end is supported by PHP. Structure based search takes place using RDKit 28 . 3D structures obtained from existing chemical databases have been converted to 2D using Open Babel 25 . These are further utilized for display and structure-based search. Structures are visualized using JSmol 29 .
Data organization. Primary data. Primary data involves information about classification and composition of chemical compounds along with mode of intake, pH and moisture of SLT products. This information was extracted from peer reviewed published articles.
Secondary data. Physicochemical properties of all the identified chemical compounds were extracted from PubChem database. Biological information (protein targets) of each chemical compound was extracted from ChEMBL database. Canonical SMILES structure of chemical compounds taken from PubChem was used for calculation of toxicological properties using pkCSM web server.

Data Availability
The database is freely available at bic.icmr.org.in/sltchem