Nutritionally recommended food for semi- to strict vegetarian diets based on large-scale nutrient composition data

Diet design for vegetarian health is challenging due to the limited food repertoire of vegetarians. This challenge can be partially overcome by quantitative, data-driven approaches that utilise massive nutritional information collected for many different foods. Based on large-scale data of foods’ nutrient compositions, the recent concept of nutritional fitness helps quantify a nutrient balance within each food with regard to satisfying daily nutritional requirements. Nutritional fitness offers prioritisation of recommended foods using the foods’ occurrence in nutritionally adequate food combinations. Here, we systematically identify nutritionally recommendable foods for semi- to strict vegetarian diets through the computation of nutritional fitness. Along with commonly recommendable foods across different diets, our analysis reveals favourable foods specific to each diet, such as immature lima beans for a vegan diet as an amino acid and choline source, and mushrooms for ovo-lacto vegetarian and vegan diets as a vitamin D source. Furthermore, we find that selenium and other essential micronutrients can be subject to deficiency in plant-based diets, and suggest nutritionally-desirable dietary patterns. We extend our analysis to two hypothetical scenarios of highly personalised, plant-based methionine-restricted diets. Our nutrient-profiling approach may provide a useful guide for designing different types of personalised vegetarian diets.

). x (avg. ± s.d.) denotes the average and standard deviation of the weights of a food (g per day) in irreducible food sets containing that food. For each food in Tables 1 and 3, we present  the diets that give NF > 0.7 [C, control diet; O, ovo-lacto vegetarian diet; V, vegan diet; M, methioninerestricted diet; I, personalised diet I (61-year-old male); II, personalised diet II (58-year-old female)], along with the specific values of NF and x in parentheses beside each diet.

Diet and physical conditions
In the current study, we consider the following four diet styles of a physically active, 20-year-old male with standard height and weight: (i) control, (ii) ovo-lacto vegetarian, (iii) vegan, and (iv) methioninerestricted diets. In addition, we consider two hypothetical scenarios of highly personalised, methionine-restricted diets of (v) a 61-year-old male with low physical activity and (vi) a physically active 58-year-old female. The details of these diet and physical conditions are explained below.

Nutritional composition of food
For diets (i) and (iv), we use the data of food nutritional compositions in our previous study 1 , wherein we accessed the USDA National Nutrient Database for Standard Reference, Release 24 2 . The database provides the contents of 7,907 foods in terms of their energy (calorie) and nutrients. The nutrient contents were normalised to the sum (100 g) of protein, total lipid, carbohydrate, water, ash, and alcohol of each food. From these foods, we considered raw foods, as well as other foods whose nutrient contents have been minimally modified. Specifically, we selected foods that fall into one of the following categories. First, we selected foods obtained directly from nature or directly from agriculture, fishery, or livestock farming, without any explicitly added or fortified ingredients such as salt, sugar, and vitamins. Those foods include various raw vegetables, fruits, meat, and fish. Second, we selected foods that belong to the first category but have some modifications in their physical properties. Those foods include ground products, e.g., wheat flour and ground meat. Third, we selected foods that belong to the first or second category but have additional, minor modifications in their nutrient contents, i.e., frozen, dried, low-fat, non-fat, and ultraviolet-treated products. In total, 1,068 foods were selected for diets (i) and (iv), and here, all of them are just called raw foods. For diet (iii), we only consider plant-derived foods among them, and for diet (ii), egg and milk products additionally. Diets (v) and (vi) include the plant-derived foods in diet (iii), along with limited amounts of eggs, whole milk, and certain types of fish (see below and Supplementary Data S2). Furthermore, diets (v) and (vi) include cheese products, although cheese products themselves do not belong to raw foods.

Consolidation of foods that have almost identical nutrient contents
In the previous study 1 , we consolidated raw foods that have almost identical nutrient contents by calculating the following quantity for foods i and j: where i(j)m is the density of nutrient m in food i (j), nij is the total number of nutrient m's for foods i and j, and K is a positive real number selected to minimise Fij given {im} and {jm}. We only considered nutrient m's that have explicit records of their quantities in both foods i and j and have non-zero quantities for food i or j. The resulting Fij ranges from zero to one, and a small Fij indicates that the foods i and j are similar in their relative nutrient amounts. The calculation of Fij works even with nutrients on very different scales or with different units for the quantities (e.g., μg RAE for vitamin A, and μg DFE for folate). From a probability distribution of Fij over all pairs of foods i and j, we found a sharp transition of the distribution at Fij ~ 0.012. Accordingly, we created unified groups of foods; each group forms an isolated, single connected component in a network of foods linked through Fij < 0.012. For each unified group, the nutrient quantities (per 100 g) were averaged over the foods. The averages for the nutrients were calculated only from foods that had explicit records for the nutrient quantity and were not dried or frozen (differences in the water contents of foods cause large variations in nutrient densities, despite the similar nutrient compositions of the foods). By treating each unified group as a single food, we obtained a total of 653 raw foods (Supplementary Data S1; unlike the previous study 1 , here we do not consider sea cucumber, because of its suspicious nutrient content data). In the present study, the aforementioned cheese products for diets (v) and (vi) were similarly consolidated into 16 cheese products by unifying the foods linked through Fij < 0.06 (Supplementary Data S2).

Identification of food categories
We follow our previous grouping of foods 1 based on their nutritional similarity. We conducted an average-linkage hierarchical clustering of the foods (agglomerating foods by the large nutritional similarity, as described in the previous study 1 ) and built a dendrogram, in which each leaf is a food and branches represent groups of foods. Groups that are deeper in the hierarchical levels from the root to leaves contain foods with greater nutritional similarity than less deep groups. Near the root, six foods (raw, dried, and frozen egg whites, duck and goose fat, honey, and table salt) are first split from the others because their nutrient contents are dissimilar to those of most foods. The remaining foods are divided into two large parts -animal-derived and plant-derived. The animal-derived part has a layered, core-peripheral organisation: the core region (bulky clusters of foods) at the deeper hierarchical level includes protein-rich foods, while the peripheral region outside the core includes both protein-rich and fat-rich foods. In a similar fashion, the plant-derived part is divided into proteinrich, fat-rich, carbohydrate-rich, and low-macronutrient categories (Supplementary Data S1; the 'lowcalorie' category in our previous study is called here the 'low-macronutrient' category, in order to prevent any nomenclatural confusion later).

Recommended levels of nutrient intakes
For the recommended daily levels of nutrient intakes, as in our previous study 1 5 . We mainly used the data from the first source, while the second and third sources were references only for the data on cholesterol, saturated fatty acids, and trans-fatty acids.
The specific values for the lower and upper bounds of the recommended daily intake of nutrients depend on ages and genders. For diets (i) to (v), the daily recommended energy E was calculated following the formula E (kcal, for a ≥19-year-old male) = 662 -(9.53 × y) + Pa × (15.91 × w + 539.6 × h), where y denotes the age in years, Pa stands for the physical activity level, w is the weight in kg, and h is the height in m. For diets (i) to (iv) and for diet (v), y = 20 and 61, Pa = 1.25 and 1.11, w = 70 and 73, and h = 1.77 and 1.68, respectively. For diet (vi), we use the formula E (kcal, for a ≥19-yearold non-pregnant female) = 354 -(6.91 × y) + Pa × (9.36 × w + 726 × h) with y = 58, Pa = 1.25, w = 62, and h = 1.70.
Unlike our previous study 1 , here we do not impose any lower bound of sodium intake (Supplementary Table S1), under the assumption that the recommended sodium intake is readily achievable in common diets through the consumption of added salt, not necessarily only through raw food consumption. In the case of diet (iii), we do not impose any lower bound of vitamin B12 intake (Supplementary Table S1). This is because a linear programming (LP) problem with variable food weights to satisfy the recommended daily nutrient intake gives an infeasible solution, as long as the lower bound of the recommended vitamin B12 intake is exerted in the diet (iii) (see below for the details). For diets (iv) to (vi), we impose the very tight upper bound of methionine intake, as merely 10% more than the lower bound of the methionine intake (Supplementary Table S1).

Nutritional fitness of foods across diets
To calculate the nutritional fitness (NF) of each food, we start by constructing irreducible food sets; each of which is a set of a small number of different foods 1 . These foods satisfy our daily nutrient demands, and they are not a superset of any other irreducible food set. To obtain a collection of irreducible food sets, we generated an initial food set by solving the following mixed-integer linear programming (MILP) problem:

Minimise
Subject to: 0 where qi is a binary variable (if food i is in the food set, qi = 1; otherwise, qi = 0), xi is a real variable for the weight of food i to consume per day, E is the daily recommended energy (calorie) that we described above, Lj (Uj) is the lower (upper) bound of the daily recommended intake of nutrient j (Supplementary Table S1), Qj (Rj) is similar to Lj (Uj) but defined by the % of total energy (Supplementary Table S1), W is the limit of the total weight of daily food consumption (W = 4 kg in this study), ei is the energy density of food i, ij is the density of nutrient j in food i, and cij is the energy density of nutrient j in food i. In the case of diet (iii), it was infeasible to find the solution to the above MILP problem as long as Lj of vitamin B12 is exerted. It was even infeasible to find the solution to the corresponding LP problem with qi = 1 for every food i in diet (iii). Therefore, we set Lj of vitamin B12 to zero in the case of diet (iii). For methionine in diets (iv) to (vi), we set Uj = 1.1 Lj to restrict methionine intake. In the cases of highly personalised diets (v) and (vi), we add the constraint ∑i ϵS xi ≤ DS where S is the collection of certain foods and DS is the limit of the daily consumed amount of foods in S. Specifically, S corresponds to the following foods in Supplementary Data S2: eggs [DS = 12.86 g and 6.43 g for diets (v) and (vi), respectively, which mean ~2 eggs/week and ~1 egg/week], whole milk [DS = 22.29 g and 7.43 g for diets (v) and (vi), respectively, which mean ~150 ml/week and ~50 ml/week], fish [DS = 21.43 g and 14.29 g for diets (v) and (vi), respectively, which mean 150 g/week and 100 g/week], and cheese [DS = 5.71 g and 2.86 g for diets (v) and (vi), respectively, which mean 40 g/week and 20 g/week].
The solution to this MILP problem in each diet gave a food set with the minimum size (i.e., minimum ∑i qi). Next, we expanded the collection of food sets by subsequently adding new food sets to the collection. At each step of adding a new food set, this food set is a solution of the above MILP problem, and is constrained to not be a superset of any previous food set in the collection. We only considered food sets with ∑i qi < 6, 7, 7, 6, 7, and 7 for diets (i) to (vi), respectively. If it was not feasible to find more food sets for the collection, the process was terminated. The final collection comprises 52,957, 43,924, 20,713, 4,101, 1,053, and 5,225 irreducible food sets in total for diets (i), (ii), (iii), (iv), (v), and (vi), respectively. Mathematically, such a collection of irreducible food sets is uniquely determined and has no degeneracy. For every diet except diet (i), irreducible food sets were first obtained using IBM ILOG CPLEX solver (v. 12.4) and subsequently obtained using Gurobi solver (v. 7.0.2); for diet (i), we only applied IBM ILOG CPLEX solver, to reduce an otherwise excessively long computation time.
The NFi of food i is given by NFi = log(fi+1)/log(N+1), where fi is the number of irreducible food sets including food i, and N is the total number of irreducible food sets. NFi ranges from zero to one, and a large NFi indicates that food i is nutritionally favourable. For the generalised definition of NFi, any functional form that monotonically increases with fi is acceptable, as long as only ordinal information of NFi matters. Note that fi is capable of quantifying NFi under the condition of small ∑i qi as in this study. Otherwise, it may be hard to estimate the true nutritional adequacy of food i' using solely fi'. For example, a nutritionally poor food i' in an irreducible food set will be easily complemented by many other foods (in the same set) to satisfy the above constraints if ∑i qi is not small enough.

Key nutrients contributing to the NF of each food
To identify the individual nutrients responsible for the NFs of foods, we measure the following quantity ϕij for each pair of food i and nutrient j: where ij is the density of nutrient j in food i, xi is the weight of food i to consume per day in a given irreducible food set, and 〈·〉 is an average over all irreducible food sets that include the food i (if ∑k kjxk = 0 in any irreducible food set, this irreducible food set is excluded from the calculation). In other words, ϕij represents the food i's contribution to the total amount of the nutrient j in an irreducible food set, on average. The value of ϕij ranges from zero to one (Supplementary Data S1 and S2). We interpret the nutrient j with large ϕij as the main contributor to the food i's NF. For a given value ϕij, we tested its statistical significance by calculating the one-sided P value of how frequently ϕi'j of a randomly-chosen raw food i' is greater than or equal to ϕij (if the raw food i' did not appear in any irreducible food sets, ϕi'j was treated as zero in this calculation).
Because of the possible presence of xi(k)'s multiple solutions within each irreducible food set resulting from the aforementioned MILP problem, the specific ϕij value may vary depending on those multiple solutions. To address this multiple-solution issue, we maximised or minimised each xi in a given irreducible food set and thereby found the xi's range allowed by the multiple solutions, while maintaining all the previous constraints of the MILP problem for this irreducible food set. A relative difference between the maximum (or minimum) and original xi values [i.e., |xi max(min) -xi org | / xi org with xi max(min) and xi org for the maximum (minimum) and original xi values, respectively] is found to be less than ~0.2 to ~0.35 for the majority (70%) of xi values in every diet. Given this limited variation of xi, the central limit theorem is expected to be applied for the calculation of ϕij, which involves a rough 'average' of xi over the irreducible food sets having the food i (see the above definition of ϕij). Therefore, the variation of ϕij from the multiple solutions is unlikely to be large if food i has high NF and thus belongs to many irreducible food sets.

Nutrients at risk of deficiency in each diet
For each diet, we examined whether a given nutrient j is subjected to a risk of deficiency in its daily intake, through the calculation of the following quantity θj: where Lj (Uj) is the lower (upper) bound of the recommended daily intake of nutrient j, max(·) is the maximum value among all multiple solutions with altered xk's in a given irreducible food set, and 〈·〉 is an average over all irreducible food sets. When calculating max(·), we maximised the value from a corresponding irreducible food set, while maintaining all the previous constraints of the MILP problem for this irreducible food set. If the nutrient j has the upper bound of its recommended daily intake, we calculate the former θj, and otherwise, the latter θj. In other words, θj quantifies the nutrient j's maximally possible excess over its minimally required intake level in an irreducible food set, on average. The value of θj ranges from zero to one, and small θj value indicates a risk of nutrient j's deficiency in a given diet.
For the nutrients having the recommended daily intake of their calorie, kj, Lj, and Uj in θj are substituted for by kjckj, QjE, and RjE, respectively, where ckj is the energy density of nutrient j in food k, E is the daily recommended energy (calorie), and Qj (Rj) is similar to Lj (Uj) but defined by the % of total energy. For the nutrients that have both Lj (or Uj) and Qj (or Rj), we calculate both θj's, and take a smaller value (i.e., a value with a stricter condition) between the two θj's.

Calculation of the Pearson correlation
We calculated the Pearson correlation between the densities of selenium and protein across raw foods 1 . Each selenium or protein density was measured as the quantity per dry weight. Only raw foods having explicit records of both selenium and protein amounts (and at least one of them with a non-zero quantity) were considered.
Note that the Pearson correlation coefficient (r) between two variables Xi and Yi (i = 1, 2,···, N) is easily distorted by the presence of outliers. When we measured the Pearson correlation, we excluded outliers as follows: xi = (Xi − μx)/σx and yi = (Yi − μy)/σy, where μx(y) and σx(y) are the average and standard deviation of Xi (Yi), respectively. In a Cartesian plane, we drew a link connecting the data points Pi = (xi, yi) and Pi' = (xi', yi') if the Euclidean distance between Pi and Pi' was shorter than a certain cut-off dc (we chose 3  c d ). In this 'network' of data points, we identified the data points in the largest connected component and considered the others to be outliers. The Pearson correlation was measured only for the data points in the largest connected component.

Statistical significance
The statistical significance of the correlation (r) between selenium and protein densities across raw foods was tested as follows 1 : we first remove the outlier raw foods defined above before generating the null model. Next, we select only the raw foods having explicit records for both selenium and protein amounts (and at least one among them with a non-zero quantity), and we randomly shuffle the densities (quantity per dry weight) of either selenium or protein across those raw foods. The Pearson correlations between such selenium and protein densities across the raw foods constitute the null distribution that gives a P value.
We measured a P value as follows. Let {i} (i = 1, 2,···, N) be a sequence of random numbers in ascending order from a null distribution. Using {i}, we obtain the two-sided P value of a given number X (= r) as follows: a value Λ (P =1 if X = Λ) is expressed as If P < 2 × 10 −3 for a given value of X (= r), we extrapolate the P value using the estimation  |) | 1 ( X P   at X → ±1 (i.e., at r → ±1; γ depends on the sign of X, and is estimated from the null distribution of X).