Introduction

Knowing where animals occur is a crucial component in our understanding of a science-based conservation management and global sustainability in the real industrial world; the Anthropocene and its challenges (e.g.1,2). Methods to obtain such knowledge are commonly not robust nor very advanced. As per textbook (see for instance3), they are primarily based on inappropriate linear functions4., simplistic use of step-wise coefficients5, frequency statistics and parsimony, unrealistic parametric assumptions, simplistic computing, and the use of relatively few predictors widely 'underdescribing' and biasing ecology (e.g. < 5 predictor variables); examples shown in6,7, 8. These problems are well-known and described for decades (e.g.4,9,10,11,12), not reflecting well on a modern science-based management employing readily-available computer models and what complex ecology with a myriad of linkages, or reality, really is about. Required progress has been widely insufficient1,2, 12. A good example for dealing better with ecological complexities is already telecoupling and spill-over effects13. But while widespread and freely available for already over two decades, more holistic methods like machine learning algorithms14,15, ensemble models16,17,18 and supercomputing based on widely available open access ‘Big Data’ are still widely ignored19,20,21, underused and not applied to their potential (11 and citations within), e.g., multivariate analysis done with modern methods (22; see23 for a national application in the subarctic). Considering the global environmental crisis12, so far, the progress in such globally relevant fields like conservation policy based on multivariate efforts have been quite insignificant (e.g.1,2, 11). For instance, most species management models still remain in the single-species realm ignoring species clusters and communities (11, see7 for Resource Selection Funcstions RSF, and4 for Habitat Suitability Index HSI). Also, telemetry data and geolocator data for most of the species are still missing and widely biased for sample sizes and animal strata, frequently  still hand-mined for perceived outliers or using ‘an assumed common-sense’ code (example shown here24 and with an application by25). It is clear that the sheer magnitude and complexity of biodiversity cannot be geo-tagged for a solution, nor should. Promoting more geo-tagging efforts and mindsets for a proper science, and conservation remains far away from the realistic and natural species distribution and from global realities. Lacking already a relevant consideration of scale and autocorrelation those approaches  do not achieve any modern modeling concepts for urgently needed population-inference in times of the global biodiversity crisis. It just remains in a repetitive ‘me too’ point-and-click science ‘group-think’. Such a low-performing institutional culture - without deeper reflection on progress—a missing vision—still dominates, e.g., in regular SDMs the use of just a few predictors and Maximum Entropy (Maxent) (= a shallow learning machine learning algorithm,26,27). A relevant research design with relevant strata, a mutually accepted taxonomy for sampling, meaningful absence and availability data linked with socio-economic or higher precision climate change predictors all rule in their absence. For mandated biodiversity management this is often widely impossible to achieve even. The codified species-habitat models like HSIs, RSFs, Occupancy Models28 or Species Distribution Models (SDMs;29) are widely competing with each other, are often not in mutual agreement and still use methods being at least 20 years old (11, and citations within), e.g., Maxent as a leading algorithm in regular SDMs (26,27, 29; Maxent as an algorithm comes from the 1960s and was not improved in relevant terms since the 1980s still remaining in the probability framework based on parametric assumptions, which are dubious to obtain in real-life biology, e.g.4,11). Instead, modern ensemble model approaches that are based on J. Friedman’s paradigm of ‘many weak learners make for a strong learner’ are far and few but powerful (30; see also 11). For HSIs, RSFs and Occupancy Models—still widely taught and used in the wildlife discipline, its institutions and federal contractors applied for governance policy—the reality is even worse (based on ambiguous parsimony, linearity, few predictors and dubious model fittings for probability requiring a strict but unrealistic and rarely achieved research design;4, 11, 28 respectively).

In the meantime, with open access data sources on the rise in the Anthropocene, many managed species are now of great concern and the wider ecology is simply left unaddressed, still using an underlying governance understanding and policy that comes from over 100 years ago (see here the dominant legal interpretation of ‘Originalism31, see32 for a critique and failure). It does not remotely allow for modern, latest, or more relevant telecoupling approaches13 and similar (see33 for Deep Ecology and holistic aspects) in the world we actual live in (‘the Anthropocene’), or for massive problems faced by humanity in the future.

Employing best-available methods for confidence of the inference11, being accurate and precise matters for a proper habitat and species management3. That concept applies even more so in areas that are already deeply affected by the Anthropocene20,21, as well as with a human-accelerated climate change where a vast environmental onslaught is predicted to occur. Sophistication matters for a good outcome.

Using a new and best-available large open access global geographic information system (GIS) predictor data set for Alaska, here we introduce and show an example of improved options available: Super SDMs (34, for regular and latest SDMs see3537, as well as23, 27. Here we apply it for a species paradox, the charismatic and circumpolar but greatly unknown, understudied and misunderstood so-called ‘Phantom of the North’ (https://abcbirds.org/bird/great-gray-owl/;38)—the Great Gray Owl (Strix nebulosa). It is a very popular species in the public eye (see for instance featured in 'Into the Wild' movie and book for remote Alaska39). This species is likely long-lived and has a circumpolar distribution38. Relevant distribution data for this species are scarce and widely missing though in Alaska40,41. We introduce here the generic concept of a ‘Super SDM’34 based on a widely extended set of open access predictors and latest computational methods. We investigate and promote it as a new but readily available science-mandated global baseline for inference in species-habitat associations. Knowing best-available species-habitat associations are of crucial importance on a finite planet, while consumption patterns, human population, social inequality, habitat fragmentation, sea levels, global temperatures, etc. are greatly on the rise compromising wilderness and its species.

Methods

We started with the pioneering study approach presented by42, based on34, 35) and applied it as an update to Great Gray Owls (GGOW; taxonomic serial number TSN 177929) for Alaska. It followed the initial work from43 and then got extended with more and fine-tuned predictors and a cloud computing platform to overcome computing limitations towards progress. The workflow is described below and visualized in Fig. 1.

Figure 1
figure 1

Generic workflow for this study and suggested for SuperSDMs. Text in brackets has adjustable components and as were used in this study).

Data

We compiled likely the best-known and publicly available open access occurrence records for GGOWs in Alaska (n = 410), covering years from 1880 til 2019 (see Fig. 2); virtually all data points come from visual detections; whereas relevant nest location information are widely unknown in Alaska and unlikely for those data. The data are in the public domain (see43,44 for citizen science data), got merged from various publicly-available sources and do not carry a unifying underlying protocol and research design (details in43; eBIRD citation provided further below). Because we let the algorithm take care of data and outliers for generalization (sensu11), we do not filter the precious data. Still, wrong identifications and erroneous species confusions for GGOW are virtually impossible due to its unique appearance (for more data validity details see43,44,45). GGOWs are not known to occur in clusters and usually found individually46, thus autocorrelation is not an apparent issue for this species and its data (our model analysis of ‘tree-based algorithms’ is relatively robust to such issues regardless, see11), and citations within. These presence data were merged with the ‘background data’ (pseudo-absence) for all of the study area resulting in a binary response (presence/absence) for the subsequent data mining and models based on a relative index of occurrence (RIO;11).

Figure 2
figure 2

Great Gray Owl sightings in the study area of Alaska.

In addition, we also compiled the best-available global open access set of GIS layer predictors. Here we used Alaska as the study area, environmentally described by 100+ predictors (‘Big Data’; we currently have an even larger global data set of over 132 and of 230 GIS layers33), but here we focus on Alaska-specific questions and use its continuous predictors (while many other categorical predictors remain unused, still awaiting their use and further assessment). The list of utilized predictors can be seen in Table 1. This dataset exists in the form of ASCII/TIFF files in a WGS 1984 geographic projection of latitude and longitude in decimal degrees (see Data Availability section and Appendix section within). For layer creation of the specific Alaska features we used also the Alaska state NAD1983 projection with coordinates in feet for a slightly higher accuracy of local variables.

Table 1 List of predictors for Alaska used in this study; the majority of predictors are climate-related (6 datasets with monthly mean metrics; n = 75) with some topographic (n = 5), biological (n = 5) and human-related ones (n = 15). This data set is a dynamic Open Access GIS layer dataset compiled by Sririam and Huettmann (unpublished, Andrews 2019 and Steiner and Huettmann in review). It lists overall more than 219 GIS Layers for Alaska.

We then used a point lattice of 1 km for Alaska, created in Open GIS  QGIS (vers. 3.28 Firenze; https://blog.qgis.org/2022/10/25/qgis-3-28-firenze-is-released/). Those lattice points were used as background (pseudo-absence) samples to be compared with presence points in the study area as part of a binary response (see also11,47). But also it was later used as a point-prediction grid for the study area for overlays with the predictors (resulting in the ‘data cube’). That way it was also used for scoring the predictions from the model described below to each lattice point (as presented in11). This step is crucial to geo-reference the obtained predictions, allowing for a spatial representation of the model results. The data cube is exported as a stand-alone table in a CSV format consisting of 373,423 rows (lattice points) and 105 columns and has a size of 206 MB.

Thanks to the machine learning approach used here, one is able to handle all the compiled data, including some potentially uncertain data (aka ‘bad apples’; see11 and citations within). Thus, we did not engage much into specific data cleaning, transformation or correction of the raw data (= GGOW locations and predictors). Being able to use default data speaks to the powerful research design we allow, and here we relied on data sections received (e.g. openly shared with the global public) and brought together. In this study we actually let the algorithm ‘learn’ the signals in the data and handle all the data realities for generalization (sensu48,49; “inference from predictions” as a core scheme of the approach chosen and promoted by Leo Breiman; see also11 and citations within). We then assess the major predictions with a test using several lines of evidence to convince. Here we apply published and alternative data, e.g. coming from a research design, as well as several citizen science source data for this species overall within Alaska (examples show in50).

Models and cloud computing

For a proof of concept, we used a basic RandomForest (‘bagging’, a powerful ensemble model classifier;48,49,50,51) run in R on the data cube. In order to successfully run this analysis, we utilized the R packages ‘randomForest’ (https://cran.r-project.org/web/packages/ randomForest/ index.html; see52,53 for further justification of this application). We followed Formula 1 for a RandomForest run. Details of the base code we used in R are shown in Appendix 1 (see Data Availability section).

$$\begin{aligned} {\text{Formula 1}}:\quad & {\text{Presence}}/{\text{Background }}\sim {\text{ tmean}}\_{\text{1 }} + {\text{ tmean}}\_{\text{2 }} + {\text{ tmean}}\_{\text{3}} + {\text{tmean}}\_{\text{4}} + {\text{tmean}}\_{\text{5}} \\ & + {\text{tmean}}\_{\text{6}} + {\text{tmean}}\_{\text{7}} + {\text{tmean}}\_{\text{8}} + {\text{tmean}}\_{\text{9}} + {\text{tmean}}\_{\text{1}}0 + {\text{tmean}}\_{\text{11}} + {\text{tmean}}\_{\text{12}} \\ & + {\text{prec}}\_{\text{1}} + {\text{prec}}\_{\text{2}} + {\text{prec}}\_{\text{3}} + {\text{prec}}\_{\text{4}} + {\text{prec}}\_{\text{5}} + {\text{prec}}\_{\text{6}} + {\text{prec}}\_{\text{7}} + {\text{prec}}\_{\text{8}} + {\text{prec}}\_{\text{9}} \\ & + {\text{prec}}\_{\text{1}}0 + {\text{prec}}\_{\text{11}} + {\text{prec}}\_{\text{12}} + {\text{pdensit1}} + {\text{ndvi}} + {\text{globcover}} + {\text{ glc2}}000 + {\text{cloud1}} \\ & + {\text{cloud2}} + {\text{cloud3}} + {\text{cloud4}} + {\text{cloud5}} + {\text{cloud6}} + {\text{cloud7}} + {\text{cloud8}} + {\text{cloud9}} \\ & + {\text{cloud1}}0 + {\text{cloud11}} + {\text{bio}}\_{\text{1}} + {\text{bio}}\_{\text{2}} + {\text{bio}}\_{\text{3}} + {\text{bio}}\_{\text{4}} + {\text{bio}}\_{\text{5}} + {\text{bio}}\_{\text{6}} + {\text{ bio}}\_{\text{7}} \\ & + {\text{bio}}\_{\text{8}} + {\text{bio}}\_{\text{9}} + {\text{bio}}\_{\text{1}}0 + {\text{bio}}\_{\text{11}} + {\text{bio}}\_{\text{12}} + {\text{bio}}\_{\text{13}} + {\text{bio}}\_{\text{14}} + {\text{bio}}\_{\text{15}} \\ & + {\text{bio}}\_{\text{16}} + {\text{bio}}\_{\text{17}} + {\text{bio}}\_{\text{18}} + {\text{bio}}\_{\text{19}} + {\text{ aspect}} + {\text{solrad1}} + {\text{solrad2}} + {\text{solrad3}} \\ & + {\text{solrad4}} + {\text{solrad5}} + {\text{solrad6}} + {\text{solrad7}} + {\text{solrad8}} + {\text{solrad9}} + {\text{solrad1}}0 + {\text{ solrad11}} \\ & + {\text{solrad12}} + {\text{hf}} + {\text{mammals}} + {\text{birds}} + {\text{distcoasta}} + {\text{distlakeri}} + {\text{ EucDistTow}} \\ & + {\text{EucDstAirp}} + {\text{EucDistFir}} + {\text{DistPipeli }} + {\text{ World}}\_{\text{MIN1}} + {\text{ World}}\_{\text{MIN2}} \\ & + {\text{World}}\_{\text{MIn3}} + {\text{World}}\_{\text{MIn4}} + {\text{ World}}\_{\text{MIn5}} + {\text{World}}\_{\text{MIN6}} \\ & + {\text{World}}\_{\text{MIN7}} + {\text{World}}\_{\text{Min8}} + {\text{World}}\_{\text{Min9}} + {\text{World}}\_{\text{Min1}}0 \\ & + {\text{World}}\_{\text{Min11}} + {\text{World}}\_{\text{Min12}} + {\text{GlobalRive}} + {\text{WorldSlope}} \\ & + {\text{WorldRoden}} + {\text{WorldSoil2}} + {\text{Model1}} \\ \end{aligned}$$

Using these data initially on a consumer-grade laptop (16 GB memory), we ran into a run-time memory error indicating that it is not executable on a common laptop machine, and thus, cannot be completed as a model prediction without removing data or simplifying the prediction model. This is a bottleneck, thus far, not allowing to progress. So here we tried to overcome this computing bottleneck with super computing in a cloud-computing environment from the Oracle Cloud Infrastructure (an Oracle for Research computing credit grant provided to FH).

An Oracle Cloud virtual machine instance running Oracle Linux 8 was accessed via SSH through Windows Powershell. Installed on the machine was R 4.2.2. Details of the virtual machine are shown in Table 2. Those settings are not on the extreme side of cloud-computing but are sufficient to have the RandomForest run completed on the Big Data set that otherwise would not have been solved. It presents a showcase of the feasibility, magnitude, and potential of the workflow presented in this study, allowing many subsequent applications and presenting vast potential.

Table 2 Supercomputing settings.

Model assessment

For a robust inference, model predictions are to be assessed for validity11. Ideally, that’s done with different lines of evidence. While we have exhausted all known publically-available data sources for this species, as available in GBIF.org and43, here we inquired with several alternative and more recent data sources beyond 2019, such as vetted bird watching listervs and citizen science web portals, e.g. iNaturalist (https://www.inaturalist.org/; new data collected).

Results

Data

We were able to compile the best publicly available distribution occurrence dataset for Great Gray Owls (GGOW) in Alaska; it covers a unique time period from 1880 to 2019, and is a testable quantified research component useable as a point data set (n = 410) in a CSV (ASCII) format, originating from various sources now existing as a GIS shapefile (see in Data Availability section, Appendix 3a within).

Further, we compiled, and make, the entire underlying GIS predictor set of over 100 GIS layers for Alaska available (see in Data Availability section, Appendix 2 within).

Both data sets are described with FGDC ISO compliant metadata in XML & HTML format (see also as part of the respective Data Availability section, Appendix within) to understand the data making it an inherent outcome of this multi-year study.

Model run

For the first time, we were able to complete an open access and open source workflow using Big Data for GGOW for a basic ensemble model algorithm (RandomForest) in the R environment run on a cloud computing workstation. We got a good model conversion (Fig. 3). This model ran c. 8 h, some of the figures required another overall 1 h to complete. The memory usage of the model run is up to 80% (of the assigned 1,024 GB).

Figure 3
figure 3

Randomforest Model fit (error) by number of trees showing a good and fast model fit.

Figure 4 shows the variable importance ranks of the 100 predictors we used, which presents the basis for the subsequent predictions (Fig. 5) and are further discussed in the next section for their meaning.

Figure 4
figure 4

Variable importance using two metrics (MSE, node purity) showing a variety of ecological predictors driving the GGOW occurrence with some predictor groups dominating, e.g. human impacts.

Figure 5
figure 5

Great Gray Owl raw predictions in the study area of Alaska using randomForest; the relative index of occurrence (RIO) is shown along a color gradient of red (predicted presence) and green (predicted absence).

Model predictions and accuracy

The map shown in Fig. 5 is the first prediction using machine learning ensembles and Big Data ever completed for Great Gray Owls (GGOWs) in Alaska and around the globe using a cloud-computing environment.

Our prediction result shows hotspots and coldspots for GGOWs in Alaska; the state with the largest protected area system in the U.S. However, our predicted ecological niche of GGOW does not match well with traditional range maps: in the predicted ecological niche the hotspots are primarily found along roads and urban areas, as well as human settlements (villages) and industrial areas, including some coastal zones and the Arctic tundra. Whereas the predicted coldspots are seen in western Alaska and in other vast sections of Alaska’s wilderness, including many protected areas and some wilderness regions. According to the predicted ecological niche (as per11 and citations within) transferred from the geographic niche this is a robust quantifiable finding to test further (details shown below for evidence and confidence).

For a wider inference, it becomes clear from Fig. 4 that a multivariate set of ecological predictors—at least 20—drives the occurrence of GGOWs in Alaska, not just a few single predictors but a wider range of predictors together across a wide environmental spectrum interacting in synergy. Whereas, a parsimonious approach does not capture GGOW’s distribution in Alaska and must be biased adding variance. However, seen from that angle, the predictor group that is directly related to human impacts and urbanization stands out (Figs. 4 and 5), whereas the more typical ecological niche predictors like climate and landcover seem to play a much lower role and are overruled by human/urban predictors. Figures 4, 6 and 7 make clear that GGOWs are found in habitats with a high human footprint, and/or occur next to it, but usually not far away from them or in the remote wilderness. Lakes and fires (54 for underlying ecology see55,56,57) could be a secondary, weak relationship for GGOW habitats. The predictors of Distance to coast and Proximity to Airports deserve more attention (many predictions are in coastal areas, a few GGOW presence records come from the Federal Bird Strike airport database (https://wildlife.faa.gov/); as per43). The predictors related to human cities and towns, human footprint, distance to pipeline and human density are among the leading predictors for GGOWs, out of a diverse set of 100 predictors overall (their variable importance ranks are shown in Fig. 4). GGOWs are known to rely on small mammals for prey (e.g.58). But noteworthy in our model findings is the high rank of the predictor called ‘model 1’, which is the predicted range of the 60+ bark beetle species community59. The correlation of GGOWs with bark beetles is a new finding, have never been described before (see60 for a traditionally reported small mammal link) and should be pursued more in future research projects.

Figure 6
figure 6

(ac) Partial dependence plot of the topthree predictors using MSE (hf, pdens, hlake).

Figure 7
figure 7

(a,b) Partial dependence plots of top two predictors using node purity (EucDistFir, EucDistPipe; the other two partial dependence plots of this group are already shown in Fig. 6).

What is the meaning of ‘background’ in binary presence/pseudo-absence models? Here we model binary predictions in the absence of ‘confirmed absence’ data points for this species (as shown in47,60). However, while meaningful absence data is missing for GGOWs in Alaska, e.g. a Breeding Bird Atlas, here we use a 1 km sample from all of Alaska and its diverse habitats making it a next-to-perfect comparison with the best-available presence records of GGOWs61, covering a unique time period 1880–2019.

We explain the mismatches with traditional GGOW maps due to lack of data, some parsimony perspectives and methods, previously insufficient predictor sets realized, and plain human expert assessment and perception errors11,62. The ML/AI methods we present as a Super SDM can help to overcome those problems. It also disproves the ‘human-desired’ distribution range of the ‘Phantom of the North’. At minimum, it shows a quantified and testable predicted ecological niche for GGOW to work from, and such a repeatable workflow.

How good and valid are the predictions achieved?

Using the Receiver Operating Characteristic11,64,65, our internal prediction accuracy shows a ROC value of over 90% for Alaska’s lattice points, but as provided by the software as a standard performance metric11, and citations within). Alternative assessment data are more powerful but few (see overview in43 for GGOW). However, as shown in Fig. 8, the existing ones at least fully confirm the model for the survey areas with high accuracy; the model predictions match the training data ‘very well’ (= almost a 100% match for locations tested) using recent bird watching records and iNaturalist records, extending the data set of c. 1% of the training data.

Figure 8
figure 8

GGOW predictions from the RF model run in ‘the cloud’ supercomputing overlaid with the training data (black dots). In addition, alternative Great Gray Owl sightings are overlaid (a) Detailed field assessment from Andrews (2019), and (b) recent sightings of the last 4 years from citizen efforts like birding listservers (b1,b2), and iNaturalist (b3–5) and Xeno-Canto (b6; 2 entries). It represents app. an additional 1% of the training data available for this ‘elusive’ species.

GGOWs are widely described as species for ‘the taiga’, e.g. in Google. Thus far, there are not many GGOW records for Alaska beyond the Brooks Range and the Arctic Tundra but some exist (Fig. 5 and evaluation data; Fig. 8). However, already in adjacent Canada, and in the Old World GGOWs are reported at those latitudes and at higher Northern latitudes. A sound recording was made in the Arctic area that we predict (for Alaska-Canada-border see https://xeno-canto.org/species/Strix-nebulosa). While prey abundance is generically high in those areas, thus far it is not known whether the model output predicts there the realized niche or indicates a sister taxon, e.g. snowy owl? Arguably, with an increased shrubification of the Arctic the boreal ecosystem is already moving north allowing for perch sites of GGOW with prey

Overall, the prediction results from the workflow we present—thus far—are difficult to beat for evidence, or to show wrong with empirical data at hand (see Fig. 8 below). They are far from overprediction, e.g. for wilderness and protected areas. Until there is better data available, specifically GGOW presences and absences, or nest, migration and telemetry data and expert information for GGOW are provided open access (e.g. from NGOs or governmental records), our results remain as good as they get and are to be used for management for time to come. All data are publicly available for that reason and allow for extension,  assessments, updates and improvements as needed in a quantified open access fashion.

Discussion

Here we present for the first time the best-available Open Access data for the Great Gray Owl (GGOW) as well as its 100+ geographic information system (GIS) habitat predictors for Alaska with ISO compliant metadata for a public audience. This presents the largest and most modern data set (“Big Data”) ever compiled for this species, its environment, and the state of Alaska (= the area in the U.S. with the largest wilderness and protected area system left) covering data from 1880 to 2019 and beyond (assessment data 2019 onwards).

Further, we were able to run the first Alaska-wide Super SDM model of GGOW predictions from such data. Super SDMs can have limitations dependent on data used, should always be assessed with several lines of independent evidence. They are not the ultimate and final statement on species-habitat associations, but they come close34. At minimum, they are low-cost rapid assessments capturing data quantitatively in time and space. It also is a great leap forward to be more ecological and more inclusive of all information and synergies available setting a new stage for species-habitat assessments11.

Beyond the data provided, the other strength of this work consists of the conceptual use and workflow of an ensemble model applied in a powerful cloud computing (supercomputer) environment, allowing for overcoming a traditional computational bottleneck using 100 predictors for new findings that were not able to be achieved before for inference. Overcoming the technical limitations of memory that come with the traditional computing environment allowed here a showcase for new computational and biological insights and progress, e.g. that GGOWs associate consistently with a high human footprint.

We followed the approach by Leo Breiman48,49 to infer from the prediction, as well as Jerome Friedman (cited in11,30) ‘many weak learners create a strong learner’. The actual base-code was made available (see Data Availability section, Appendix 4 within) for improvements, and the results were mapped in Open Source GIS for further use and application. Arguably, these ML models can be tested, improved and extended in various ways (for instance, the randomForest in R version can usually be challenged by Leo Breiman’s code in the Minitab Salford Predictive Modeler System (https://www.minitab.com/en-us/products/spm/). But here we show a proof of concept with all settings allowing to run and establish Super SDMs in a quantified and testable fashion.

We further pursed the concept of data mining, which keeps raw data and potential outliers ‘as is’, because that is a more powerful approach to the vast and otherwise accurate dataset. It leaves the actual ML algorithm to resolve problems and find the best prediction, rather than a biased human perception, assumptions, human errors11,65,66,67,, and human meddling with a wrath of data and model settings within a complex ecological setting widely not understood (23,63,68; see11,65 for alternatives). The same applies to the concept of overfitting (better to be referred to as a full fit, as per11); randomForest is designed on the principle of ‘bagging’ which tends to avoid overfitting in the default setting, including a robust handling of outliers and autocorrelation11.

Biologically, it is known that GGOW’s populations and subsequent habitat needs are somewhat cyclic58,66,67,68; here we present the year-wide average ecological niche across decades of observations with a testable and quantified prediction. From the raw data and predictions one can already easily show that GGOW is not a ‘phantom of the north’ (38, see also69) but instead it is a circumpolar species occurring instead in more southern areas70,71, e.g. in coastal areas and latitudes of 40 degrees North72,73,74,75,76,77 and thus living already for a long time in a highly urbanized, industrial, forestry and farming landscape among humans in the “Total Anthropocene” (78; for specific GGOW examples in its range see79,80,81,82,83,84,85,86). GGOWs do associate with a high human footprint. In Alaska, albeit well known and enthusiastically reported87,88,89, the GGOW is quite a rare sighting as such, but it is clearly affiliated with human landscapes43. However, a solid description and effective GGOW conservation plan with an associated budget for this species exist elsewhere (see90 for Oregon,91,92 for national forest practices) but is widely missing in (urban) Alaska (93,94; see95,96,97,98,99,100 for specific GGOW field protocols to be used; see101 for Alaska). Using a Super SDM, here we further can infer102 and confirm that GGOW in Alaska (= the state with the biggest wilderness in the U.S. and holding its largest national park system) is in essence an urbanized bird that associates with industrial infrastructure, pipeline, roads, urbanized centers and farming. Whereas the vast tracts of Alaska, e.g. western Alaska, interior Alaska and protected areas are widely free of reported GGOW sightings and high numbers/clusters (that is true for raw data as well as for the predictions of the ecological niche using over 100 predictors). Essentially, our finding flips how this species must be perceived and managed (e.g. opposite from81,103). As a minimum estimate, we find GGOW is an urbanized species primarily detected thus far in association with humans and man-made habitats (104; this habitat link can somewhat cycle over the years, and it is even stronger during migration and in wintering areas, such as found for a long time already in Alberta and Manitoba/Canada;72,95, 105, 106, and in the Old World107; contrast it with93). A question remains for GGOWs in the high arctic, and whether it occurs there much, or is a sister taxon like the Snowy Owl occupying that niche? Arguably, prey is abundant for GGOW and so are perching options.

How generalizable are the ecological niche predictions for inference, and for the realized niche? In the wide absence of any relevant research design specific for GGOW (see108,109,110 for road bias and how resolved), representative sampling, of an Alaskan Bird Atlas and Nesting Survey for that matter (compare with Birds of Yukon111, or bird banding/ringing work elsewhere in the GGOW range, e.g.112), and unsubstantiated narratives113 this question currently cannot be answered with ultimate accuracy (compare with114; see101 for owls in Southeast Alaska). Table 3 shows that more data and information exist that actually could be used, but unfortunately it is not presented to us, communicated with the public, and available to the public or science’s use. However, it is clear that much avian and raptor research was done but not shared, and thus opportunity was left unused, which is a generic pattern in wildlife-related research, specifically in Alaska, and for ML/AI applications (see for instance11,115, 116). As SDMs can indeed generalize11,28 here we used all publicly available GGOW information human-possible to-date in order to achieve the goals starting from 1880 onwards.

Table 3 Data sources for Great Gray Owls in Alaska.

While our model prediction assessments are ‘high’, arguably our model prediction still presents an underestimate of reality and an incomplete truth; many pixels await ground-truthing. Already the limits of data, research design and pseudo-absences can potentially limit inference (e.g.117). Cycling aspects of the Arctic and its populations are not included yet (e.g.118,119) and more focused data will fill other gaps and provide model updates. However, it is undeniable—from the raw data and the predictions alike—that GGOWs occur in human-dominated areas of Alaska. Those sightings are linked with man-made, urban and industrial habitats indeed, beyond ‘myth’. It matches other wildlife research findings in Alaska, such as50.

This research sets the stage for how habitat models—SDMs—can be run and improved. Leaving out predictors in the pursuit of parsimony  is still widely done in most of the species-habitat works in Alaska to-date—must be seen as willful, with an untested hypothesis-drop, that knowingly creates uncertainty and bias, leaving out many possible questions unanswered (see11,117, 118 for a vast range of applications). In the light of Super SDMs, such scholastic work must be perceived as ignoring best-available options; arguably it has either not done its homework or does not want to use existing data, information and employ easily available potential at hand for their research while better approaches have existed for many decades (see57,120,121,122,123,124 for other applications done in Alaska, and see125,126,127,128,129,130,131 for other disciplines).

As commonly done in wildlife applications, e.g.11,132, here we show a ‘proof of concept’ with first inference. It is primarily technical progress it allows for bigger impacts on improved inference related to species and habitat management, in Alaska and globally. Here we were able to set a new available and mandatory baseline for inference: we established the Super SDM. Having such concepts available allows for predictions of high accuracy (see132 for 1 m prediction resolution), specifically when it comes to impact assessments, e.g. with an optimized survey design133, done into the future and with climate change (e.g.134,135,136). For Alaska, coming already from a troubling industrial past (e.g.137), much more industrial development is the current path to come in the Anthropocene. It is where state-wide mining and nuclear reactors are now tried and planned while the permafrost landscape melts, and the boreal forest gets cut down and burns55,138, with a new major sector exponentially on the rise—seabed mining139. As the decaying fate of natural resources and wilderness has shown140,141, regular ‘modern’ conservation governance has widely failed in Alaska and beyond (12; see for instance Alaska’s salmon crisis including King Salmon disappearance within just less than 50 years under such a regime affecting habitats and associated thousand-year long indigenous cultures relying on it142,143). Here we provide some quantified progress on best-available human options for global sustainability.