Introduction

The environments in which the inhabitants of cities live, work and travel can influence their health, safety and wellbeing both positively and adversely1. For instance, different modes of transportation, such as walking, cycling, driving private cars or motorcycles, and taking public transportation, have implications for ease and cost of mobility and are associated with different risks of injury, levels of physical activity, and emissions of, and exposures to, air and noise pollution2. City markets provide a setting for income generation, access to goods and services and social interactions but also can be settings of social conflict and confrontations3,4,5,6. Single-use plastics, which are a common form of packaging found in many cities, enable working people to have ready access to food, beverage, and goods but are important contributors to solid waste which can block drainage channels (e.g., gutters) if their disposal is not properly managed7. This can lead to water logging problems and exacerbate urban flood risk7,8, which often disproportionately impacts poor communities9,10. Build-up of trash9 and the free roaming of livestock and other animals11,12 can increase the population and diversity of disease vectors. Some features of city environments that affect health, safety and wellbeing are stationary over short periods of time (e.g., buildings, trees and other forms of greenspace). Others, such as traffic or market activities, vary across space and time, as does the scale at which people carry out their daily activities. The combination of how people spend their time in and travel through different parts of a city, and the dynamics of environmental features and the objects that influence them, patterns a population’s experience of the city’s social, commercial and built environments.

The number of people living in cities in sub-Saharan Africa (SSA) increased from 51 million in 1970 to 450 million in 2020, growing faster than any other region13. As cities across SSA grow and change, there is a critical need to understand the spatial and temporal dynamics of urban environmental features and human–environment interactions that are relevant for health, safety and wellbeing. This information is essential for the formulation and evaluation of policies that promote positive outcomes, especially as such policies may be different from those in industrialised countries1,14,15. Our aim was to characterise spatial patterns and temporal dynamics of features of the urban environment that are relevant for health, safety and sustainability, across different neighbourhood types and time scales within a major SSA city, and to provide an approach for doing so in other cities in Africa and throughout other low and middle-income countries. To achieve this aim we collected a novel and bespoke image dataset in the city of Accra, Ghana, and adapted and applied a convolutional neural network (CNN) to these images, to detect features which help to understand the patterns and dynamics of human activity and environment.

Data, methodological context and contributions

Censuses, other routinely collected government data, and economic, transport and health surveys have demonstrated that African cities are expanding rapidly, accompanied by changes in population age structure, socioeconomic status, sanitation and transportation infrastructure, and housing characteristics14,16. However, census and large survey data are costly, take a long time to plan and implement, typically every few years, and hence lack the temporal resolution needed to understand the dynamics of urban life and environment at shorter timescales. These types of data are usually collected at the level of households or neighbourhoods17,18, and do not capture the use of, and interactions with, urban public spaces such as streets, roads and marketplaces. In the sections below, we describe how smart sensing of cities can provide complimentary types of data to censuses and surveys.

Researchers and policy agencies have recently incorporated different kinds of passive and active smart sensing data to study spatially and temporally varying urban environments and activities. The availability, information content, and the spatial and temporal scales of such data vary by domain of urban activity, including transportation, energy, water, telecommunication and retail19. The current collection and use of such data is limited in SSA countries20. Time-stamped and spatially resolved data from phone network usage, mobile social media or Global Positioning Systems (GPS) have been used to study mobility in cities21,22,23, including in some SSA cities24,25,26,27, including sudden changes in their dynamics, for example during the first few months of the COVID-19 pandemic and the corresponding social distancing measures28. However, information on urban environments is typically not available from mobile phone data29. Furthermore, such data, particularly in the SSA context, where there are a large number of urban poor, are not representative of the population since socioeconomic factors influence the ownership, type (e.g. smartphone vs feature phone) and frequency of mobile phone use30,31.

Short-term observations in specific locations in some SSA cities have included rich environmental data32,33,34, but are limited in spatial and temporal coverage. Street-level images, if collected with high spatial coverage, provide granular information on the urban environment and its spatial variations35,36,37,38,39,40,41, which can be extracted using computer vision techniques37. However, imagery data from cities in African countries are limited in quantity, with some parts of the city less represented42. Further, the temporal resolution of such data, typically on an annual basis and primarily during the day, makes it best suited for the analysis of static or slow changing features or average trends. For example, unlike mobile phone data, street images from sources such as Google Street View, have not been used for evaluating changes in urban activity with respect to COVID-19 lockdowns. The methods used to extract features from images have overwhelmingly relied on CNNs37 including object detection methods43,44 for countable and interpretable features (namely “objects”), such as persons45,46, safety barriers47, vehicles48 or street signage49. Because such algorithms were largely trained using data from high-income countries, they are biased towards representations of objects from these countries, leading to misidentification and poorer accuracy when applied to images elsewhere31,50.

Our study makes a number of novel contributions to the use of smart sensing and analytics to understand the urban environment and its impacts on people, especially in rapidly expanding cities in SSA and other developing regions. We present a unique dataset that was created through distributed collection of over two million street-level images at 145 representative sites throughout Accra with high temporal resolution. We systematically adapted computer vision techniques—including transfer learning and data augmentation—to the local environmental and social context. Using these data and methods, the paper presents a characterisation of the dynamics of urban human activity and environment, methodologically bridging the gap between studies focused on spatiotemporal urban mobility patterns and those extracting features of environments from images. We also evaluated whether this approach could detect the impact of policies and interventions on neighbourhood environment, traffic and/or human activities, and how the effects varied across the city. Such a policy was implemented in Accra in response to the COVID-19 pandemic (April 2020 city-wide lockdown), which coincided with our data collection campaign. Finally, we discuss how our results and approach can be used to address data gaps related to the urban environment relevant for health and wellbeing in SSA, and support urban planning and policy decisions to make cities more equitable, sustainable and healthy.

Study location

Our work covered the Greater Accra Metropolitan Area (GAMA, ~ 1500 km2), the administrative boundary of Accra, the capital and largest city in Ghana, with a population of about five million51. GAMA comprises the Accra Metropolitan Area (AMA, ~ 2 million people) at its core, other metropolises and municipalities (e.g., Industrial port city of Tema), and peri-urban and largely rural areas in the periphery. Accra has become one of SSA’s leading hubs for business, technology and education14,52, with large variations and inequalities in individual and community wealth, and a diversity of ethnicities and languages14,52. Accra has diverse land use, built environment, and provision of services which influence the spatial and temporal patterns of health determinants and outcomes53,54. There is virtually no train or tram service beyond a shuttle train between Accra central and Tema, and formal transit bus services are limited52. Therefore, private vehicles and privately owned and informally operated minibuses, known locally as tro tro are the main means of public transportation55,56 along with ride-share cars (e.g. Uber) and motorcycle-taxis.

Study design

Over a ~ 15 month period, we placed cameras (Fig. 1) at 145 sites throughout the Greater Accra Metropolitan Area (GAMA) for either weeklong (n = 135 sites) or ~ year long periods (n = 10 sites) capturing ~ 2.10 million images57. The sites were selected to be representative of the city’s diverse social, physical and natural environment, and were sampled as described in the study protocol paper57 from areas classified as i) formal, mostly low- and medium-density, residential areas; ii) informal, mostly high-density, settlements and slums; iii) commercial, business and industrial (CBI) areas; and iv) “other” areas that are often peri-urban or rural, and can have dense vegetation (i.e., forest, grassland) or barren land (i.e., sand, soil, dirt).

We used an interdisciplinary consensus process to identify 20 distinct objects within the images which are relevant for mobility, safety, leisure and play, daily life activities like shopping, air and noise pollution, and sanitation and hygiene. These objects were grouped as persons and market vendors; large vehicles (lorries, vans, buses and tro tros); small vehicles (cars, taxis and pick-up trucks); two wheelers (bicycles and motorcycles); objects from the market and street vending (market stalls, umbrellas, cookstoves, cooking pots/bowls, food and loudspeakers) which are common in African cities5,6; refuse (debris and trash); and animals. Members of the research team identified and labelled these objects in bounding boxes across 1,250 images. We then divided the labelled images into training, testing and validation sets and used them to retrain and test the performance of a pre-trained CNN as described in Methods and Data. The retrained CNN was then applied to the entire image set to identify the candidate objects in each image (Fig. 2).

Results

Number of people and objects

By applying the algorithm described in Methods and Data, we identified 23.5 million occurrences of the 20 target objects in our 2.10 million images. Of these, 9.66 million (41%) were persons, followed by cars (4.19 million; 18%), umbrellas (3.00 million; 13%) and tro tros (2.94 million; 13%) (Fig. 3). The large number of umbrellas reflect their use over many hours at the same place to protect market and roadside vendors and their merchandise from the sun and rain. The least common objects that were routinely identified were animals (36,712; 0.2%), food (49,672; 0.2%) and bicycles (102,620; 0.4%). Furthermore, our network identified no cookstoves or loudspeakers, and only 14 buses and 98 market vendors.

Spatial patterns and correlations of people and objects

The average counts of people, vehicles and market-related objects in images was correlated across sites (Fig. 4), though there was a stronger correlation among the number of people, large vehicles and market-related objects (correlation coefficients ranging from 0.67 to 0.71), than among these three categories and small vehicles (0.30–0.46) (Fig. 4). Large vehicles (7.32 average counts per image) and market-related objects (7.17 counts) were the most common at a site on a major throughput road (N1 West, Lapaz) that traverses northern AMA (Fig. 5). This site also had the second (17.90 counts) and fourth (7.30 counts) most common occurrences of people and small vehicles, respectively. The inspection of images and observations by the authors indicate that many people walk along this road or wait for and alight from tro tros that connect major parts of the city, which makes it attractive to roadside vendors.

More generally, small vehicles were most frequently found along secondary (average of 3.57 counts per image) and major (3.74 counts) roads, especially those in the AMA and in the municipality of Tema and were the only object category which outnumbered counts of people along major roads (Table 1). Large vehicles were most commonly found on secondary roads (1.61 counts), nearly five times as many as along tertiary roads (0.34 counts) and twice as many as along major roads (0.79 counts). Both small and large vehicles were more frequent in the centre of GAMA (obtained from the population-weighted average of centroids of census enumeration areas from the 2010 national Ghana census, corresponding to a location three kilometres west of Kotoka International Airport) than in peripheral areas (p-value for distance-from-centre gradient =  < 0.0001 for small vehicles and 0.02 for large vehicles). They were also most prevalent at the CBI sites, 50% greater than in high-density residential sites and twice as frequent as in low- and medium-density residential and peri-urban sites (Table 2). Two wheelers were present five-fold more frequently in high-density residential (0.26 counts) and CBI (0.22) sites than other categories but showed no specific spatial pattern as a function of distance from the centre of Accra (p = 0.74).

The presence of market-related objects, like market stalls, umbrellas that shade vendors from the sun, and cooking pots/bowls, largely followed that of people (correlation coefficient = 0.70) (Fig. 4). Within the AMA, people and market-related objects were more frequently observed in high-density residential sites than in CBI ones and nearly three-fold more frequently than in low- and medium-density residential and peri-urban sites (Table 2). Previous studies have indicated that many residents of these neighbourhoods buy food and household items from these vendors, and that in informal neighbourhoods, many families have home-based enterprises and roadside food vending33,58,59, which may explain these patterns. Similar to vehicles, counts of both people and market related objects decreased with distance from the centre of GAMA (p = 0.002 and 0.03 respectively).

Animals and refuse were weakly and inversely correlated with other objects across sites but were themselves positively correlated, since both appeared more frequently in the mostly rural and peri-urban peripheral areas of GAMA (p = 0.004 for refuse and < 0.0001 for animals) and were less frequently observed in CBI areas (Table 2).

Temporal dynamics of people and objects

The frequency of visible objects changed throughout the day and night times, and the extent of variation depended on both the type of object and site location (Fig. 6). When comparing changes in the proportion of images with one or more counts of a given object across different times of day, presence of people increased sharply between midnight and sunrise (6:00) (604–826% increase at different site types), followed by a midday drop before increasing again at sunset (18:00). At sites in all land use categories, market related objects had a unimodal pattern, peaking at midday, possibly because the sun is at peak intensity, leading to more umbrellas (which shade vendors and their produce) to be visible.

In the high-density informal settlements, people were visible in 30–60% of images at different hours during the night, contrasting with peri-urban and low- and medium-density residential areas, where the majority of night-time images contained no people (Fig. 6). Nonetheless, even in high-density settlements, the number of images with six or more people decreased to virtually zero at night. In all but peri-urban areas, small and large vehicles were present in 20–85% of day-time (6:00–18:00) images but only 10–60% of night-time (18:00–6:00) images (Fig. 6). The night-time decline in the proportion of images with one or more counts of vehicles relative to sunset (18:00) was smallest (5–21%) in high-density sites. Manual observation of the images indicated that this relative stability arose from a combination of continued activity at night-time and roadside vehicle parking. Very few (< 5%) night-time images were found to contain animals (Fig. 6).

COVID-19 analysis

During the city-wide lockdown in April 2020, there was a noticeable drop in the number of people, vehicles and market-related objects at CBI sites (Fig. 7). Other research has indicated that people who had non-essential business or could work from home avoided these areas60. In contrast, in the high-density settlements and slums, there was little change in the number of people and small vehicles, and more two-wheelers were detected. The number of large vehicles, including tro tros, declined throughout the city, because fewer people commuted for work and business. There were also fewer market-related objects at all sites. This finding is consistent with government advice to shop locally where possible, and relatively strict enforcement of social distancing and hygiene measures for market traders, which other studies indicated led to some traders being removed from the market or entire marketplaces being closed60. The number of animals and debris or trash increased slightly in the CBI areas during lockdown. In the two-month period immediately after the lockdown ended, the number and temporal patterns of people and all object categories returned to their pre-lockdown levels, with the exception of market-related objects in CBI areas, which only partially returned from lockdown to pre-lockdown levels (Fig. 7). This trend may be because some people working in the government or the private sector did not (fully) return to their workplace, and may indicate a longer lasting impact from COVID-19 restrictions on the informal market and commercial sector in Accra.

Discussion

Our unique dataset of spatially and temporally representative images and our contextually-adapted application of computer vision methodology has revealed the patterns and dynamics of the human–environment interface in Accra, a major African city. These novel insights can inform planning and policy decisions for making Accra, and other cities in Africa, more liveable, sustainable, equitable and healthy as they expand and develop. Below, we discuss key contemporary applications of our data and approach in Accra and other cities, and their integration with other sources, that are relevant for health and sustainability.

Mobility and transport

Our results show that the privately owned and informally operated tro tros were present in large numbers throughout the city’s major roads, while cars and taxis were present on all roads. These patterns arise at least partly because Accra does not have a formal urban public transportation option such as train, tram or even an extensive bus network. The strong correlation between the number of tro tros and the number of people is an indication of their important role in how people move around the city to access jobs and services55,56. It also shows their potential impact on population exposure to air and noise pollution, because they tend to be older diesel vehicles, often imported into the country after they were used in wealthier countries32,33,55,61, a situation exacerbated by limited enforcement of local emission standards. Similarly, the widespread presence of private cars, whose numbers have increased over time as a growing middle-class has purchased more cars62, worsens air and noise pollution and traffic congestion, and may make transportation less equitable among socioeconomic groups55,63.

Our data on people and vehicle categories at different times and locations provide a baseline to inform the design of mobility policies, and our image-based system can be used to monitor how they influence traffic volume and traffic fleet composition over time. They may also be used, together with other mobile-phone based mobility data, to indicate where and when public space is most frequently traversed by pedestrians. As we discuss below, these data are also essential inputs for modelling how policies influence traffic patterns, air and noise pollution, and injury risk.

Environmental pollution, sanitation and waste management

Our data on the spatial and temporal patterns of objects in Accra also help map air and noise pollution and their sources, and guide pollution control policies. For example, in comparing our image-based object count data with previously published measurements of noise levels64 and fine particulate matter air pollution (PM2.5) concentrations65 in Accra, we found that counts of people (Spearman correlation coefficient = 0.48), small vehicles (0.39), large vehicles (0.48), and market-related objects (0.40) were correlated with noise levels (measured in A-weighted decibels (dBA)) recorded within 30 s of the corresponding images collected at the same locations. Weaker correlations were found between co-located PM2.5 concentration and people (0.18), small vehicles (0.12), large vehicles (0.13) and market-related objects (0.13), possibly because PM2.5 has a combination of local and regional sources. Based on these observations and previous studies66,67,68, we expect that our images and the objects within them can be used as variables within predictive models of time-resolved air and noise pollution. In previous work69,70,71,72, image data were used to estimate annual average pollution levels, or temporally varying pollution in a limited number of locations. Our temporally representative data serves as a unique basis for such models with spatiotemporal data on human presence, to estimate population exposure to pollution, and how it may be impacted through planning and policy decisions.

Our data also showed that refuse, consisting of trash (discarded items) and debris (remnants of construction materials), was present throughout Accra, albeit at different levels in inner and outer areas, possibly because household refuse is not systematically and regularly collected in the city73. The presence and accumulation of refuse increases the risk of flooding, because it blocks open drainage channels which overflow with wastewater or rainwater9. These phenomena, as well as the presence of animals which we identified in the city’s peripheries, are also risk factors for vector-borne diseases11. Data on refuse and animals could reveal where these risks are most common and identify targets for their control.

Livelihood and environment in informal settlements

Our results also revealed the extent of human activity in, and the environment of, high-density residential neighbourhoods, many of which are typically classified as informal settlements and slums. In particular, compared to other parts of the city, these neighbourhoods had more human presence and higher volumes of market related activities that are important components of social and commercial networks in African cities and support the livelihood of numerous families74. As Accra develops and land use patterns change, it is essential that the social and economic benefits of these activities are protected. Achieving this aim requires careful upgrading of the environment and services to make these areas healthier and more liveable52,75,76, while putting in place land tenure arrangements that protect their residents and businesses against displacement.

Monitoring and evaluation of policies

Our spatiotemporally resolved images dataset and approach provide a model for a digital urban information system that can be used to evaluate the impacts of policy or technical interventions that affect the city’s infrastructure and environment. An example of this potential can be seen by the fact that our approach revealed the substantial changes in the number of people, vehicles and marketplace activity in commercial, business, and industrial areas that coincided with the introduction of pandemic-related lockdown measures, and the return of people and vehicles, but not market activity, to pre-lockdown levels at those same sites. An emerging policy-related application of this approach is to measure the impact of public transport infrastructure, such as a Bus Rapid Transport (BRT) system77,78,79, on traffic volumes and fleet composition in spatially and temporally resolved ways, as has been done in some countries using images from CCTV networks80,81, which have recently been also installed in Accra and other African cities82,83.

Strengths and limitations

Our work is among the first applications of computer vision methods for object detection in a city in a low- and middle-income country and especially in Africa. We collected a large number of images at sites representative of, and throughout, the entire city, with fine temporal granularity, which allowed us to assess the dynamics of urban environment and activity over space and time. We adapted deep learning models and training datasets to the specific social and environmental context, for example through selection and labelling of tro tros and market related objects, to enhance the local relevance of the data. We used transfer learning with data augmentation in order to maximise the use of our labelling resources. Our network’s performance was systematically analysed with widely used metrics, providing a benchmark for future global applications of computer vision object detection methods to street view imagery. We covered different land use categories, and stratified our results by these classes, which helps envision how urban development and change may influence cities’ environment and human interactions with the environment. Our approach can be scaled up to other cities, particularly those in West Africa, which have shared features with Accra related to local geography, travel characteristics, social structure and economic activities. The general approach may be replicated using any camera technology capable of capturing time-lapsed images (or video), including CCTV networks that are increasingly deployed in African cities83, though appropriate safeguards for privacy are needed for such data. We used open source software, and the weights of our network will be made publicly available, to allow application to other cities, though fine-tuning of the model may be needed to maintain or improve model performance.

Our work also has some limitations. Like any field measurement campaign, there were trade-offs between spatial and temporal granularity of data. Although we had images from over a hundred representative sites across Accra, and the use of stationary cameras allowed us to have temporally resolved images, we did not cover the entire city as data such as Google Street View often do in high-income countries. The number of objects detected within an image is sensitive to the field of view which the cameras capture. Although we implemented systematic positioning and placement of camera height as described in the Methods and Data section, the extent to which objects are visible within the cameras’ field of view may vary between sites and may affect comparison across sites. Finally, although our trained algorithm achieved the target performance threshold, performance varied across objects, with poorer performance for those that were sparsely represented in our training set, like cookstoves and street vendors, or which visually differed from the MS-COCO object categories which the model was pre-trained on, like loudspeakers, and/or have varied visual appearances within the same category, like food. This issue is common in object detection when datasets have uneven frequency across object categories or do not have a unique appearance84. Finally, although the Faster R-CNN algorithm, which we used in our transfer learning approach, was amongst the top performing object detection algorithms at time of our analysis, better performing models become available over time, and may improve performance on some object categories85,86.

Conclusions

To select, target and evaluate policies that aim to improve, and reduce inequalities in, health, safety and wellbeing in growing cities in Africa, there is a critical need for spatially and temporally resolved, representative data on human activities and the environment where they take place. Our work shows that systematic collection of image data and application of computer vision techniques can be used in an interdisciplinary approach to complement traditional administrative data sources to reduce the current data gap in cities in Africa and other developing regions and identify key aspects of urban life in a rapidly growing metropolis. Routine collection of images has the potential to complement traditional data platforms and provide governments, researchers, and civil society groups with additional information to identify areas in need of intervention and track the impacts of policies that address them.

Methods and data

Field study design for image collection

Over a 15-month period (10th April 2019 to 11th June 2020), we placed cameras at 145 sites throughout GAMA57. Sites were set up with permission from residents and owners’ which also helped ensure that the equipment were safe and not interfered with. We operated ten sites for the entire measurement period (referred to as fixed sites throughout the paper) and operated 135 sites for one week each57 (referred to as rotating sites). The rotating sites were selected through stratified random sampling from a dataset representing four land cover and neighbourhood classes for the city (20 m $$\times$$ 20 m resolution)87: formal, mostly low- and medium-density residential areas which mostly contained planned road networks; informal, mostly high-density, settlements and slums with small irregular buildings and narrow unpaved roads; commercial, business and industrial (CBI) areas with large buildings used for commercial, industrial, office or warehouse purposes; and other areas that were largely peri-urban or rural, and have relatively dense vegetation (i.e., forest, farmland and grassland), barren land (i.e., sand, soil and dirt) or water.

We oversampled sites in the AMA, which is the main metropolitan centre of GAMA, and where nearly half of the population of GAMA lives. After target locations were selected, we verified their suitability through inspection of aerial images and site visits. The locations of the fixed sites were chosen to cover different land-use classes, areas with different road types and other microenvironmental features, and neighbourhoods with different socioeconomic status and population density. The locations of the fixed sites were as follows: N1 West at Lapaz and Tema Motorway are at the west and east ends of the multi-lane N1 motorway; Asylum Down is on the Central Ring Road; Jamestown and Nima are poor, densely populated neighbourhoods in south and middle of the AMA, respectively; Taifa is an emerging medium-density neighbourhood north of the city; Labadi is an indigenous Ga community along on the Coast; East Legon is an affluent neighbourhood which has a mix of residential space and buildings that house corporate, commercial and small business ventures; Ashaiman is an emerging low-density residential neighbourhood next to the port city of Tema; and University of Ghana Hill is located on top of the quiet, forested Legon Hill and is a part of the university campus. More details on site selection and their assignment to land cover and neighbourhood classes are provided in the study protocol paper57.

From the 30th of March 2020 to the 20th of April 2020, Accra imposed a city-wide lockdown due to the COVID-19 pandemic. We have images from eight of the 10 fixed sites for as long as the cameras operated and had memory (median 15 days; 25th–75th percentiles 9–18 days). These data were excluded from our main results on temporal dynamics of object counts. However, in order to examine whether image-based analysis can detect changes in activity patterns, we compared object counts and their hourly variation before, during and after lockdown for these sites.

The study protocol was approved by the University of Ghana Ethics Committee and deemed exempt from full ethics review at Imperial College London and the University of Massachusetts Amherst.

Image collection hardware

We used the Moultrie-M50 camera trap for taking images. We selected these cameras for several reasons: First, they had sufficient memory for logging data, battery life, and were rugged so as to withstand weather conditions such as heavy rainstorms, humidity, heat and seasonal dust storms, while capturing high-quality data. Second, the cameras are programmable to take images at fixed intervals and, importantly, automatically switch to night-vision mode when it is dark which allows having data at night as well as day. Third, the cameras capture images with 20-megapixel quality in a 36.7° field of view, which is sufficient resolution for identifying features from the street-level with object detection, whilst allowing storage within camera memory for duration of measurement, and regular back-ups and upload to a data server. We programmed each camera to take a time-stamped image every five minutes which allowed the camera memory to store an entire week of images (n ~ 2000). Images were also captured at night in black and white with infrared flash. These qualities helped address logistical constraints of conducting robust high-quality environmental monitoring in a city with unreliable electricity supply and extreme weather. Additional details on camera specifications are available in the study protocol57.

We deployed cameras in weather protective cases, and affixed them to trees or poles which were either directly in the ground or on a rooftop or balcony (Fig. 1). The target installation height was at ~ 4 m above the ground. However, in practice we had to be within ± 1 m of the target height for logistical reasons. The cameras were mounted in metal protective cases with rotational multi-access brackets for ease of orientation on the outside of a box holding air and noise pollution measurement equipment (Fig. 1). We identified appropriate angles of view for each camera by assessing and adjusting its image angle using a tablet computer. The cameras were oriented such that they captured the public street view to incorporate the main thoroughfare. Some measurement sites had two cameras (90 sites) while others had one (55 sites), based on whether one or two fields of view would be needed to capture the public streetscape. Cameras were placed at all fixed sites and 4–5 rotating sites each week. For the rotating sites, we returned 7-days after initial set-up to take down the equipment and bring it to the lab for cleaning and data download. Fresh equipment was then re-deployed 48 h later at new sites. For the fixed sites, we brought replacement batteries and Secure Digital (SD) cards to the site so as not to have a disruption in continuous monitoring. Lastly, we asked the owner or resident of each site to call a member of the team if they noticed any problems with the equipment.

While we aimed to capture images at all target measurement sites and for the entire measurement period, in some cases, camera equipment failed or shut off due to stress from high temperature or a similar factor, or breakage from prolonged wear and tear. There was also a two-week interruption in continuous fixed-site monitoring in January 2020 due to a scheduled equipment quality control check. Detailed information on the missingness of images is provided in Supplementary Fig. 1.

Selection and labelling of objects in images

We used a systematic interdisciplinary consensus process to identify the objects relevant for various domains in urban environmental research. We first constructed a sample set of 100 images consisting of 10 images from each fixed site. Twenty-four researchers, from social and environmental sciences, urban planning, public health and machine learning, and from countries in Africa, Asia, Europe and North America independently reviewed the images and listed visual features relevant to mobility, safety, leisure and play, daily life activities like shopping, air and noise pollution and sanitation and hygiene. The reviewers were not explicitly informed on what is detectable using computer vision methods. This process led to identification of 113 unique features listed in the Supplementary Information S1.

Since our images were captured in space and time, we filtered these features to only include non-stationary objects. These are singly identifiable and countable “things”, with clear visual boundaries that could change in location or frequency of presence across time, e.g., cars but not trees (which are stationary) or grass (which is both stationary and not countable). After this process, 69 candidate objects remained. Nine of the study researchers independently scored these 69 objects on a scale from one to five on the following attributes: uniqueness compared to other listed objects, frequency as perceived from the sample images and utility or relevance from the perspective of urban environment, health and wellbeing research and policy. The mean for each attribute was calculated across all nine respondents, and the final score for each object was the average of its three attribute means. Empirically, mean frequency score was positively correlated with mean utility score across these 69 objects (Pearson correlation coefficient = 0.84).

We selected the top 20 scoring objects for labelling. We made one post-hoc alteration to the list: combining livestock, which was in the top 20, with goat and dog, which were in the subsequent 15, to create a combined category of animal. The 20 selected objects were: person, market vendor (a person carrying a container over their heads which is a common scene in Accra and other African markets), car, taxi, pick-up truck, bus, lorry, van, tro tro, motorcycle, bicycle, market stall, loudspeaker, umbrella (commonly used to protect market and roadside vendors from the sun and rain), cookstove, cooking pot/bowl (which frequently contain wares for sale in the marketplace), food, trash, (piece of) debris and animal. Examples of these objects can be seen in Fig. 2. For presentation, these objects are grouped into the following categories: people (persons and market vendors); large vehicles (lorries, vans, buses and tro tros); small vehicles (cars, taxis and pick-up trucks); two wheelers (bicycles and motorcycles); objects from the market and street vending which are common in African cities5,6 (market stalls, umbrellas, cookstoves, cooking pots/bowls, food and loudspeakers); refuse (trash and debris); and animals. Members of the research team identified and labelled these objects in bounding boxes across 1,250 images in a process described in SI S2.

Retraining of deep convolutional neural network with transfer learning

Training a CNN for object detection from scratch requires hundreds of thousands of labelled images88 which is not feasible for most applications. Therefore, most object detection applications retrain a pre-trained network using a smaller amount of data, a process known as transfer learning89,90,91. For this work, we retrained the Faster R-CNN model92 (atrous convolutions with an Inception V2 base), which is provided by the Tensorflow Object Detection API V193, with emphasis on enhancing performance for detecting objects that are seen in an African city like Accra. We used this model for two reasons. First, it had one of the best performances of any network when tested on a benchmark dataset at the time of model selection93, while not having excessively high memory requirement, hence balancing accuracy and efficiency. Second, the model was pre-trained on the Microsoft Common Objects in Context (MS-COCO) dataset94, which contains 91 object categories, some of which overlapped with those in our study, e.g., person, umbrella, bus, truck, car, bicycle, motorcycle, and multiple categories of animals and foods. By comparison, in our work, all food and animal types were grouped, while different vehicle types relevant to Accra’s context (e.g., tro tro and taxi) were distinguished. Other locally relevant objects such as trash, debris, market stalls, street vendors, cookstoves and loudspeakers, which were on our list, are not available in common computer vision datasets. We also tried a CNN with the same architecture as Faster R-CNN pre-trained on the Google OpenImages (V4) dataset95 but the performance of the retrained network on our images, as measured by mean Average Precision (mAP), was worse than that pre-trained on MS-COCO.

We retrained the Faster R-CNN model for our object detection task in the following manner. First, we simultaneously stratified the images by frequency and size (as measured by pixel count) of each object category in the image, and colour versus greyscale (which correspond with day and night time images), and split the strata into 60–20-20% (750–250-250 image) subsets for training, validation and testing as described in SI S3. In this way, objects in each category were represented as evenly as possible in all three sets, and under similar visibility conditions. The resulting validation set was used as described below to set training features (e.g., image augmentation operations) and hyperparameters (e.g., learning rate) that would optimise performance. Finally, the test set was used to evaluate the performance of the model (trained on the combined 1000 images of training and validation data) once all features of training had been configured. The retrained CNN was then applied to the entire image set to identify and locate the candidate objects in each image (Fig. 2).

Data augmentation and optimisation of training approach and hyperparameters

We used two types of data augmentation, described in SI S4, an approach that helps the trained network to avoid overfitting and identify objects in a broader set of conditions96,97. We optimised the training approach and hyperparameters as described in SI S5. The optimisation process yielded a 20% improvement in mAP compared with no augmentation or changes to the default learning schedule and maximum proposal number provided in Tensorflow Object Detection API V1. Object-specific improvements in mAP are shown in Supplementary Table 1.

Performance of the retrained network

After finalizing our training approach and parameters, we trained the algorithm on the combined 1000 images of the training and validation sets and evaluated on the 250-image test set which had been entirely held back from the model as described in SI S6.

Fixed site data down-sampling

We had approximately five times more images from fixed sites than from rotating sites. When reporting site-specific results (Figs. 4 and 5), we report object counts per image which eliminates the impact of different image numbers. For presenting hourly variation (Figs. 6, 7 and Supplementary Fig. 3), which combine data across sites, we down-sampled object count data from the fixed site camera images, such that each fixed site camera's contribution to the figures comprises an equivalent amount of data as a camera from a rotating site. For each object, we down-sampled by ordering the images within every hourly interval from each fixed site by the counts of objects, from lowest to highest, and selecting a subset of images from equally spaced quantiles of object counts such that the total number for each fixed site was approximately equal to the 2,016 images collected at the rotating sites. Ordered, stratified down-sampling better preserves the distribution of object counts in the fixed site images than simple random sampling. This process is described in detail in SI 7.

Ethical approval

The study protocol was approved by the University of Ghana Ethics Committee and deemed exempt from full ethics review at Imperial College London and the University of Massachusetts Amherst.