Hybrid YOLOv3 and ReID intelligent identification statistical model for people flow in public places

The statistical model for automatic flow recognition is significant for public place management. However, the current model suffers from insufficient statistical accuracy and low lightweight. Therefore, in this study, the structure of the lightweight object detection model "You Only Live Once v3" is optimized, and the "Deep Simple Online Real-Time Tracking" algorithm with the "Person Re-Identification" module is designed, so as to construct a statistical model for people flow recognition. The results showed that the median PersonAP of the designed model was 94.2%, the total detection time was 216 ms, the Rank-1 and Rank-10 were 87.2% and 98.6%, respectively, and the maximum occupied memory of the whole test set was 2.57 MB, which was better than all comparison models. The results indicate that the intelligent identification statistical model for public crowd flow obtained through this design and training has higher statistical accuracy, less computational resource consumption, and faster computing speed. This has certain application space in the management and guidance of crowd flow in public places.


Improved YOLO lightweight model design for intelligent statistics of crowd flow
Crowd statistics can provide data support for the safety management of public places such as bus stops, transportation hubs, and gymnasiums [17][18][19] .However, the traditional statistical methods based on machine learning algorithms are inefficient and have large statistical errors.Therefore, this research constructs an intelligent statistical model with advanced performance based on YOLOv3 algorithm.

Lightweight crowd flow detection algorithm based on improved YOLO
YOLOv3, as a single-stage detector, has shown many superior characteristics in the pedestrian detection.Compared with other updated versions of the YOLO series, the YOLOv3 model, as a classic and widely used object detection algorithm, has been fully validated in terms of network structure and performance.The relative simplicity of the YOLOv3 model also makes it easier to modify and optimize to verify the effectiveness of newly proposed methods or strategies.Furthermore, the YOLOv3 has lower computational resource requirements relative to the updated models.In this resource-limited environment, YOLOv3 is more appropriate.Firstly, YOLOv3 fully considers the global nature of the object detection problem.It uses a global convolutional structure to predict the position and class of all objects simultaneously in a forward propagation process 20,21 .Compared with other neural networks and machine learning algorithms, YOLOv3 has a faster detection speed on pedestrian detection tasks, which meets the real-time applications 22 .In addition, YOLOv3 optimizes the objective function design and introduces multi-scale prediction, which can better handle pedestrians of various sizes and shapes, and weaken the detection performance degradation caused by scale changes 9,23 .Meanwhile, compared with the previous version of YOLO, YOLOv3 has added presets for more types of anchor boxes, which can better adapt to pedestrians in various postures, and correspondingly improve the detection accuracy 24 .Therefore, compared with other neural networks and machine learning algorithms, YOLOv3 achieves a good balance between detection speed and accuracy in pedestrian detection tasks, which has significant advantages.Therefore, the YOLOv3 algorithm is selected to build a statistical model for pedestrian flow.
As the third-generation algorithm of YOLOv3, YOLOv3 is different from the previous version in that YOLOv3 has three significant improvements.Firstly, YOLOv3 adopts the concept of ResNet residual network to ensure that the model can converge normally even in extremely deep network structures.Secondly, YOLOv3 implements a pyramid-like network structure, providing multi-scale prediction at three different sizes, especially enhancing the detection performance of small-size targets.Thirdly, in terms of the loss function, YOLOv3 uses binary cross-entropy loss instead of the traditional softmax loss function, which not only improves the target prediction accuracy, but also allows each bounding box to predict multiple targets [25][26][27] .YOLOv3 feature extraction network Darknet-53 includes many 3 × 3 and 1 × 1 convolutional layers, mainly composed of convolution and residual layers.Its complex depth structure is the reason for the slowdown of YOLOv3 training and detection 28 .Five down-sampling times are carried out in the network, and the features are output in the last three layers, and three scales of YOLO layers are generated after processing in the pyramid feature space 29,30 .Darknet-53 dominates the feature extraction work, and the YOLO layer is responsible for the interaction between the different feature layers.The above improvements make YOLOv3 have stronger human image recognition capabilities, which is one of the reasons for choosing this algorithm to build the model.In Fig. 1, "Conv" is the abbreviation of "Convolutional", and "Conv2d" stands for two-dimensional convolution.The numbers after each layer type character in the figure are filter size, number of neurons, and step size from left to right.
YOLOv3 has shown excellent results in the field of pedestrian detection, but it still faces two major challenges.One is the increase in the required storage space due to massive network parameters, which seriously affects its real-time performance in the detection process.The other is that in the changing street background, the pedestrian target size difference and occlusion are different, which is easy to cause missed detection.The main network of YOLOv3, Darknet-53, includes 52 convolutional layers and a fully connected layer, which hardly satisfy real-time requirements based on massive computing and parameters.In contrast, YOLOv3-tiny is a lightweight detection method with less than 9 million parameters and only 33.7 MB storage space, so it has fast training, low video memory requirements, and fast detection.However, its ability to extract complex image features is weak, and its detection performance is poor for light drastic changes, serious occlusion or small targets.In view of this, this chapter is improved based on YOLOv3 and proposes a I_YOLOv3 algorithm.The overall architecture is shown below.The number after the output layer character in Fig. 2 represents the output size.
The first step of the I_YOLOv3 algorithm is network prediction, which uses the I_YOLOv3 network model for forward propagation.The network uses MobileNet as the backbone network and replaces Darknet-53 in YOLOv3 with depth separable convolution, reducing parameters and computational complexity.The second step is to perform output processing by reshaping the tensor output of the network into appropriate dimensions.Each grid cell predicts multiple bounding boxes, including bounding box coordinates, category confidence, and object confidence.The third part is confidence screening, which applies a sigmoid function to the object confidence of each bounding box and sets a threshold.Boundaries with a confidence level higher than the threshold are filtered out as possible detection targets.The fourth step is coordinate inversion.Based on the prediction method of I_YOLOv3, the actual coordinates of the bounding box are calculated by the offset of the bounding box and the prior box information obtained through K-means clustering.The fifth step is to perform Non Maximum Suppression (NMS) and apply NMS to the predicted bounding boxes of the same category, eliminating redundant boxes with high overlap and retaining the bounding box with the highest score.The sixth step outputs the results, draws the filtered bounding boxes on the original image, and labels the predicted categories to complete the pedestrian detection task.The output size O of the I_YOLOv3 algorithm follows Eq. (1).
In Eq. (1), I represents the input image size.K represents the convolution kernel size.P represents the number of fills.S represents the convolution step size.In the backbone network, Darknet-53 is replaced by a MobileNet lightweight network using deep separable convolution, because the latter can reduce the parameters and computational complexity of the entire algorithm, which is more suitable for the application scenario of crowd statistics in public places.The deep separable convolution and standard convolution structures are shown in Fig. 3.
Point convolution uses separate convolution kernels to process independent channels.D F is the width and height of the feature map.M and N are the number of deep convolutional layers and point convolutional layers.www.nature.com/scientificreports/D K is the width and height of the convolution kernel of the deep convolution and standard convolution.If the size of the output feature map is D F × D F and the convolution kernel is K × K , the computational amount S 1 of the standard convolution is shown in Eq. (2).
In contrast, the computational amount S 2 of depth separable convolution is shown in Eq. ( 3).
The two items before and after the plus sign in Eq. ( 3) are the computational amount of depth convolution and point convolution, respectively.Therefore, the ratio of the computational amount of standard convolution to the deep separable convolution can be described in Eq. (4).
Equation (4) shows that the computational cost of the depth separable convolution is 1 times of the standard convolution.For example, when N is much larger than K 2 , D k is 4, and the computational cost of the depth separable convolution is 8/9 lower than that of the traditional convolution.With this approach, MobileNet significantly reduces the amount of computation and parameters, thereby reducing redundant expression.
Before predicting the target box, the anchor boxes are redefined by using K-means clustering in the algorithm to accelerate the model convergence.In this process, the size ratio of the K-means cluster is quite different from the target in the dataset, because the initial candidate box selection does not consider the actual labeling information of the actual pedestrian dataset, which increases the missed detection of small pedestrians.The K-means clustering method optimizes the size and proportion based on the concept of anchor boxes.The clustering distance is measured by calculating the intersection over union ratio (IoU) between the predicted candidate box and the actual bounding box.The distance calculation method for clustering is shown in Eq. ( 5): In Eq. ( 5), center is the cluster center and box is the clustering sample.

Mixed improvement of the flow statistics model based on YOLO and ReID processing
The improved YOLO algorithm designed above is used to detect pedestrian targets.However, in the actual application scenario, the input data contains video, and the detected pedestrians need to be tracked in order to count the flow.Therefore, the pedestrian tracking intelligent algorithm is designed.
DeepSORT is a typical and commonly used object tracking algorithm, which applies Kalman filter in image space to complete target prediction, and uses Hungarian algorithm to correlate inter-frame data one by one.The algorithm measures the correlation using the overlap ratio of the bounding boxes, and shows excellent performance in high frame rate videos.The computational cost of the improved YOLO algorithm is greatly reduced.Due to the shallow network and poor adaptability to the dataset, the ReID module in DeepSORT is improved to obtain Improved DeepSORT (I_DeepSORT).The main improvement of this algorithm is the Convolutional Neural Network (CNN) architecture based on deep cosine metric learning, so as to improve the extraction performance of appearance features.In terms of measurement calculation, the algorithm adopts a comprehensive measurement method, which combines the appearance and motion information of the target through linear weighting, and then enhances the weight of appearance features.The purpose is to improve the matching between recognition and tracking targets, thereby enhancing the algorithm performance.The I_DeepSORT algorithm includes ReID enhancement, trajectory processing, etc.
Firstly, in network structure, the improved deep CNN network structure information in the I_DeepSORT is shown in Table 1.In Table 1, the main content of this improvement is to increase the four-layer residual network.By increasing the depth of the model, the model can learn more layers and more complex feature representations, thus improving the ability to identify pedestrian identity, deepening the depth of the network and improving the identification accuracy.However, the network hierarchy deepening tends to slow down the network convergence speed, because the input distribution of the activation function may gradually approach the saturation region of the nonlinear function, resulting in the gradient disappearance of the lower neurons during back propagation.Batch normalization (BN) amplifies gradients and accelerates the learning rate of the network by re-normalizing the data with distribution shifts, which helps the response of the activation function to be more in the sensitive region of the nonlinear function.Therefore, after each residual network adds layers, BN operations are introduced in this section to ensure fast convergence even as the network depth increases.
Then the loss function of the network is redesigned.Common neural network loss functions include "0-1", Hinge, cross-entropy, and logarithmic, among which the latter two are the most widely used.Their calculations are shown in Eqs. ( 6) and (7), respectively.
In Eq. ( 6), L CE is the cross-entropy loss function of the binary classification problem.y is the true label of the sample x .a is the prediction output.n is the total number of samples. (2) Vol:.( 1234567890) www.nature.com/scientificreports/In Eq. ( 7), L log is the value of the logarithmic loss function.p y|x is the probability y that the predicted outcome will be labeled given the input x .However, to minimize the inter-class error and maximize the intra- class difference, the Cosine softmax loss function trains the network, and the corresponding loss expression is shown in Eq. ( 8).
In Eq. ( 8), I y i =k is the indicator function, evaluated as 1. p CS (•) is the Cosine softmax classifier in Eq. ( 9).In Eq. ( 9), κ is the free scaling factor.wT k is the normalized and inverted weight coefficient vector.r is the basic representation corresponding to the features of the parameterized encoder network trained simultaneously as the classifier.C is the maximum number of dataset labels.By minimizing the cross-entropy between the true label distribution and the Cosine softmax estimated probability, this loss function makes the estimated probability of the correct class close to 1 and the estimated probability of the rest of the classes close to 0.
Then, the trajectory processing and state estimation process of the I_DeepSORT algorithm are designed.The trajectory state can be represented in an 8-dimensional space u, v, γ , h, ẋ, ẏ, γ , ḣ , which includes the center coordinates (u, v) of the predicted bounding box, the predicted height h and aspect ratio γ of the pedestrian target box.The remaining four dimensions represent the velocity information of the above parameters relative to the image coordinates.With the help of a standard Kalman filter, the target motion prediction is carried out based on a uniform motion and linear observation model.The prediction result (u, v, γ , h) can be obtained by the algorithm, where the frame rate is standard 25 fps.The algorithm needs to set a counter for each target tracked, with the aim of counting during the Kalman filter prediction.The tracker's counter is reset when a trace is successfully matched to a detection.If the tracker consistently does not match the results within a specific time period, the tracker is removed from the tracker list.When a new finding appears (i.e., a finding that doesn't match an existing tracking list), a new tracker is created for it.If the location prediction of the newly tracked (7)  L log y, p y|x = − log p y|x www.nature.com/scientificreports/target matches the detection result in three consecutive frames, the algorithm considers the new target to appear.Otherwise, it is considered a "false alarm" and removes the tracker from the tracker list.
The information association and cascade matching rules in the algorithm are then designed.In terms of information association, the algorithm uses the Mahalanobis distance to describe the distance between the detection frame and the prediction frame, which evaluates the correlation of target motion information in Eq. (10).
In Eq. (10), L (1) m i, j is the Mahalanobis distance between the detection frame and the prediction frame.p i is the predicted position of the i tracker.l j is the predicted position of the j detection target box.X i is the matrix of covariance between the detected and averaged tracking locations.Taking into account the continuity of the target's motion state, the Mahalanobis distance matching filter is used, and the threshold is set to the 95% of the chi-square distribution.At the same time, considering that the correlation method of Mahalanobis distance is invalid when the camera is in motion, the target appearance information correlation is proposed.The calculation process is as follows.
Firstly, the eigenvectors rr j of each item l j ( rr j = 1 ) is calculated.The second step is to generate a relevant gallery for each tracking target, which contains the eigenvectors that are successfully associated with the first 100 frames.The third step is to calculate the minimum cosine distance rr j = 1 between the gallery and the current frame detection result for each tracker.If the distance is less than the threshold, the association is successful.This distance is calculated in Eq. ( 11).
In addition, in the final measurement, the improved algorithm uses a fusion measurement method to connect the target appearance and motion information through linear weighting to perform correlation measurements, so as to improve the matching degree between detection and tracking trajectories.This process can be described in Eq. (12).
In terms of cascade matching, if an object is occluded, the uncertainty of the Kalman filter prediction will increase with the increase of the observability of the state space.However, if the matching right of the same detection result is preempted by two trackers, the covariance of the tracking line that updates the location information will become larger due to the obscuration, resulting in an increase in the uncertainty of the tracking prediction position.In terms of the trajectory correlation of the detection results, it is easier to associate with objects with a long occlusion time, which greatly destroys the continuity of tracking.Cascade matching effectively solves this problem by matching trajectories with the same vanishing time to ensure that the closest object is given the highest priority.The specific flow of this idea is shown in Fig. 4.
At this point, the improved I_DeepSORT algorithm has been designed.Next, the technical rules of the crowd flow are determined, and the statistical model of the crowd flow is constructed by combining the improved I_YOLOv3 algorithm and the I_DeepSORT algorithm.In terms of crowd flow statistics rules, the virtual line setting is used in the model to count the crowd flow, and judge whether the pedestrian trajectory (formed by the center point of the tracking window) crosses the line to count the flow of people.Due to the high randomness of pedestrian behavior, especially considering the uncertainty of pedestrians' motion direction and detention situation, a two-way counting rule as shown in Fig. 5 is designed.As shown in Fig. 5, in the counting rule, depending on the pedestrian trajectory from A to B, it is specified as downward.If crossing line B, the number of people going down is added by 1.When the pedestrian crosses line A, the number of people going up is added by 1.The sum of the number of people going up and down is the total number.( 10) The improved I_YOLOv3 algorithm and I_DeepSORT algorithm are used for pedestrian target detection and pedestrian target tracking, respectively.A smart human flow statistical model is formed, as shown in Fig. 6.The model counts the number of people through the cross-line counting method, so as to achieve two-way people flow statistics.

Performance test based on improved YOLO flow statistics model
After the statistical model for pedestrian flow recognition based on improved I_YOLOv3 and I_DeepSORT algorithm is designed, the experiment is carried out to verify the application value of the model in people flow statistics.The advantages and disadvantages of the model are discussed.

Test plan construction
Testing experiments are conducted using two publicly available pedestrian datasets, Caletech and CUHK Occusion Pedstrian.Caltech dataset is widely used in the computer vision field, which contains multiple sub-datasets, Vol  13).
In Eq. ( 13), precision C is the sum of the average recognition precision of all pedestrian types in all images.N(Tot_Ima) C is the number of images containing pedestrian objects in all tested images.
The LAMR calculates the average false detection rate over the interval [10 -2 ,100] and takes a logarithmic average of 9 points uniformly in Eq. ( 14).
In Eq. ( 14), missrate(•) is the recognition error rate.LAMR is the ratio of the number of images with recogni- tion errors to the total number of images tested.Other conditions remain constant, the smaller the LAMR value, the higher the detection accuracy of the model.
The MOTA calculation method is shown in Eq. ( 15).
In Eq. ( 15), FN , FP , ID_SW , and GT represent the number of false negative cases, the number of false positive cases, the number of identity switches, and the number of ground truth objects, respectively.
The MOTP calculation method is shown in Eq. ( 16).
In Eq. ( 16), i represents the detection target.d t,i represents the average measurement distance between the detection target i and its true value in all frames.c t represents the number of successful matches.
Table 2 shows the parameter setting results of the software environment, hardware environment, and improved flow recognition model in the test.Model design and hyper-parameter comparison are determined by manual experience to determine the range, and then debugged based on the grid parameter method.
In order to improve the reliability of the test results, the common target detection algorithms YOLOv3, CenterNet proposed by 32 , and the accepted field block network (RFBNet) proposed by 33  www.nature.com/scientificreports/

Analysis of test calculation results
Firstly, in order to verify the superiority of the proposed method, the algorithm ablation experiment is conducted, and compared with the newly proposed CenterNet algorithm and RFBNet algorithm.The average accuracy of the target detection algorithm is evaluated by Mean Average Precision (mAP), and the accuracy of multi-target tracking is evaluated by MOTA.MOTP is used to evaluate the accuracy of the tracking results, and Frames Per Second (FPS) is used to evaluate the real-time performance of the algorithm.Model size is sued to evaluate the size and storage requirements of the model.The experimental results are shown in Table 3.
According to the experimental results in Table 3, the algorithm proposed in this study achieved significant improvements in multiple aspects.In terms of object detection, I_YOLOv3 improved mAP by 2% compared with Baseline, demonstrating better detection accuracy.In the DeepSORT algorithm with four layers of ResNet added, MOTA increased by 5% and MOTP decreased by 30 cm, indicating a significant improvement in the multi-target tracking accuracy.Although the tracking accuracy slightly decreased, it was still within an acceptable range.In addition, the FPS of I_YOLOv3 increased from 20 to 25, resulting in improved real-time performance.At the same time, the model size reduced by 20MB, indicating that the algorithm was optimized in terms of computational efficiency and storage requirements.Compared with CenterNet and RFBNet, although there are slight shortcomings in mAP and MOTA, I_YOLOv3 and its combination with DeepSORT perform excellently in overall performance, especially in real-time performance and model size, verifying the superiority of the proposed method.The performance of each model in the training process is analyzed, as shown in Fig. 7.The "I_YO_SORT" represents the statistical model designed in this study.The horizontal axis represents the number of iterations, and the vertical axis in Fig. 7a,b show the loss function value and the PersonAP value, respectively.As iterations increased, the loss function of each statistical model gradually has decreased, and the PersonAP value gradually has increased and tended to be stable.I_YO_SORT model completed convergence when training times were 168, and the convergence speed was faster.After 300 iterations, the training of each model was completed, and the loss functions and PersonAP values of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 models were 0.86, 1.47, 4.82, and 4.93, and 98.7%, 96.8%, 95.3%, and 94.9%, respectively.
After training each model, the study conducts application experiments in many crowded places in Shanghai.The statistical results of pedestrian detection PersonAP and detection time are shown in Fig. 8.The horizontal and vertical axes describe the values of different pedestrian detection models and PersonAP indicators.To test the model stability, each test scheme is repeated 30 times.The difference is tested by T-test, and the significance level of the difference is 0.05. Figure 8 showed that the median PersonAP of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 pedestrian detection models were 94.2%, 84.5%, 71.3%, and 65.1%, respectively, and the P values of the T-test for last three models and the I_YO_SORT model were all less than 0.05, which was considered to be significant.
To test the running speed and sensitivity of each person detection model to the number of training samples, the testing set detection time of each model under different training sample numbers is tested, as shown in Fig. 9.Each protocol is also performed in 30 replicates.In Fig. 9a,b, the horizontal axis represents the number of samples  As the number of training samples increases, the average detection time and total detection time of each model generally show an upward trend, because the more samples participate in model training, the higher the overall complexity of the CNN after the model.Therefore, the connections between neurons become more complex, which increases the computational time used, but the increase is relatively small.When the training set was 2988 images, the total detection time of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 pedestrian detection models was 216 ms, 638 ms, 482 ms, and 271 ms, respectively.The Rank-k metric evaluates each detection accuracy, which represents the probability that the top k images contain the correct label in the confidence level of the prediction output.The statistical results are shown in Fig. 10.The horizontal axis is the value of k, and the vertical axis is the corresponding Rank-k accuracy.As k increased, the Rank-k accuracy of each model also has increased.The average Rank-k accuracy for the I_YO_ SORT was always higher than that of the other comparison models.When k was 1 and 10, the average Rank-k accuracy of I_YO_SORT, CenterNet, RFBNet, and YOLOv3 models was 87.2%, 84.8%, 82.6%, and 81.3%, and 98.6%, 96.2%, 92.5%, and 91.0%, respectively.
Table 4 shows the comparison of four multi-target tracking performance evaluation indexes: MOTA, MOTP, the proportion of hit trajectory assumptions to trajectory truth (Mostly Tracked, MT), and the proportion of lost target trajectory to trajectory truth (Mostly Lost, ML) of each model under different Rank-k conditions.From Table 4, the values of the I_YO_SORT model on MOTA and MOTP were higher than those of the other three models under the same conditions, indicating that the former had higher multi-target tracking accuracy.At the same time, the MT and ML indicators of the I_YO_SORT model were lower than those of the other three models, indicating that the former improved the integrity of the overall tracking trajectory.
Figure 11 compares the memory consumption of each model in the tracking count of the testing set.In Fig. 11, the horizontal axis represents the number of testing set samples participating in the calculation, and the vertical axis represents the corresponding memory consumption.The gray dotted line in Fig. 11   From the perspective of subjective evaluation, the statistical effect of crowd recognition of each model is analyzed.20 domestic object detection experts are invited to participate in the evaluation, and they are required to conduct multi-dimensional satisfaction scores on the flow statistics of each model.The score is 10 points.The higher the score, the higher the satisfaction, the evaluation results are shown in Fig. 12.The median subjective

Conclusion
To Firstly, the model has not yet been widely deployed and tested in actual commercial grade products.Secondly, in complex and constantly changing environments, the performance of the model may be affected.To make up for these shortcomings, it is necessary to continue exploring the performance optimization of model under extreme weather conditions in the future, and integrating new technologies into existing models to continuously improve its performance.Through these efforts, it is expected to provide more accurate, efficient, and intelligent solutions for crowd management in public places.

Figure 7 .
Figure 7.Comparison of training effects of pedestrian flow statistical models.

Figure 8 .Figure 9 .
Figure 8.Comparison of PersonAP and detection time in the testing set of the flow statistical model.

Figure 10 .
Figure 10.Comparison of Rank-k indicators of each model testing set.
solve the insufficient recognition accuracy and high resource consumption, a lightweight intelligent statistical model for people flow based on improved YOLOv3 and ReID was designed.The test results are as follows.I_YO_SORT model completed convergence when the number of training times was about 168 times, and the convergence speed was faster.The median PersonAP of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 pedestrian detection models were 94.2%, 84.5%, 71.3%, and 65.1%, respectively, and the P values of the T-test for last three models and the I_YO_SORT model were all less than 0.05, which was considered to be significantly different.When the training set was 2988 images, the total detection time of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 pedestrian detection models was 216 ms, 638 ms, 482 ms, and 271 ms, respectively.With the increase of k, the Rank-k accuracy of each model also increased, but the average Rank-k accuracy of the I_YO_SORT was always higher than that of the other comparison models.The values of the I_YO_SORT model on MOTA and MOTP were higher than those of the other three models under the same conditions.Meanwhile, the MT and ML indicators of the I_YO_SORT model were lower than those of the other three models.When the number of samples participating in the test was 130 and 1280, respectively, the memory consumption of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 models was 2.42 MB, 7.81 MB, 12.85 MB, 5.31 MB, and 2.57 MB, 12.48 MB, 25.62 MB, and 5.74 MB, respectively.The median subjective scores of pedestrian detection and trust tracking of the I_YO_SORT, CenterNet, RFBNet, and YOLOv3 models were 9.3, 8.3, 7.2, and 7.5, and 9.2, 8.3, 7.5, and 7.6, respectively, and the overall scores of the I_YO_SORT were higher.The test results show that the flow statistics model designed in this study has higher tracking accuracy, faster statistical speed, and less resource consumption.Although the improved intelligent human traffic statistics model based on YOLOv3 and ReID designed in this study has significantly improved performance, there are still some research shortcomings.

Figure 12 .
Figure 12.Comparison of subjective evaluation scores.
Vol.:(0123456789) Scientific Reports | (2024) 14:14601 | https://doi.org/10.1038/s41598-024-64905-9 31ch as Caltech101 and Caltech256.The CUHK dataset is a general term for several datasets published by the Chinese University of Hong Kong (CUHK).This dataset is the first large-scale pedestrian re-identification dataset sufficient for deep learning.43,264images of 13,264 individuals are collected from six existing personnel reidentification datasets (including CUHK03)31.Test experiments uses two publicly available pedestrian datasets, Huay Technology and CUHK occupy pedals.The team selects images with missing labels and completely blocking pedestrians, obtaining a total of 4268 valid data images.The IoU threshold of 0.3 is used for NMS, which allows the model to have a certain degree of fault tolerance when locating pedestrians.The appearance feature vector dimension 512 ensures the richness of feature expression, which helps improve the accuracy of detection and ReID.Weight attenuation of 0.0005 is used for regularization to prevent over-fitting and ensure the model generalization ability on the training set.The dataset partitioning of 7:3 ensures sufficient data for model training while retaining sufficient test data to evaluate model performance.The research team screens out the images with missing labels and completely occluded pedestrians, obtaining a total of 4268 valid data images.The dataset includes a testing set and a training set at 7:3.Accuracy personAP, Log Average Miss rate (LAMR), Multi Object Tracking Accuracy (MOTA), and Multi Object Tracking Accuracy (MOTP) are used as evaluation metrics for model performance in the test.PersonAP is calculated in Eq. ( .:(0123456789) Scientific Reports | (2024) 14:14601 | https://doi.org/10.1038/s41598-024-64905-9www.nature.com/scientificreports/

Table 2 .
Software and hardware environment, improved flow identification model parameter setting scheme.

Table 3 .
Results of the algorithm ablation experiments.

Table 4 .
Comparison of multi target tracking performance indexes of various models under different Rank-k conditions.Memory consumption for each model testing set tracking counts.www.nature.com/scientificreports/scores of pedestrian detection and trust tracking of I_YO_SORT, CenterNet, RFBNet, and YOLOv3 models were 9.3, 8.3, 7.2, 7.5, and 9.2, 8.3, 7.5, respectively.7.6.The overall score of the I_YO_SORT was higher than that of the comparison model.