An eight-camera fall detection system using human fall pattern recognition via machine learning by a low-cost android box

Falls are a leading cause of unintentional injuries and can result in devastating disabilities and fatalities when left undetected and not treated in time. Current detection methods have one or more of the following problems: frequent battery replacements, wearer discomfort, high costs, complicated setup, furniture occlusion, and intensive computation. In fact, all non-wearable methods fail to detect falls beyond ten meters. Here, we design a house-wide fall detection system capable of detecting stumbling, slipping, fainting, and various other types of falls at 60 m and beyond, including through transparent glasses, screens, and rain. By analyzing the fall pattern using machine learning and crafted rules via a local, low-cost single-board computer, true falls can be differentiated from daily activities and monitored through conventionally available surveillance systems. Either a multi-camera setup in one room or single cameras installed at high altitudes can avoid occlusion. This system’s flexibility enables a wide-coverage set-up, ensuring safety in senior homes, rehab centers, and nursing facilities. It can also be configured into high-precision and high-recall application to capture every single fall in high-risk zones.


Histogram of Oriented Gradients (HOG) Feature Extraction
Consider first a tiling of an image (a video frame) into a rectangular grid of cells, as shown in Supplementary Figure S1. Each cell is associated with it 25 bins whose values are initialized to 0. The purpose of these 25 bins will be addressed shortly. Spatially, each cell also consists of some number of pixels, and the insert in Figure S1 shows one such example in magnified view. For a pixel, we approximate its horizontal and vertical intensity value changes as (∆ x , ∆ y ) = (R -L, T -B), where L, R, T, and B are the intensities of the left, right, top, and bottom spatial neighbors of the pixel, respectively ( Supplementary Fig. S2). With this (∆ x , ∆ y ), we compute a dot product with each of the 8 unit vectors shown in Supplementary Figure S3a representing 8 unsigned directions, and a dot product with each of the 16 unit vectors shown in Supplementary Figure S3b representing 16 signed directions, 8 of which are essentially the opposite of the 8 unsigned directions. Lastly, we note the direction corresponding to the unsigned case whose dot product has maximal magnitude, and the direction corresponding to the signed case whose dot product has maximal value. For ease of ongoing discussion, for the pixel under consideration, say the unsigned direction chosen is 67.5° and the signed direction chosen is 247.5°. Let m orientation be this maximal magnitude, which is the absolute value of the dot product with the 67.5° or the 247.5° unit vector. Finally, let m texture be the sum of the absolute values of all the 8 dot products from the unsigned directions. The choice of names "orientation" and "texture" will be discussed shortly.  Imagine for now that each of the 25 bins of a cell has a name. Eight of the names are unsigned 0°, unsigned 22.5°, …, unsigned 157.5°. Sixteen of the names are signed 0°, signed 22.5°, …, signed 337.5°. One of the names is texture. Then, for the cell C in which the pixel in the example above resides, we increment its bin values in the following manner. Note that the unsigned and signed directions chosen in the manner above must always be identical or differ by 180°. The proof is straightforward and is hereby omitted. We traverse through the pixels in the image and compute the bin values as depicted above. Each cell's bins will be incremented for multiple times as a cell typically consists of multiple pixels. Therefore, these bin values are not necessarily sparse.
An image typically consists of three channels representing its color information. In the HOG feature extraction algorithm above, it is not specified which channel the pixel intensity value comes from. There are two possibilities here. First, we carry out the algorithm three times, once for each color channel, and for each pixel simply collect our m orientation and m texture from whichever channel that would give the maximum value. Let us denote this possibility by Grayscale HOG, where each cell has 25 bins forming effectively a 25-vector. Second, we keep all the three outcomes and leave them intact, effectively expanding the bin size of each cell to 25 × 3 = 75. Let us denote this possibility by Color HOG, where the bins of each cell are essentially a 75-vector.
It is worthwhile to pause here and consider the intuitive motivation of the above algorithm as reflected from the naming convention. We note that m orientation is collected from the maximal dot product magnitude, which corresponds to the orientation in which the signal response is the strongest, thereby capturing the intuitive notion of orientation, a crucial ingredient for shape and general fall pattern. Furthermore, m texture captures the magnitude of all orientations. Without favoring any orientation, it captures an intuitive, if rough, notion of texture where patch surface appearance and consistency matter. In Grayscale HOG, we reduce information from three color channels to effectively only one, mimicking a "grayscale" image. Of course, it is very different from the conventional notion of reducing an RGB image to a grayscale one. The name here is for analogous purpose only. Finally, in Color HOG, we genuinely retain the information of each color channel (in the form of HOG features), which is useful for us to assess how color and lack thereof affect the performance of our system.
We now convert the h-vector, where h = 25 or 75, of each cell into a single discrete value known as a texton. Consider first a (large) set of HOG features of h-vectors collected as described above over each frame of every video in the training data set. We apply the standard k-means clustering algorithm to obtain k = 400 clusters for this set of h-vectors. Each cluster corresponds to a specific texton, and we can use these 400 clusters as a dictionary to assign any h-vector to the specific texton by determining its closest cluster in the Euclidean ℝ space. Now, any rectangular subregion R of an image is simply a tiling of textons as shown in Supplementary Figure S4. Consider the entire region R 1 = R as shown in Supplementary Figure S5a, the 4 equally partitioned regions R 2 , R 3 , …, R 5 in Supplementary Figure S5b, and the 16 equally partitioned regions R6, R7, …, R21 in Supplementary Figure  S5c, all from the same image. In each of these 21 regions, we tabulate its residing textons into a 400-histogram representing the frequency of occurrence of each texton in the region. Concatenating these histograms, we see that any rectangular subregion can be represented as a 21 × 400 = 8400-vector. Finally, the histogram intersection function is applied to pairwise 8400-vectors to form a type of spatial pyramid kernel 1 , which can be fed directly to the RVM. This establishes our image semantics-based feature engineering.  It is important to note that numbers such as 8 (the number of discrete orientations) or 400 (dictionary size) are tunable parameters. We use concrete numbers in our discussion for only the purpose of ease of understanding and avoidance of algebraic notational clutter, without any loss of generality.

Relevance Vector Machine (RVM) Learning Framework
We now start RVM 2 to train our detectors, with details of the learning framework as follows. Consider first the extracted features represented as the design matrix Φ of size , where is the number of training instances and is the number of features. Denote the column by φ , 1 . We seek to find a preferably sparse -vector for which  , where  is a feature vector, would approximate the probability that the data is ham (human/a falling event) rather than spam (not human/not a falling event), with sigmoid function ensuring the value of to lie in the range 0, 1 so that would be classified as ham if, for example, 0.5.
Consider a zero-mean Gaussian prior | , where for each of the -vector hyperparameter satisfies ~ 0, for each j. Let ∈ 0, 1 be the ham/spam ground truth indicator variable (for each training instance). As there is no known polynomial-time algorithm for maximizing the posterior | , over , we follow a standard aproximation technique of using iterative reweighted least squares (IRLS) to find a local optimum using iterative updates akin to the second-order Newton method with gradient ∇ ln  and Hessian ∇∇ ln   A , where A is the diagonal matrix of and is an diagonal matrix with entries 1 , 1 , yielding upon proper