AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Due to the widespread usage of Android smartphones in the present era, Android malware has become a grave security concern. The research community relies on publicly available datasets to keep pace with evolving malware. However, a plethora of apps in those datasets are mere clones of previously identified malware. The reason is that instead of creating novel versions, malware authors generally repack existing malicious applications to create malware clones with minimal effort and expense. This paper investigates three benchmark Android malware datasets to quantify repacked malware using package names-based similarity. We consider 5560 apps from the Drebin dataset, 24,533 apps from the AMD and 695,470 apps from the AndroZoo dataset for analysis. Our analysis reveals that 52.3% apps in Drebin, 29.8% apps in the AMD and 42.3% apps in the AndroZoo dataset are repacked malware. Furthermore, we present AndroMalPack, an Android malware detector trained on clones-free datasets and optimized using Nature-inspired algorithms. Although trained on a reduced version of datasets, AndroMalPack classifies novel and repacked malware with a remarkable detection accuracy of up to 98.2% and meagre false-positive rates. Finally, we publish a dataset of cloned apps in Drebin, AMD, and AndrooZoo to foster research in the repacked malware analysis domain.

• The loudness of Bats can vary from a large value A o to a constant minimum value A min .
Based on the aforementioned rules, BA algorithm is presented in Algorithm 1. Rank the bats and find the current best x * 17: end while 18: Return x *

Appendix B Firefly Algorithm
FA is a meta-heuristic algorithm for optimization problems inspired by the flashing patterns and behavior of fireflies. Firefly algorithm is formulated by using following three rules: • Fireflies are attracted to other fireflies based on the intensity of their brightness.
• The attractiveness and brightness of firefly decreases as it moves away from other fireflies. Fireflies start to move randomly if they are unable to find a brighter firefly.
• An objective function is used to determine the brightness of a particular firefly.
Based on the aforementioned rules, the pseudo code of FA is presented in algorithm 2.

Appendix C Grey wolf optimizer
GWO is a meta-heuristic algorithm inspired by the social hierarchy and hunting strategy of grey wolves. Grey wolves live in a pack of 5 to 12 and are divided in to 4 different classes (alpha, beta, delta and omega) based on individual responsibilities. Alpha wolf is the head of the pack (regardless of gender) and is responsible to organize, make decisions and lead the pack. Beta wolf is second to the superior in the pack. It assists alpha wolf in decision making and has the authority to take over the command in case of injury or senility of alpha wolf. Delta wolves are the scouts and have the responsibility for security and Rank fireflies and find the current best 15: end while 16: Return brightest fireflies hunting activities for the pack. Finally, Omega wolves are the elders or the frail wolves. Mostly, they have the responsibility to take care of the off springs. Grey wolves are known for their extraordinary technique for hunting by employing following 3 steps: • Track, tail and approach towards the prey.
• Encircle, harass and move towards the prey until it becomes to a stationary state.
• Simultaneously attack the prey .
Equation 1 shows the mathematical representation of encircling the prey characteristics of grey wolves.
where X represents the position of the wolf, the current iteration is presented by t and X p is the current location of the prey. The controlled coefficients A and C are calculated with the help of equation 2 and equation 3 where r 1 and r 2 are randomly generated during iterations from a range of [0, 1] respectively. The controlled vector a linearly decreases from 2 to zero during the iterations as shown in equation 4.
where T represents the maximum number of iterations. The other wolves in the pack update their position based on the position of alpha (α), beta (β ) and delta (δ ) wolves as shown below: where the distance of current wolf from α, β and δ is represented by Equation 5 for each grey wolf in pack do 5: compute A and Cby Eq. 2 and 3 6: Update the position of current wolf using X α , X β and X δ by Eq.11 7: end for 8: Update a, A and C 9: Calculate the fitness of each wolf 10: Update X α , X β and X δ 11: end while 12: Return Best solution X α 3/3