Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
A well-known approach for collaborative spam filtering is to determine which emails belong to the same bulk, e.g. by exploiting their content similarity. This allows, after observing an initial portion of a bulk, for the bulkiness scores to be assigned to the remaining emails from the same bulk. This also allows the individual evidence of spamminess to be joined, if such evidence is generated by collaborating filters or users for some of the emails from an initial portion of the bulk. Usually a database of previously observed emails or email digests is formed and queried upon receiving new emails. Previous evaluations [2,10] of the approach based on the email digests that preserve email content similarity indicate and partially demonstrate that there are ways to make the approach robust to increased obfuscation efforts by spammers. However, for the settings of the parameters that provide good matching between the emails from the same bulk, the unwanted random matching between ham emails and unrelated ham and spam emails stays rather high. This directly translates into a need for use of higher bulkiness thresholds in order to ensure low false positive (FP) detection of ham, which implies that larger initial parts of spam bulks will not be filtered, i.e. true positive (TP) detection will not be very high (FP-TP conflict). In this paper we demonstrate how, by use of the negative selection algorithm, the unwanted random matching between unrelated emails may be decreased at least by an order of magnitude, while preserving the same good matching between the emails from the same bulk. We also show how this translates into an order of magnitude (at least) of less undetected bulky spam emails, under the same ham miss- detection requirements.
Simon Nessim Henein, Mohammad Hussein Kahrobaiyan, Billy Nussbaumer