On perturbation measure for binary vectors

. The paper is about remoteness of objects described by the nominal-valued attributes. Nominal values of the attributes are replaced by respective binary vectors. A new measure of remoteness between sets, based on binary attributes' values, is introduced. The new measure is called a measure of perturbation of one binary vector by another binary vector and can be treated as a binary version of developed by the authors sets' perturbation measure. Values of the newly developed measure range between 0 and 1, and the perturbation measure of one binary vector by another is not the same as the perturbation of the sec ond binary vector by the first one - it means that the measure is not symmetric in general.


Introduction
There are problems wherein comparison of objects plays an essential role and the result of such comparison often depend on applied similarity measures between objects. Generally, we can distinguished two different kinds of methods for measuring proxinlity between objects. The first kind is based on a measure of distance between points described in Cartesian coordinates; in the second kind an object is described by sets of features or attributes (Tversky [8]) instead of geometric points.
For nominal-valued attributes definitions of similarity (or dissimilarity) measures of two sets, Krawczak and Szkatula introduced concepts of perturbation of one set by another set (cf. Krawczak and Szkatula [3] , [4], [5]). The proposed measures identifies changes of the first set after adding the second set and/ or changes of the second set after adding the first set. It is shown that this measure is not symmetric, it means that a value of the measure of perturbation of the first set by the second set can be different then a value of the measure of perturbation of the second set by the first set. Of course there are cases with symmetric perturbation measures. The proposed measure can be normalized in different ways to a value ranged from Oto 1, where 1 is the highest value of perturbation, while 0 is the lowest value of perturbation. The measure of perturbation type 1 of one set by another set was introduced in the papers by Krawczak and Szkatula (cf. Krawczak and Szkatula [3], [4], [5], [6]). The mathematical properties of this measure were studied and the authors rewrote equivalent definitions of the few selected measures based on the measure of perturbation type 1 ( cf. Krawczak and Szkatula [61) . The measure of perturbation type 2 of one set by another set was proposed in the paper by Krawczak and Szkatula [7] and the mathematical properties of this measure were studied.
In this paper, we introduce a binary vector representation of a nominal-valued sets based on a procedure of binary encoding of sets. For the new representation of sets, namely binary vector representation we propose the perturbation of one binary vector by another binary vector. And next, we introduce the measure of perturbation type 2 of one binary vector by another binary vector. This new definition allows us to compare the newly introduced measure to other proximity measures. Next the mathematical properties of the measure are studied. Attaching the first set A; to the second set Aj, where A;, Aj ~ V, can be considered that the second set is perturbed by the first set, in other words the set A; perturbs the set Aj with some degree. In such a way we defined a new concept of perturbation of set Ai by set A;, which is denoted by (A; >-t Ai), and interpreted by a set A;\ Aj. The cardinality of the set A;\ Ai can be normalized to a value ranged from O to 1 and can be defined a measure of perturbation. The measure of perturbation type 2 of one set by another set was proposed in the paper by Krawczak and Szkatula [7] in the following manner:

Asymmetric matching between binary vectors
The measure of perturbation type 2 of one set by another set (1) was developed for nominal-valued sets' representation. By application of the following binary sets encoding procedure we are able to replace nominal sets representation by binary vector sets representation. The replacement allows us comparison of the selected measures for binary data to the newly developed measure of perturbation of one binary vector by another. The selected measures taken from literature (e.g. Choi et al. [11) describe various forms of the distance measures and similarity measures for binary cases.
Let us introduce the following procedure of binary encoding of sets which will be applied to change sets representation from nominal-valued into binary vector representation.
for \/v1 E V. Equipped with procedure (2) we can formulate the new representation of the nominal sets which are described by binary vectors of dimension equal to the cardinality of the set V. Let us illustrate the new set's representation by the following example.

Example 1.
There are considered the following set V ={ a, b, c, d} and subsets A; c:;;; V. Due to the introduced notation, for car(V)=4, we can describe any subset of V in a form of a binary vector, where digit 1 and 0 correspond to presence and absence of a respective nominal value in each subset , see Table 1.  Table 1 should be interpreted as follows: a first set A1 = {a} is represented by a binary vector A1 = [1,0,0, 0], i. e., a binary vector A1 describe a set A1.
The last set V={a,b,c,d} is represented by a 4-dimensional unit vector, i.e., a 4-dimensional unit vector describe a set V.
In literature we can find various forms of the distance measures and similarity measures for binary cases. Considering two L-dimensional binary vectors respectively. We will need to define the subtraction, summation and intersection of binary vectors Ai and Aj, as also the L-dimensional binary vector Ak , Ak = [w}, w~, ... , wi], as shown in Table 2, 3 and 4.   a+b+c (4) Introducing the measure of perturbation type 2 of the L-dimensional binary vectors we will discuss some its properties. It is important to notice that this measure is not symmetrical in general, by Definition 1.
It can be proved that this measure is positive and ranges between 0 and 1, where 0 is the lowest level of perturbation while 1 is interpreted as most level of perturbation, as it is shown in the Corollary 1. Additionally we can prove that a sum of measure of perturbation type 2 of the L-dimensional binary vectors is always positive and less than 1, as shown in the Corollary 2.

Corollary 2 The sum of the measures of perturbation type 2 for L-dimensional binary vectors A; and Ai satisfies the following inequality 0 ~ Per(A; >--t Ai)+ Per(Ai >--t A;)~ 1 (7)
Proof. 1) By Corollary 1, the sum Per(A; rl Aj) + Per(Aj rl A;) is non negative.
2) It can be noticed that the inequality b + c ~ b + c + a for a 2: 0 is satisfied. The right side of inequality (7) can be written as Additionally we can prove an interesting property of the introduced in this paper the measures of perturbation type 2 for the L-dimensional binary vectors and the Jaccard's coefficient presented as Corollary 3. The Jaccard's coefficient for two binary vectors, denoted by S1accard(A;,Aj), is defined in the following manner (e.g. Choi et al. Let us consider the following example which illustrates the mutual relationships between the above recalled proximity measures. Example 3. Let us consider two 9-dimensional binary vectors A1 and A2, where A1 = [1, 1, 1, 1, 0, 1, 0, 0, OJ and A2 = [1, 1, 0, 1, 1, 1, 1, 0, OJ . The problem is to calculate degrees of proximity between these vectors. The values of the measures of perturbation type 2 and the selected measures are compared. It seems that the best way to illustrate the proximity measure relationships is the graphic illustration shown in Fig. 1. It must be emphasized that the calculated measure values were done for these two exemplary binary vectors A1 and A2.  It is obvious that objects' proximity measures are not universal and applied for the same objects return different values (see Fig. 1). In general, the known in the literature measures of objects' proximities are developed and designed for specified data or even for considered data mining problem. The same specification is observed for binary vector representation of sets. Such approach is commonly used for nominal-valued data as well as for its binary vector representation. It seems that the proposed measure of perturbation type 2 of one vector by another vector can be considered as more general because we did not give any primary conditions for considered data set.

Conclusions
In this paper we consider problem of remoteness of objects described by attributes of nominal values. In general such problems are converted to binary representation and proceed as binary vectors comparisons. Therefore we proposed a novel remoteness measure called the measure of.perturbation of one binary vector by another binary vector. The proposed measure can be treated as an extension of the previously developed by the authors measure of one set by another set. The binary version of the perturbation measure causes some procedure simplification and additionally allows us to compare the developed measure to other approaches !mown in the literature. Some mathematical properties of the proposed in this paper the measure of perturbation type 2 for the L-dimensional binary vectors are explored. The proposed measure was compared with the selected measures for binary data. In must be emphasized that the developed measure of perturbation of one binary vector by another has some advantage compare to other methods because there are any initial assumptions on the considered data structure. Therefore the new measure can be considered as more general than others. Additionally, the measure has another advantage, namely it is not symmetric. The approach is illustrated by several examples which bring the new idea closer.