Bidirectional comparison of multi-attribute qualitative objects

. In the paper, the multi-attribute objects with repeating qualitative values of attributes are considered. Each object is represented by a collection of multisets drawn from sets of values of the attributes. Formalism of the theory of multisets allows taking into account simultaneously all the combinations of attribute values and various versions of the objects. The effective procedure for comparing such objects as well as groups of such objects is developed. The considered measure of perturbation of one object by another is proposed as the difference of the multisets representing the objects. The measure describes remoteness between the objects, and, in general, is asymmetrical, and therefore cannot be treated as the distance. Next, we introduce the new measure of perturbation of one group of objects by another group of objects and then generate the description of each group of objects in the form of the classification rules to distinguish the considered groups. A practical illustration of the proposed approach is carried out for the task of grouping of text documents described by multisets.


Introduction
In data mining tasks there is a genuine problem ofusing a suitable measure of proximity between objects.
Here, we consider a pair of objects A and 13 •{ndi~ating a'distance measure and the similarity between these two objects.Generally, a distance represe,its a 0 qtiantitative• degree and shows how far apart two objects are.
Meanwhile, similarity describes degree indicates how close two objects are.It is imp01tant to notice that similarities focus on matching of relations between non identical objects while the differences focus on mismatching of attributes.Usually, there is an additional assumption about symmetry of objects ' proximity, i.e., the proximity of the object A to the object B is equal to the proximity of B to A .
However, there are many types of data proximity which are non-symmetric, e.g. in psychological literatt1re, especially related to modeling of human similarity judgments.It happens that considering two objects one can notice that the object A is more associated with object B than the other way round.Asymmetry may have variot1s meaning.Possible examples are like telephone calls between cities, e.g. the nt1mber of telephone calls from city A to city B can be different from the number of telephone calls from city B to city A. Another case, the cost of transformation of figt1res, e.g. the figure "~ " is more similar to the figure " c ", than the figure" c "to the figure" ~ " .This way, judging the similarity, e.g.Tversky found, that the less prominent stimulus was more similar to the prominent stimulus [Tversky, 1977].Thus, objects can be viewed either as similar or as different, depending on the context and frame of reference [Goodman, 1972].Sometimes researchers perform some preprocessing of the data to get symmetric.According to Beals at.al. [ 1968], "if asymmetries arise they must be removed by averaging or by an appropriate theoretical analysis that extracts a symmetric dissimilarity index".On the other hand, asymmetry may carry out important information, e.g.[Tversky, 1977[Tversky, , 2004]], [Tversky and Gati, 1978], [Tversky and Kahneman, 1981 ].Thus, it seems that the assumption of symmetry shot1ld not be established in advance, because often asymmetry of data should not be neglected.• We can distinguish qualitative properties describing objects in subjective terms as well as quantitative properties describing objects in objective terms.The task of comparing of objects requires choosing proper methods of data representation as well as the computer's data representation.In general, quantitative data represent numerical information about objects, such information may be measured, i.e., length, height, weight, time, cost, etc. While, qualitative data represent descriptive information about objects.Quality information are subjective and cannot be definitively measured.Thus, qualitative data can be observed but not measured, for example beauty, smells, tastes, etc.In general, the qualitative data are described by sets of attributes and the attributes are measured by nominal scales.Determination of similarities between "qua! itative" objects by using common dista1)ce ri;Jeasures cannot be directly applicable for qualitative data.The problem of defining of proximity rneasi.11'es,seems to be less trivial for nominal than for real-valued attributes.
[n the present paper, we consider a finite, non-empty set of objects, each object is described by a set of attributes, and each attribute is described by nominal values, and additionally it is assumed, that the values of the attributes can be repeated in the object description .In other words, each multi-attribute object can be presented in m copies or versions, and the descriptions of the copies may vary within the values of the attributes.Such problems are faced when e.g.some object is evaluated by several independent experts upon the multiple criteria, or the attributes of the object were measured in different conditions, or by different methods.The multiple-valued attributes can be processed using transformations like "averaging" or "weighting", or so on.However, in such a case, a collection of objects can have different structure.Therefore, the new methods for aggregating such kind of objects are required.Formalism of the multisets theory allows to take into account all possible combinations of attributes' values simultaneously and therefore various versions of the objects can be compared.It seems to be obvious that the multisets theory gives a very convenient mathematical methodology to describe and analyze collections of multi-attribute qualitative data with repeated values of objects' attributes.
In the classical set theory, a set Vis a collection of distinct values, v E V .If repeating of any value is allowed, then such a set is called the multiset.Thus, the multiset Scan be understood as a set of pairs, with additional information about the multiplicity of occurring elements.Let us assume now, that every subset of the set V of nominal values, in which repetition of elements is included, is called a multiset.The term "multi set" was introduced by Richard Dedekind in 1898' .•A complete survey ofmultisets theory can be found in several papers wherein appropriate op:e1•atib1is' and tliefr .pl'opertiesare investigated, e.g.[El-Sayed, Abo-Tabl, 2013; Girish, and Sunil, 2012;Petrov~ky, 1994Petrov~ky, , 200 I, 2003; Singh, Ibrahim, Yohanna, and Singh, 2007Singh, , 2008; Syropoulos, 200 I;Krawczak and • Szkatula, 2015b, 2015c, 2016].For instance, an exemplary description of the multi set { (l,a), (3,b),(2,c)} is • understood that the set of three pairs is considered wherein there is one occurrence of the element a, three occurrences of the element b, and two occurrences of the element c.The applications of multisets theory can be divided into two main groups: in mathematics ( especially, combinatorial and computational • aspects) and computer science.The paper [Singh, Ibrahim, Yohanna, and Singh, 2007] contains a con1prehensive survey of various applications ofmultisets.
In this way, each multi-attribute qualitative object can be represented by a collection of multisets drawn from the sets of nominal values V of the attributes describing each object.Following [Pelrovsky, 1994[Pelrovsky, , 1997[Pelrovsky, , 2001[Pelrovsky, , 2003) ) we will recall selected cases of qualitative data: evaluation of projects, retrieval of textual documents, and recognition of graphic symbols.Case first, evaluation of research projects by expe11s using predefined criteria with qualitative scale.This way, each project can be described in a form of a multiset, wherein the number of the elements is equal to the number of evaluations with qualitative scale, while the value multiplicity is equal to a number ofexpe11s evaluating the project.Case second, a collection of textual documents described by qualitative attributes is considered.The lexical attributes like descriptors, keywords, terms, labels, etc., express a semantic contents of documents.The description of each such document has the form of a multiset, where the multiplicities are equal to numbers of values of the lexical units appearing in the document.For many lexical units, the collection of such multisets constitutes another multiset.Case third concerns a collection of graphic symbols and a collection of standard symbols.Each such graphic symbol ._; , :• ; , _ _ • ; <• has a form of a multiset, where the multiplicity is equal to the valuation of the recognized graphic symbol comparing to the standard symbols.
In our present work we develop the effective procedure for comparing the nominal-value data wherein the attributes values are allowed to be repeated within the object's description.For such kind of data represented by multisets, the new asymmetric measure of remoteness between two multisets is developed.Additionally, following Tversky's suggestions about possible asymmetric nature of similarities between objects, our aim is to verify asymmetry of objects' proximity.Therefore, for data described by multisets we develop the new mathematical tool which provide satisfactory comparisons of two objects and then also two groups of objects.Although, there are known fairly many proximity measures of objects, however, usually there is an assumption about similarity.But, it seems to be obvious that there are problems wherein the direction of objects' comparison is significant.The appropriate choice of the applied measure depends on both properties of the objects considered.and the natuie of data under consideration.
This paper is a continuation as well as exte1ision of authors' previous papers on the pe1turbation of sets [Krawczak, andSzkatula, 2014a, 2015a].1;he teim "perturbation of one set by another set" is used in the general sense and ccmesponds to Tversky's considerations about objects' similarities [Tversky, 1977[Tversky, , 2004]].The considerations are based on the theory of the multisets and their basic operations.First, we define a description of each multi-attribute object as a K-tuple of the multisets, i.e., an ordered collection of multisets.Next, it is defined a novel concept of perrurbation of one multiset by another multise/ which constitutes a new multiset.Then, it is shown that the perturbation of one multiset by another multiset is described by a difference between these two multisets, and therefore the direction of the pe1turbation of multisets has significant meaning.Due to normalization of the cardinality of this difference, the developed measure of the perturbation ranges between 0 and I, wherein 0 indicates the lowest value of pe1turbation while 1 indicates the highest value of pe1turbation.We propose two types of the measure of multisets' pe1turbation.The first is called the measure of perturbation type 1, where the pe1turbation is normalized by the arithmetic addition of these two multisets [Krawczak and Szkatula, 2015b, 20 I Sc] .The seco nd is called the measure of perturbation type 2 [Krawczak and Szkatufa, 2016], where the pe1turbation is normalized by the union of these two multisets.Then, we developed a description of a group of objects as an ordered collection of the multisets, and next a concept of perturbation of one group of objects by another group of objects is defined.The perturbation represents the difference of the description of one group compared to the description of another group.The direction of the pe1turbation of the groups has significant meaning also therefore, that the difference of multisets (e.g. the arithmetic subtraction of multi sets) is used.For example, the methodology allows to generate classifications rules distinguishing the considered groups (e.g. the text documents as shown in Section 5).These rules can be used to classify new objects to one of the prescribed group.Another example of application ofthis ' IT)ethodol6gy"is possibility to evaluate groups' distances in order to solve clustering tasks, analogically' tci'the authors' pi;evious paper [Krawczak and Szkatula, 2014b]., ' ' The paper is organized as follows: Section 2: pi-esents preliminary considerations on the asymmetric nature of the similarity of data.In Section 3 we present the description of the pe1turbation methodology for multiscts and the mathematical properties of the measure of pe1turbation type I and type 2. ln Section 4 we present the measures of interactions between objects described by multi sets.Section 5 presents the application of the measures of objects' pe1turbation for classification problem.The considered classification rules have the form "IF certain conditions are satisfied THEN a given object is a member of a specific group".
The developed methodology is explained by an illustrative example.

Asymmetry of data proximity
There are several ways to model asymmetries of proximity of data.The only assumption is, that a measure of similarity or dissimilarity between two objects must be defined.Let us provide a sho1t discussion of some of such models, for instance the prospect theory, "salient" and "goodness" of the form, and "cost" of objects' transformation.

Tversky and Ka/111ema11 prospect theo,y
Human perception can be modeled by the prospect theory developed by Tversky and Kahneman [Tversky and Kahneman, 1981].In outline, this theory describes people rationality in decisions involving risk.The theory states, that people make decisions based on the potential value of losses and gains.The value function is s-shaped and asymmetrical, see Fig.The most evident characteristics of the pro;pect theory is that the same loss creates greater feeling of pain compared to the joy created by an equivalent gain.For example, see Fig. 1, the feeling of joy due to obtaining $100 is lower than the pain caused by losing$ I 00.

"Salient" and "goodness" of the form
The issue of symmetry was extensively analyzed by Tversky [Tversky, 1997[Tversky, , 2004]], who considered objects represented by a sets of features, and proposed measuring of similarity via comparison of their common and distinctive features.Such assumptions generate different approach to comparisons of objects.Namely, comparing two objects A and B there are the following fundamental questions : "how similar are A and B?", "how similar is A to B?" and "how similar is B to A?" .The first question does not distinguishes the directions of comparison and corresponds to symmetric similarity.The next two questions are directional and the similarity of the objects should not be a symmetric relation, meanwhile.For example, comparing a person and his potirait, we say that "the portrait resembles the person" rather than "the person resembles the po1irait" [Tversky and Gati, 1978].
The perceived similarity is strictly associated with data representation.In general, the direction of asymmetry is determined by the relative "salience ofthe \; timuli".Thus, "The less salient stimulus is more similar to the more salient than the more salient stimulus is similar to the less salient" [Tversky, 1977].If the object B is more salient than the object A, then A is more similar to B. In other words, the variant is more similar to the prototype than the prototype to the variant.A toy train is quite similar to a real train, because_most features of the toy train are included in the real train.On the other hand, a real train is not as similar to a toy train, because many of the features ofa real train are not included in the toy train.
The psychological nature of human perception was discussed among others by Tversky and Gati [ 1978].They hypothesized, that both "goodness of form" and complexity contribute to the salience of geometric figures.Moreover, they expected that the "good figure" to be more salient than the "bad figure" .To investigate these hypotheses, they conducted two sets of eight pairs of geometric figures.In the first set, one figure in each pair (denoted p) has "better" form than the other figure (denoted q).In the second set, one figure in each pair (denotedp) was "richer or more complex" than the other (denoted q).Example two pair of figures from each set are presented in Fig. 2 and Fig. 3. /\ group of 69 respondents were involved in the experiment whom two elements .ofeach.pair were displayed side by side.The respondents were asked to choose one of the following two state~ents: "the left figure is similar to the right figure," or "the right figure is similar to the left figure".The order of the presented figures were randomized so that figures appeared an equal number of times on the left as well as on the right side.In results, more than 2/3 of the respondents selected the form "q is similar top".
Within the secon_ d experiment, the same pairs of figures were used.One group of respondents was asked to estimate (on a 20-point scale) the degree to which the figure on the left was similar to the figure on the right, while the second group was asked to estimate the degree to which the figure on the right was similar to the figure on the left.In results the hypothesis was confirmed that the average pairs' similarity of the figures q to the figures p, S(q,p), was significantly higher than the average pairs' similarity of the figures p to the figures q, S(p,q).These experiments confinned their hypothesis that similarity is asymmetrical, but it does not clarify the concept of"goodness of the form".

"Cost" of transformation
The objects' distance may be referred as a transformational distance between two objects.Such distance is described by the minimal costs (the smallest number of elementary operations) of transformation by a computer program of the first object's representation to the second object's representation.This concept is known as Levenshtein 's distance [Leven~htein, i 966].The developed measure of perturbation concept can be regarded as an extension of Levenshtein's ' distance.However the concept perturbation is evidently much more general because is bidirectional and conce1'11s nominal-valued attributes.
According to Tversky [I 977] as well as Garner and Haun [ 1978], the objects' transformations involve the operations of additions and deletions.It seems that deleting of feature typically requires a less complete specification than addition of its.Each comparison of the representations has a "short" and a "long" transformation, the arrows indicate the tempo1'al order of stimulus presentation.
Such transformations for the exemplary shapes A and B can be illustrated in Fig. 4. In order to generate the right figure from the left, the bottom line should be deleted.In the opposite case, the process of adding bottom line is more complex because requires specification of"what" and "where" exactly to add.Also can be considered the overall tra11.1formationdistance between two representations, which is characterized by the number of steps required to change one representation to other [Hodgetts et.al., 2009].They distinguished three general transformations for comparing shapes: 1) create a new feature, that is unique to the target representation; 2) apply feature, this operation takes a feature created via step I and applies it to one or both of the objects in the target representation; 3) swap feature between a pair of objects, e.g.shape or color.The transformation from the exemplary pair of two shapes A to the pair of two shapes B, and in the opposite direction, can be illustrated in Fig. 5. Let us consider first case, in order to calculate the transformation distance from the pair of shapes A to the pair of shapes B. Then, there are required to use only one transformation apply for existing square, i.e., app!y(square)=I.In the second case, the transformation distance from the pair of shapes B to the pair of shapes A requires using two transformations, creation of a new triangle and application of this new triangle, i.e., create(triangle) + app!y(triangle)=2.Thus, the transformation distance from the pair of two shapes A to the pair of two shapes B is "sho1i" (requires one operation), whereas the transformation from the pair of two shapes B to the pair of two shapes A is "long" (required two operations).Applying a feature that is currently available is simpler than introducing a new feature.

transformation (short)
In the next section we present the description of the perturbation methodology for multisets.

(I)
In (I) the function k., (.) is called a counting.fimctionor the multiplicity jimction, and the value of k, 1 .( v) s pecifies the number of occurrences of the element v E V ii, the multiset S. The element which is not included in the multiset S has its counting function equal zero .. The mulliset space is the set of all multisets with elements of V, such that no element occurs more than m times, and is denoted by [VJ"'.
Definition I can be formulated in the followii1g way (2) understood that the element v 1 E V appears k s (v,) times in the multiset S, the element v 2 E V appears ks(v 2 ) times and so on.In the case where k, 1 .(v;) = 0 then the element 11; E V is omitted.
Let us consider two multisets S 1 and S ( According to [Krawczak, and Szkatula, 20 I Sb, 20 I Sc, 20 I 6) the following basic operations and notions of the multisets can be distinguished.
On the basis of the authors' previous research, the new asymmetric measure of proximity between two multisets S I and S 2 is introduced.The details ofthe _proposed approach are presented below.

Concept of multisets' perturbation
Comparison of the first multi set S I to the second multiset S 2 is meant that the second multiset is perturbed by the first multiset, while comparison of the seco nd multiset S 2 to S 1 is meant that the first multiset is perturbed by the second one.It is important to notice that the direction of the perturbation has significant meaning.ln other words, one multisct can pe1turbs another multiset with some degree.
In [Krawczak and Szkatula, 20 I Sb, 20 I Sc, 2016], there was developed the definition of a novel concept of perlurbalion of one multiset S 2 by another multiset S 1 , denoted by (S 1 HSi) , which is interpreted as a difference between one multiset and another multiset, S,0S,, in the following way: (4) The counterpait definition is similar The interpretation of the perturbation of one multiset by another multiset is presented in the following example.
Example I.There is considered the followin_g set V = {a,b,c,d,e} and two exemplaty multisets S 1 = {(1,a),(1,e)} and S 2 = {(l,a),(1,d),(3,e)}, S 1 , S 2 E[V]3.The pe11urbation of the multiset S 2 by the multisct S 1 is the empty multiset, because (S 1 HS 2 )'=S 1 0S 2 =0.The perturbation of the multisct S 1 by the multiset S 2 is the following multiset (S 2 H S 1 ) = S 2 0S 1 = {(J,d),(2,e)}.D Note, that each finite multisct drawn from the ordinary set of L elements can be shown as a point in L-dimensional space.For example, assume that L=2, then the multiset {b,a,b,b} can be written in a simplified form as {(1,a),(3,b)} (since the order of elements is irrelevant) and by omitting the names of the elements, we get the point (1,3) in 2-dimensional space.
The geometrical interpretation of the proposed concept of the pe11urbation in 2D space is provided below.
Next, we will present details of the proposed approach of the measure of the perturbation of one multiset by another multiset.

Measure of 11111/tisets' perturbation
Again, let us consider two multisets S 1 ,S 2 e [V] 111 , V = {11 1 , 11 2 , ... ,v,J.The perturbation of one multi set by another constitute a new multiset, and there is a problem of estimating numerical values of the multisets' perturbations.For this purpose, we give two proposals of defining the measure of the perturbation of one multiset by another multiset, which values range between O and I. Value O indicates the lowest value of the perturbation measure while 1 is the highest value.The definitions are based on the cardinality of the multiset as a function that assigns a non-negative real number to _e~ch finite multi set Se [V] 111 , i.e., carcl._S)= I>s(v).' .

~
At the beginning the arithmetic subtractio1i of two 1irnltisets, S 1 0S 2 , is determined and its cardinality is described, and then the result is nonnalized.
Here, we propose the measure of perturbatio11 type 1 of one multi set by another with normalization done by the use of the arithmetic addition of these two multisets S 1 © S 2 , and another measure of perturbation p .I (S H S ) = ea, (6) i= I The intuitive meaning of the above definition can be given as follows, namely the measure of perturbation of one multiset by another is understood as the total number of elements appeari ng in the multiset which is created as the arithmetic subtraction of these multiset.The measure is normalized by the total number of elements within the multiset created by arithmetic addition of these multisets.The normalization causes that the measure is not greater than I.
In the counterpart case, the measure of perturbation of the multiset S 1 by the multiset S 2 is defined in the similar way: L(ks, (v;) -ks1,.;,•,(v;)) i=I 'fhe definitions of these two cases are similar, however the difference is involved in the directional character of the arlthmetic subtractions S/2>S 2 and S/8JS 1 , respectively.
The measure ofmultisets' perturbation type I satisfies the following properties : Corollary 1.The measure of perturbation type 1 of the multi set S 2 by the multiset s, satisfies the.following conditions

Imax{ks, (v1 ),ks, (v1)) l=I
The deffoition of the counterpart case is similar l 2)ks, (v;)-ks,ns, (v,)) The remark is the same, i.e., the difference relies on using the arithmetic subtractions S 1 0S 2 and S 2 0S 1 , respectively.The measure of perturbation type I of multisets differs from the measure of perturbation type 2 with respect to different form of the denominator.Namely, in the Definition 2 there is the arithmetic addition S 1 ffi S 2 , while in Definition 3 there is the union of multisets S 1 uS 2 • The measure of perturbation type 2 of one multi set by another set satisfies the following properties: The idea ofmultisets' perturbation we will be now illustrated by the following example.' Pe1,~s(S, H S 1 )=~'"~',------L(k,., (v, )+ks, (v,)) 7 / ,: I

D
In the subsequent subsection we provide the geometrical interpretations of the proposed measure of the multisets' perturbation in 2D and 3D space.

Geometrical interpretation of measure of multisets' perturbation
In order to demonstrate the meaning of the measures of the perturbation both type I and type 2, of a multiset S 2 by another multiset S 1 , i.e., Pe,{ 1 s(S 1 H S 2 ) and PerA~s (S 1 H 8 2 ), as well as the counterpart cases, i.e., Per;.'. 1 s (S, H S 1 ) and PerA:ts (S 2 H S 1 ), we draw some geometrical interpretations of the measures of the perturbations of the multisets in 2D and in 3D.

Case 2D
Let us assume that V = {a}, i.e., L = card(V) = I, and consider the following two multisets S 1 ,8 2 E (V]5, denoted by S 1 = {(k., 1 (a), a)}, and S 2 = {(k 8 2 (a), a)} .According to Eq. ( 6) and ( 7) the measures of perturbation type 1 have the following forms: and according to Eq. ( 8) and ( 9) the measures of perturbation type 2 have the following forms  (which are changed from O to 5), for fixed value of the function ks 1 (a) == 2 .For the first case of the perturbation (S 1 H S 2 ), the measures Per},, 1 • (S 1 H S 2 ) and Pe,},s(S 1 H S 2 ) (indicated as the points on the blue lines in Fig. 7) are equal O for k 81 (a)== 2 s; k 82 (a) s; 5. for the second case of the pe1turbation (S 2 HS 1 ) , the values of the measures of the perturbation: Per},s (S 2 H S 1 ) and Pe,A'.,s(S 2 H S 1 ) (indicated as the points on the red lines) are equal O for O s; ks, (a) s; ks, (a) == 2 .It is interesting to note that the both curves are convex.

Case 3D
Now, let us consider a case characterized by V == {a,b}, i.e., L == card(V) = 2 , and two exemplary multisets As an example of 3D case, let us consider the measure of perturbation type 2 for the multisets s, and

Comparing proximity measures
Let us consider two multi sets S 1 and S  According to (4) and ( 5), the perturbations for the multisets S 1 and S 2 are interpreted as the new multisets, described as follows : (S 1 HS 2 ) = { (max { ks, (a) -ks 2 (a), O}, a), (max{ k8, (b The graphic illustration of the selected measures and the counting functions of proposed petiurbations, for the fixed multi~ets S I and S 2 , is shown in Fig. 9. Fig. 9.A graphical illustration of few selected measures for fixed multi sets S 1 and S 2 • It is easy to confirm that the different criteria of evaluation of the distances between multisets will lead to different results , Obviously, the Chebyshev measure d c''""Y'''"'' (S,, S 2 ) = 2 (the purple segment) as well as Euclidean de,,c1;,,'"" (S1 ,S2 ) = ✓ 5 (green segment) and Manhattan du""'"'"""(SpS2 ) = 3 (the red path shows one of possible realization) are symmetric.However, if the direction of comparison of multisets cannot be neglected, then the counting functions k. , •"-'"' (b) = 2 and ks,Hs, (a)= I of the petiurbations (two black segments) may be used.Thus, it is obvious that it is impossible to indicate which measure is better in general.In other words, there does not exist the best measure for evaluation of proximity between two arbitrary multi sets and the choice depends on the nature of data under consideration.

Analytic case
The different measures known in the literature can be expressed as some functions of the measures of perturbations type I of one multiset by another • multiset [Krawczak andSzkatula, 2015b, 2015c], or the measures of perturbations type 2 (Krawczak and Szkatula, 2016].These measures can be spread into two components, which correspond to the directional two perturbations.In the following corollaries we present several very important properties of the select few measures, in which there is involved our idea of the perturbation measures. For example, the Bray-Curtis dissimilarity (d 11 _c(S 1 ,S 2 )= cal'cl(S/..Si) ) (Bray, Curtis, 1957), that is card(S 1 EB S 2 ) popular in the environmental sciences, can be rewritten in such a way that the equivalent definition contains the sum of the measures of the perturbation type I.
Corollary 3. The sum of the measures of the pei:titrbation type 1 satisfies the following condition
Corollary 4. The sum of the measures of the perturbation type 2 satisfies the.following condition Proof.See Appendix.
Thus, the introduced measures of perturbations of one multiset by another multi set can be used to provide equivalent interpretations of the distances between two multisets.
Equipped with the fundamental definitions about the pe11urbation of multisets, in the forthcoming sections, we will define a description of the multi-attribute object with repeating nominal values of attributes, as an ordered collection of multisets.Nex~, ;the concept of the measure of pe11urbation of one multiset by another multiset is adopted to all multisets within describing the considered object and the group of such objects.

Description of multi-attribute object
Assuming, that the objects are represented by their descriptions, the description of an object e is denoted by G •, and can be represented by an ordered collection of multisets, see the following definition .
• A single object e 1 is characterized by a lack of repetitions of values of all attributes, and each attribute a 1 , j = I, ... , K , can take only one value vi(J)JU,ei) E Va 1 .Because the value vi(j),i(J,ei) appears once in the rnultiset S. 1 <J )' then ks . .(v( .)i(• ))=!.In this case, the multiset S 11 (/eJ for j=l, ... ,K, in ( 16) is
Definition 5 (Join between descriptions of objects).The join between the description ofan object i:; and the description of an object 1' i is described as follows ( 18) The definition says, that the description of two joined objects is again a collection of K multisets.Each such j-th multiset, j = I, ... , K, is constructed as the join of two multisets S.1.,u.,,>EBS 1 ,,u.,,>describing the attribute c~ for the objects i:; and fi, respectively.
Case K = I Now let us consider another special case, for K = 1 , i.e., an object e is described by a single attribute A = {a 1 }, and the set Va 1 = {v 1 . 1 , v 2 • 1 , ... , vL 1 ,I} is the domain of this attribute.Each object e can be represented by a single multiset s 1 ., 0 _,) drawn froin the ordinary set of values V 01 , In this case, the description of each object e defined in ( 15) is reduced to the form G, = <Si.,(I.,)>, where S1,,(l,e) is the multiset •Si.,ti:,)E [V0 J" , and is defined by (16), and now can be written in the following form • ( 19) where v;{l),,(l,c) E f",, 1 , for i(l) e {I,2, ... ,L1}.The index i(l),1(],e) specifies which value ";(IJ,1{l,e) E Va 1 of the attribute a 1 is used in the object e .Foi• the object • e and fo1' the attribute a 1 the value V;(l).r(I.,)appears k,.

•11,1(1,c) I
,f ,e , , Next, we will present details of the proposed approach of the measure of the pe1turbation of one object by another object.

Corolla,)' 5. Measure of perturbation of the object
(v.) I " 2.., ' ,1.,u.,1)0 •',/,l(},,z) 1 Thus, the sum of the measure of perturbation of tlie object t; by the object ei, and the measure of pe1turbation of the objects ei by the objects t;, gives an equivalent interpretation of dissimilarity of two objects.In thi s way, Eq. ( 28) can be rewritten, and the equivalent definition of the similarity of the objects can be obtained: which is based on our idea of the objects pe1turbation measmes.
Tn order to make closer the idea, how to represent the objects using the multisets, and how the perturbations are realized, let us discus the following illustrative example.

3. Illustrative example -students described by several sets of the semester grades
The example concerns on the question, how to describe the object which exists in several versions, e.g.students described by several sets oftbe semester•grades, Interesting examples can also be found in the paper [Petrovsky,20 I OJ.Let us consider the high school student e 1 and his two sets of the semester grades in the same four obligatory subjects (attributes) {a 1 ,a 2 ,a 3 ,a 4 } and four optional subject (attributes) {a 5 ,a 6 ,a 7 ,a 8 }, all with qualitative scale V={v 2 ,v 3 ,v 4 ,v 5 }= {2-"unsatisfac fOIJ' ",3-"sati~facto ,y",4-"good",5-"excellent "} .

• • •
Going further, the concept of the measuring of perturbation of one object by another object can be extended to the groups of objects.Details of' the proposed approach are presented in the f01ihcoming subsection.

Measure of perturbation of groups of objects
Now, let us assume, that every non-empty subset of a finite set U = {e,,) , n = I,2, .. N, is called a group.

,en j
Thus, the group of objects g can be represented by an ordered collection ofmultisets, while each multiset is drawn from the ordinary sets of values V 0 .i, for j = I, 2, ... , K, and the description of such a group is defined as follows, Gg = EB G,,, , see Definition 8.
Definition 9 (Perturbation of one group by another).The perturbation of the one group of the object g, by the another group of the objects g,, denoted (Gg, HGg,), can be represented by an ordered collection of multisets SiJU,i:, J0SJ.i(.i.g 2 ) , j = 1,2, ... , K , drml'n from the ordina,y sets of nominal values V 01 of the attributes a 1 , respectively, and is defined as follows Thus, the perturbation of one group of objects by another group of objects is defined in an analogous way to the perturbation of one object by another object.Namely, the perturbation of the one group of the objects g 2 by another group of the objects g, is represented by a collection of pe1turbations s 1 .,(J.g,)0Sf.lU,g 2 ) generated for separate attributes a 1 , j = 1,2, ... , K. In result, it constitute a collection ofmultisets.
The considered developments can be applied in data mining tasks with redundancy, like classification problems of multi-attribute qualitative objects, wherein the values of the attributes can be repeated, The objects' classification is based on representing of each object by multisets, and on a set of elementary rules, and allows to assign the objects into proper groups.Thus, in the fo1thcoming section, the groups' Let us consider in general two groups of objects.In the first group g 1 r;;;,U, there are objects {e,,: 11EJg 1 r;;;,{l, ... ,N}}, card(Jg,)=N" while another objects {e,,: 11EJ" 2 r;;;,{l, ... ,N}), card(Jg)=N2 , do not belonging to the first but belong to the second group g 2 <;;;, u, where J, 1 nJ,, = 0. Additionally, it is assumed that the cardinality of each group is similar, i.e., N 1 "' N 2 • The classification rule for distinguish the objects belonging to the group g 1 can be generated in the following algorithmic way.
Then, the set of pairs PER a described by (35) can be used to create the set of the one-condition ,\'g1 HSgl elementary rules describing the group g 1 • Each such one-condition elementa,y rule for the group g 1 , denoted by R;,, "; , for i = i 1 , i 2 , ... , i la , is defined in the following manner R;, , , ., : IF [considered value= V; ];q(R;,,,•;) THEN a given object is a member of a group g 1 (36) where q( 11" ,,.) , for i E {i 1 , i 2 , ... , i L } , is called the strength coefficient of the rule R~ ,, , and is described The above procedure shows, how to create the classification rule for one group, taking into account the two existing groups.When we consider more than two groups, the procedure is rnn in a very similar way.
Namely, generating the classification rule for the group g, all other groups are considered as one group containing the objects do not belong to the group g.Then, e.g.considering the classification rnle for the group g 2 , the objects from the rest groups (i.e., g I and g, , g 4 , and so on) are considered as one group.
The classification rules are sequentially fo1:ni.edfor each group.
The already generated the classification rules (37) (i.e., R:i , R; 2 , and so on) can be applied to classification of a new object e.The classification is carried out through verification offulfilment of conditions in the conditional pa1ts of the rules.The classification is unequivocal where the only one classification rule is fulfilled.In the case of equivocal situations, when more than one of the classification rule is fulfilled, a matching degree to the group is calculated [G.Szkatula, 1995].The greatest degree of matching is the basis for grading.For example, for a new object e and tlie" group g 1 , described by the classification rnle (37), denoted by R; 1 , the matching degree MD.(e, R;,) can be calculated in the following way:
The developed approach to generate the group description in the form of the classification rules will be illustrated by the following example.

Illustrative example -grouping text documents
Practical presentation of the proposed approach was carried out for the task of grouping of the text documents, assuming that the context and the semantics are neglected.Here, a text document Sis modeled as a multiset, drawn from the ordinary set .of ,inique keywords and phrases appearing in the text, and can be represented by a set of I-ordered pairs, accqrdiiig to (i ), i.e:; S={(the number of occurrence of the keyword or phrase in the text document, the keyword or phrase)}, where L is the number of distinguished unique keywords and phrases.Usually, the keywords and phrases can be weighted in various ways, but here for simplicity, we assume the same importance for all keywords.
Having such objects (i.e., the text documents), the task is to divide the objects into similar groups and determine the number of these groups.

Grouping of the objects
The aim of this task is to divide the set of the considqred the text documents U into non-empty, disjoint groups, together containing all the considered documents.
First, in order to define the number of groups 'we applied the taxonomic method proposed by Czekanowski in 1909 [Czekanowski,1909].The so called Czekanowski's diagram is a graphic methodology for multidimensional grouping of objects, which used to be widely applied in physical anthropology, plant sociology, agricultural economics, etc.The Czekanowski method is regarded as an early, perhaps the first method of cluster analysis in the world.Obviously, Czekanowski's methodology cannot be applied in all cases, however the methodology gives very important outlooks on the structure of the considered data as well as the number of groups of the data [Liiv, 2010].Thus, considering a set of data characterized by the same keywords, let us form a square matrix with cells describing the values of the measure of the distances between all possible pairs of objects; with all diagonal values equal zero.
In the relative literature, there are known several distance measures.One of them is Chebyshev's distance, given as cl Cl,ehyshe,• (S,p 's,'I) = ,./rt~.Jksp (v,) -k.,'I (1'; )I where the multi sets S, and Se represent the documents with the counting functions ks (.) and ks: (.), 1:-,,  " p q respectively.In this way, the Chebyshev distances between any pair of objects are shown in Table I.For better visualization of the structure of the values of Chebyshev's distances between the text documents, there are used special graphic characters, i.e. the black circles of different sizes.Czekanowski's diagram with random arranged objects is provided in Fig. l I. s,, s,, sc3 s,., s,, s"" Meanwhile, applying simple swapping rows and columns, the matrix can be rearranged in order to gather the closest objects in distinguished groups.The proper reordering ofrows and columns of the matrix can be treated as an unsupervised learning discovering similarity as well as relationships between the objects.Formerly, in the original works by Czekanowski, the reordering of rows and columns was done manually and was very burdensome.Fo1tunately, nowadays, there are several computer programs for generating Czekanowski 's diagrams, e.g. the software called MaCzek [Soltysiak, and Jaskulski, 1999].
In the considered example, the reordered Czekanowski's diagram is provided in Fig. 12 .
.  The rearranged objects in Fig.I 2 clearly demonstrate that there are distinguished two groups of considered objects, indicated by two separated blocks of meaningful symbols.In this way, it can be assumed, that the considered text documents can be divided into two separated groups, namely g 1 = {e 1 ,e 4 ,e 6 } and g 2 = { e 2 , e 3 , e 5 } , Then, we can create the descriptions of these two groups in the form of the classification rules.Details of the applied procedure can be described in the following way.

Generation of tlze classification rules
. Now, let us consider the group g 1 = {e 1 ,e 4 ,'e~} and 'the group g 2 = {e 2 ,e 3 ,e 5 } of the objects.Our aim is to construct the classification rule for the group 'g 1 , as disjunctions of the one-condition elementary rules.
T he proper algorithm is described in the following steps.
The above six pairs were rearranged with respect to the descending values of the elementary measures of pertmbations, according to (34 ).In result there is considered the following set of rearranged pairs: PER.
Next, the value of the threshold was assumed to be a=0.7.Then, the reduced set of pairs, according to (35), for which the values of elementary measures of perturbation are greater than or equal to 0.7, has the following form: S8 1 H,\',: 2 Step 5.
At the final step, according to (36), the classification rule for the group g 1 is described as the followin g disjunctions of two one-condition elementary rules: Ri; 1 : IF [considered value ="financia/"];1.0v [considered value ="training" ];l.0 THEN a given object is a member of a group g1• Tn this way the classification rule for the group g 1 was constructed.Next, let us construct the classification rule for the group g 2 • The corresponding algorithm is described step by step below.
Next, the value of the threshold was assumed to be also a = 0.7, and then the reduced set of pairs, for which the values of elementary measures of perturbation are greater than or equal to 0.7, has the following form: = {(!,"submission "),(0,8, "article")}, ,\'g21-+.\'g 1 Step 5.
At the end, the classification rule for the group g 2 is described as the following disjunctions of two onecondition elementary rules: Ri/: IF [considered value ="submission "];1.0 v [considered value ="artic/e"];0,8 THEN a given object is a member of a group g, .
In this procedure, the classification rule for the group g, was constructed.

Brief analysis of the classification rules
Now, let LIS consider the six considereq :text .docume.n.ts, f 1,e2 ,e3,e4 ,e5 and e 6 represented by multisets, and the generated classification rules R i; 1 '~11d • kt 1 for the group g I and g 2 , respectively.Both generated classification rules are shown in Table 2.The number associated with each keyword is considered as the strength coefficient of the proper elementary rule, according to (38), The testing classification of these documents to the appropriate group is carried out tlu•ough verification offulfilment of conditions in the conditional pa1ts of the rules [Szkatula, 1995], Details of the calculations are presented below, The classification is unequivocal where the only one classification rule is fulfilled.The text documents e 1 and e 6 were unequivocal classified to the appropriate group g 1 , and the text documents e 2 , e 3 and e 5 were unequivocal classified to the appropriate group g 2 .

Conclusions
In this paper we propose the new measure describing remoteness between the multi-attribute objects with repeating qualitative values of attributes and the groups of such objects.The concept is based on multi sets operations.In our opinion the approach can be considered as a new as well as alternative measure of remoteness between qualitative data, pa1ticularly where repetitions of values of attributes are permitted and the direction of comparison has significant meaning.
It seems to be important to emphasize, thiil this paper is the next one within the series of the papers, written by the present authors, which are dedicated to the perturbation of one set by another, wherein there were considered different kinds of "sets", like the ordinary sets, the multisets, the fuzzy sets, the intuitionistic fuzzy sets and so on.The aim of the papers series is comparing the objects described by nominal-valued attributes represented by differ.en! kinds of sets.Up till now, we have already developed the perturbations of the ordinary sets [Krawczak, and Szkatula, 2014a, 20 I Sa], the multisets [Krawczak, and Szkatula, 20 I Sb, 201 Sc, 2016] including this paper, and the fuzzy sets [under review].
Applications ofthe developed approach for dealing with objects within large, real databases (e.g.grouping of similar objects, retrieval of textual documents, documents classification, etc.), seems to be an interesting topic for the future research.

Appendix. Proofs of corollaries
Proof of Corollary 3. The left side of equation can be rewritten as follows L max{ks 1 (v1 Proof of Corollary 6. I) First, we prove the left hand side inequality O s Pe1 0 (G~ HG,,)+ Pe1 0 (G,., HG,,).

Fig. 6 .
Fig. 6.The graphical interpretations cif perturb,itions .of the multisets S I and S 2 • The arrows indicate the directions of the pe11urbation.

type 2 Definition 2 (
with normalization caused by the union of two considered multisets S 1 u S 2 • First, let us consider the measure of the multiscts' perturbation type l of the multiset S 2 by the multiset S 1 • This measure of the perturbation is calculated in the following way [Krawczak and Szkatula, 2015b,2015c].Measure of perturbation type I).The measure of perturbation type 1 of the multiset S 2 by the multiset S 1 , denoted by Pe,Ls (S 1 H S 2 ) , is defined by a mapping Per,' 1 s : [v J" x ~,, J" • [O, I] , in the following manner: L •d(S'-'S) L(ks,(11,)-ks,ns,(v,))

Fig. 8 .
Fig. 8.The changes of the measure of the perturbations.

4 .
We can consider any real number as a parameter a E [O, I] treated as the a -threshold.The parameter is applied to the set of sorted pairs PERs,, >->s., , defined by (34), lo construct a new reduced set of pairs, denoted by PER a .The reduction is done via consideration of those pairs which values of the elementary Sg 1 HS/.:Z measures are greater than or equal to the value of the threshold parameter a.The new set of the pairs is written in the following way (35) MD(e,R~ )=MD(e,Rga ,. vR~ , .. v .. ,vR~" ) = b I I• 'I c-I• 12 , I• 'la =Agg(MD(e,R; ,. ),MD(e,R;" ), .. ,, MD(e, R;" )) .D( Ra ) {q(R; , . . ) if rule R; 1 , •• is fulfilled by object e W 1ere lVL e, K ,,.
2 , drown from the set V ={v 1 , v 2 , ... ,vi} of nominal elements, such that S, ,S 2 E [VJ"'.It is important to mention, that there are several known measures which can be applied for comparison of two multisets.Comparing proximity measures can be analyzed analytically, where two measures are considered equivalent or one measure is expressed as a function of the other measure, or empirically, for a given data set.Both cases are discussed below.