New Similarity Methods for Unsupervised Learning

Blogs

At this phase, 5 different measures were computed: Cosine Similarity, Standard Projection Similarity, Logistic Projection Similarity, Cosine Similarity multiplied by Standard Projection Similarity, and Cosine Similarity increased by Logistic Projection Similarity.

For the former, the experiment showed that the approach of Cosine Similarity increased by Logistic Projection Similarity can successfully provide the best asymmetric similarity with a high precision at par with Cosine Similarity.

2. Using Cosine Similarity

The competitive dynamics model used in Strategizing with Competitive Asymmetry has two dimensions, Market Commonality and Resource Similarity, and the possible combinations are:

Source: Competitor Analysis and Interfirm Rivalry: Toward a Theoretical Integration, Ming-Jer Chen, Academy of Management Review, 1996, Vol. 21, No. 1, 100-134.

Under this approach, companies were characterized with one vector for each dimension including several determinant traits of their markets and their resources. Cosine Similarity was initially used to compare the vectors pairwise, but two problems arose.

First, Cosine Similarity is symmetric. The similarity of vector A with respect to vector B is the same as the one of vector B with respect of vector A. Cosine Similarity fails to represent competitive asymmetry.

Second, the similarities were very high. In a two-by-two matrix like Image 1 above, the intuitive threshold to classify a data point as high or low is 50%. Above 50% there are more odds that the two data points compared are similar than they are not. They are classified as “high”. And vice versa – low if below 50%. With Cosine Similarity even companies radically different had similarities above 50%. If there is a training data set to find which is the optimal threshold, rather than at 50%, this problem is solvable. In this case falls Market Commonality where the industry and the countries where a company operates are known. But, for unsupervised classification, the fact that the optimal threshold falls at the intuitive 50% has a significant impact on the accuracy of the classification. This is the case of Resource Similarity where the skills of a company are neither easily nor publicly known.

3. Using Projection Similarity

An alternative method to compare the vectors was used in order to have asymmetric similarity. The projection similarity of vector A in relation to vector B was calculated as follows:

1. Calculate the orthogonal projection of vector A over vector B

2. Divide the norm of the orthogonal projection by the norm of B, which will give the relative value of the norm of the orthogonal projection in relation to B

3. Subtract 1 to the resulting value of step 2 and take the absolute value (to take advantage of the symmetric distributions)

This difference statistic can be used as Z value of a standard normal distribution to get the Standard Projection Similarity by multiplying the area of the cumulative distribution function from -∞ to -Z by 2:

It could also be used as the exponent of a logistic function to get the Logistic Projection Similarity:

In the next section, we will examine the validity and accuracy of this alternative method.

4. Calculations

At this phase, 5 different measures were calculated: Cosine Similarity, Standard Projection Similarity, Logistic Projection Similarity, Cosine Similarity multiplied by Standard Projection Similarity, and Cosine Similarity multiplied by Logistic Projection Similarity. Each dimension had its unique data set, one for Market Commonality and one for Resource Similarity. The criterion to decide if two companies were similar was set by industry: if two companies are in the same industry, their similarity should be high in any of the dimensions; otherwise, low. The positive outcomes for both data sets were 32%, and the negative outcomes 68%. The chosen criterion carried the implicit assumption that companies in the same industry can differ but not a lot, either in one dimension or the other one. The performance of each method as a function of the threshold value were the following:

Market Commonality

Resource Similarity

5. Conclusions

The two challenges of using Cosine Similarity were the presence of symmetric similarity and the optimal threshold value far from the intuitive 50%.

For the former, the experiment showed that the method of Cosine Similarity multiplied by Logistic Projection Similarity can successfully deliver the best asymmetric similarity with a high accuracy at par with Cosine Similarity.

For the latter, the optimal threshold of Cosine · Logistic (60%) was 5% below the Cosine one (65%) for Market Commonality, and 10% below for Resource Similarity (75% and 85% respectively). But those values were still far from 50%. So, even if there was an improvement, the challenge for unsupervised classification remained.

Following closely, the suboptimal Cosine · Standard method delivered a pair of optimal thresholds of 55% for Market Commonality and 70% for Resource Similarity. Even a bit better.

Read the original article on LinkedIn.

Please follow and like us: