`
riching
  • 浏览: 261268 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

使用余弦相似性原理计算文本的相似度

    博客分类:
  • java
 
阅读更多
原理参考:http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

好多人说包不对,或者不知道哪儿下载,贴上个下载地址:https://code.google.com/p/ik-analyzer/downloads/list,附件贴上ik的包,其他的apache的commons包自己去下吧
/**
 * 
 */
package com.text;

import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.collections.MapUtils;
import org.apache.commons.lang3.tuple.MutablePair;
import org.apache.commons.lang3.tuple.Pair;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * @author Riching
 * 
 * @date 2013-8-10
 */
public class IKMainTest {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        String str1 = "我喜欢看电视,不喜欢看电影。";
        String str2 = "我不喜欢看电视,也不喜欢看电影。";
        Map<String, Integer> tf1 = getTF(str1);
        Map<String, Integer> tf2 = getTF(str2);
        Map<String, MutablePair<Integer, Integer>> tfs = new HashMap<String, MutablePair<Integer, Integer>>();
        for (String key : tf1.keySet()) {
            MutablePair<Integer, Integer> pair = new MutablePair<Integer, Integer>(tf1.get(key), 0);
            tfs.put(key, pair);
        }
        for (String key : tf2.keySet()) {
            MutablePair<Integer, Integer> pair = tfs.get(key);
            if (null == pair) {
                pair = new MutablePair<Integer, Integer>(0, tf2.get(key));
            } else {
                pair.setRight(tf2.get(key));
            }
        }
        double d = caclIDF(tfs);
        System.out.println(d);
    }

    public static Map<String, Integer> getTF(String str) throws IOException {
        Map<String, Integer> map = new HashMap<String, Integer>();
        IKSegmenter ikSegmenter = new IKSegmenter(new StringReader(str), true);
        Lexeme lexeme = null;
        while ((lexeme = ikSegmenter.next()) != null) {
            String key = lexeme.getLexemeText();
            Integer count = map.get(key);
            if (null == count) {
                count = 1;
            } else {
                count = count + 1;
            }
            map.put(key, count);
        }
        return map;
    }

    public static double caclIDF(Map<String, MutablePair<Integer, Integer>> tf) {
        double d = 0;
        if (MapUtils.isEmpty(tf)) {
            return d;
        }
        double denominator = 0;
        double sqdoc1 = 0;
        double sqdoc2 = 0;
        Pair<Integer, Integer> count = null;
        for (String key : tf.keySet()) {
            count = tf.get(key);
            denominator += count.getLeft() * count.getRight();
            sqdoc1 += count.getLeft() * count.getLeft();
            sqdoc2 += count.getRight() * count.getRight();
        }
        d = denominator / (Math.sqrt(sqdoc1) * Math.sqrt(sqdoc2));
        return d;
    }
}


分享到:
评论
3 楼 Interceptor2013 2015-10-23  
org.wltea.analyzer.core.IKSegmenter;
这jar包可以发一下吗?    1306606945@qq.com
谢谢
2 楼 riching 2015-05-10  
liudeyuan 写道
..运行你的代码时出现了
java.lang.UnsupportedClassVersionError: org/wltea/analyzer/sample/IKAnalzyerDemo : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)

怎么破~


版本不兼容吧,下载个最新版本试试
1 楼 liudeyuan 2015-04-15  
..运行你的代码时出现了
java.lang.UnsupportedClassVersionError: org/wltea/analyzer/sample/IKAnalzyerDemo : Unsupported major.minor version 51.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)

怎么破~

相关推荐

Global site tag (gtag.js) - Google Analytics