text = text.decode('euc-jp')

nkf for python: http://city.plala.jp/moin/NkfPython
http://memo.jj-net.jp/jjnet_sandbox/pykf/pykf-0.3.4.tgz SourceForge?

chardet: http://chardet.feedparser.org/

print type(text)

text = text.replace('\r\n','\n')
text = text.replace('\r','\n')

text = text.replace('\n','')

import unicodedata

text = unicodedata.normalize('NFKC', text)
print text

http://straitmouth.jp/blog/setomits/139 http://straitmouth.jp/blog/setomits/877

import re
htmltag = re.compile(r'<.*?>', re.I | re.S)
text = htmltag.sub('', html)

http://www.crummy.com/software/BeautifulSoup/

&

http://www.programming-magic.com/20080820002254/

MeCab?

MeCab? http://mecab.sourceforge.net/

TreeTagger? http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

http://www.limsi.fr/Individu/pointal/python/treetaggerwrapper-doc/

guess-language: http://pypi.python.org/pypi/guess-language/

LanguageGuesser?

ngram.py: http://thomas.mangin.me.uk/
http://thomas.mangin.me.uk/data/source/ngram.py

Lingua::LanguageGuesser?
http://gensen.dl.itc.u-tokyo.ac.jp/LanguageGuesser/hajimete_monogatari.html

TextCat?

  • LanguageGuesser?

TextCat?

MeCab?

http://mecab.sourceforge.net/dic.html

% mecab












2
1
1
1
2
1
1
1
1
1
#!/usr/bin/env python
# -*- coding:utf-8 -*-
"""
feature_vector.py





% python feature_vector.py file



import feature_vector


result = feature_vector.analyse(text)
"""
import MeCab

def analyse(text):



   while node:

       surface = node.surface.decode('utf-8')

       node = node.next
   
   return feature_vector

if __name__ == '__main__':
   import sys
   filename = sys.argv[1]
   file = open(filename).read()
   feature_vector = analyse(file)

   for word,freq in feature_vector.items():
       print "%s\t%d" % (word,freq)
yono@orca% cat test.txt                                              


yono@orca% python feature_vector.py test.txt             
 BOS/EOS,*,*,*,*,*,*,*,*











 BOS/EOS,*,*,*,*,*,*,*,*
        2









http://gist.github.com/231879
MeCab?

MeCab?

yono@orca% mecab                                                       



EOS


EOS



EOS

http://gist.github.com/271862