用python统计单词出现次数
今天在珠三角技术沙龙发起者赖总的博客上 http://blog.csdn.net/lanphaday/archive/2011/03/31/6291668.aspx 看到一道Python面试题,赖总直接强悍地用Pipe写了个示例。我完全不懂Pipe这个模块的用法,试着用传统的方法写了遍。通过这道题,我发现正则语句”\W”很霸气。
题目是这样的:
读取文件,统计文件中每个单词出现的次数,然后按照次数高低排序。
代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #! /usr/bin/env python import re addr="/home/yarkee/test.txt" f=open(addr,"rb") p=re.compile("\W"); #\W equals to [^a-zA-Z0-9] text=f.read() wn=re.split(p,text) #a list contain all the words wd={} #a dic to count the appearence times for i in range(0,len(wn)): if wn[i]!='': wn[i]=wn[i].strip() wn[i]=wn[i].lower() if wn[i] in wd: wd[wn[i]]+=1 else: wd[wn[i]]=1 wl=[(k,v) for k,v in wd.items()] print sorted(wl,key=lambda x:x[1],reverse=True) #sort desc by times |
python 2.7以后有了超级牛力的Counter
来做这个相当爽啊
http://docs.python.org/dev/library/collections.html
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(‘\w+’, open(‘hamlet.txt’).read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]