python 分词统计,Python中文词频统计

　　词频统计就是输入一个句子或者一篇文章，然后统计每个词在句子中出现的次数。下面这篇文章主要介绍python中文词频统计的相关信息，有需要的朋友可以参考一下。

　　00-1010前言一、文字导入二、使用步骤1。导入数据库2。读入数据3。取出非索引字表4。分词，去停词(这时候可以直接用python的原函数做词频统计)5。将分段后的有用字和去停止字输出到txt6。函数调用7。结果附件：输入一个段落，统计每个字母出现的次数。摘要：文章写好后，目录会自动

前言

　　我准备了一个名为abstract.txt的文本文件

　　然后我在网上下载了stopword.txt(口吃分词的停用词)。

　　有一些我觉得没用补充。

　　另外，我建立了自己的字典extraDict.txt

　　作品做好了，就看怎么用吧！

一、文本导入

二、使用步骤

　　代码如下：

　　进口洁霸

　　来自jieba.analyse导入摘录_标签

　　从sk learn . feature _ extraction . text导入tfidf矢量器

1.引入库

　　代码如下：

　　解霸。load _ userdict (extradict。txt) #导入您自己的词典

2.读入数据

　　def停止字列表():

　　stop words=[line . strip()for line in open( Chinese stop words . txt ，encoding=UTF-8 )。readlines()]

　　#-停用词补充，视具体情况而定-

　　i=0

　　对于范围(19):内的I

　　stopwords.append(str(10 i))

　　# -

　　返回停用词

3.取出停用词表

　　def seg_word(line):

　　# seg=jieba . cut _ for _ search(line . strip())

　　seg=jieba.cut(line.strip())

　　temp=

　　计数={}

　　wordstop=stopwordlist()

　　对于seg:中的word

　　如果word不在wordstop:中

　　如果字！= :

　　temp=word

　　temp=\n

　　Counts [word]=counts.get (word，0) 1 #统计每个单词出现的次数。

　　Return temp #显示分词结果。

　　# RETURN STR (sorted (counts.items()，key=lambda x:x [1]，reverse=true) [336020]) #统计前20个单词及其出现的次数。

　　/pre>

5.输出分词并去停用词的有用的词到txt

def output(inputfilename, outputfilename):
　　 inputfile = open(inputfilename, encoding=UTF-8, mode=r)
　　 outputfile = open(outputfilename, encoding=UTF-8, mode=w)
　　 for line in inputfile.readlines():
　　 line_seg = seg_word(line)
　　 outputfile.write(line_seg)
　　 inputfile.close()
　　 outputfile.close()
　　 return outputfile

6.函数调用

if __name__ == __main__:
　　 print("__name__", __name__)
　　 inputfilename = abstract.txt
　　 outputfilename = a1.txt
　　 output(inputfilename, outputfilename)

7.结果

附：输入一段话，统计每个字母出现的次数

　　先来讲一下思路：

　　例如给出下面这样一句话

Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven’t got a clue.
　　

　　那么想要统计里面每一个单词出现的次数，思路很简单，遍历一遍这个字符串，再定义一个空字典count_dict，看每一个单词在这个用于统计的空字典count_dict中的key中存在否，不存在则将这个单词当做count_dict的键加入字典内，然后值就为1，若这个单词在count_dict里面已经存在，那就将它对应的键的值+1就行

　　下面来看代码：

#定义字符串
　　sentences = """ # 字符串很长时用三个引号
　　Love is more than a word
　　it says so much.
　　When I see these four letters,
　　I almost feel your touch.
　　This is only happened since
　　I fell in love with you.
　　Why this word does this,
　　I havent got a clue.
　　"""
　　#具体实现
　　# 将句子里面的逗号去掉,去掉多种符号时请用循环，这里我就这样吧
　　sentences=sentences.replace(,,) 
　　sentences=sentences.replace(.,) # 将句子里面的.去掉
　　sentences = sentences.split() # 将句子分开为单个的单词，分开后产生的是一个列表sentences
　　# print(sentences)
　　count_dict = {}
　　for sentence in sentences:
　　 if sentence not in count_dict: # 判断是否不在统计的字典中
　　 count_dict[sentence] = 1
　　 else: # 判断是否不在统计的字典中
　　 count_dict[sentence] += 1
　　for key,value in count_dict.items():
　　 print(f"{key}出现了{value}次")