场景
官⽅API:
实现
font_path : string #字体路径,需要展现什么字体就把该字体路径+后缀名写上,如:font_path = '⿊体.ttf' width : int (default=400) #输出的画布宽度,默认为400像素 height : int (default=200) #输出的画布⾼度,默认为200像素
prefer_horizontal : float (default=0.90) #词语⽔平⽅向排版出现的频率,默认 0.9 (所以词语垂直⽅向排版出现频率为 0.1 )
mask : nd-array or None (default=None) #如果参数为空,则使⽤⼆维遮罩绘制词云。如果 mask ⾮空,设置的宽⾼值将被忽略,遮罩形状被 mask 取代。除全⽩(#FFFFFF)的部分将不会绘制,其余部分会⽤于绘制词云。如:bg_pic = imread('读取⼀张图⽚.png')scale : float (default=1) #按照⽐例进⾏放⼤画布,如设置为1.5,则长和宽都是原来画布的1.5倍 min_font_size : int (default=4) #显⽰的最⼩的字体⼤⼩
font_step : int (default=1) #字体步长,如果步长⼤于1,会加快运算但是可能导致结果出现较⼤的误差 max_words : number (default=200) #要显⽰的词的最⼤个数
stopwords : set of strings or None #设置需要屏蔽的词,如果为空,则使⽤内置的STOPWORDS
background_color : color value (default=”black”) #背景颜⾊,如background_color='white',背景颜⾊为⽩⾊ max_font_size : int or None (default=None) #显⽰的最⼤的字体⼤⼩
mode : string (default=”RGB”) #当参数为“RGBA”并且background_color不为空时,背景为透明 relative_scaling : float (default=.5) #词频和字体⼤⼩的关联性
color_func : callable, default=None #⽣成新颜⾊的函数,如果为空,则使⽤ self.color_func regexp : string or None (optional) #使⽤正则表达式分隔输⼊的⽂本 collocations : bool, default=True #是否包括两个词的搭配
colormap : string or matplotlib colormap, default=”viridis” #给每个单词随机分配颜⾊,若指定color_func,则忽略该⽅法 random_state : int or None #为每个单词返回⼀个PIL颜⾊ fit_words(frequencies) #根据词频⽣成词云generate(text) #根据⽂本⽣成词云
generate_from_frequencies(frequencies[, ...]) #根据词频⽣成词云generate_from_text(text) #根据⽂本⽣成词云
process_text(text) #将长⽂本分词并去除屏蔽词(此处指英语,中⽂分词还是需要⾃⼰⽤别的库先⾏实现,使⽤上⾯的 fit_words(frequencies) )recolor([random_state, color_func, colormap]) #对现有输出重新着⾊。重新上⾊会⽐重新⽣成整个词云快很多to_array() #转化为 numpy arrayto_file(filename) #输出到⽂件
补充:⽣成词云之python中WordCloud包的⽤法效果图:
这是python中使⽤wordcloud包⽣成的词云图。
下⾯来介绍⼀下wordcloud包的基本⽤法
这是wordcloud的所有参数,下⾯具体介绍⼀下各个参数:
class wordcloud.WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9,mask=None, scale=1, color_func=None, max_words=200, min_font_size=4, stopwords=None, random_state=None,background_color='blac
font_path : string //字体路径,需要展现什么字体就把该字体路径+后缀名写上,如:font_path = '⿊体.ttf'width : int (default=400) //输出的画布宽度,默认为400像素height : int (default=200) //输出的画布⾼度,默认为200像素
prefer_horizontal : float (default=0.90) //词语⽔平⽅向排版出现的频率,默认 0.9 (所以词语垂直⽅向排版出现频率为 0.1 )
mask : nd-array or None (default=None) //如果参数为空,则使⽤⼆维遮罩绘制词云。如果 mask ⾮空,设置的宽⾼值将被忽略,遮罩形状被 mask 取代。除全⽩(#FFFFFF)的部分将不会绘制,其余部分会⽤于绘制词云。如:bg_pic = imread('读取⼀张图⽚.png'scale : float (default=1) //按照⽐例进⾏放⼤画布,如设置为1.5,则长和宽都是原来画布的1.5倍。min_font_size : int (default=4) //显⽰的最⼩的字体⼤⼩
font_step : int (default=1) //字体步长,如果步长⼤于1,会加快运算但是可能导致结果出现较⼤的误差。max_words : number (default=200) //要显⽰的词的最⼤个数
stopwords : set of strings or None //设置需要屏蔽的词,如果为空,则使⽤内置的STOPWORDS
background_color : color value (default=”black”) //背景颜⾊,如background_color='white',背景颜⾊为⽩⾊。max_font_size : int or None (default=None) //显⽰的最⼤的字体⼤⼩
mode : string (default=”RGB”) //当参数为“RGBA”并且background_color不为空时,背景为透明。relative_scaling : float (default=.5) //词频和字体⼤⼩的关联性
color_func : callable, default=None //⽣成新颜⾊的函数,如果为空,则使⽤ self.color_funcregexp : string or None (optional) //使⽤正则表达式分隔输⼊的⽂本collocations : bool, default=True //是否包括两个词的搭配
colormap : string or matplotlib colormap, default=”viridis” //给每个单词随机分配颜⾊,若指定color_func,则忽略该⽅法。fit_words(frequencies) //根据词频⽣成词云generate(text) //根据⽂本⽣成词云
generate_from_frequencies(frequencies[, ...]) //根据词频⽣成词云generate_from_text(text) //根据⽂本⽣成词云
process_text(text) //将长⽂本分词并去除屏蔽词(此处指英语,中⽂分词还是需要⾃⼰⽤别的库先⾏实现,使⽤上⾯的 fit_words(frequencies) )recolor([random_state, color_func, colormap]) //对现有输出重新着⾊。重新上⾊会⽐重新⽣成整个词云快很多。to_array() //转化为 numpy arrayto_file(filename) //输出到⽂件
例⼦:
想要⽣成的词云的形状:
图中⿊⾊部分就是词云的将要显⽰的部分,⽩⾊部分不显⽰任何词。下⾯是⼀个⽂本⽂档:
How the Word Cloud Generator Works
The layout algorithm for positioning words without overlap is available on GitHub under an open source license as d3-cloud. Note that this is the only the layout algorithm and any code forconverting text into words and rendering the final output requires additional development.
As word placement can be quite slow for more than a few hundred words, the layout algorithm can be run asynchronously, with a configurable time step size. This makes it possible to
animate words as they are placed without stuttering. It is recommended to always use a time step even without animations as it prevents the browser's event loop from blocking while placingthe words.
The layout algorithm itself is incredibly simple. For each word, starting with the most “important”:
Attempt to place the word at some starting point: usually near the middle, or somewhere on a central horizontal line. If the word intersects with any previously placed words, move it one stepalong an increasing spiral. Repeat until no intersections are found. The hard part is making it perform efficiently! According to Jonathan Feinberg, Wordle uses a combination of hierarchicalbounding boxes and quadtrees to achieve reasonable speeds.Glyphs in JavaScript
There isn't a way to retrieve precise glyph shapes via the DOM, except perhaps for SVG fonts. Instead, we draw each word to a hidden canvas element, and retrieve the pixel data.Retrieving the pixel data separately for each word is expensive, so we draw as many words as possible and then retrieve their pixels in a batch operation.Sprites and Masks
My initial implementation performed collision detection using sprite masks. Once a word is placed, it doesn't move, so we can copy it to the appropriate position in a larger sprite representingthe whole placement area.
The advantage of this is that collision detection only involves comparing a candidate sprite with the relevant area of this larger sprite, rather than comparing with each previous wordseparately.
Somewhat surprisingly, a simple low-level hack made a tremendous difference: when constructing the sprite I compressed blocks of 32 1-bit pixels into 32-bit integers, thus reducing thenumber of checks (and memory) by 32 times.
In fact, this turned out to beat my hierarchical bounding box with quadtree implementation on everything I tried it on (even very large areas and font sizes). I think this is primarily because thesprite version only needs to perform a single collision test per candidate area, whereas the bounding box version has to compare with every other previously placed word that overlapsslightly with the candidate area.
Another possibility would be to merge a word's tree with a single large tree once it is placed. I think this operation would be fairly expensive though compared with the analagous sprite maskoperation, which is essentially ORing a whole block.从这个⽂本中⽣成⼀个词云,代码如下:
#!/usr/bin/python# -*- coding: utf-8 -*-#coding=utf-8
#导⼊wordcloud模块和matplotlib模块from wordcloud import WordCloudimport matplotlib.pyplot as pltfrom scipy.misc import imread#读取⼀个txt⽂件
text = open('test.txt','r').read()#读⼊背景图⽚
bg_pic = imread('3.png')#⽣成词云
wordcloud = WordCloud(mask=bg_pic,background_color='white',scale=1.5).generate(text)image_colors = ImageColorGenerator(bg_pic)#显⽰词云图⽚
plt.imshow(wordcloud)plt.axis('off')plt.show()#保存图⽚
wordcloud.to_file('test.jpg')
运⾏结果:
以上为个⼈经验,希望能给⼤家⼀个参考,也希望⼤家多多⽀持。如有错误或未考虑完全的地⽅,望不吝赐教。
因篇幅问题不能全部显示,请点此查看更多更全内容