python 去停用词

推荐订阅源

WordPress大学

奇客Solidot–传递最新科技情报

人人都是产品经理

让小产品的独立变现更简单 - ezindie.com

Troy Hunt's Blog

Schneier on Security

The Exploit Database - CXSecurity.com

月光博客

cs.CL updates on arXiv.org

腾

腾讯CDC

Lohrmann on Cybersecurity

The Cloudflare Blog

LINUX DO - 最新话题

Security @ Cisco Blogs

cs.AI updates on arXiv.org

Vercel News

Hacker News: Front Page

SegmentFault 最新的问题

Schneier on Security

aimingoo的专栏

Privacy & Cybersecurity Law Blog

博

博客园 - 三生石上(FineUI控件)

Forbes - Security

CXSECURITY Database RSS Feed - CXSecurity.com

InfoQ

Tailwind CSS Blog

Application and Cybersecurity Blog

小众软件

cs.CV updates on arXiv.org

博客园 - Donal

docker命令 NLP | 自然语言处理 - 语言模型（Language Modeling） windows: Python安装scipy,scikit-image时提示"no lapack/blas resources found"的解决方法 Sense2vec with spaCy and Gensim nohup command > myout.file 2>&1 & NLTK vs SKLearn vs Gensim vs TextBlob vs spaCy Gensim进阶教程：训练word2vec与doc2vec模型 Gensim入门教程使用pdb调试python git只clone仓库中指定子目录转：深度学习与自然语言处理之五：从RNN到LSTM 转：如何构建爬虫代理服务？ RHEL7下安装使用TensorFlow和kcws RHEL7 -- Linux搭建FTP虚拟用户解决windows10搜索不到内容的问题 forward和redirect 的区别 RHEL7磁盘分区挂载和格式化 Spring注解 100 open source Big Data architecture papers for data professionals

python 去停用词

Donal · 2017-05-25 · via 博客园 - Donal

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。