To be really clear, this is 100 percent theoretical: It’s a research paper, not a product announcement or anything equally exciting. (Google publishes hundreds of research papers a year.) Still, the fact that a search engine could effectively evaluate truth, and that Google is actively contemplating that technology, should boggle the brain. After all, truth is a slippery, malleable thing — and grappling with it has traditionally been an exclusively human domain.
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provides the foundational building blocks for higher-level and domain-specific text understanding applications.
Stanford CoreNLP integrates all Stanford NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, and the sentiment analysis tools, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option, you can choose which tools should be enabled and which should be disabled.
A few months ago I was working on a project that had a word cloud-like feature. A word cloud is an interesting way to visually represent a popular theme or topic. I had a dataset of user reviews from another project that we wanted to parse and use. This began my first exposure to Natural Language Processing NLP and other advanced text analytics tools.
Instead manually of annotating text, one should try to benefit from an existing
annotated and publicly available text corpus that deals with a wide range of topics,
Our approach is rather simple: the text body of Wikipedia articles is rich in internal links
pointing to other Wikipedia articles. Some of those articles are referring to the entity classes
we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way
to convert those links into entity class annotations on text sentences (without the
Wikimarkup formatting syntax).
The Apache OpenNLP library is a machine learning based toolkit for processing of natural language text. It includes a sentence detector, a tokenizer, a name finder, a parts-of-speech (POS) tagger, a chunker, and a parser. It has very good APIs that can be easily integrated with a Java program. However, the documentation contains unupdated information.
via OpenNLP example.
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.
The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. An additional goal is to provide a large number of pre-built models for a variety of languages, as well as the annotated text resources that those models are derived from.