Text Mining and Web Mining

Text Mining

Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.

Text mining is different from what we’re familiar with in web search. In search, the user is typically looking for something that is already known and has been written by someone else. The problem is pushing aside all the material that currently isn’t relevant to your needs in order to find the relevant information.

In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down.

The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts. Databases are designed for programs to process automatically; text is written for people to read. We do not have programs that can “read” text and will not have such for the forseeable future. Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

However, there is a field called computational linguistics (also known as natural language processing) which is making a lot of progress in doing small subtasks in text analysis. For example, it is relatively easy to write a program to extract phrases from an article or book that, when shown to a human reader, seem to summarize its contents. (The most frequent words and phrases in this article, minus the really common words like “the” are: text mining, information, programs, and example, which is not a bad five-word summary of its contents.)

Typical applications of text Mining could include Analyzing open-ended survey responses. For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro’s and con’s of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study.

Another application include to aid in the automatic classification of texts. For example, it is possible to “filter” out automatically most undesirable “junk email” based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content.

Text Mining Algorithm consist of 3 steps.

1. Train.create attribute dictioary where the attribute represents words from articles related to a particular topic. Choose only words that occur a minimum number of times.
2. Filter. Remove the common words known to be useless in the differentiating articles. Eg. The, As, We etc
3. Classify. Check each document to be classified for the presence and frequency of the chosen attributes. Classify the document under a particular topic if it contains a predetermined minimum number of references to the chosen attributes for the topic.

Web Mining

the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Web Usage Mining

Web usage mining is the process of finding out what users are looking for on internet. Some users might be looking at only textual data whereas some other might want to get multimedia data. Web usage mining also helps finding the search pattern for a particular group of people belonging to a particular region.

Application of web usage mining

Using web usage mining, it can extract useful information from the clickstream analysis of web server log containing details of webpage visits, transactions. Web server log analyzer may include software such as NetTracker, AwStats to view how often is the website visited, which kind of product is the best and worst sellers in a e-commerce website. The ability to track web users’ browsing behaviour down to individual mouse clicks makes it possible to personalise services for individual customers on a massive scale. This ‘mass customisation’ of services not only helps customers by satisfying their needs, but also results in customer loyalty. Due to a more personalised and customer-centred approach, the content and structure of a web site can be evaluated and adapted to the customer’s preferences and the right offers can be made to the right customer.

Web  Structure Mining

Web structure mining is the process of using graph theory to analyse the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds.

The first kind of web structure mining is extracting patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. The other kind of the web structure mining is mining the document structure. It is using the tree-like structure to analyse and describe the HTML (Hyper Text Markup Language) or XML(eXtensible Markup Language) tags within the web page.

Application of Web Content and Web Structure Mining

Structure mining can aid to this goal, by identifying popular sites (so-called ‘authorities’), for example, by analysing the number of links that refer to a particular site. Web content and structure mining are not only used to improve the quality of public search engines. Special search services can also be offered. Content and structure mining tools can for instance track down online misuse of brands , or analyse the content and structure of competitive web sites in detail to gain some strategic advantage . With content and structure mining tools, things like online curriculum vitae or personal homepages can be collected. After interpreting the personal data found on personal pages this information could be used for marketing purposes. Profiles on potential customers can be produced and more detailed information is added to profiles of current customers. So mining the web not only contributes to acquiring new customers, it can also aid in retaining existing ones.





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: