Clustering News Articles

I subscribe to multiple publications including Wall Street Journal, Financial Times, CNN, MSNBC, etc. The problem with having so much information at your fingertips is that I usually go to the front page of each website and follow a few links. I almost always miss relevant articles because the links are usually not easy to follow. I wanted to aggregate information so that articles on related topics were grouped together. For that I had to

  • Scan and download all new articles on each site.
  • Cluster the articles automatically based on keywords.

I wrote a webcrawler in Python that indexes each site and pulls in new articles I have not downloaded before. There is plenty of literature out there on scraping websites. The book I found useful was Web Scraping with Python (sold on Amazon).

Once we have the articles we preprocess the data by

  • Converting everything to lowercase
  • Add part of speech tagging for each word. This classifies the word as a Noun, Verb, Adverb, etc.
  • Lemmatize the word, i.e. group together inflected forms of a word into a single item. For example
am, are, is ⇒ be
having, have, had ⇒have
car, cars ⇒car
  • Run TfIdf on the lemmatized documents. TfIdf stands for Term Frequency, Inverse Document Frequency. In a nutshell, a word is more important (and has higher weight) if it occurs more often in a document (the Term Frequency part), except if it occurs commonly in other documents in our corpus (the Inverse Document Frequency part).
  • TfIdf gives us a matrix of important words. We can then cluster the words using a common clustering algorithm like KMeans.

Here are what some of the clusters look like for all the WSJ articles from 30th April 2017 (along with the first 10 article titles in each cluster).

states, plan, house, cost, governments

  • Fugitive Mexican Ex-Governor Tomás Yarrington Had State Security While on the Run
  • Australia Considers Cross-Continent Pipeline to Beat Gas Shortages
  • GOP Health-Care Push Falls Short Again
  • America’s Most Anti-Reform Institution? The Media
  • Fugitive Mexican Governor Arrested in Guatemala
  • Pentagon Investigates Whether Army Rangers in Afghanistan Were Killed by Friendly Fire
  • New Plan, Same Hurdle in GOP’s Quest to Gut Obamacare
  • The Resurgent Threat of al Qaeda
  • Saudi Arabia Reinstates Perks for State Employees as Finances Improve
  • Trump Unveils Broad Tax-Cut Plan

growth, economy, economic, quarter, rose

  • Mexican Economy Maintains Growth in First Quarter
  • Outlook for Kenyan Economy Dimmed by Severe Drought
  • Economic Growth Lags Behind Rising Confidence Data
  • Economists See Growth Climbing in 2017, 2018, Then Dissipating
  • Economy Needs Consumers to Shop Again
  • U.K. Economy Slows Sharply Ahead of Election, Brexit Talks
  • From Diapers to Soda, Big Brands Feel Pinch as Consumers Pull Back
  • Stars Align for Emmanuel Macron—and France
  • Consumer Sentiment Remains High Despite GDP Report
  • South Korea’s Economy Grew 0.9% in First Quarter

trump, u.s., officials, president, administration

  • Pentagon Opens Probe Into Michael Flynn’s Foreign Payments
  • Two U.S. Service Members Killed in Afghanistan
  • Immigrant Crackdown Worries Food and Construction Industries
  • U.S. Launches Cruise Missiles at Syrian Air Base in Response to Chemical Attack
  • At NRA Meeting, Trump Warns of Challengers in 2020
  • Trump Backers in Phoenix Region Are Fine With His Learning Curve
  • Relative of Imprisoned Iranian-Americans Appeals to Trump for Help
  • Trump Issues New Warning to North Korea
  • U.S. Presses China on North Korea After Failed Missile Test
  • Trump’s Bipartisan War Coalition

share, billion, quarter, sales, company

  • What’s Keeping GM Going Strong Probably Won’t Last
  • Cardinal Health’s $6.1 Billion Deal for Some Medtronic Operations Raises Debt Concerns
  • UnitedHealth Profits Rise as it Exits Health-Care Exchanges
  • Lockheed Martin Hit By Middle East Charges
  • Johnson & Johnson Lifts Forecast on Actelion Tie-Up
  • Alphabet and Amazon Extend an Earnings Boom
  • How an ETF Gold Rush Rattled Mining Stocks
  • Ski-Park Operator Intrawest to Go Private in Latest Resort Deal
  • S&P’s Warning: Here Are 10 Public Retailers Most in Danger of Default
  • Advertising’s Biggest Threat Isn’t Digital Disruption

u.s., trade, company, trump, billion

  • Mexico Registers Small Trade Deficit in March
  • Coming to America: How Immigration Policy Has Changed the U.S.
  • Boeing Files Petition With Commerce Dept. Over Bombardier
  • Trump Administration Mulls More Trade Actions, Commerce Secretary Says
  • U.S. Hoteliers Go on Charm Offensive Amid Concerns Over Trump Policies
  • Today’s Top Supply Chain and Logistics News From WSJ
  • Dear Canada: It’s Not Personal, It’s Just Trade
  • Today’s Top Supply Chain and Logistics News From WSJ
  • Venezuela Creditor Seeks Asset Freeze on U.S. Refiner Citgo
  • Seoul Plays Down Possibility of Pre-Emptive U.S. Strike on North Korea