Data, simple. blog

David Tsukiyama
6/20/2015: Simple Polarity Scores with the New York Times API and TextBlob

Quick and Dirty Method to Search for all Relevant New York Times Article Comments and Classify Text. First I want to filter comments to only those articles that were the most popular articles for that week.


def emailed_results(data):
    for i, document in enumerate(data):
        return data['results']

def parsed_mailed(data):
    mailed = []
    for b in data:
        dic = {}
        dic['sub-title'] = b['abstract']
        dic['byline'] = b['byline']
        dic['column'] = b['column']
        dic['type'] = b['des_facet']
        dic['date'] =b['published_date']
        dic['section'] = b['section']
        dic['title'] = b['title']
        dic['url'] = b['url']
        
        mailed.append(dic)
    return mailed

def title(data):
    for i, title in enumerate(d['title'] for d in data): 
            print i,title
          
def url(data):
    for i, url in enumerate(d['url'] for d in data): 
            print i,url


def most_mailed(days, api):
    import urllib
    import json
    bucket = 'http://api.nytimes.com/svc/mostpopular/v2/mostemailed/all-sections/'
    string = bucket+days+api
    
    response_string = urllib.urlopen(string).read()
    response_dictionary = json.loads(response_string)
    
    results = emailed_results(response_dictionary)
    parsed_results = parsed_mailed(results)
    
    titles = title(parsed_results)
    urls = url(parsed_results)
    return titles, urls
    
    

#past seven days
days='7?'
api='api-key=##########'

#call the function
most_mailed(days, api)
		

Title url
To Lose Weight, Eating Less Is Far More Important Than Exercising More
How to Pick a Cellphone Plan for Traveling Abroad
Naomi Oreskes, a Lightning Rod in a Changing Climate
How to Make Online Dating Work
America’s Seniors Find Middle-Class ‘Sweet Spot’
Experts on Aging, Dying as They Lived
What It’s Like as a ‘Girl’ in the Lab
Three Simple Rules for Eating Seafood
Pope Francis, in Sweeping Encyclical, Calls for Swift Action on Climate Change
Stop Revering Magna Carta
Review: Pixar’s ‘Inside Out’ Finds the Joy in Sadness, and Vice Versa
Cardinals Investigated for Hacking Into Astros’ Database
In Tucson, an Unsung Architectural Oasis
Magna Carta, Still Posing a Challenge at 800
Democrats Being Democrats
Black Like Who? Rachel Dolezal’s Harmful Masquerade
A Sea Change in Treating Heart Attacks
In ‘Game of Thrones’ Finale, a Breakdown in Storytelling
My Choice for President? None of the Above
The Family Dog

Next we loop through the list of articles, and also get all the comments from each article. This takes several helper functions, the final function is called below.


df = []
for b in articles:
    initial_df = nytimes(b)
    df = df + initial_df 
    print 'Processing ' + str(b) + '...'
	
	
Processing http://www.nytimes.com/2015/06/16/upshot/to-lose-weight-eating-less-is-far-more-important-than-exercising-more.html...
Processing http://www.nytimes.com/2015/06/21/travel/how-to-pick-a-cellphone-plan-for-traveling-abroad.html...
Processing http://www.nytimes.com/2015/06/16/science/naomi-oreskes-a-lightning-rod-in-a-changing-climate.html...
Processing http://www.nytimes.com/2015/06/14/opinion/sunday/how-to-make-online-dating-work.html...
Processing http://www.nytimes.com/2015/06/15/business/economy/american-seniors-enjoy-the-middle-class-life.html...
Processing http://opinionator.blogs.nytimes.com/2015/06/19/depressed-try-therapy-without-the-therapist/...
Processing http://opinionator.blogs.nytimes.com/2015/06/17/experts-on-aging-dying-as-they-lived/...
Processing http://www.nytimes.com/2015/06/21/health/saving-heart-attack-victims-stat.html...
Processing http://www.nytimes.com/2015/06/18/opinion/what-its-like-as-a-girl-in-the-lab.html...
Processing http://www.nytimes.com/2015/06/19/world/europe/pope-francis-in-sweeping-encyclical-calls-for-swift-action-on-climate-change.html...
Processing http://www.nytimes.com/2015/06/14/opinion/three-simple-rules-for-eating-seafood.html...
Processing http://www.nytimes.com/2015/06/15/opinion/stop-revering-magna-carta.html...
Processing http://www.nytimes.com/2015/06/19/movies/review-pixars-inside-out-finds-the-joy-in-sadness-and-vice-versa.html...
Processing http://www.nytimes.com/2015/06/21/opinion/sunday/is-your-boss-mean.html...
Processing http://www.nytimes.com/2015/06/17/sports/baseball/st-louis-cardinals-hack-astros-fbi.html...
Processing http://www.nytimes.com/2015/06/14/travel/in-tucson-an-unsung-architectural-oasis.html...
Processing http://www.nytimes.com/2015/06/15/world/europe/magna-carta-still-posing-a-challenge-at-800.html...
Processing http://www.nytimes.com/2015/06/15/opinion/paul-krugman-democrats-being-democrats.html...
Processing http://www.nytimes.com/2015/06/16/opinion/rachel-dolezals-harmful-masquerade.html...
Processing http://www.nytimes.com/2015/06/16/arts/television/in-game-of-thrones-finale-a-breakdown-in-storytelling.html...
	 

We have collected all the comments from all the most-emailed articles of the past seven days

len(df)
7092
df[0]

{'comment': u'The only way to lose weight permanently and arrive at a normal weight (117 pounds for me) is to create a calorie deficit and then a calorie equilibrium once that goal is achieved. 

Thirty-five hundred calories eaten but not burned is equal to one pound of fat gained and 3,500 calories burned, but not consumed is equal to one pound of fat lost. Obviously eating 3,500 calories is a lot quicker than burning them. It would take 35 miles of walking to do that.

Fifteen years ago I was obese and about 70 pounds heavier, but I have maintained a normal weight since losing my excess weight in the year 2,000.

After studying the long-term research results, I started free weight loss groups of people who were suffering from my problem. We logged our daily calories, exercise and weight with each other. I still do it and you can join me if you are motivated and have less than 100 pounds to lose. You can see our results and my before and after pictures on www.permanentweightloss.org.

Let me know if this interests you by emailing me at russellk100@gmail.com. See you lighter soon?', 'comment_type': u'comment', 'date': u'1434435437', 'editorsSelection': 0, 'email': u'russellk100@gmail.com', 'location': u'New York City', 'login': None, 'name': u'Roberta Russell', 'recommend': 23, 'replies': [], 'update_date': u'1434435486'}

I am kind of curious about recommended comments and those comments selected by the editor

import numpy as np
np.mean([b['editorsSelection'] for b in df])
0.025803722504230117
np.mean([b['recommend'] for b in df])
13.1056119571348
time=np.asarray([b['date'] for b in df]).astype(float)
recommendations=np.asarray([b['recommend'] for b in df]).astype(float)

import matplotlib.pyplot as plt
import seaborn

model = zip( time, recommendations)
model.sort()
sliding = []

windowSize = 100

tSum = sum([x[0] for x in model[:windowSize]])
rSum = sum([x[1] for x in model[:windowSize]])

for i in range(windowSize,len(model)-1):
  tSum += model[i][0] - model[i-windowSize][0]
  rSum += model[i][1] - model[i-windowSize][1]
  sliding.append((tSum*1.0/windowSize,rSum*1.0/windowSize))

X = [x[0] for x in sliding]
Y = [x[1] for x in sliding]

plt.plot(X, Y)
plt.title("Recommendations over Time", fontsize=15)


  

One of the quickest ways to get a read on text sentiment without having to build your own classifier is to use TextBlob


from textblob import TextBlob

text=[b['comment'] for b in df]
text=[word.strip("!\"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~").lower() for word in text]

blobs=[]
for b in text:
    blobs.append(TextBlob(b).sentiment.polarity)
	
blobs[:10]

[0.03712121212121212,
 0.048571428571428585,
 0.2380952380952381,
 0.08571428571428573,
 0.09743589743589742,
 -0.8,
 0.09166666666666667,
 0.25,
 0.08240740740740739,
 0.09757142857142857]
 

4/19/2015: Getting comments from specific articles with the New York Times API

While there is a Python wrapper for the New York Times Article Search API, getting article comments necessitates the use of the New York Time API directly. Compared to the Article Search wrapper, there are additional differences in parsing the results. In this example we are querying for the term 'misconduct' in an article's body, byline, and headline and filter this through those articles that contain Facebook in the headline and where the source is either Reuters, The AP, or the New York Times.


from nytimesarticle import articleAPI

api=articleAPI('API-key')
		
articles = api.search( q = 'misconduct', 
	fq = {'headline':'Facebook', 'source':['Reuters','AP', 'The New York Times']}, 
	begin_date = 20140101 )
		
articles

{u'copyright': u'Copyright (c) 2013 The New York Times Company.  All Rights Reserved.',
 u'response': {u'docs': [{u'_id': u'542d8c8638f0d87d7534ce9e',
    u'abstract': u'Facebook pledges that future research on its 1.3 billion users will be subjected to greater internal scrutiny from top managers, especially if it is focused on personal topics; pledge follows public backlash after company undertook study that used its newsfeed to manipulate the emotions of some users without telling them; declines to disclose particulars of new research guidelines.',
    u'blog': [],
    u'byline': {u'contributor': u'',
     u'original': u'By VINDU GOEL',
     u'person': [{u'firstname': u'Vindu',
       u'lastname': u'GOEL',
       u'organization': u'',
       u'rank': 1,
       u'role': u'reported'}]},
    u'document_type': u'article',
    u'headline': {u'main': u'Facebook Promises Deeper Review of User Research, but Is Short on the Particulars',
     u'print_headline': u'Facebook Vow on Research Is Short on the Particulars'},
    u'keywords': [{u'is_major': u'Y',
      u'name': u'organizations',
      u'rank': u'1',
      u'value': u'Facebook Inc'}
		

Parse the output, for example if you want just the deadline and publish date from the output


def parse(articles):
    
    news = []
    for i in articles['response']['docs']:
        dic = {}
        dic['headline'] = i['headline']
        dic['date'] = i['pub_date'][0:10]
        news.append(dic)
    return news
	
data=parse(articles)
data[0]

{'date': u'2014-10-03',
 'headline': {u'main': u'Facebook Promises Deeper Review of User Research, but Is Short on the Particulars',
  u'print_headline': u'Facebook Vow on Research Is Short on the Particulars'}}
	 

Now we will use the New York Times Community API to get comments from specific articles. Here we will get comments for David Brook's Sunday Op-Ed column, "The Moral Bucket List." The API retrieves only 25 comments per call; therefore a loop may be needed to retrieve all comments. I just happened to pick this article because it was one of the most e-mailed articles this past week. However you should definitely check out the documentation for a better understanding. To retrieve comments associated with a specific NYTimes.com URL, use the following URI structure:


bucket='http://api.nytimes.com/svc/community/v3/user-content/recent.json?api-key=you-API-number&url=http://www.nytimes.com/2015/04/12/opinion/sunday/david-brooks-the-moral-bucket-list.html'
 

response = urllib.urlopen(bucket).read()
response_dictB = json.loads(response)
print response_dictionary.keys()
	 

We write a function to extract only information we are interested in and call the function and view the first entry:


def parse(mail):
    
    brooks = []
    for b in mail:
        dic = {}
        dic['comment'] = b['commentBody']
        dic['date'] = b['createDate']
        dic['comment_type'] = b['commentType']
        dic['editorsSelection'] =b['editorsSelection']
        dic['email'] = b['email']
        dic['recommend'] = b['recommendationCount']
        dic['replies'] = b['replies']
        dic['name'] = b['userDisplayName']
        dic['location'] = b['userLocation']
        dic['login'] = b['login']
        
        brooks.append(dic)
    return brooks

comments=response_dictionary['results']['comments']
comments=parse(comments)
comments[0]

{'comment': u'The lament of the hollow man who sees but does not understand.',
 'comment_type': u'comment',
 'date': u'1428853503',
 'editorsSelection': 0,
 'email': u'wilkinson.eileen@gmail.com',
 'location': u'Maine',
 'login': None,
 'name': u'Eileen Wilkinson',
 'recommend': 4,
 'replies': []}	
	 

For more insight press here

New York Times API