Friday, June 10, 2016

Zeppelin and Spark: Finding Associated Hashtags

Introduction

I'm finding that eBay related spam accounts for nearly 5% of all the tweets I'm analyzing. The @eBay username is a good indicator; I've found ~<3% of all tweets with this username to be valid.

Spam Hashtags

This code finds all the hashtags in tweets containing the @eBay keyword:
%pyspark

df = sqlContext.read.parquet("/output/craig/buzz/parquet/twitter")

df_ebay_tags = df.rdd \
    .filter(lambda x: '@ebay' in x[5].lower()) \
    .flatMap(lambda x: x[5].split(" ")) \
    .filter(lambda x: "#" in x) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x,y:x+y) \
    .map(lambda x: { 
        'tag': x[0],
        'count': x[1] 
    }).toDF().sort('count', ascending=False)

df_ebay_tags.registerTempTable("df_ebay_tags")


Associated Hashtags

This code will find all the hashtags associated with the show Gotham in Spanish:
%pyspark

import string

def cleanse(value):
    for i in list(string.punctuation):
        value = value.replace(i, "")
    return value

df_tags = df_fb.rdd \
    .filter(lambda x: x[4].lower() == "gotham") \
    .filter(lambda x: x[7].lower() == "es") \
    .flatMap(lambda x: x[6].lower().split(" ")) \
    .filter(lambda x: "#" in x) \
    .map(lambda x: cleanse(x)) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x,y:x+y) \
    .map(lambda x: { 
        'tag': x[0],
        'count': x[1] 
    }).toDF().sort('count', ascending=False)

df_tags.registerTempTable("df_tags")
print "Total: {}".format(
    locale.format(
        "%d", 
        df_tags.count(), 
        grouping=True))


Output:
Not surprisingly, the tag #gotham is the most prevalent.  That #series should show up is promising; not shown here is the prior exclusion of #movie and #movies as we're particularly targeting a show.


HTML Output

This analysis is similar, but produces an HTML report with clickable links to explore the hashtags on Twitter:
%pyspark

print u"%html"
print u"<h3>Hashtags</h3"
print u"<br />"
    
for row in df_tags.rdd.filter(lambda x: x[0] > 10).toLocalIterator():
    
    url = u"""
        https://twitter.com/search?src=typd&q=%23{}
    """.strip().format(row[1].strip())
    
    html = u"""
        <a href="{}" target="_new">{}</a>
    """.strip().format(url, row[1].strip())
    
    print u"""
        {}: {}
    """.strip().format(html, row[0])
    print u"<br />"


Output:

No comments:

Post a Comment