Introduction
I'm finding that eBay related spam accounts for nearly 5% of all the tweets I'm analyzing. The @eBay username is a good indicator; I've found ~<3% of all tweets with this username to be valid.Spam Hashtags
This code finds all the hashtags in tweets containing the @eBay keyword:%pyspark df = sqlContext.read.parquet("/output/craig/buzz/parquet/twitter") df_ebay_tags = df.rdd \ .filter(lambda x: '@ebay' in x[5].lower()) \ .flatMap(lambda x: x[5].split(" ")) \ .filter(lambda x: "#" in x) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) \ .map(lambda x: { 'tag': x[0], 'count': x[1] }).toDF().sort('count', ascending=False) df_ebay_tags.registerTempTable("df_ebay_tags")
Associated Hashtags
This code will find all the hashtags associated with the show Gotham in Spanish:%pyspark import string def cleanse(value): for i in list(string.punctuation): value = value.replace(i, "") return value df_tags = df_fb.rdd \ .filter(lambda x: x[4].lower() == "gotham") \ .filter(lambda x: x[7].lower() == "es") \ .flatMap(lambda x: x[6].lower().split(" ")) \ .filter(lambda x: "#" in x) \ .map(lambda x: cleanse(x)) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) \ .map(lambda x: { 'tag': x[0], 'count': x[1] }).toDF().sort('count', ascending=False) df_tags.registerTempTable("df_tags") print "Total: {}".format( locale.format( "%d", df_tags.count(), grouping=True))
Output:
Not surprisingly, the tag #gotham is the most prevalent. That #series should show up is promising; not shown here is the prior exclusion of #movie and #movies as we're particularly targeting a show.
HTML Output
This analysis is similar, but produces an HTML report with clickable links to explore the hashtags on Twitter:%pyspark print u"%html" print u"<h3>Hashtags</h3" print u"<br />" for row in df_tags.rdd.filter(lambda x: x[0] > 10).toLocalIterator(): url = u""" https://twitter.com/search?src=typd&q=%23{} """.strip().format(row[1].strip()) html = u""" <a href="{}" target="_new">{}</a> """.strip().format(url, row[1].strip()) print u""" {}: {} """.strip().format(html, row[0]) print u"<br />"
Output:
No comments:
Post a Comment