Tuesday, May 17, 2016

Learning basic PySpark concepts with S3


I'm learning Apache Spark with Python and S3. I tried them out following an example here https://www.codementor.io/spark/tutorial/spark-python-rdd-basics.

import boto

AWS_ACCESS_KEY_ID="myAccessKeyId"
AWS_SECRET_ACCESS_KEY="mySecretAccessKey"

conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket('nuhaa')

# view list
for file_key in bucket.list():
    print(file_key.name)

key_name = "kddcup-data_10_percent.csv"

# load data
data_rdd = sc.textFile("s3n://nuhaa/%s" % key_name)
    
from time import time
t0 = time()
data_count = data_rdd.count()
tt = time() - t0
print "total number of records: %d" % data_count
print "size %s" % hbytes(key.size)
print "count completed in {} seconds".format(round(tt,3))
key.close()

# using filter
normal_data_rdd = data_rdd.filter(lambda x: 'normal.' in x)
t0 = time()
normal_count = normal_data_rdd.count()
tt = time() - t0

print "there are {} 'normal' interactions".format(normal_count)
print "count completed in {} seconds".format(round(tt,3))

# using map
csv_data = data_rdd.map(lambda x: x.split(","))
t0 = time()
head_rows = csv_data.take(5)
tt = time() - t0

print "parse completed in {} seconds".format(round(tt,3))

# using collect
t0 = time()
all_data = data_rdd.collect()
tt = time() - t0

print "data collected in {} seconds".format(round(tt,3))



7 comments:

  1. In todays every developer started adopting the rich features of Bootstrap framework. The points you have shared regarding the benefits bootstrap compels most of the business people to make use of this technology.
    Hire Dedicated Web Developers
    Hire Dedicated Php Developer
    Hire Dedicated Opencart Developers
    Hire Dedicated Developers
    Hire Dedicated Programmers

    ReplyDelete
  2. I enjoy NorwayMagazine’s ability to combine global news with local Scandinavian insights. Its articles are both informative and engaging, making complex topics accessible while highlighting cultural, technological, and business trends in a meaningful and trustworthy manner.

    ReplyDelete
  3. Every visit to FreshlyTimes brings something new and informative. The platform’s ability to provide news as it happens, combined with clarity and thoughtful presentation, ensures readers receive accurate, timely, and highly engaging insights on world events.

    ReplyDelete
  4. I admire FastestMagazine for its dedication to quick, reliable news coverage. From trending stories to world events, their timely updates keep readers engaged and informed, offering a unique perspective that’s both comprehensive and fast-paced.

    ReplyDelete