jamal sasha
2013-04-01 21:25:26 UTC
Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....
I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue
reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))
I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?
I have data as following
id1, value
1, 20.2
1,20.4
....
I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue
reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))
I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?