Finding mean and median python streaming

Discussion:

jamal sasha

2013-04-01 21:25:26 UTC

Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....

I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
for line in sys.stdin:
try:
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
print '%s,%s' % (tokens[0],tokens[1])
except Exception:
continue

reducer.py
def mean(data_list):
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
def median(mylist):
sorts = sorted(mylist)
length = len(sorts)
if not length % 2:
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]

for line in sys.stdin:
try:
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
except Exception:
pass
for k,v in data_dict.items():
print "%s,%s,%s" %(k, mean(v), median(v))

I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?

jamal sasha

2013-04-01 21:27:01 UTC

Permalink

data_dict is declared globably as
data_dict = defaultdict(list)

Post by jamal sasha
Very dumb question..
I have data as following
id1, value
1, 20.2
1,20.4
....
I want to find the mean and median of id1?
I am using python hadoop streaming..
mapper.py
# remove leading and trailing whitespace
line = line.rstrip(os.linesep)
tokens = line.split(",")
print '%s,%s' % (tokens[0],tokens[1])
continue
reducer.py
return sum(data_list)/float(len(data_list)) if len(data_list) else 0
sorts = sorted(mylist)
length = len(sorts)
return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
return sorts[length / 2]
line = line.rstrip(os.linesep)
serial_id, duration = line.split(",")
data_dict[serial_id].append(float(duration))
pass
print "%s,%s,%s" %(k, mean(v), median(v))
I am expecting a single mean,median to each key
But I see id1 duplicated with different mean and median..
Any suggestions?

jamal sasha

2013-04-01 23:35:56 UTC

Permalink

pinging again.
Let me rephrase the question.
If my data is like:
id, value

And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my understanding
of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks

Post by jamal sasha
data_dict is declared globably as
data_dict = defaultdict(list)

Yanbo Liang

2013-04-02 09:14:22 UTC

Permalink

How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo

Post by jamal sasha
pinging again.
Let me rephrase the question.
id, value
And I want to find average "value" for each id, how can i do that using
hadoop streaming?
I am sure, it should be very straightforward but aparently my
understanding of how code works in hadoop streaming is not right.
I would really appreciate if someone can help me with this query.
THanks

Post by jamal sasha
data_dict is declared globably as
data_dict = defaultdict(list)

jamal sasha

2013-04-04 20:24:43 UTC

Permalink

Hi,
I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?

Post by Yanbo Liang
How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.
1, Call the method of setNumReduceTasks(1) to set the Reducer number to
1, and it will produce only one output file and each distinct key will
produce only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.
BR
Yanbo

Post by jamal sasha
data_dict is declared globably as
data_dict = defaultdict(list)

Yanbo Liang

2013-04-06 11:58:58 UTC

Permalink

Because each Reducer is just the local mean and median of the data which
are shuffled to that Reducer.

Post by jamal sasha
Hi,
I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?

Post by jamal sasha
data_dict is declared globably as
data_dict = defaultdict(list)