MapReduce是一种
分布式计算模型,通常用于处理大规模数据集。在这里,我们假设你有一个包含用户
访问日志的文本文件,每行记录包含日期和
访问次数。以下是
使用Python编写的基本
Hadoop MapReduce程序(
Hadoop Streaming API)来
统计每个日期的
访问次数:
```python
# Mapper (input is log line, output key-value pairs with date as key and 1 as value)
mapper.py
#!/usr/bin/env python
im
port sys
for line in sys.stdin:
# Split the line into parts ba
sed on a delimiter (e.g., space or comma)
data = line.strip().split(
',
')
# Assuming the first element is the date
date = data[0]
# Emit the key-value pair for each visit
print(f"{date} 1", sep=" ")
# Reducer (input is key-value pairs grouped by date, output sum of values)
reducer.py
#!/usr/bin/env python
im
port sys
current_date = None
total_visits = 0
for line in sys.stdin:
# Split the key-value pair
key, value = line.strip().split(" ")
if current_date and key != current_date:
# Output the previous day
's total
print(f"{current_date} {total_visits}", sep=" ")
current_date = key
total_visits += int(value)
# Last reducer to handle the last day
if current_date:
print(f"{current_date} {total_visits}", sep=" ")
#
Hadoop command to execute the job
hadoop jar
hadoop-streaming.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input input_log.txt -output output_dir/
```