I am using hadoop to process an xml file,so i had written mapper file , reducer file in python.
suppose the input need to process is **test.xml**.
**mapper.py** file
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row") != -1:
.............
.............
.............
print '%s\t%s'%(campaignID,adGroupID )
**reducer.py** file
import sys
if __name__ == '__main__':
for line in sys.stdin:
print line.strip()
I had run the hadoop with following command
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
-output /path/to/output_folder/to/store/file
When i run the above command hadoop is creating an output file at output path in the format we mentioned in `reducer.py` file correctly with required data.
Now
after all what i am trying to do is, i dont want to store output data
in a text file created as default by haddop when i run above command,
instead i want to save the data in to a `MYSQL` database
so i had written some python code in `reducer.py`
file that writes the data directly to `MYSQL` database , and tried to
run the above command by removing the output path as below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
And i am getting the error something like below
12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input DFS input file(s) for the Map step
-output DFS output directory for the Reduce step
-mapper The streaming command to run
-combiner The streaming command to run
-reducer The streaming command to run
-file File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
.........................
.........................
1. After all my doubt is how to save the data in `Database` after processing the files ?
2. In which file(mapper.py/reducer.py ? ) can we write the code that writes the data in to database
3.
which command is used to run hadoop for saving data in to database,
becuase when i removed the output folder path in the hadoop command, it
is showing an error.
Can anyone please help me to solve the above problem.............