Discussion:
Copying many files to HDFS
Mark Payne
2015-03-04 22:30:39 UTC
Permalink
Johny, NiFi looks interesting but I can't really grasp how it will
help me. If you could provided some example code or a more detail
explanation of how you set up a topology, then that would be great.



Kevin,

With NiFi you wouldn't have example code. NiFi is a dataflow automation
tool where you construct your dataflow visually with drag-and-drop
components. You can download it by going to nifi.incubator.apache.org
and then going to the Downloads link. Once downloaded, you would untar
it and run "bin/nifi.sh start"

At that point you could build your dataflow by navigating your browser
to http://localhost:8080/nifi

There's actually a really good blog post on how to do essentially what
you're looking to do at http://ingest.tips/2014/12/22/getting-started-
with-apache-nifi/

The idea is that the dataflow pulls in any data from a local or network
drive into NiFi, deletes the file, and then pushes the data to HDFS. I
would caution though that in the blog post, failure to send to HDFS is
"auto-terminated," which means that the data would be deleted. In
reality, you should route the "failure" relationship back to PutHDFS. I
think this would make a lot more sense after you read the blog :)

There are also a lot of video tutorials on how to use NiFi at
https://kisstechdocs.wordpress.com/

If you've got any questions or comments, you can mail
On Fri, Feb 13, 2015 at 10:38 AM, johny casanova <pcgamer2426-
 Hi Kevin,
 
You can try Apache nifi https://nifi.incubator.apache.org/ is a new
application that is still in incubation but, awesome tool to use for
what you are looking for. Ithas a processor that put data and get data
from HDFS and send continuously without having to use the put command.
Check them out and let me know if you need help. I use it to put to HDFS
also and put high volumes like you mentioned.
Date: Fri, 13 Feb 2015 09:25:35 -0500Subject: Re: Copying many files
Ahmed,
Flume is a great tool but it doesn't cover my use case. I need to copy
the files in their entirety and keep their file names.
Alexander,
Thanks for sharing Slurper. From the code it looks like a reasonable
multi-threaded application to copy files. I'll keep looking at it.
On Fri, Feb 13, 2015 at 9:03 AM, Alexander Alten-Lorenz <wget.null
Kevin,
https://github.com/alexholmes/hdfs-file-slurper
BR,
 Alexander 
On 13 Feb 2015, at 14:28, Kevin <kevin.macksamie-
Hi,
I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a
thousand or so files into HDFS, which totals roughly 1 TB. The cluster
will be isolated on its own private LAN with a single client machine
that is connected to the Hadoop cluster as well as the public network.
The data that needs to be copied into HDFS is mounted as an NFS on the
client machine.
I can run `hadoop fs -put` concurrently on the client machine to try
and increase the throughput.
If these files were able to be accessed by each node in the Hadoop
cluster, then I could write a MapReduce job to copy a number of files
from the network into HDFS. I could not find anything in the
documentation saying that `distcp` works with locally hosted files (its
code in the tools package doesn't tell any sign of it either) - but I
wouldn't expect it to.
Loading...