25 Questions and Answers About Hortonworks DataFlow
Last week, we had a jam-packed webinar on Hortonworks DataFlow, with over 700 registrants and so we were unable to get back to everyone to answer their questions. We’ve grouped the questions (and answers) below into the following categories, and if you have more questions, anytime, we encourage you to check out the Data Ingestion & Streaming track of Hortonworks Community Connection where an entire community of folks are monitoring and responding to questions.
For those who may have missed the session you can check out theon-demand webinar, slideshare and still sign up to attend the remaining webinars in the 7-part Data-In-Motion webinar series .HDF Use Case Questions
Our plan with NiFi is to use it to ingest data from traditional data sources into our data lake is that an appropriate use of the technology?Yes. This is one of the primary use cases. NiFi is very good for “matching impedances” between disparate data sources and getting them into the format that different systems need, and then feeding that into various consumers such as a data lake, or streaming applications. Velocity and volume matters, and mapping data into the systems that need it, in the format the consumer needs it. Is it possible to transfer data from HIVE to AWS redshift or Azure mssql Data warehouse? There is not a single processor for direct connection to Redshift just yet, but there are a series of processors designed to work with a lot of AWS technologies that feed Redshift, such as Kinesis streams or S3. Can you schedule the flow to auto run like one would with Coordinator? By default, the processors are already continuously running. Unless you select to only run a processor on a hourly basis for example (also possible) it’s not a job oriented thing. Once you start a processor, it continuously runs. It is a source processor, and continuously Again, also scheduling options should you choose to use them but by default it is constantly running. It is also fine for processors to be a ‘always running’. The better way to think about it is that they’re always ‘eligible’ to execute. Does NiFi Support Real Time processing or Streaming? e.g. JMS Queue as source? Yes. There is a out of the box processor for pulling information out of JMS topics. A lot of the power of NiFi is it’s ability to tap into existing java ecosystem. There are a lot of existing protocols, technologies, languages, that allows you to leverage those and wrap it with the power of NiFI to allow you to use all the functions and features that the platform provides to provide extensions that meet your organization’s formats and needs. Extensibility is key as it’s impossible to know all the proprietary formats an organization may have, so NiFi provides a toolbox that works for common formats, but also extend to various needs you may have. Can NiFi parse Apache log files? Yes, there are a number of processors that handle various formats for data wrangling and data mapping. We’re happy to hear more ideas on needs to continually make this even easier. Can we integrate the existing MR/SPARK process with HDF (data Ingestion via HDF)? NiFi can certainly be used to drive data into and out of systems like HDFS, Spark and many others. In the case of systems like Spark, Storm, and several others Kafka or HDFS are likely great intermediaries and NiFi integrates with both very well. A common pattern for streaming apps is to use Kafka. For block-oriented apps consider integrating by exchange datasets via HDFS, Hive, HBase, etc Can we use NiFi to process / push PCAP (wireshark) content to KAFKA ? You can certainly use NiFi to capture, route, transform, and deliver PCAP data. There are PCAP libraries which can be leveraged by wrapping as a NiFi processor. How do you decide between NiFi versus Flume and Sqoop? NiFi supports all Flume use cases, and has a Flume processor out of the box. NiFi supports some similar capabilities of Sqoop check out the GenerateTableFetch processor, that does incremental fetch and parallel fetch against source table partitions. Ultimately what you want to look at is whether you’re solving a specific or singular use case. If so then any one of the tools that works will work. NiFi’s benefits will really shine when you consider multiple use cases being handled at once and when critical flow management features like interactive and live command and control with full data provenance are applicable. “How Does NiFI Work” Questions
Each processor metric with “5 min” next to it. Is that how often NiFi will refresh those metrics from the processor?This is configurable. Out of the box it is UI driven, which is powered by the same REST API that is available to external systems and developers who want to interface with it. The default is 30 seconds, or manually right click on the canvas to refresh. There also some longer running metrics that show how the processors have been behaving have been running for the past day such as amount of xdata processed, or number of files process. In general what is being shown live on the UI is a rolling ‘last five minutes’ view so you understand recent behavior whereas the metrics are showing you historical five minute windows.
How is the Performance of NIFI Processors? Is there any comparison matrix of NIFI Performance with Other Processing tools?Performance depends on what the flow is doing end-to-end. There are provisions in NiFi architecture to optimize physical data movement by only moving data pointers to the blobs of data stored in NiFi’s content repository. NiFi also records all of the metadata, history and content as it changes, so any comparison won’t be apples-to-apples. That said, NiFi’s performance is often said to be quite good particularly as flows can easily be setup that fully utilize the capabilities of the system in terms of network, CPU, disk, and memory.
How scalable are the internal NiFi repositories what is the storage/persistence engine? Assume there is a mechanism to prune old or irrelevant audit data?All these are repositories are pluggable mutations. Out of the box have some custom built repositories configured for key use cases. It is all on disk right now, and it’s able to talk to multiple volumes to help with throughput and IO. There are 3 kinds of repositories Content repository, Flowfile and Provenance repository, can all be spread across multiple volumes. They can also be configured through their properties, for how long you want to hold the data for, how much volume you are willing to allocate. Remember, NiFi is not a long-term storage solution but we do want to provide users with enough context to see how things are working, so it becomes a capital expenditure decision how much to spend to to provide disks and storage. The Apache NiFi community has a valuable and detailed document covering the internals here https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html.
Does NiFi get all the dependencies (associated processing libraries etc.) for all the 172 out-of-box processors?
本文标题：25 Questions and Answers About Hortonworks DataFlow