Python, RapidMiner, and Carriage Returns

2018-08-22 00:00:00 Python RapidMiner Tutorials

I’ve been working on some Python code for a RapidMiner process. What I want to do is simplify my Instagram Hashtag Tool and make it go faster.

Part of that work is extracting the Instagram comments for text processing. I ran into utter hell trying to export those comments into a CSV file that RapidMiner could read. It was exporting the data just fine but wrapping the comment into carriage returns. For some strange reason, RapidMiner can not read carriage returned data in a cell. It can only read the first line. Luckily with the help of some users I managed to work around and find a solution on my end. DO all the carriage return striping on my end before export.

The trick is to strip all carriage returns, spaces, tabs, etc using the regular expression ’s’, then replace the stripped items with a space like this ’ ’ in place. While this isn’t elegant, it had to be done because Instagram comment are so messy to begin with.

Code

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import json
from pprint import pprint
import csv
import re

df = pd.read_json('https://www.instagram.com/explore/tags/film/?__a=1')

name = df['graphql']['hashtag']['name']

hashtag_count = df['graphql']['hashtag']['edge_hashtag_to_media']['count']

count_list = []
likes = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
for i in likes:
    add = (df['graphql']['hashtag']['name'], df['graphql']['hashtag']['edge_hashtag_to_media']['count'], i['node']['id'], i['node']['edge_liked_by']['count'], i['node']['display_url'])
    count_list.append(add)
print(count_list)

count_out = pd.DataFrame(count_list)
count_out.columns = ['hashtag', 'count', 'user_id', 'likes', 'image_url']

# This just exports out a CSV with the above data. Plan is to use this in a RM process
count_out.to_csv('out.csv')

# Now comes the hard part, preparing the comments for a RM Process.
# This is where the carriage returns killed me for hours

text_export = []
rawtext = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
for i in rawtext:
    rawtext = i['node']['edge_media_to_caption']['edges']
    #print(rawtext)
    for j in rawtext:
        final_text = j['node']['text']
        df['final_text'] = j['node']['text']
        text_export.append(final_text)
        print(df['final_text'])

text_out = pd.DataFrame(text_export)

#This is the key, I had to strip everything using s and replacing it with a space via ' '

text_out.replace(r's',' ', regex=True, inplace=True)

text_out.columns = ['comment']

text_out.to_csv('out2.csv', line_terminator='rn')