Trying to load multiple json files and merge into one pandas dataframe












1















I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.



I have tried quite a few solutions but nothing seems to be yielding a positive result.



This is what I have tried so far



path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data


And this is the full traceback of error message I am receiving:



---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']

/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301

/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)


I have added a sample of the json files I am trying to read from



Link to Jsons



Example of jsons:



{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}









share|improve this question

























  • Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

    – amanb
    Jan 14 at 7:44











  • @amanb added 2 examples of the JSON files

    – Emm
    Jan 14 at 7:52













  • These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

    – amanb
    Jan 14 at 8:01











  • Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

    – amanb
    Jan 14 at 20:01
















1















I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.



I have tried quite a few solutions but nothing seems to be yielding a positive result.



This is what I have tried so far



path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data


And this is the full traceback of error message I am receiving:



---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']

/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301

/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)


I have added a sample of the json files I am trying to read from



Link to Jsons



Example of jsons:



{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}









share|improve this question

























  • Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

    – amanb
    Jan 14 at 7:44











  • @amanb added 2 examples of the JSON files

    – Emm
    Jan 14 at 7:52













  • These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

    – amanb
    Jan 14 at 8:01











  • Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

    – amanb
    Jan 14 at 20:01














1












1








1


1






I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.



I have tried quite a few solutions but nothing seems to be yielding a positive result.



This is what I have tried so far



path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data


And this is the full traceback of error message I am receiving:



---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']

/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301

/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)


I have added a sample of the json files I am trying to read from



Link to Jsons



Example of jsons:



{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}









share|improve this question
















I am trying to load multiple json files from a directory in my Google Drive into one pandas dataframe.



I have tried quite a few solutions but nothing seems to be yielding a positive result.



This is what I have tried so far



path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
participants = json_text['participants']
messages = json_text['messages']
active = json_text['is_still_participant']
threadtype = json_text['thread_type']
threadpath = json_text['thread_path']
jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data


And this is the full traceback of error message I am receiving:



---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
1 for index, js in enumerate(json_files):
2 with open(os.path.join(path_to_json, js)) as json_file:
----> 3 json_text = json.load(json_file)
4 participants = json_text['participants']
5 messages = json_text['messages']

/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
297 cls=cls, object_hook=object_hook,
298 parse_float=parse_float, parse_int=parse_int,
--> 299 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
300
301

/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352 parse_int is None and parse_float is None and
353 parse_constant is None and object_pairs_hook is None and not kw):
--> 354 return _default_decoder.decode(s)
355 if cls is None:
356 cls = JSONDecoder

/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
337
338 """
--> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340 end = _w(s, end).end()
341 if end != len(s):

/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)


I have added a sample of the json files I am trying to read from



Link to Jsons



Example of jsons:



{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}






python json pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 14 at 7:51







Emm

















asked Jan 14 at 7:18









EmmEmm

1008




1008













  • Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

    – amanb
    Jan 14 at 7:44











  • @amanb added 2 examples of the JSON files

    – Emm
    Jan 14 at 7:52













  • These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

    – amanb
    Jan 14 at 8:01











  • Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

    – amanb
    Jan 14 at 20:01



















  • Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

    – amanb
    Jan 14 at 7:44











  • @amanb added 2 examples of the JSON files

    – Emm
    Jan 14 at 7:52













  • These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

    – amanb
    Jan 14 at 8:01











  • Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

    – amanb
    Jan 14 at 20:01

















Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

– amanb
Jan 14 at 7:44





Instead of adding a link to the JSON files, can you edit your question and show example JSON from the files.

– amanb
Jan 14 at 7:44













@amanb added 2 examples of the JSON files

– Emm
Jan 14 at 7:52







@amanb added 2 examples of the JSON files

– Emm
Jan 14 at 7:52















These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

– amanb
Jan 14 at 8:01





These are malformed JSONS. The keys are not strings, some values are also not strings and may error eventually. You should find a way to validate them using a JSON validator.

– amanb
Jan 14 at 8:01













Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

– amanb
Jan 14 at 20:01





Can you share an example output of the dataframe you expect to see? What shape do you expect? Some sort of a tabular representation of one of the sample JSONS should be ideal!

– amanb
Jan 14 at 20:01












2 Answers
2






active

oldest

votes


















1














I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.



For example, document1.json should be corrected as:



{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}


EDIT: you can use following line to add double quote to the keys of a json file:



re.sub("([^s^"]+):(.+)", '"\1":\2', s)





share|improve this answer


























  • I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

    – Emm
    Jan 14 at 7:48











  • The regex string is helpful!

    – amanb
    Jan 14 at 9:57











  • Hi @Emm if this or any answer has solved your question please consider accepting it .

    – keineahnung2345
    Jan 15 at 9:18



















1














There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.



Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.



I've broken down the problem into the following parts:




  • convert malformed JSON's in the files to valid JSON's

  • flatten the dict's in the valid JSON files to prepare for dataframe conversion

  • create dataframes from the files and merge into one dataframe


Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.



Part 1: convert malformed JSON's in the files to valid JSON's:



The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.



with open('./Test/test.json') as f:
data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)


The only way I found to get around this problem was:




  • to convert all the JSON files to txt files as this would convert the contents to string

  • perform regex on the JSON string in text files and add quotes(" ") around the keys

  • save the file as JSON again


The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.



import json
import os
import re
import pandas as pd

#rename to txt files and return list of filenames
def rename_to_text_files():
all_new_filenames =
for filename in os.listdir('./Test'):
if filename.endswith("json"):
new_filename = filename.split('.')[0] + '.txt'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
all_new_filenames.append(new_filename)
else:
all_new_filenames.append(filename)
return all_new_filenames

#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
json_validated_files =
for index, filename in enumerate(files):
filepath = os.path.join('./Test',filename)
with open(filepath,'r+') as f:
data = f.read()
dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
f.seek(0)
f.write(dict_converted)
f.truncate()
#rename
new_filename = filename[:-4] + '.json'
os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
json_validated_files.append(new_filename)
print("All files converted to valid JSON!")
return json_validated_files


So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":



#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}


If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:



df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length


In the next part, we fix this problem by flattening the dict in the JSON.



Part 2: Flatten dict for dataframe conversion:



As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:



#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
values =
d['participants_name'] = d.pop('participants')
for i in d['participants_name']:
values.append(i['name'])
for i in d['messages']:
d['messages_sender_name'] = i['sender_name']
d['messages_timestamp_ms'] = str(i['timestamp_ms'])
d['messages_content'] = i['content']
d['messages_type'] = i['type']
if "share" in i:
d['messages_share_link'] = i["share"]["link"]
d["is_still_participant"] = str(d["is_still_participant"])
d.pop('messages')
d.update(participants_name=values)
return d


This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:



#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}


When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):



with open('./Test/test1.json') as f:
data = json.load(f)

print(flatten(data))
#Output:
{'title': 'Clearance',
'is_still_participant': 'True',
'thread_type': 'Regular',
'thread_path': 'inbox/Clearance',
'participants_name': ['Clearance', 'Person'],
'messages_sender_name': 'Emmanuel Sibanda',
'messages_timestamp_ms': '1212242073308',
'messages_content': 'Dear',
'messages_type': 'Share',
'messages_share_link': 'http://www.example.com/'}


Part 3: Create Dataframes and merge them into one:



So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:




  • open the JSON files one by one, load each JSON as a dict in memory using json.load().

  • call the flatten function on each dict

  • convert the flattened dicts to dataframes

  • append all dataframes to an empty list.

  • merge all dataframes with pd.concat() passing the list of dataframes as an argument.


The code to accomplish these tasks:



#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

def create_merge_dataframes(list_of_valid_json_files):
df_list =
for index, js in enumerate(list_of_valid_json_files):
with open(os.path.join('./Test', js)) as json_file:
data = json.load(json_file)
flattened_json_data = flatten(data)
df = pd.DataFrame(flattened_json_data)
df_list.append(df)
merged_df = pd.concat(df_list,sort=False, ignore_index=True)
return merged_df


Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.



#rename invalid JSON files to text
files = rename_to_text_files()

#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)

#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)


The final Dataframe:



        title is_still_participant thread_type      thread_path  
0 Test 1 True Regular inbox/xyz
1 Test 1 True Regular inbox/xyz
2 Clearance True Regular inbox/Clearance
3 Clearance True Regular inbox/Clearance

participants_name messages_sender_name messages_timestamp_ms
0 Test 1 Person 1485467319139
1 Person Person 1485467319139
2 Clearance Emmanuel Sibanda 1212242073308
3 Person Emmanuel Sibanda 1212242073308

messages_content messages_type messages_share_link
0 Hie Generic NaN
1 Hie Generic NaN
2 Dear Share http://www.example.com/
3 Dear Share http://www.example.com/


You can change the order of columns as you like.



Note:




  • The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

  • The shape and columns of the Dataframes has also been assumed

  • You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

  • The whole script can be further modularized by putting the functions into a class.

  • It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

  • It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.


The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54177131%2ftrying-to-load-multiple-json-files-and-merge-into-one-pandas-dataframe%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.



    For example, document1.json should be corrected as:



    {
    "participants": [
    {
    "name": "Clothing"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Person",
    "timestamp_ms": 1210107456233,
    "content": "Good day",
    "type": "Generic"
    }
    ],
    "title": "Clothing",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/Clothing"
    }


    EDIT: you can use following line to add double quote to the keys of a json file:



    re.sub("([^s^"]+):(.+)", '"\1":\2', s)





    share|improve this answer


























    • I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

      – Emm
      Jan 14 at 7:48











    • The regex string is helpful!

      – amanb
      Jan 14 at 9:57











    • Hi @Emm if this or any answer has solved your question please consider accepting it .

      – keineahnung2345
      Jan 15 at 9:18
















    1














    I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.



    For example, document1.json should be corrected as:



    {
    "participants": [
    {
    "name": "Clothing"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Person",
    "timestamp_ms": 1210107456233,
    "content": "Good day",
    "type": "Generic"
    }
    ],
    "title": "Clothing",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/Clothing"
    }


    EDIT: you can use following line to add double quote to the keys of a json file:



    re.sub("([^s^"]+):(.+)", '"\1":\2', s)





    share|improve this answer


























    • I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

      – Emm
      Jan 14 at 7:48











    • The regex string is helpful!

      – amanb
      Jan 14 at 9:57











    • Hi @Emm if this or any answer has solved your question please consider accepting it .

      – keineahnung2345
      Jan 15 at 9:18














    1












    1








    1







    I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.



    For example, document1.json should be corrected as:



    {
    "participants": [
    {
    "name": "Clothing"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Person",
    "timestamp_ms": 1210107456233,
    "content": "Good day",
    "type": "Generic"
    }
    ],
    "title": "Clothing",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/Clothing"
    }


    EDIT: you can use following line to add double quote to the keys of a json file:



    re.sub("([^s^"]+):(.+)", '"\1":\2', s)





    share|improve this answer















    I've checked your json files, and found that there are same problems in document1.json, document2.json and document3.json: the property name are not enclosed with double quotes.



    For example, document1.json should be corrected as:



    {
    "participants": [
    {
    "name": "Clothing"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Person",
    "timestamp_ms": 1210107456233,
    "content": "Good day",
    "type": "Generic"
    }
    ],
    "title": "Clothing",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/Clothing"
    }


    EDIT: you can use following line to add double quote to the keys of a json file:



    re.sub("([^s^"]+):(.+)", '"\1":\2', s)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 14 at 8:59

























    answered Jan 14 at 7:44









    keineahnung2345keineahnung2345

    60012




    60012













    • I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

      – Emm
      Jan 14 at 7:48











    • The regex string is helpful!

      – amanb
      Jan 14 at 9:57











    • Hi @Emm if this or any answer has solved your question please consider accepting it .

      – keineahnung2345
      Jan 15 at 9:18



















    • I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

      – Emm
      Jan 14 at 7:48











    • The regex string is helpful!

      – amanb
      Jan 14 at 9:57











    • Hi @Emm if this or any answer has solved your question please consider accepting it .

      – keineahnung2345
      Jan 15 at 9:18

















    I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

    – Emm
    Jan 14 at 7:48





    I see ...problem is I have about 1000 json files all extracted from my Facebook. Would I have to manually make these changes?

    – Emm
    Jan 14 at 7:48













    The regex string is helpful!

    – amanb
    Jan 14 at 9:57





    The regex string is helpful!

    – amanb
    Jan 14 at 9:57













    Hi @Emm if this or any answer has solved your question please consider accepting it .

    – keineahnung2345
    Jan 15 at 9:18





    Hi @Emm if this or any answer has solved your question please consider accepting it .

    – keineahnung2345
    Jan 15 at 9:18













    1














    There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.



    Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.



    I've broken down the problem into the following parts:




    • convert malformed JSON's in the files to valid JSON's

    • flatten the dict's in the valid JSON files to prepare for dataframe conversion

    • create dataframes from the files and merge into one dataframe


    Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.



    Part 1: convert malformed JSON's in the files to valid JSON's:



    The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.



    with open('./Test/test.json') as f:
    data = json.load(f)
    print(data)
    #Error:
    JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)


    The only way I found to get around this problem was:




    • to convert all the JSON files to txt files as this would convert the contents to string

    • perform regex on the JSON string in text files and add quotes(" ") around the keys

    • save the file as JSON again


    The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.



    import json
    import os
    import re
    import pandas as pd

    #rename to txt files and return list of filenames
    def rename_to_text_files():
    all_new_filenames =
    for filename in os.listdir('./Test'):
    if filename.endswith("json"):
    new_filename = filename.split('.')[0] + '.txt'
    os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
    all_new_filenames.append(new_filename)
    else:
    all_new_filenames.append(filename)
    return all_new_filenames

    #fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
    def fix_dict_rename_to_json_files(files):
    json_validated_files =
    for index, filename in enumerate(files):
    filepath = os.path.join('./Test',filename)
    with open(filepath,'r+') as f:
    data = f.read()
    dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
    f.seek(0)
    f.write(dict_converted)
    f.truncate()
    #rename
    new_filename = filename[:-4] + '.json'
    os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
    json_validated_files.append(new_filename)
    print("All files converted to valid JSON!")
    return json_validated_files


    So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":



    #test.json
    {
    "participants": [
    {
    "name": "Test 1"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Person",
    "timestamp_ms": 1485467319139,
    "content": "Hie",
    "type": "Generic"
    }
    ],
    "title": "Test 1",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/xyz"
    }


    If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:



    df = pd.read_json('./Test/test.json')
    print(df)
    #Error
    ValueError: arrays must all be same length


    In the next part, we fix this problem by flattening the dict in the JSON.



    Part 2: Flatten dict for dataframe conversion:



    As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:



    #accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
    def flatten(d):
    values =
    d['participants_name'] = d.pop('participants')
    for i in d['participants_name']:
    values.append(i['name'])
    for i in d['messages']:
    d['messages_sender_name'] = i['sender_name']
    d['messages_timestamp_ms'] = str(i['timestamp_ms'])
    d['messages_content'] = i['content']
    d['messages_type'] = i['type']
    if "share" in i:
    d['messages_share_link'] = i["share"]["link"]
    d["is_still_participant"] = str(d["is_still_participant"])
    d.pop('messages')
    d.update(participants_name=values)
    return d


    This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:



    #test1.json
    {
    "participants": [
    {
    "name": "Clearance"
    },
    {
    "name": "Person"
    }
    ],
    "messages": [
    {
    "sender_name": "Emmanuel Sibanda",
    "timestamp_ms": 1212242073308,
    "content": "Dear",
    "share": {
    "link": "http://www.example.com/"
    },
    "type": "Share"
    }
    ],
    "title": "Clearance",
    "is_still_participant": true,
    "thread_type": "Regular",
    "thread_path": "inbox/Clearance"
    }


    When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):



    with open('./Test/test1.json') as f:
    data = json.load(f)

    print(flatten(data))
    #Output:
    {'title': 'Clearance',
    'is_still_participant': 'True',
    'thread_type': 'Regular',
    'thread_path': 'inbox/Clearance',
    'participants_name': ['Clearance', 'Person'],
    'messages_sender_name': 'Emmanuel Sibanda',
    'messages_timestamp_ms': '1212242073308',
    'messages_content': 'Dear',
    'messages_type': 'Share',
    'messages_share_link': 'http://www.example.com/'}


    Part 3: Create Dataframes and merge them into one:



    So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:




    • open the JSON files one by one, load each JSON as a dict in memory using json.load().

    • call the flatten function on each dict

    • convert the flattened dicts to dataframes

    • append all dataframes to an empty list.

    • merge all dataframes with pd.concat() passing the list of dataframes as an argument.


    The code to accomplish these tasks:



    #accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

    def create_merge_dataframes(list_of_valid_json_files):
    df_list =
    for index, js in enumerate(list_of_valid_json_files):
    with open(os.path.join('./Test', js)) as json_file:
    data = json.load(json_file)
    flattened_json_data = flatten(data)
    df = pd.DataFrame(flattened_json_data)
    df_list.append(df)
    merged_df = pd.concat(df_list,sort=False, ignore_index=True)
    return merged_df


    Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.



    #rename invalid JSON files to text
    files = rename_to_text_files()

    #fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
    json_validated_files = fix_dict_rename_to_json_files(files)

    #flatten and receive merged dataframes
    df = create_merge_dataframes(json_validated_files)
    print(df)


    The final Dataframe:



            title is_still_participant thread_type      thread_path  
    0 Test 1 True Regular inbox/xyz
    1 Test 1 True Regular inbox/xyz
    2 Clearance True Regular inbox/Clearance
    3 Clearance True Regular inbox/Clearance

    participants_name messages_sender_name messages_timestamp_ms
    0 Test 1 Person 1485467319139
    1 Person Person 1485467319139
    2 Clearance Emmanuel Sibanda 1212242073308
    3 Person Emmanuel Sibanda 1212242073308

    messages_content messages_type messages_share_link
    0 Hie Generic NaN
    1 Hie Generic NaN
    2 Dear Share http://www.example.com/
    3 Dear Share http://www.example.com/


    You can change the order of columns as you like.



    Note:




    • The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

    • The shape and columns of the Dataframes has also been assumed

    • You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

    • The whole script can be further modularized by putting the functions into a class.

    • It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

    • It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.


    The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.






    share|improve this answer






























      1














      There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.



      Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.



      I've broken down the problem into the following parts:




      • convert malformed JSON's in the files to valid JSON's

      • flatten the dict's in the valid JSON files to prepare for dataframe conversion

      • create dataframes from the files and merge into one dataframe


      Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.



      Part 1: convert malformed JSON's in the files to valid JSON's:



      The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.



      with open('./Test/test.json') as f:
      data = json.load(f)
      print(data)
      #Error:
      JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)


      The only way I found to get around this problem was:




      • to convert all the JSON files to txt files as this would convert the contents to string

      • perform regex on the JSON string in text files and add quotes(" ") around the keys

      • save the file as JSON again


      The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.



      import json
      import os
      import re
      import pandas as pd

      #rename to txt files and return list of filenames
      def rename_to_text_files():
      all_new_filenames =
      for filename in os.listdir('./Test'):
      if filename.endswith("json"):
      new_filename = filename.split('.')[0] + '.txt'
      os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
      all_new_filenames.append(new_filename)
      else:
      all_new_filenames.append(filename)
      return all_new_filenames

      #fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
      def fix_dict_rename_to_json_files(files):
      json_validated_files =
      for index, filename in enumerate(files):
      filepath = os.path.join('./Test',filename)
      with open(filepath,'r+') as f:
      data = f.read()
      dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
      f.seek(0)
      f.write(dict_converted)
      f.truncate()
      #rename
      new_filename = filename[:-4] + '.json'
      os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
      json_validated_files.append(new_filename)
      print("All files converted to valid JSON!")
      return json_validated_files


      So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":



      #test.json
      {
      "participants": [
      {
      "name": "Test 1"
      },
      {
      "name": "Person"
      }
      ],
      "messages": [
      {
      "sender_name": "Person",
      "timestamp_ms": 1485467319139,
      "content": "Hie",
      "type": "Generic"
      }
      ],
      "title": "Test 1",
      "is_still_participant": true,
      "thread_type": "Regular",
      "thread_path": "inbox/xyz"
      }


      If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:



      df = pd.read_json('./Test/test.json')
      print(df)
      #Error
      ValueError: arrays must all be same length


      In the next part, we fix this problem by flattening the dict in the JSON.



      Part 2: Flatten dict for dataframe conversion:



      As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:



      #accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
      def flatten(d):
      values =
      d['participants_name'] = d.pop('participants')
      for i in d['participants_name']:
      values.append(i['name'])
      for i in d['messages']:
      d['messages_sender_name'] = i['sender_name']
      d['messages_timestamp_ms'] = str(i['timestamp_ms'])
      d['messages_content'] = i['content']
      d['messages_type'] = i['type']
      if "share" in i:
      d['messages_share_link'] = i["share"]["link"]
      d["is_still_participant"] = str(d["is_still_participant"])
      d.pop('messages')
      d.update(participants_name=values)
      return d


      This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:



      #test1.json
      {
      "participants": [
      {
      "name": "Clearance"
      },
      {
      "name": "Person"
      }
      ],
      "messages": [
      {
      "sender_name": "Emmanuel Sibanda",
      "timestamp_ms": 1212242073308,
      "content": "Dear",
      "share": {
      "link": "http://www.example.com/"
      },
      "type": "Share"
      }
      ],
      "title": "Clearance",
      "is_still_participant": true,
      "thread_type": "Regular",
      "thread_path": "inbox/Clearance"
      }


      When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):



      with open('./Test/test1.json') as f:
      data = json.load(f)

      print(flatten(data))
      #Output:
      {'title': 'Clearance',
      'is_still_participant': 'True',
      'thread_type': 'Regular',
      'thread_path': 'inbox/Clearance',
      'participants_name': ['Clearance', 'Person'],
      'messages_sender_name': 'Emmanuel Sibanda',
      'messages_timestamp_ms': '1212242073308',
      'messages_content': 'Dear',
      'messages_type': 'Share',
      'messages_share_link': 'http://www.example.com/'}


      Part 3: Create Dataframes and merge them into one:



      So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:




      • open the JSON files one by one, load each JSON as a dict in memory using json.load().

      • call the flatten function on each dict

      • convert the flattened dicts to dataframes

      • append all dataframes to an empty list.

      • merge all dataframes with pd.concat() passing the list of dataframes as an argument.


      The code to accomplish these tasks:



      #accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

      def create_merge_dataframes(list_of_valid_json_files):
      df_list =
      for index, js in enumerate(list_of_valid_json_files):
      with open(os.path.join('./Test', js)) as json_file:
      data = json.load(json_file)
      flattened_json_data = flatten(data)
      df = pd.DataFrame(flattened_json_data)
      df_list.append(df)
      merged_df = pd.concat(df_list,sort=False, ignore_index=True)
      return merged_df


      Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.



      #rename invalid JSON files to text
      files = rename_to_text_files()

      #fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
      json_validated_files = fix_dict_rename_to_json_files(files)

      #flatten and receive merged dataframes
      df = create_merge_dataframes(json_validated_files)
      print(df)


      The final Dataframe:



              title is_still_participant thread_type      thread_path  
      0 Test 1 True Regular inbox/xyz
      1 Test 1 True Regular inbox/xyz
      2 Clearance True Regular inbox/Clearance
      3 Clearance True Regular inbox/Clearance

      participants_name messages_sender_name messages_timestamp_ms
      0 Test 1 Person 1485467319139
      1 Person Person 1485467319139
      2 Clearance Emmanuel Sibanda 1212242073308
      3 Person Emmanuel Sibanda 1212242073308

      messages_content messages_type messages_share_link
      0 Hie Generic NaN
      1 Hie Generic NaN
      2 Dear Share http://www.example.com/
      3 Dear Share http://www.example.com/


      You can change the order of columns as you like.



      Note:




      • The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

      • The shape and columns of the Dataframes has also been assumed

      • You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

      • The whole script can be further modularized by putting the functions into a class.

      • It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

      • It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.


      The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.






      share|improve this answer




























        1












        1








        1







        There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.



        Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.



        I've broken down the problem into the following parts:




        • convert malformed JSON's in the files to valid JSON's

        • flatten the dict's in the valid JSON files to prepare for dataframe conversion

        • create dataframes from the files and merge into one dataframe


        Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.



        Part 1: convert malformed JSON's in the files to valid JSON's:



        The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.



        with open('./Test/test.json') as f:
        data = json.load(f)
        print(data)
        #Error:
        JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)


        The only way I found to get around this problem was:




        • to convert all the JSON files to txt files as this would convert the contents to string

        • perform regex on the JSON string in text files and add quotes(" ") around the keys

        • save the file as JSON again


        The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.



        import json
        import os
        import re
        import pandas as pd

        #rename to txt files and return list of filenames
        def rename_to_text_files():
        all_new_filenames =
        for filename in os.listdir('./Test'):
        if filename.endswith("json"):
        new_filename = filename.split('.')[0] + '.txt'
        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
        all_new_filenames.append(new_filename)
        else:
        all_new_filenames.append(filename)
        return all_new_filenames

        #fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
        def fix_dict_rename_to_json_files(files):
        json_validated_files =
        for index, filename in enumerate(files):
        filepath = os.path.join('./Test',filename)
        with open(filepath,'r+') as f:
        data = f.read()
        dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
        f.seek(0)
        f.write(dict_converted)
        f.truncate()
        #rename
        new_filename = filename[:-4] + '.json'
        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
        json_validated_files.append(new_filename)
        print("All files converted to valid JSON!")
        return json_validated_files


        So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":



        #test.json
        {
        "participants": [
        {
        "name": "Test 1"
        },
        {
        "name": "Person"
        }
        ],
        "messages": [
        {
        "sender_name": "Person",
        "timestamp_ms": 1485467319139,
        "content": "Hie",
        "type": "Generic"
        }
        ],
        "title": "Test 1",
        "is_still_participant": true,
        "thread_type": "Regular",
        "thread_path": "inbox/xyz"
        }


        If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:



        df = pd.read_json('./Test/test.json')
        print(df)
        #Error
        ValueError: arrays must all be same length


        In the next part, we fix this problem by flattening the dict in the JSON.



        Part 2: Flatten dict for dataframe conversion:



        As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:



        #accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
        def flatten(d):
        values =
        d['participants_name'] = d.pop('participants')
        for i in d['participants_name']:
        values.append(i['name'])
        for i in d['messages']:
        d['messages_sender_name'] = i['sender_name']
        d['messages_timestamp_ms'] = str(i['timestamp_ms'])
        d['messages_content'] = i['content']
        d['messages_type'] = i['type']
        if "share" in i:
        d['messages_share_link'] = i["share"]["link"]
        d["is_still_participant"] = str(d["is_still_participant"])
        d.pop('messages')
        d.update(participants_name=values)
        return d


        This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:



        #test1.json
        {
        "participants": [
        {
        "name": "Clearance"
        },
        {
        "name": "Person"
        }
        ],
        "messages": [
        {
        "sender_name": "Emmanuel Sibanda",
        "timestamp_ms": 1212242073308,
        "content": "Dear",
        "share": {
        "link": "http://www.example.com/"
        },
        "type": "Share"
        }
        ],
        "title": "Clearance",
        "is_still_participant": true,
        "thread_type": "Regular",
        "thread_path": "inbox/Clearance"
        }


        When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):



        with open('./Test/test1.json') as f:
        data = json.load(f)

        print(flatten(data))
        #Output:
        {'title': 'Clearance',
        'is_still_participant': 'True',
        'thread_type': 'Regular',
        'thread_path': 'inbox/Clearance',
        'participants_name': ['Clearance', 'Person'],
        'messages_sender_name': 'Emmanuel Sibanda',
        'messages_timestamp_ms': '1212242073308',
        'messages_content': 'Dear',
        'messages_type': 'Share',
        'messages_share_link': 'http://www.example.com/'}


        Part 3: Create Dataframes and merge them into one:



        So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:




        • open the JSON files one by one, load each JSON as a dict in memory using json.load().

        • call the flatten function on each dict

        • convert the flattened dicts to dataframes

        • append all dataframes to an empty list.

        • merge all dataframes with pd.concat() passing the list of dataframes as an argument.


        The code to accomplish these tasks:



        #accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

        def create_merge_dataframes(list_of_valid_json_files):
        df_list =
        for index, js in enumerate(list_of_valid_json_files):
        with open(os.path.join('./Test', js)) as json_file:
        data = json.load(json_file)
        flattened_json_data = flatten(data)
        df = pd.DataFrame(flattened_json_data)
        df_list.append(df)
        merged_df = pd.concat(df_list,sort=False, ignore_index=True)
        return merged_df


        Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.



        #rename invalid JSON files to text
        files = rename_to_text_files()

        #fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
        json_validated_files = fix_dict_rename_to_json_files(files)

        #flatten and receive merged dataframes
        df = create_merge_dataframes(json_validated_files)
        print(df)


        The final Dataframe:



                title is_still_participant thread_type      thread_path  
        0 Test 1 True Regular inbox/xyz
        1 Test 1 True Regular inbox/xyz
        2 Clearance True Regular inbox/Clearance
        3 Clearance True Regular inbox/Clearance

        participants_name messages_sender_name messages_timestamp_ms
        0 Test 1 Person 1485467319139
        1 Person Person 1485467319139
        2 Clearance Emmanuel Sibanda 1212242073308
        3 Person Emmanuel Sibanda 1212242073308

        messages_content messages_type messages_share_link
        0 Hie Generic NaN
        1 Hie Generic NaN
        2 Dear Share http://www.example.com/
        3 Dear Share http://www.example.com/


        You can change the order of columns as you like.



        Note:




        • The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

        • The shape and columns of the Dataframes has also been assumed

        • You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

        • The whole script can be further modularized by putting the functions into a class.

        • It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

        • It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.


        The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.






        share|improve this answer















        There were a few challenges working with the JSON files you provided and then some more converting them to dataframes and merging. This was because the keys of the JSON's were not strings, secondly, the arrays of the resulting "valid" JSONS were of different length and could not be converted to dataframes directly and thirdly, you did not specify the shape of the dataframe.



        Nevertheless, this is an important problem as malformed JSON's are more commonplace than "valid" ones and despite several SO answers to fix such JSON strings, every malformed JSON problem is unique on its own.



        I've broken down the problem into the following parts:




        • convert malformed JSON's in the files to valid JSON's

        • flatten the dict's in the valid JSON files to prepare for dataframe conversion

        • create dataframes from the files and merge into one dataframe


        Note: For this answer, I copied the example JSON strings you provided into two files namely "test.json" and "test1.json" and saved them into a "Test" folder.



        Part 1: convert malformed JSON's in the files to valid JSON's:



        The two example JSON strings that you provided had no data type whatsoever. This is because the keys were not strings and were invalid. So, even if you load the JSON file and parse the contents, an error appears.



        with open('./Test/test.json') as f:
        data = json.load(f)
        print(data)
        #Error:
        JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)


        The only way I found to get around this problem was:




        • to convert all the JSON files to txt files as this would convert the contents to string

        • perform regex on the JSON string in text files and add quotes(" ") around the keys

        • save the file as JSON again


        The above three steps were accomplished with two functions that I wrote. The first one renames the files to txt files and returns a list of filenames. The second one accepts this list of filenames, fixes the JSON keys using a regex and saves them to JSON format again.



        import json
        import os
        import re
        import pandas as pd

        #rename to txt files and return list of filenames
        def rename_to_text_files():
        all_new_filenames =
        for filename in os.listdir('./Test'):
        if filename.endswith("json"):
        new_filename = filename.split('.')[0] + '.txt'
        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
        all_new_filenames.append(new_filename)
        else:
        all_new_filenames.append(filename)
        return all_new_filenames

        #fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
        def fix_dict_rename_to_json_files(files):
        json_validated_files =
        for index, filename in enumerate(files):
        filepath = os.path.join('./Test',filename)
        with open(filepath,'r+') as f:
        data = f.read()
        dict_converted = re.sub("(w+):(.+)", r'"1":2', data)
        f.seek(0)
        f.write(dict_converted)
        f.truncate()
        #rename
        new_filename = filename[:-4] + '.json'
        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
        json_validated_files.append(new_filename)
        print("All files converted to valid JSON!")
        return json_validated_files


        So, now I had two JSON files with valid JSON. But they were still not ready for dataframe conversion. To explain things better, consider the valid JSON from "test.json":



        #test.json
        {
        "participants": [
        {
        "name": "Test 1"
        },
        {
        "name": "Person"
        }
        ],
        "messages": [
        {
        "sender_name": "Person",
        "timestamp_ms": 1485467319139,
        "content": "Hie",
        "type": "Generic"
        }
        ],
        "title": "Test 1",
        "is_still_participant": true,
        "thread_type": "Regular",
        "thread_path": "inbox/xyz"
        }


        If I read the json into a dataframe, I still get an error because the array lengths are different for each of the keys. You can check this: the "messages" key value is an array of length 1 while "participants" has a value of array length 2:



        df = pd.read_json('./Test/test.json')
        print(df)
        #Error
        ValueError: arrays must all be same length


        In the next part, we fix this problem by flattening the dict in the JSON.



        Part 2: Flatten dict for dataframe conversion:



        As you had not specified the shape that you expect for your dataframe, I extracted the values in the best way possible and flattened the dict with the following function. This is assuming the keys provided in the example JSON's will not change across all JSON files:



        #accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
        def flatten(d):
        values =
        d['participants_name'] = d.pop('participants')
        for i in d['participants_name']:
        values.append(i['name'])
        for i in d['messages']:
        d['messages_sender_name'] = i['sender_name']
        d['messages_timestamp_ms'] = str(i['timestamp_ms'])
        d['messages_content'] = i['content']
        d['messages_type'] = i['type']
        if "share" in i:
        d['messages_share_link'] = i["share"]["link"]
        d["is_still_participant"] = str(d["is_still_participant"])
        d.pop('messages')
        d.update(participants_name=values)
        return d


        This time let's consider the second example JSON string which also has a "share" key with a URL. The valid JSON string is as below:



        #test1.json
        {
        "participants": [
        {
        "name": "Clearance"
        },
        {
        "name": "Person"
        }
        ],
        "messages": [
        {
        "sender_name": "Emmanuel Sibanda",
        "timestamp_ms": 1212242073308,
        "content": "Dear",
        "share": {
        "link": "http://www.example.com/"
        },
        "type": "Share"
        }
        ],
        "title": "Clearance",
        "is_still_participant": true,
        "thread_type": "Regular",
        "thread_path": "inbox/Clearance"
        }


        When we flatten this dict with the above function, we get a dict that can be easily fed into a DataFrame function(discussed later):



        with open('./Test/test1.json') as f:
        data = json.load(f)

        print(flatten(data))
        #Output:
        {'title': 'Clearance',
        'is_still_participant': 'True',
        'thread_type': 'Regular',
        'thread_path': 'inbox/Clearance',
        'participants_name': ['Clearance', 'Person'],
        'messages_sender_name': 'Emmanuel Sibanda',
        'messages_timestamp_ms': '1212242073308',
        'messages_content': 'Dear',
        'messages_type': 'Share',
        'messages_share_link': 'http://www.example.com/'}


        Part 3: Create Dataframes and merge them into one:



        So now that we have a function that can flatten the dict, we can call this function inside our final function where we will:




        • open the JSON files one by one, load each JSON as a dict in memory using json.load().

        • call the flatten function on each dict

        • convert the flattened dicts to dataframes

        • append all dataframes to an empty list.

        • merge all dataframes with pd.concat() passing the list of dataframes as an argument.


        The code to accomplish these tasks:



        #accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

        def create_merge_dataframes(list_of_valid_json_files):
        df_list =
        for index, js in enumerate(list_of_valid_json_files):
        with open(os.path.join('./Test', js)) as json_file:
        data = json.load(json_file)
        flattened_json_data = flatten(data)
        df = pd.DataFrame(flattened_json_data)
        df_list.append(df)
        merged_df = pd.concat(df_list,sort=False, ignore_index=True)
        return merged_df


        Let's give the whole code a test run. We begin with functions in Part1 and end with Part 3, to get a merged ddataframe.



        #rename invalid JSON files to text
        files = rename_to_text_files()

        #fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
        json_validated_files = fix_dict_rename_to_json_files(files)

        #flatten and receive merged dataframes
        df = create_merge_dataframes(json_validated_files)
        print(df)


        The final Dataframe:



                title is_still_participant thread_type      thread_path  
        0 Test 1 True Regular inbox/xyz
        1 Test 1 True Regular inbox/xyz
        2 Clearance True Regular inbox/Clearance
        3 Clearance True Regular inbox/Clearance

        participants_name messages_sender_name messages_timestamp_ms
        0 Test 1 Person 1485467319139
        1 Person Person 1485467319139
        2 Clearance Emmanuel Sibanda 1212242073308
        3 Person Emmanuel Sibanda 1212242073308

        messages_content messages_type messages_share_link
        0 Hie Generic NaN
        1 Hie Generic NaN
        2 Dear Share http://www.example.com/
        3 Dear Share http://www.example.com/


        You can change the order of columns as you like.



        Note:




        • The code does not have Exception handling and assumes the keys will be the same for the dicts as shown in your examples

        • The shape and columns of the Dataframes has also been assumed

        • You may add all the functions into one Python script and wherever "./Test" is used for the JSON folder path, you should enter your path. The folder should only contain mailformed JSON files to begin with.

        • The whole script can be further modularized by putting the functions into a class.

        • It can also be further optimized with the use of hashable data types like tuples and sped up with threading and asyncio libraries. However, for a folder of 1000 files this code should work fairly well and shouldn't take very long.

        • It could be possible some errors may crop up while converting the malformed JSON files to valid ones, as the contents of all the JSON files is not known.


        The code discussed provides a workflow to accomplish what you need and I hope this helps you and anyone who comes across a similar problem.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 22 at 11:57

























        answered Jan 18 at 15:29









        amanbamanb

        1,145421




        1,145421






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54177131%2ftrying-to-load-multiple-json-files-and-merge-into-one-pandas-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Liquibase includeAll doesn't find base path

            How to use setInterval in EJS file?

            Petrus Granier-Deferre